|
# BitTransformerLM User Guide |
|
|
|
**Version:** 0.1.0 Experimental |
|
**Last Updated:** August 2025 |
|
**Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience |
|
|
|
## Table of Contents |
|
|
|
1. [Quick Start](#quick-start) |
|
2. [Architecture Overview](#architecture-overview) |
|
3. [Core Features](#core-features) |
|
4. [Installation & Setup](#installation--setup) |
|
5. [Basic Usage Examples](#basic-usage-examples) |
|
6. [Advanced Features](#advanced-features) |
|
7. [Training Your Own Models](#training-your-own-models) |
|
8. [Safety and Monitoring](#safety-and-monitoring) |
|
9. [Distributed Training](#distributed-training) |
|
10. [Performance Optimization](#performance-optimization) |
|
11. [Troubleshooting](#troubleshooting) |
|
12. [Best Practices](#best-practices) |
|
|
|
--- |
|
|
|
## Quick Start |
|
|
|
BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring. |
|
|
|
### Minimal Example |
|
```python |
|
from bit_transformer import BitTransformerLM, example_training_step |
|
|
|
# Run basic example |
|
loss, telemetry = example_training_step() |
|
print(f"Training loss: {loss}") |
|
print(f"Available telemetry: {list(telemetry.keys())}") |
|
``` |
|
|
|
### Text Processing Example |
|
```python |
|
from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text |
|
|
|
# Create model |
|
model = BitTransformerLM( |
|
d_model=128, |
|
nhead=4, |
|
num_layers=2, |
|
dim_feedforward=256, |
|
max_seq_len=256 |
|
) |
|
|
|
# Convert text to bits and process |
|
text = "Hello, world!" |
|
bits = text_to_bits(text) |
|
bit_tensor = torch.tensor(bits).unsqueeze(0) |
|
|
|
# Forward pass |
|
logits, telemetry = model(bit_tensor) |
|
print(f"Input bits: {len(bits)}") |
|
print(f"Output shape: {logits.shape}") |
|
print(f"Telemetry metrics: {list(telemetry.keys())}") |
|
``` |
|
|
|
--- |
|
|
|
## Architecture Overview |
|
|
|
### Bit-Native Processing |
|
Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences: |
|
|
|
- **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte) |
|
- **Processing**: Multi-head attention on bit embeddings |
|
- **Output**: Probability distribution over next bit (0 or 1) |
|
|
|
### Key Innovations |
|
|
|
#### 1. **Reversible Transformer Layers** |
|
- Memory-efficient computation that doesn't store intermediate activations |
|
- Enables training of deeper models with same memory footprint |
|
- Mathematically reversible operations for gradient computation |
|
|
|
#### 2. **Built-in Safety Telemetry** |
|
- **K (Negentropy)**: Measures information content vs random noise |
|
- **C (LZ Complexity)**: Proxy for compressibility and pattern complexity |
|
- **S (Symbiosis)**: Alignment with reference distributions |
|
- Real-time monitoring and safety gates |
|
|
|
#### 3. **Dual-Mode Operation** |
|
- **Causal Mode**: Traditional autoregressive generation |
|
- **Diffusion Mode**: Bidirectional denoising for higher quality output |
|
|
|
#### 4. **Progressive Scaling** |
|
- Dynamic architecture expansion based on validation performance |
|
- Automatic addition of layers, width, or context length |
|
- Curriculum learning from simple to complex patterns |
|
|
|
--- |
|
|
|
## Core Features |
|
|
|
### Text Processing |
|
- **Parity-Protected Encoding**: Each byte gets a parity bit for error detection |
|
- **UTF-8 Support**: Full Unicode text processing capability |
|
- **Bidirectional Processing**: Support for both causal and diffusion modes |
|
|
|
### Safety & Monitoring |
|
- **Real-time Telemetry**: K/C/S metrics computed during inference |
|
- **Safety Gates**: Automatic blocking of unsafe outputs |
|
- **Metric Drift Detection**: Alerts when model behavior changes |
|
- **Human-in-the-Loop**: Safe inference with retry mechanisms |
|
|
|
### Memory Efficiency |
|
- **Reversible Layers**: Significant memory savings for deep models |
|
- **Gradient Checkpointing**: Trade compute for memory in training |
|
- **Dynamic Quantization**: Runtime INT8 conversion for inference |
|
- **4-bit QAT**: Quantization-aware training for extreme efficiency |
|
|
|
### Advanced Training |
|
- **Distributed Training**: FSDP and pipeline parallelism support |
|
- **Mixed Precision**: FP16/BF16 optimization with CPU autocast |
|
- **Compression Pipeline**: Run-length encoding for efficient storage |
|
- **Progressive Curriculum**: Automatic difficulty scaling |
|
|
|
--- |
|
|
|
## Installation & Setup |
|
|
|
### Requirements |
|
- Python 3.10 or later |
|
- PyTorch 2.7.1 or later |
|
- CUDA (optional, for GPU acceleration) |
|
|
|
### Installation |
|
```bash |
|
# Clone repository |
|
git clone https://huggingface.co/WCNegentropy/BitTransformerLM |
|
cd BitTransformerLM |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
|
|
# For GPU support (optional) |
|
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118 |
|
``` |
|
|
|
### Quick Test |
|
```bash |
|
# Run basic example |
|
python example.py |
|
|
|
# Expected output: |
|
# Training loss: [some value] |
|
# Available telemetry: ['activations', 'attention_maps', ...] |
|
``` |
|
|
|
### **🤖 Recommended: Setup with Claude Code** |
|
|
|
For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM: |
|
|
|
1. **Open Claude Code** and navigate to your project directory |
|
2. **Clone the repository**: Claude Code can help with git operations and dependency management |
|
3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters |
|
4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging |
|
5. **Code Generation**: Generate custom training scripts and experiments with AI assistance |
|
|
|
Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing. |
|
|
|
--- |
|
|
|
## Basic Usage Examples |
|
|
|
### 1. Creating Models |
|
|
|
```python |
|
from bit_transformer import BitTransformerLM |
|
|
|
# Small model for experimentation |
|
small_model = BitTransformerLM( |
|
d_model=64, # Embedding dimension |
|
nhead=4, # Number of attention heads |
|
num_layers=2, # Number of transformer layers |
|
dim_feedforward=128, # Feedforward dimension |
|
max_seq_len=128, # Maximum sequence length |
|
reversible=True, # Use memory-efficient reversible layers |
|
use_checkpoint=True # Enable gradient checkpointing |
|
) |
|
|
|
# Medium model for research |
|
medium_model = BitTransformerLM( |
|
d_model=512, |
|
nhead=8, |
|
num_layers=8, |
|
dim_feedforward=2048, |
|
max_seq_len=512, |
|
reversible=True, |
|
use_checkpoint=True, |
|
chunk_size=64, # Chunked attention for long sequences |
|
lambda_K=0.1, # Negentropy regularization weight |
|
lambda_C=0.1, # Complexity regularization weight |
|
lambda_S=0.1 # Symbiosis regularization weight |
|
) |
|
``` |
|
|
|
### 2. Text Generation |
|
|
|
```python |
|
from bit_transformer.bit_io import sample_text |
|
|
|
# Generate text from prompt |
|
prompt = "The future of AI is" |
|
generated = sample_text( |
|
model, |
|
prompt=prompt, |
|
max_new_tokens=20, # Generate ~20 new characters |
|
temperature=0.8, # Sampling temperature |
|
top_p=0.9 # Nucleus sampling |
|
) |
|
print(f"Generated: {generated}") |
|
``` |
|
|
|
### 3. Safe Inference |
|
|
|
```python |
|
from bit_transformer import hil_safe_inference, text_to_bits |
|
import torch |
|
|
|
# Convert text to bits |
|
text = "Hello, world!" |
|
bits = torch.tensor(text_to_bits(text)).unsqueeze(0) |
|
|
|
# Safe inference with telemetry monitoring |
|
try: |
|
output_bits, telemetry = hil_safe_inference( |
|
model, |
|
bits, |
|
c_floor=0.3, # Minimum complexity threshold |
|
s_floor=0.5, # Minimum symbiosis threshold |
|
strict=True # Throw error if thresholds not met |
|
) |
|
print("✅ Safe inference completed") |
|
print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}") |
|
print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}") |
|
print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}") |
|
except Exception as e: |
|
print(f"⚠️ Safety check failed: {e}") |
|
``` |
|
|
|
### 4. Interactive Dashboard |
|
|
|
```python |
|
# Launch the interactive dashboard |
|
python unified_workflow.py --dashboard |
|
|
|
# Or programmatically |
|
from bit_transformer.dashboard_app import run_dashboard |
|
run_dashboard(host="localhost", port=5000) |
|
``` |
|
|
|
The dashboard provides: |
|
- Real-time training monitoring |
|
- Telemetry visualization |
|
- Model configuration controls |
|
- HuggingFace checkpoint management |
|
- Safe inference testing interface |
|
|
|
--- |
|
|
|
## Advanced Features |
|
|
|
### 1. Diffusion Mode Training |
|
|
|
Diffusion mode enables bidirectional processing for higher quality generation: |
|
|
|
```python |
|
# Train with diffusion mode |
|
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32 |
|
|
|
# Different noise schedules |
|
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 |
|
|
|
# Diffusion curriculum (noise decay over epochs) |
|
python unified_workflow.py --diffusion --diffusion-curriculum |
|
``` |
|
|
|
**Diffusion Parameters:** |
|
- `--diffusion-steps`: Number of denoising steps (higher = better quality) |
|
- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay |
|
- `--diffusion-curriculum`: Gradually reduce noise over training epochs |
|
|
|
### 2. Progressive Scaling |
|
|
|
Enable automatic model growth based on performance: |
|
|
|
```python |
|
from bit_transformer.training import train_loop |
|
from bit_transformer.scale import expand_model |
|
|
|
# Training loop with automatic scaling |
|
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128) |
|
train_data = torch.randint(0, 2, (1000, 64)) |
|
|
|
# Train with progressive scaling |
|
train_loop( |
|
model, |
|
train_data, |
|
epochs=10, |
|
batch_size=8, |
|
# Progressive scaling will automatically trigger when validation loss plateaus |
|
) |
|
|
|
# Manual model expansion |
|
expanded_model = expand_model(model, strategy="depth") # Add layers |
|
expanded_model = expand_model(model, strategy="width") # Increase width |
|
expanded_model = expand_model(model, strategy="context") # Extend context |
|
``` |
|
|
|
### 3. Compression Pipeline |
|
|
|
BitTransformerLM includes run-length encoding for efficient data storage: |
|
|
|
```python |
|
from bit_transformer import compress_bits, decompress_bits |
|
|
|
# Compress bit sequences |
|
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1]) |
|
compressed = compress_bits(original_bits) |
|
decompressed = decompress_bits(compressed) |
|
|
|
print(f"Original: {original_bits}") |
|
print(f"Compressed: {compressed}") |
|
print(f"Decompressed: {decompressed}") |
|
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}") |
|
|
|
# Use compression in training |
|
train_loop( |
|
model, |
|
data, |
|
compress_prob=0.5, # 50% of training uses compressed data |
|
compress_warmup=100 # Start compression after 100 steps |
|
) |
|
``` |
|
|
|
### 4. Quantization and Optimization |
|
|
|
```python |
|
from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx |
|
|
|
# Dynamic quantization for inference |
|
quantized_model = quantize_dynamic(model, dtype=torch.qint8) |
|
|
|
# 4-bit quantization-aware training |
|
qat_model = prepare_qat_fx(model) |
|
# ... train qat_model ... |
|
final_model = convert_qat_fx(qat_model) |
|
|
|
# Enable mixed precision and compilation |
|
train_loop( |
|
model, |
|
data, |
|
amp=True, # Enable automatic mixed precision |
|
compile_model=True # Use torch.compile for speedup |
|
) |
|
``` |
|
|
|
--- |
|
|
|
## Training Your Own Models |
|
|
|
### Basic Training Script |
|
|
|
```python |
|
import torch |
|
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer |
|
from bit_transformer.bit_io import text_to_bits |
|
|
|
# Prepare training data |
|
texts = ["Hello world", "How are you?", "BitTransformer is working!"] |
|
all_bits = [] |
|
for text in texts: |
|
bits = text_to_bits(text) |
|
all_bits.extend(bits) |
|
|
|
# Convert to tensor and create sequences |
|
data = torch.tensor(all_bits) |
|
sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride |
|
|
|
# Create model |
|
model = BitTransformerLM( |
|
d_model=128, |
|
nhead=8, |
|
num_layers=4, |
|
dim_feedforward=512, |
|
max_seq_len=64, |
|
reversible=True |
|
) |
|
|
|
# Configure optimizer |
|
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01) |
|
|
|
# Training loop |
|
train_loop( |
|
model, |
|
sequences, |
|
epochs=10, |
|
batch_size=4, |
|
optimizer=optimizer, |
|
amp=True, # Mixed precision |
|
log=True # Enable logging |
|
) |
|
``` |
|
|
|
### Advanced Training Configuration |
|
|
|
```python |
|
# Advanced training with all features enabled |
|
train_loop( |
|
model, |
|
data, |
|
epochs=20, |
|
batch_size=8, |
|
accum_steps=4, # Gradient accumulation |
|
amp=True, # Mixed precision |
|
compile_model=True, # torch.compile optimization |
|
|
|
# Compression settings |
|
compress_prob=0.3, # 30% compression probability |
|
compress_warmup=50, # Start compression after 50 steps |
|
|
|
# Diffusion settings |
|
diffusion=True, # Enable diffusion mode |
|
diffusion_curriculum=True, # Decay noise over epochs |
|
|
|
# Direct bit training |
|
direct_prob=0.1, # 10% direct bit prediction |
|
|
|
# Logging |
|
log=True # Enable detailed logging |
|
) |
|
``` |
|
|
|
### Custom Training Loop |
|
|
|
```python |
|
import torch.nn.functional as F |
|
from bit_transformer.utils import set_dropout |
|
|
|
# Manual training loop for full control |
|
model.train() |
|
set_dropout(model, 0.1) # Enable dropout for training |
|
|
|
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) |
|
criterion = F.cross_entropy |
|
|
|
for epoch in range(10): |
|
total_loss = 0 |
|
for batch in data_loader: |
|
optimizer.zero_grad() |
|
|
|
# Forward pass |
|
logits, telemetry = model(batch) |
|
|
|
# Compute loss |
|
if logits.dim() == 3: # (batch, seq, 2) |
|
targets = batch[:, 1:] # Next bit prediction |
|
logits = logits[:, :-1] # Remove last prediction |
|
loss = criterion(logits.reshape(-1, 2), targets.reshape(-1)) |
|
else: |
|
loss = criterion(logits, batch) |
|
|
|
# Add telemetry regularization |
|
if model.lambda_K > 0: |
|
loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0)) |
|
if model.lambda_C > 0: |
|
loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0)) |
|
|
|
# Backward pass |
|
loss.backward() |
|
|
|
# Gradient clipping |
|
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) |
|
|
|
optimizer.step() |
|
total_loss += loss.item() |
|
|
|
# Safety check |
|
if telemetry.get('symbiosis_score', 1.0) < 0.3: |
|
print("⚠️ Low symbiosis score detected") |
|
|
|
print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}") |
|
``` |
|
|
|
--- |
|
|
|
## Safety and Monitoring |
|
|
|
### Telemetry Metrics |
|
|
|
BitTransformerLM provides three key safety metrics: |
|
|
|
#### K (Negentropy) - Information Content |
|
- **Range**: 0-1 (0 = random noise, 1 = perfectly ordered) |
|
- **Purpose**: Measures departure from randomness |
|
- **Interpretation**: |
|
- Very low K (< 0.1): Output is noise-like |
|
- Moderate K (0.3-0.7): Structured but varied output |
|
- Very high K (> 0.9): Repetitive or overly structured |
|
|
|
#### C (LZ Complexity) - Pattern Complexity |
|
- **Range**: 0-1 (higher = more complex patterns) |
|
- **Purpose**: Proxy for Lempel-Ziv compressibility |
|
- **Interpretation**: |
|
- Low C (< 0.3): Highly repetitive patterns |
|
- Moderate C (0.3-0.7): Balanced complexity |
|
- High C (> 0.8): Complex, varied patterns |
|
|
|
#### S (Symbiosis) - Distribution Alignment |
|
- **Range**: 0-1 (higher = better alignment) |
|
- **Purpose**: Agreement with reference distributions via KL divergence |
|
- **Interpretation**: |
|
- Low S (< 0.3): Poor alignment with expected patterns |
|
- Moderate S (0.5-0.8): Good alignment |
|
- High S (> 0.8): Excellent alignment |
|
|
|
### Safety Gates |
|
|
|
```python |
|
from bit_transformer.safety import SafetyGate, safe_sample_with_retry |
|
|
|
# Configure safety gate |
|
gate = SafetyGate( |
|
c_floor=0.3, # Minimum complexity |
|
s_floor=0.5, # Minimum symbiosis |
|
decay=0.9, # EMA decay factor |
|
burn_in=10 # Steps before gating starts |
|
) |
|
|
|
# Check if output should be blocked |
|
should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds |
|
|
|
# Safe sampling with automatic retry |
|
output = safe_sample_with_retry( |
|
model, |
|
input_bits, |
|
max_retries=3, |
|
retry_strategy="diffusion" # Try diffusion mode on failure |
|
) |
|
``` |
|
|
|
### Metric Drift Detection |
|
|
|
```python |
|
from bit_transformer.telemetry import detect_metric_drift |
|
|
|
# Monitor metric stability over time |
|
metrics_history = [ |
|
{"K": 0.5, "C": 0.6, "S": 0.7}, |
|
{"K": 0.52, "C": 0.58, "S": 0.69}, |
|
{"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected! |
|
# ... more metrics |
|
] |
|
|
|
drift_detected = detect_metric_drift( |
|
metrics_history, |
|
window=10, # Look back 10 steps |
|
threshold=0.2 # Alert if change > 0.2 |
|
) |
|
|
|
if drift_detected: |
|
print("⚠️ Model behavior drift detected!") |
|
``` |
|
|
|
--- |
|
|
|
## Distributed Training |
|
|
|
### FSDP (Fully Sharded Data Parallel) |
|
|
|
```python |
|
from bit_transformer.distributed import wrap_fsdp, setup_distributed |
|
import torch.distributed as dist |
|
|
|
# Initialize distributed training |
|
setup_distributed(rank=0, world_size=4) |
|
|
|
# Wrap model with FSDP |
|
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12) |
|
fsdp_model = wrap_fsdp( |
|
model, |
|
sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD" |
|
mixed_precision=True, |
|
device_id=0 |
|
) |
|
|
|
# Train with FSDP |
|
train_loop( |
|
fsdp_model, |
|
data, |
|
epochs=10, |
|
batch_size=2, # Smaller batch per GPU |
|
amp=True |
|
) |
|
``` |
|
|
|
### Pipeline Parallelism |
|
|
|
```python |
|
from bit_transformer.distributed import make_pipeline |
|
|
|
# Create pipeline parallel model |
|
pipeline_model = make_pipeline( |
|
model, |
|
balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs |
|
devices=[0, 1, 2, 3], |
|
checkpoint="never" # or "always", "except_last" |
|
) |
|
|
|
# Pipeline training requires special handling |
|
# See unified_workflow.py for complete implementation |
|
``` |
|
|
|
### Multi-GPU Training Script |
|
|
|
```bash |
|
# Single node, multiple GPUs |
|
python -m torch.distributed.launch \ |
|
--nproc_per_node=4 \ |
|
unified_workflow.py \ |
|
--distributed \ |
|
--batch-size 2 \ |
|
--epochs 10 |
|
|
|
# Multiple nodes |
|
python -m torch.distributed.launch \ |
|
--nnodes=2 \ |
|
--node_rank=0 \ |
|
--master_addr="192.168.1.100" \ |
|
--master_port=29500 \ |
|
--nproc_per_node=4 \ |
|
unified_workflow.py \ |
|
--distributed |
|
``` |
|
|
|
--- |
|
|
|
## Performance Optimization |
|
|
|
### Memory Optimization |
|
|
|
```python |
|
# Enable all memory optimizations |
|
model = BitTransformerLM( |
|
d_model=512, |
|
nhead=8, |
|
num_layers=8, |
|
reversible=True, # Reversible layers save ~50% memory |
|
use_checkpoint=True, # Gradient checkpointing |
|
chunk_size=64, # Chunked attention for long sequences |
|
full_attn_logging=False # Skip full attention reconstruction |
|
) |
|
|
|
# Training optimizations |
|
train_loop( |
|
model, |
|
data, |
|
batch_size=4, # Smaller batches |
|
accum_steps=8, # Gradient accumulation |
|
amp=True, # Mixed precision |
|
compile_model=True # torch.compile |
|
) |
|
``` |
|
|
|
### CPU Optimization |
|
|
|
```python |
|
from bit_transformer.torch_utils import cpu_autocast |
|
|
|
# Enable BF16 on CPU |
|
with cpu_autocast(): |
|
logits, telemetry = model(bits) |
|
|
|
# Or enable for entire model |
|
model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16 |
|
``` |
|
|
|
### Inference Optimization |
|
|
|
```python |
|
# Quantize for inference |
|
from bit_transformer import quantize_dynamic |
|
|
|
# Switch to evaluation mode |
|
model.eval() |
|
set_dropout(model, 0.0) |
|
|
|
# Dynamic quantization |
|
quantized = quantize_dynamic(model, dtype=torch.qint8) |
|
|
|
# Optimize for inference |
|
with torch.no_grad(): |
|
logits, _ = quantized(input_bits) |
|
``` |
|
|
|
### Long Sequence Processing |
|
|
|
```python |
|
from bit_transformer.model import infer_long_sequence |
|
|
|
# Process sequences longer than max_seq_len |
|
long_text = "Very long text..." * 1000 |
|
bits = text_to_bits(long_text) |
|
|
|
output = infer_long_sequence( |
|
model, |
|
torch.tensor(bits).unsqueeze(0), |
|
chunk_size=256, # Process in 256-bit chunks |
|
overlap=32, # 32-bit overlap between chunks |
|
stride=224 # 224-bit stride (256-32) |
|
) |
|
``` |
|
|
|
--- |
|
|
|
## Troubleshooting |
|
|
|
### Common Issues |
|
|
|
#### 1. **Memory Errors** |
|
``` |
|
RuntimeError: CUDA out of memory |
|
``` |
|
**Solutions:** |
|
- Enable reversible layers: `reversible=True` |
|
- Enable gradient checkpointing: `use_checkpoint=True` |
|
- Reduce batch size or use gradient accumulation |
|
- Use chunked attention: `chunk_size=64` |
|
- Enable mixed precision: `amp=True` |
|
|
|
#### 2. **Tensor Shape Mismatches** |
|
``` |
|
RuntimeError: view size is not compatible with input tensor's size |
|
``` |
|
**Solutions:** |
|
- Always use `.reshape()` instead of `.view()` with BitTransformerLM |
|
- Check that input sequences are properly formatted (1D for bits) |
|
- Ensure batch dimensions are consistent |
|
|
|
#### 3. **Parity Check Failures** |
|
``` |
|
ValueError: Parity check failed |
|
``` |
|
**Solutions:** |
|
- Use `enforce_parity()` to fix parity bits in generated sequences |
|
- Check that text encoding/decoding is consistent |
|
- Verify bit sequences have correct 9-bit (8+parity) structure |
|
|
|
#### 4. **Safety Gate Triggering** |
|
``` |
|
SafetyError: Output blocked by safety gate |
|
``` |
|
**Solutions:** |
|
- Lower safety thresholds: `c_floor=0.2, s_floor=0.4` |
|
- Increase burn-in period: `burn_in=20` |
|
- Use retry with diffusion: `safe_sample_with_retry()` |
|
- Check model training quality |
|
|
|
### Debug Mode |
|
|
|
```python |
|
# Enable detailed logging |
|
import logging |
|
logging.basicConfig(level=logging.DEBUG) |
|
|
|
# Model with debug telemetry |
|
model = BitTransformerLM( |
|
d_model=64, |
|
nhead=4, |
|
num_layers=2, |
|
full_attn_logging=True, # Log full attention maps |
|
chunk_size=None # Disable chunking for debugging |
|
) |
|
|
|
# Inspect telemetry |
|
logits, telemetry = model(input_bits) |
|
print("Telemetry keys:", list(telemetry.keys())) |
|
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']]) |
|
print("Activation stats:", torch.stack(telemetry['activations']).describe()) |
|
``` |
|
|
|
### Performance Profiling |
|
|
|
```python |
|
import torch.profiler |
|
|
|
# Profile training step |
|
with torch.profiler.profile( |
|
activities=[ |
|
torch.profiler.ProfilerActivity.CPU, |
|
torch.profiler.ProfilerActivity.CUDA, |
|
], |
|
record_shapes=True, |
|
with_stack=True, |
|
) as prof: |
|
logits, telemetry = model(input_bits) |
|
loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1)) |
|
loss.backward() |
|
|
|
print(prof.key_averages().table(sort_by="cuda_time_total")) |
|
``` |
|
|
|
--- |
|
|
|
## Best Practices |
|
|
|
### Model Configuration |
|
|
|
#### For Experimentation (< 1M parameters) |
|
```python |
|
model = BitTransformerLM( |
|
d_model=64, |
|
nhead=4, |
|
num_layers=2, |
|
dim_feedforward=128, |
|
max_seq_len=128, |
|
reversible=False, # Simpler for debugging |
|
use_checkpoint=False |
|
) |
|
``` |
|
|
|
#### For Research (1M-100M parameters) |
|
```python |
|
model = BitTransformerLM( |
|
d_model=256, |
|
nhead=8, |
|
num_layers=6, |
|
dim_feedforward=1024, |
|
max_seq_len=512, |
|
reversible=True, # Enable memory efficiency |
|
use_checkpoint=True, |
|
chunk_size=128, |
|
lambda_K=0.05, # Light regularization |
|
lambda_C=0.05, |
|
lambda_S=0.05 |
|
) |
|
``` |
|
|
|
#### For Large-Scale (100M+ parameters) |
|
```python |
|
model = BitTransformerLM( |
|
d_model=1024, |
|
nhead=16, |
|
num_layers=20, |
|
dim_feedforward=4096, |
|
max_seq_len=2048, |
|
reversible=True, |
|
use_checkpoint=True, |
|
chunk_size=256, |
|
full_attn_logging=False, # Save memory |
|
lambda_K=0.1, |
|
lambda_C=0.1, |
|
lambda_S=0.1 |
|
) |
|
``` |
|
|
|
### Training Best Practices |
|
|
|
1. **Always validate on held-out data** to monitor overfitting |
|
2. **Use gradient clipping** to prevent training instability |
|
3. **Monitor telemetry metrics** for signs of model degradation |
|
4. **Start with smaller models** before scaling up |
|
5. **Use safety gates** in production deployments |
|
6. **Enable logging** to track training progress |
|
7. **Save checkpoints frequently** to prevent loss of progress |
|
|
|
### Data Preparation |
|
|
|
```python |
|
# Good: Clean, well-formatted text |
|
texts = [ |
|
"The quick brown fox jumps over the lazy dog.", |
|
"Machine learning is transforming technology.", |
|
"BitTransformer processes information at the bit level." |
|
] |
|
|
|
# Convert to training sequences |
|
all_bits = [] |
|
for text in texts: |
|
bits = text_to_bits(text) |
|
all_bits.extend(bits) |
|
|
|
# Create overlapping sequences for better learning |
|
data = torch.tensor(all_bits) |
|
seq_len = 128 |
|
stride = 64 |
|
sequences = [] |
|
for i in range(0, len(data) - seq_len, stride): |
|
sequences.append(data[i:i + seq_len]) |
|
|
|
training_data = torch.stack(sequences) |
|
``` |
|
|
|
### Production Deployment |
|
|
|
```python |
|
# Production-ready model setup |
|
model.eval() # Disable dropout |
|
set_dropout(model, 0.0) |
|
|
|
# Enable safety monitoring |
|
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5) |
|
|
|
# Quantize for efficiency |
|
production_model = quantize_dynamic(model) |
|
|
|
# Safe inference with monitoring |
|
def safe_generate(input_text, max_length=100): |
|
try: |
|
return safe_sample_with_retry( |
|
production_model, |
|
text_to_bits(input_text), |
|
max_retries=3 |
|
) |
|
except Exception as e: |
|
logging.error(f"Generation failed: {e}") |
|
return "Error: Unable to generate safe output" |
|
``` |
|
|
|
--- |
|
|
|
## Getting Help |
|
|
|
### Documentation Resources |
|
- **ABOUTME.md**: Project overview and quick start |
|
- **README.md**: Professional model card and specifications |
|
- **RESEARCH_STATUS.md**: Current research status and limitations |
|
- **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities |
|
|
|
### Community Support |
|
- **GitHub Issues**: Report bugs and request features |
|
- **Discussions**: Ask questions and share experiences |
|
- **Examples**: Check the `tests/` directory for usage examples |
|
|
|
### **🤖 Recommended: Use with Claude Code** |
|
|
|
For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code): |
|
|
|
- **Interactive Setup**: Get step-by-step guidance for configuration |
|
- **Real-time Debugging**: Immediate help when things go wrong |
|
- **Code Generation**: Custom scripts and experiments tailored to your needs |
|
- **Architecture Explanation**: Deep understanding of bit-native processing |
|
- **Best Practices**: Learn optimal configurations for your use case |
|
|
|
Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling. |
|
|
|
--- |
|
|
|
**Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.** |
|
|
|
Happy experimenting! 🤖✨ |