| # BitTransformerLM User Guide | |
| **Version:** 0.1.0 Experimental | |
| **Last Updated:** August 2025 | |
| **Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience | |
| ## Table of Contents | |
| 1. [Quick Start](#quick-start) | |
| 2. [Architecture Overview](#architecture-overview) | |
| 3. [Core Features](#core-features) | |
| 4. [Installation & Setup](#installation--setup) | |
| 5. [Basic Usage Examples](#basic-usage-examples) | |
| 6. [Advanced Features](#advanced-features) | |
| 7. [Training Your Own Models](#training-your-own-models) | |
| 8. [Safety and Monitoring](#safety-and-monitoring) | |
| 9. [Distributed Training](#distributed-training) | |
| 10. [Performance Optimization](#performance-optimization) | |
| 11. [Troubleshooting](#troubleshooting) | |
| 12. [Best Practices](#best-practices) | |
| --- | |
| ## Quick Start | |
| BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring. | |
| ### Minimal Example | |
| ```python | |
| from bit_transformer import BitTransformerLM, example_training_step | |
| # Run basic example | |
| loss, telemetry = example_training_step() | |
| print(f"Training loss: {loss}") | |
| print(f"Available telemetry: {list(telemetry.keys())}") | |
| ``` | |
| ### Text Processing Example | |
| ```python | |
| from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text | |
| # Create model | |
| model = BitTransformerLM( | |
| d_model=128, | |
| nhead=4, | |
| num_layers=2, | |
| dim_feedforward=256, | |
| max_seq_len=256 | |
| ) | |
| # Convert text to bits and process | |
| text = "Hello, world!" | |
| bits = text_to_bits(text) | |
| bit_tensor = torch.tensor(bits).unsqueeze(0) | |
| # Forward pass | |
| logits, telemetry = model(bit_tensor) | |
| print(f"Input bits: {len(bits)}") | |
| print(f"Output shape: {logits.shape}") | |
| print(f"Telemetry metrics: {list(telemetry.keys())}") | |
| ``` | |
| --- | |
| ## Architecture Overview | |
| ### Bit-Native Processing | |
| Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences: | |
| - **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte) | |
| - **Processing**: Multi-head attention on bit embeddings | |
| - **Output**: Probability distribution over next bit (0 or 1) | |
| ### Key Innovations | |
| #### 1. **Reversible Transformer Layers** | |
| - Memory-efficient computation that doesn't store intermediate activations | |
| - Enables training of deeper models with same memory footprint | |
| - Mathematically reversible operations for gradient computation | |
| #### 2. **Built-in Safety Telemetry** | |
| - **K (Negentropy)**: Measures information content vs random noise | |
| - **C (LZ Complexity)**: Proxy for compressibility and pattern complexity | |
| - **S (Symbiosis)**: Alignment with reference distributions | |
| - Real-time monitoring and safety gates | |
| #### 3. **Dual-Mode Operation** | |
| - **Causal Mode**: Traditional autoregressive generation | |
| - **Diffusion Mode**: Bidirectional denoising for higher quality output | |
| #### 4. **Progressive Scaling** | |
| - Dynamic architecture expansion based on validation performance | |
| - Automatic addition of layers, width, or context length | |
| - Curriculum learning from simple to complex patterns | |
| --- | |
| ## Core Features | |
| ### Text Processing | |
| - **Parity-Protected Encoding**: Each byte gets a parity bit for error detection | |
| - **UTF-8 Support**: Full Unicode text processing capability | |
| - **Bidirectional Processing**: Support for both causal and diffusion modes | |
| ### Safety & Monitoring | |
| - **Real-time Telemetry**: K/C/S metrics computed during inference | |
| - **Safety Gates**: Automatic blocking of unsafe outputs | |
| - **Metric Drift Detection**: Alerts when model behavior changes | |
| - **Human-in-the-Loop**: Safe inference with retry mechanisms | |
| ### Memory Efficiency | |
| - **Reversible Layers**: Significant memory savings for deep models | |
| - **Gradient Checkpointing**: Trade compute for memory in training | |
| - **Dynamic Quantization**: Runtime INT8 conversion for inference | |
| - **4-bit QAT**: Quantization-aware training for extreme efficiency | |
| ### Advanced Training | |
| - **Distributed Training**: FSDP and pipeline parallelism support | |
| - **Mixed Precision**: FP16/BF16 optimization with CPU autocast | |
| - **Compression Pipeline**: Run-length encoding for efficient storage | |
| - **Progressive Curriculum**: Automatic difficulty scaling | |
| --- | |
| ## Installation & Setup | |
| ### Requirements | |
| - Python 3.10 or later | |
| - PyTorch 2.7.1 or later | |
| - CUDA (optional, for GPU acceleration) | |
| ### Installation | |
| ```bash | |
| # Clone repository | |
| git clone https://huggingface.co/WCNegentropy/BitTransformerLM | |
| cd BitTransformerLM | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # For GPU support (optional) | |
| pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118 | |
| ``` | |
| ### Quick Test | |
| ```bash | |
| # Run basic example | |
| python example.py | |
| # Expected output: | |
| # Training loss: [some value] | |
| # Available telemetry: ['activations', 'attention_maps', ...] | |
| ``` | |
| ### **🤖 Recommended: Setup with Claude Code** | |
| For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM: | |
| 1. **Open Claude Code** and navigate to your project directory | |
| 2. **Clone the repository**: Claude Code can help with git operations and dependency management | |
| 3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters | |
| 4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging | |
| 5. **Code Generation**: Generate custom training scripts and experiments with AI assistance | |
| Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing. | |
| --- | |
| ## Basic Usage Examples | |
| ### 1. Creating Models | |
| ```python | |
| from bit_transformer import BitTransformerLM | |
| # Small model for experimentation | |
| small_model = BitTransformerLM( | |
| d_model=64, # Embedding dimension | |
| nhead=4, # Number of attention heads | |
| num_layers=2, # Number of transformer layers | |
| dim_feedforward=128, # Feedforward dimension | |
| max_seq_len=128, # Maximum sequence length | |
| reversible=True, # Use memory-efficient reversible layers | |
| use_checkpoint=True # Enable gradient checkpointing | |
| ) | |
| # Medium model for research | |
| medium_model = BitTransformerLM( | |
| d_model=512, | |
| nhead=8, | |
| num_layers=8, | |
| dim_feedforward=2048, | |
| max_seq_len=512, | |
| reversible=True, | |
| use_checkpoint=True, | |
| chunk_size=64, # Chunked attention for long sequences | |
| lambda_K=0.1, # Negentropy regularization weight | |
| lambda_C=0.1, # Complexity regularization weight | |
| lambda_S=0.1 # Symbiosis regularization weight | |
| ) | |
| ``` | |
| ### 2. Text Generation | |
| ```python | |
| from bit_transformer.bit_io import sample_text | |
| # Generate text from prompt | |
| prompt = "The future of AI is" | |
| generated = sample_text( | |
| model, | |
| prompt=prompt, | |
| max_new_tokens=20, # Generate ~20 new characters | |
| temperature=0.8, # Sampling temperature | |
| top_p=0.9 # Nucleus sampling | |
| ) | |
| print(f"Generated: {generated}") | |
| ``` | |
| ### 3. Safe Inference | |
| ```python | |
| from bit_transformer import hil_safe_inference, text_to_bits | |
| import torch | |
| # Convert text to bits | |
| text = "Hello, world!" | |
| bits = torch.tensor(text_to_bits(text)).unsqueeze(0) | |
| # Safe inference with telemetry monitoring | |
| try: | |
| output_bits, telemetry = hil_safe_inference( | |
| model, | |
| bits, | |
| c_floor=0.3, # Minimum complexity threshold | |
| s_floor=0.5, # Minimum symbiosis threshold | |
| strict=True # Throw error if thresholds not met | |
| ) | |
| print("✅ Safe inference completed") | |
| print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}") | |
| print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}") | |
| print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}") | |
| except Exception as e: | |
| print(f"⚠️ Safety check failed: {e}") | |
| ``` | |
| ### 4. Interactive Dashboard | |
| ```python | |
| # Launch the interactive dashboard | |
| python unified_workflow.py --dashboard | |
| # Or programmatically | |
| from bit_transformer.dashboard_app import run_dashboard | |
| run_dashboard(host="localhost", port=5000) | |
| ``` | |
| The dashboard provides: | |
| - Real-time training monitoring | |
| - Telemetry visualization | |
| - Model configuration controls | |
| - HuggingFace checkpoint management | |
| - Safe inference testing interface | |
| --- | |
| ## Advanced Features | |
| ### 1. Diffusion Mode Training | |
| Diffusion mode enables bidirectional processing for higher quality generation: | |
| ```python | |
| # Train with diffusion mode | |
| python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32 | |
| # Different noise schedules | |
| python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 | |
| # Diffusion curriculum (noise decay over epochs) | |
| python unified_workflow.py --diffusion --diffusion-curriculum | |
| ``` | |
| **Diffusion Parameters:** | |
| - `--diffusion-steps`: Number of denoising steps (higher = better quality) | |
| - `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay | |
| - `--diffusion-curriculum`: Gradually reduce noise over training epochs | |
| ### 2. Progressive Scaling | |
| Enable automatic model growth based on performance: | |
| ```python | |
| from bit_transformer.training import train_loop | |
| from bit_transformer.scale import expand_model | |
| # Training loop with automatic scaling | |
| model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128) | |
| train_data = torch.randint(0, 2, (1000, 64)) | |
| # Train with progressive scaling | |
| train_loop( | |
| model, | |
| train_data, | |
| epochs=10, | |
| batch_size=8, | |
| # Progressive scaling will automatically trigger when validation loss plateaus | |
| ) | |
| # Manual model expansion | |
| expanded_model = expand_model(model, strategy="depth") # Add layers | |
| expanded_model = expand_model(model, strategy="width") # Increase width | |
| expanded_model = expand_model(model, strategy="context") # Extend context | |
| ``` | |
| ### 3. Compression Pipeline | |
| BitTransformerLM includes run-length encoding for efficient data storage: | |
| ```python | |
| from bit_transformer import compress_bits, decompress_bits | |
| # Compress bit sequences | |
| original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1]) | |
| compressed = compress_bits(original_bits) | |
| decompressed = decompress_bits(compressed) | |
| print(f"Original: {original_bits}") | |
| print(f"Compressed: {compressed}") | |
| print(f"Decompressed: {decompressed}") | |
| print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}") | |
| # Use compression in training | |
| train_loop( | |
| model, | |
| data, | |
| compress_prob=0.5, # 50% of training uses compressed data | |
| compress_warmup=100 # Start compression after 100 steps | |
| ) | |
| ``` | |
| ### 4. Quantization and Optimization | |
| ```python | |
| from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx | |
| # Dynamic quantization for inference | |
| quantized_model = quantize_dynamic(model, dtype=torch.qint8) | |
| # 4-bit quantization-aware training | |
| qat_model = prepare_qat_fx(model) | |
| # ... train qat_model ... | |
| final_model = convert_qat_fx(qat_model) | |
| # Enable mixed precision and compilation | |
| train_loop( | |
| model, | |
| data, | |
| amp=True, # Enable automatic mixed precision | |
| compile_model=True # Use torch.compile for speedup | |
| ) | |
| ``` | |
| --- | |
| ## Training Your Own Models | |
| ### Basic Training Script | |
| ```python | |
| import torch | |
| from bit_transformer import BitTransformerLM, train_loop, configure_optimizer | |
| from bit_transformer.bit_io import text_to_bits | |
| # Prepare training data | |
| texts = ["Hello world", "How are you?", "BitTransformer is working!"] | |
| all_bits = [] | |
| for text in texts: | |
| bits = text_to_bits(text) | |
| all_bits.extend(bits) | |
| # Convert to tensor and create sequences | |
| data = torch.tensor(all_bits) | |
| sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride | |
| # Create model | |
| model = BitTransformerLM( | |
| d_model=128, | |
| nhead=8, | |
| num_layers=4, | |
| dim_feedforward=512, | |
| max_seq_len=64, | |
| reversible=True | |
| ) | |
| # Configure optimizer | |
| optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01) | |
| # Training loop | |
| train_loop( | |
| model, | |
| sequences, | |
| epochs=10, | |
| batch_size=4, | |
| optimizer=optimizer, | |
| amp=True, # Mixed precision | |
| log=True # Enable logging | |
| ) | |
| ``` | |
| ### Advanced Training Configuration | |
| ```python | |
| # Advanced training with all features enabled | |
| train_loop( | |
| model, | |
| data, | |
| epochs=20, | |
| batch_size=8, | |
| accum_steps=4, # Gradient accumulation | |
| amp=True, # Mixed precision | |
| compile_model=True, # torch.compile optimization | |
| # Compression settings | |
| compress_prob=0.3, # 30% compression probability | |
| compress_warmup=50, # Start compression after 50 steps | |
| # Diffusion settings | |
| diffusion=True, # Enable diffusion mode | |
| diffusion_curriculum=True, # Decay noise over epochs | |
| # Direct bit training | |
| direct_prob=0.1, # 10% direct bit prediction | |
| # Logging | |
| log=True # Enable detailed logging | |
| ) | |
| ``` | |
| ### Custom Training Loop | |
| ```python | |
| import torch.nn.functional as F | |
| from bit_transformer.utils import set_dropout | |
| # Manual training loop for full control | |
| model.train() | |
| set_dropout(model, 0.1) # Enable dropout for training | |
| optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) | |
| criterion = F.cross_entropy | |
| for epoch in range(10): | |
| total_loss = 0 | |
| for batch in data_loader: | |
| optimizer.zero_grad() | |
| # Forward pass | |
| logits, telemetry = model(batch) | |
| # Compute loss | |
| if logits.dim() == 3: # (batch, seq, 2) | |
| targets = batch[:, 1:] # Next bit prediction | |
| logits = logits[:, :-1] # Remove last prediction | |
| loss = criterion(logits.reshape(-1, 2), targets.reshape(-1)) | |
| else: | |
| loss = criterion(logits, batch) | |
| # Add telemetry regularization | |
| if model.lambda_K > 0: | |
| loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0)) | |
| if model.lambda_C > 0: | |
| loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0)) | |
| # Backward pass | |
| loss.backward() | |
| # Gradient clipping | |
| torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) | |
| optimizer.step() | |
| total_loss += loss.item() | |
| # Safety check | |
| if telemetry.get('symbiosis_score', 1.0) < 0.3: | |
| print("⚠️ Low symbiosis score detected") | |
| print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}") | |
| ``` | |
| --- | |
| ## Safety and Monitoring | |
| ### Telemetry Metrics | |
| BitTransformerLM provides three key safety metrics: | |
| #### K (Negentropy) - Information Content | |
| - **Range**: 0-1 (0 = random noise, 1 = perfectly ordered) | |
| - **Purpose**: Measures departure from randomness | |
| - **Interpretation**: | |
| - Very low K (< 0.1): Output is noise-like | |
| - Moderate K (0.3-0.7): Structured but varied output | |
| - Very high K (> 0.9): Repetitive or overly structured | |
| #### C (LZ Complexity) - Pattern Complexity | |
| - **Range**: 0-1 (higher = more complex patterns) | |
| - **Purpose**: Proxy for Lempel-Ziv compressibility | |
| - **Interpretation**: | |
| - Low C (< 0.3): Highly repetitive patterns | |
| - Moderate C (0.3-0.7): Balanced complexity | |
| - High C (> 0.8): Complex, varied patterns | |
| #### S (Symbiosis) - Distribution Alignment | |
| - **Range**: 0-1 (higher = better alignment) | |
| - **Purpose**: Agreement with reference distributions via KL divergence | |
| - **Interpretation**: | |
| - Low S (< 0.3): Poor alignment with expected patterns | |
| - Moderate S (0.5-0.8): Good alignment | |
| - High S (> 0.8): Excellent alignment | |
| ### Safety Gates | |
| ```python | |
| from bit_transformer.safety import SafetyGate, safe_sample_with_retry | |
| # Configure safety gate | |
| gate = SafetyGate( | |
| c_floor=0.3, # Minimum complexity | |
| s_floor=0.5, # Minimum symbiosis | |
| decay=0.9, # EMA decay factor | |
| burn_in=10 # Steps before gating starts | |
| ) | |
| # Check if output should be blocked | |
| should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds | |
| # Safe sampling with automatic retry | |
| output = safe_sample_with_retry( | |
| model, | |
| input_bits, | |
| max_retries=3, | |
| retry_strategy="diffusion" # Try diffusion mode on failure | |
| ) | |
| ``` | |
| ### Metric Drift Detection | |
| ```python | |
| from bit_transformer.telemetry import detect_metric_drift | |
| # Monitor metric stability over time | |
| metrics_history = [ | |
| {"K": 0.5, "C": 0.6, "S": 0.7}, | |
| {"K": 0.52, "C": 0.58, "S": 0.69}, | |
| {"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected! | |
| # ... more metrics | |
| ] | |
| drift_detected = detect_metric_drift( | |
| metrics_history, | |
| window=10, # Look back 10 steps | |
| threshold=0.2 # Alert if change > 0.2 | |
| ) | |
| if drift_detected: | |
| print("⚠️ Model behavior drift detected!") | |
| ``` | |
| --- | |
| ## Distributed Training | |
| ### FSDP (Fully Sharded Data Parallel) | |
| ```python | |
| from bit_transformer.distributed import wrap_fsdp, setup_distributed | |
| import torch.distributed as dist | |
| # Initialize distributed training | |
| setup_distributed(rank=0, world_size=4) | |
| # Wrap model with FSDP | |
| model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12) | |
| fsdp_model = wrap_fsdp( | |
| model, | |
| sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD" | |
| mixed_precision=True, | |
| device_id=0 | |
| ) | |
| # Train with FSDP | |
| train_loop( | |
| fsdp_model, | |
| data, | |
| epochs=10, | |
| batch_size=2, # Smaller batch per GPU | |
| amp=True | |
| ) | |
| ``` | |
| ### Pipeline Parallelism | |
| ```python | |
| from bit_transformer.distributed import make_pipeline | |
| # Create pipeline parallel model | |
| pipeline_model = make_pipeline( | |
| model, | |
| balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs | |
| devices=[0, 1, 2, 3], | |
| checkpoint="never" # or "always", "except_last" | |
| ) | |
| # Pipeline training requires special handling | |
| # See unified_workflow.py for complete implementation | |
| ``` | |
| ### Multi-GPU Training Script | |
| ```bash | |
| # Single node, multiple GPUs | |
| python -m torch.distributed.launch \ | |
| --nproc_per_node=4 \ | |
| unified_workflow.py \ | |
| --distributed \ | |
| --batch-size 2 \ | |
| --epochs 10 | |
| # Multiple nodes | |
| python -m torch.distributed.launch \ | |
| --nnodes=2 \ | |
| --node_rank=0 \ | |
| --master_addr="192.168.1.100" \ | |
| --master_port=29500 \ | |
| --nproc_per_node=4 \ | |
| unified_workflow.py \ | |
| --distributed | |
| ``` | |
| --- | |
| ## Performance Optimization | |
| ### Memory Optimization | |
| ```python | |
| # Enable all memory optimizations | |
| model = BitTransformerLM( | |
| d_model=512, | |
| nhead=8, | |
| num_layers=8, | |
| reversible=True, # Reversible layers save ~50% memory | |
| use_checkpoint=True, # Gradient checkpointing | |
| chunk_size=64, # Chunked attention for long sequences | |
| full_attn_logging=False # Skip full attention reconstruction | |
| ) | |
| # Training optimizations | |
| train_loop( | |
| model, | |
| data, | |
| batch_size=4, # Smaller batches | |
| accum_steps=8, # Gradient accumulation | |
| amp=True, # Mixed precision | |
| compile_model=True # torch.compile | |
| ) | |
| ``` | |
| ### CPU Optimization | |
| ```python | |
| from bit_transformer.torch_utils import cpu_autocast | |
| # Enable BF16 on CPU | |
| with cpu_autocast(): | |
| logits, telemetry = model(bits) | |
| # Or enable for entire model | |
| model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16 | |
| ``` | |
| ### Inference Optimization | |
| ```python | |
| # Quantize for inference | |
| from bit_transformer import quantize_dynamic | |
| # Switch to evaluation mode | |
| model.eval() | |
| set_dropout(model, 0.0) | |
| # Dynamic quantization | |
| quantized = quantize_dynamic(model, dtype=torch.qint8) | |
| # Optimize for inference | |
| with torch.no_grad(): | |
| logits, _ = quantized(input_bits) | |
| ``` | |
| ### Long Sequence Processing | |
| ```python | |
| from bit_transformer.model import infer_long_sequence | |
| # Process sequences longer than max_seq_len | |
| long_text = "Very long text..." * 1000 | |
| bits = text_to_bits(long_text) | |
| output = infer_long_sequence( | |
| model, | |
| torch.tensor(bits).unsqueeze(0), | |
| chunk_size=256, # Process in 256-bit chunks | |
| overlap=32, # 32-bit overlap between chunks | |
| stride=224 # 224-bit stride (256-32) | |
| ) | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### Common Issues | |
| #### 1. **Memory Errors** | |
| ``` | |
| RuntimeError: CUDA out of memory | |
| ``` | |
| **Solutions:** | |
| - Enable reversible layers: `reversible=True` | |
| - Enable gradient checkpointing: `use_checkpoint=True` | |
| - Reduce batch size or use gradient accumulation | |
| - Use chunked attention: `chunk_size=64` | |
| - Enable mixed precision: `amp=True` | |
| #### 2. **Tensor Shape Mismatches** | |
| ``` | |
| RuntimeError: view size is not compatible with input tensor's size | |
| ``` | |
| **Solutions:** | |
| - Always use `.reshape()` instead of `.view()` with BitTransformerLM | |
| - Check that input sequences are properly formatted (1D for bits) | |
| - Ensure batch dimensions are consistent | |
| #### 3. **Parity Check Failures** | |
| ``` | |
| ValueError: Parity check failed | |
| ``` | |
| **Solutions:** | |
| - Use `enforce_parity()` to fix parity bits in generated sequences | |
| - Check that text encoding/decoding is consistent | |
| - Verify bit sequences have correct 9-bit (8+parity) structure | |
| #### 4. **Safety Gate Triggering** | |
| ``` | |
| SafetyError: Output blocked by safety gate | |
| ``` | |
| **Solutions:** | |
| - Lower safety thresholds: `c_floor=0.2, s_floor=0.4` | |
| - Increase burn-in period: `burn_in=20` | |
| - Use retry with diffusion: `safe_sample_with_retry()` | |
| - Check model training quality | |
| ### Debug Mode | |
| ```python | |
| # Enable detailed logging | |
| import logging | |
| logging.basicConfig(level=logging.DEBUG) | |
| # Model with debug telemetry | |
| model = BitTransformerLM( | |
| d_model=64, | |
| nhead=4, | |
| num_layers=2, | |
| full_attn_logging=True, # Log full attention maps | |
| chunk_size=None # Disable chunking for debugging | |
| ) | |
| # Inspect telemetry | |
| logits, telemetry = model(input_bits) | |
| print("Telemetry keys:", list(telemetry.keys())) | |
| print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']]) | |
| print("Activation stats:", torch.stack(telemetry['activations']).describe()) | |
| ``` | |
| ### Performance Profiling | |
| ```python | |
| import torch.profiler | |
| # Profile training step | |
| with torch.profiler.profile( | |
| activities=[ | |
| torch.profiler.ProfilerActivity.CPU, | |
| torch.profiler.ProfilerActivity.CUDA, | |
| ], | |
| record_shapes=True, | |
| with_stack=True, | |
| ) as prof: | |
| logits, telemetry = model(input_bits) | |
| loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1)) | |
| loss.backward() | |
| print(prof.key_averages().table(sort_by="cuda_time_total")) | |
| ``` | |
| --- | |
| ## Best Practices | |
| ### Model Configuration | |
| #### For Experimentation (< 1M parameters) | |
| ```python | |
| model = BitTransformerLM( | |
| d_model=64, | |
| nhead=4, | |
| num_layers=2, | |
| dim_feedforward=128, | |
| max_seq_len=128, | |
| reversible=False, # Simpler for debugging | |
| use_checkpoint=False | |
| ) | |
| ``` | |
| #### For Research (1M-100M parameters) | |
| ```python | |
| model = BitTransformerLM( | |
| d_model=256, | |
| nhead=8, | |
| num_layers=6, | |
| dim_feedforward=1024, | |
| max_seq_len=512, | |
| reversible=True, # Enable memory efficiency | |
| use_checkpoint=True, | |
| chunk_size=128, | |
| lambda_K=0.05, # Light regularization | |
| lambda_C=0.05, | |
| lambda_S=0.05 | |
| ) | |
| ``` | |
| #### For Large-Scale (100M+ parameters) | |
| ```python | |
| model = BitTransformerLM( | |
| d_model=1024, | |
| nhead=16, | |
| num_layers=20, | |
| dim_feedforward=4096, | |
| max_seq_len=2048, | |
| reversible=True, | |
| use_checkpoint=True, | |
| chunk_size=256, | |
| full_attn_logging=False, # Save memory | |
| lambda_K=0.1, | |
| lambda_C=0.1, | |
| lambda_S=0.1 | |
| ) | |
| ``` | |
| ### Training Best Practices | |
| 1. **Always validate on held-out data** to monitor overfitting | |
| 2. **Use gradient clipping** to prevent training instability | |
| 3. **Monitor telemetry metrics** for signs of model degradation | |
| 4. **Start with smaller models** before scaling up | |
| 5. **Use safety gates** in production deployments | |
| 6. **Enable logging** to track training progress | |
| 7. **Save checkpoints frequently** to prevent loss of progress | |
| ### Data Preparation | |
| ```python | |
| # Good: Clean, well-formatted text | |
| texts = [ | |
| "The quick brown fox jumps over the lazy dog.", | |
| "Machine learning is transforming technology.", | |
| "BitTransformer processes information at the bit level." | |
| ] | |
| # Convert to training sequences | |
| all_bits = [] | |
| for text in texts: | |
| bits = text_to_bits(text) | |
| all_bits.extend(bits) | |
| # Create overlapping sequences for better learning | |
| data = torch.tensor(all_bits) | |
| seq_len = 128 | |
| stride = 64 | |
| sequences = [] | |
| for i in range(0, len(data) - seq_len, stride): | |
| sequences.append(data[i:i + seq_len]) | |
| training_data = torch.stack(sequences) | |
| ``` | |
| ### Production Deployment | |
| ```python | |
| # Production-ready model setup | |
| model.eval() # Disable dropout | |
| set_dropout(model, 0.0) | |
| # Enable safety monitoring | |
| gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5) | |
| # Quantize for efficiency | |
| production_model = quantize_dynamic(model) | |
| # Safe inference with monitoring | |
| def safe_generate(input_text, max_length=100): | |
| try: | |
| return safe_sample_with_retry( | |
| production_model, | |
| text_to_bits(input_text), | |
| max_retries=3 | |
| ) | |
| except Exception as e: | |
| logging.error(f"Generation failed: {e}") | |
| return "Error: Unable to generate safe output" | |
| ``` | |
| --- | |
| ## Getting Help | |
| ### Documentation Resources | |
| - **ABOUTME.md**: Project overview and quick start | |
| - **README.md**: Professional model card and specifications | |
| - **RESEARCH_STATUS.md**: Current research status and limitations | |
| - **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities | |
| ### Community Support | |
| - **GitHub Issues**: Report bugs and request features | |
| - **Discussions**: Ask questions and share experiences | |
| - **Examples**: Check the `tests/` directory for usage examples | |
| ### **🤖 Recommended: Use with Claude Code** | |
| For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code): | |
| - **Interactive Setup**: Get step-by-step guidance for configuration | |
| - **Real-time Debugging**: Immediate help when things go wrong | |
| - **Code Generation**: Custom scripts and experiments tailored to your needs | |
| - **Architecture Explanation**: Deep understanding of bit-native processing | |
| - **Best Practices**: Learn optimal configurations for your use case | |
| Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling. | |
| --- | |
| **Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.** | |
| Happy experimenting! 🤖✨ |