# BitTransformerLM User Guide **Version:** 0.1.0 Experimental **Last Updated:** August 2025 **Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience ## Table of Contents 1. [Quick Start](#quick-start) 2. [Architecture Overview](#architecture-overview) 3. [Core Features](#core-features) 4. [Installation & Setup](#installation--setup) 5. [Basic Usage Examples](#basic-usage-examples) 6. [Advanced Features](#advanced-features) 7. [Training Your Own Models](#training-your-own-models) 8. [Safety and Monitoring](#safety-and-monitoring) 9. [Distributed Training](#distributed-training) 10. [Performance Optimization](#performance-optimization) 11. [Troubleshooting](#troubleshooting) 12. [Best Practices](#best-practices) --- ## Quick Start BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring. ### Minimal Example ```python from bit_transformer import BitTransformerLM, example_training_step # Run basic example loss, telemetry = example_training_step() print(f"Training loss: {loss}") print(f"Available telemetry: {list(telemetry.keys())}") ``` ### Text Processing Example ```python from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text # Create model model = BitTransformerLM( d_model=128, nhead=4, num_layers=2, dim_feedforward=256, max_seq_len=256 ) # Convert text to bits and process text = "Hello, world!" bits = text_to_bits(text) bit_tensor = torch.tensor(bits).unsqueeze(0) # Forward pass logits, telemetry = model(bit_tensor) print(f"Input bits: {len(bits)}") print(f"Output shape: {logits.shape}") print(f"Telemetry metrics: {list(telemetry.keys())}") ``` --- ## Architecture Overview ### Bit-Native Processing Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences: - **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte) - **Processing**: Multi-head attention on bit embeddings - **Output**: Probability distribution over next bit (0 or 1) ### Key Innovations #### 1. **Reversible Transformer Layers** - Memory-efficient computation that doesn't store intermediate activations - Enables training of deeper models with same memory footprint - Mathematically reversible operations for gradient computation #### 2. **Built-in Safety Telemetry** - **K (Negentropy)**: Measures information content vs random noise - **C (LZ Complexity)**: Proxy for compressibility and pattern complexity - **S (Symbiosis)**: Alignment with reference distributions - Real-time monitoring and safety gates #### 3. **Dual-Mode Operation** - **Causal Mode**: Traditional autoregressive generation - **Diffusion Mode**: Bidirectional denoising for higher quality output #### 4. **Progressive Scaling** - Dynamic architecture expansion based on validation performance - Automatic addition of layers, width, or context length - Curriculum learning from simple to complex patterns --- ## Core Features ### Text Processing - **Parity-Protected Encoding**: Each byte gets a parity bit for error detection - **UTF-8 Support**: Full Unicode text processing capability - **Bidirectional Processing**: Support for both causal and diffusion modes ### Safety & Monitoring - **Real-time Telemetry**: K/C/S metrics computed during inference - **Safety Gates**: Automatic blocking of unsafe outputs - **Metric Drift Detection**: Alerts when model behavior changes - **Human-in-the-Loop**: Safe inference with retry mechanisms ### Memory Efficiency - **Reversible Layers**: Significant memory savings for deep models - **Gradient Checkpointing**: Trade compute for memory in training - **Dynamic Quantization**: Runtime INT8 conversion for inference - **4-bit QAT**: Quantization-aware training for extreme efficiency ### Advanced Training - **Distributed Training**: FSDP and pipeline parallelism support - **Mixed Precision**: FP16/BF16 optimization with CPU autocast - **Compression Pipeline**: Run-length encoding for efficient storage - **Progressive Curriculum**: Automatic difficulty scaling --- ## Installation & Setup ### Requirements - Python 3.10 or later - PyTorch 2.7.1 or later - CUDA (optional, for GPU acceleration) ### Installation ```bash # Clone repository git clone https://huggingface.co/WCNegentropy/BitTransformerLM cd BitTransformerLM # Install dependencies pip install -r requirements.txt # For GPU support (optional) pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118 ``` ### Quick Test ```bash # Run basic example python example.py # Expected output: # Training loss: [some value] # Available telemetry: ['activations', 'attention_maps', ...] ``` ### **🤖 Recommended: Setup with Claude Code** For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM: 1. **Open Claude Code** and navigate to your project directory 2. **Clone the repository**: Claude Code can help with git operations and dependency management 3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters 4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging 5. **Code Generation**: Generate custom training scripts and experiments with AI assistance Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing. --- ## Basic Usage Examples ### 1. Creating Models ```python from bit_transformer import BitTransformerLM # Small model for experimentation small_model = BitTransformerLM( d_model=64, # Embedding dimension nhead=4, # Number of attention heads num_layers=2, # Number of transformer layers dim_feedforward=128, # Feedforward dimension max_seq_len=128, # Maximum sequence length reversible=True, # Use memory-efficient reversible layers use_checkpoint=True # Enable gradient checkpointing ) # Medium model for research medium_model = BitTransformerLM( d_model=512, nhead=8, num_layers=8, dim_feedforward=2048, max_seq_len=512, reversible=True, use_checkpoint=True, chunk_size=64, # Chunked attention for long sequences lambda_K=0.1, # Negentropy regularization weight lambda_C=0.1, # Complexity regularization weight lambda_S=0.1 # Symbiosis regularization weight ) ``` ### 2. Text Generation ```python from bit_transformer.bit_io import sample_text # Generate text from prompt prompt = "The future of AI is" generated = sample_text( model, prompt=prompt, max_new_tokens=20, # Generate ~20 new characters temperature=0.8, # Sampling temperature top_p=0.9 # Nucleus sampling ) print(f"Generated: {generated}") ``` ### 3. Safe Inference ```python from bit_transformer import hil_safe_inference, text_to_bits import torch # Convert text to bits text = "Hello, world!" bits = torch.tensor(text_to_bits(text)).unsqueeze(0) # Safe inference with telemetry monitoring try: output_bits, telemetry = hil_safe_inference( model, bits, c_floor=0.3, # Minimum complexity threshold s_floor=0.5, # Minimum symbiosis threshold strict=True # Throw error if thresholds not met ) print("✅ Safe inference completed") print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}") print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}") print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}") except Exception as e: print(f"⚠️ Safety check failed: {e}") ``` ### 4. Interactive Dashboard ```python # Launch the interactive dashboard python unified_workflow.py --dashboard # Or programmatically from bit_transformer.dashboard_app import run_dashboard run_dashboard(host="localhost", port=5000) ``` The dashboard provides: - Real-time training monitoring - Telemetry visualization - Model configuration controls - HuggingFace checkpoint management - Safe inference testing interface --- ## Advanced Features ### 1. Diffusion Mode Training Diffusion mode enables bidirectional processing for higher quality generation: ```python # Train with diffusion mode python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32 # Different noise schedules python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 # Diffusion curriculum (noise decay over epochs) python unified_workflow.py --diffusion --diffusion-curriculum ``` **Diffusion Parameters:** - `--diffusion-steps`: Number of denoising steps (higher = better quality) - `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay - `--diffusion-curriculum`: Gradually reduce noise over training epochs ### 2. Progressive Scaling Enable automatic model growth based on performance: ```python from bit_transformer.training import train_loop from bit_transformer.scale import expand_model # Training loop with automatic scaling model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128) train_data = torch.randint(0, 2, (1000, 64)) # Train with progressive scaling train_loop( model, train_data, epochs=10, batch_size=8, # Progressive scaling will automatically trigger when validation loss plateaus ) # Manual model expansion expanded_model = expand_model(model, strategy="depth") # Add layers expanded_model = expand_model(model, strategy="width") # Increase width expanded_model = expand_model(model, strategy="context") # Extend context ``` ### 3. Compression Pipeline BitTransformerLM includes run-length encoding for efficient data storage: ```python from bit_transformer import compress_bits, decompress_bits # Compress bit sequences original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1]) compressed = compress_bits(original_bits) decompressed = decompress_bits(compressed) print(f"Original: {original_bits}") print(f"Compressed: {compressed}") print(f"Decompressed: {decompressed}") print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}") # Use compression in training train_loop( model, data, compress_prob=0.5, # 50% of training uses compressed data compress_warmup=100 # Start compression after 100 steps ) ``` ### 4. Quantization and Optimization ```python from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx # Dynamic quantization for inference quantized_model = quantize_dynamic(model, dtype=torch.qint8) # 4-bit quantization-aware training qat_model = prepare_qat_fx(model) # ... train qat_model ... final_model = convert_qat_fx(qat_model) # Enable mixed precision and compilation train_loop( model, data, amp=True, # Enable automatic mixed precision compile_model=True # Use torch.compile for speedup ) ``` --- ## Training Your Own Models ### Basic Training Script ```python import torch from bit_transformer import BitTransformerLM, train_loop, configure_optimizer from bit_transformer.bit_io import text_to_bits # Prepare training data texts = ["Hello world", "How are you?", "BitTransformer is working!"] all_bits = [] for text in texts: bits = text_to_bits(text) all_bits.extend(bits) # Convert to tensor and create sequences data = torch.tensor(all_bits) sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride # Create model model = BitTransformerLM( d_model=128, nhead=8, num_layers=4, dim_feedforward=512, max_seq_len=64, reversible=True ) # Configure optimizer optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01) # Training loop train_loop( model, sequences, epochs=10, batch_size=4, optimizer=optimizer, amp=True, # Mixed precision log=True # Enable logging ) ``` ### Advanced Training Configuration ```python # Advanced training with all features enabled train_loop( model, data, epochs=20, batch_size=8, accum_steps=4, # Gradient accumulation amp=True, # Mixed precision compile_model=True, # torch.compile optimization # Compression settings compress_prob=0.3, # 30% compression probability compress_warmup=50, # Start compression after 50 steps # Diffusion settings diffusion=True, # Enable diffusion mode diffusion_curriculum=True, # Decay noise over epochs # Direct bit training direct_prob=0.1, # 10% direct bit prediction # Logging log=True # Enable detailed logging ) ``` ### Custom Training Loop ```python import torch.nn.functional as F from bit_transformer.utils import set_dropout # Manual training loop for full control model.train() set_dropout(model, 0.1) # Enable dropout for training optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) criterion = F.cross_entropy for epoch in range(10): total_loss = 0 for batch in data_loader: optimizer.zero_grad() # Forward pass logits, telemetry = model(batch) # Compute loss if logits.dim() == 3: # (batch, seq, 2) targets = batch[:, 1:] # Next bit prediction logits = logits[:, :-1] # Remove last prediction loss = criterion(logits.reshape(-1, 2), targets.reshape(-1)) else: loss = criterion(logits, batch) # Add telemetry regularization if model.lambda_K > 0: loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0)) if model.lambda_C > 0: loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0)) # Backward pass loss.backward() # Gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() total_loss += loss.item() # Safety check if telemetry.get('symbiosis_score', 1.0) < 0.3: print("⚠️ Low symbiosis score detected") print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}") ``` --- ## Safety and Monitoring ### Telemetry Metrics BitTransformerLM provides three key safety metrics: #### K (Negentropy) - Information Content - **Range**: 0-1 (0 = random noise, 1 = perfectly ordered) - **Purpose**: Measures departure from randomness - **Interpretation**: - Very low K (< 0.1): Output is noise-like - Moderate K (0.3-0.7): Structured but varied output - Very high K (> 0.9): Repetitive or overly structured #### C (LZ Complexity) - Pattern Complexity - **Range**: 0-1 (higher = more complex patterns) - **Purpose**: Proxy for Lempel-Ziv compressibility - **Interpretation**: - Low C (< 0.3): Highly repetitive patterns - Moderate C (0.3-0.7): Balanced complexity - High C (> 0.8): Complex, varied patterns #### S (Symbiosis) - Distribution Alignment - **Range**: 0-1 (higher = better alignment) - **Purpose**: Agreement with reference distributions via KL divergence - **Interpretation**: - Low S (< 0.3): Poor alignment with expected patterns - Moderate S (0.5-0.8): Good alignment - High S (> 0.8): Excellent alignment ### Safety Gates ```python from bit_transformer.safety import SafetyGate, safe_sample_with_retry # Configure safety gate gate = SafetyGate( c_floor=0.3, # Minimum complexity s_floor=0.5, # Minimum symbiosis decay=0.9, # EMA decay factor burn_in=10 # Steps before gating starts ) # Check if output should be blocked should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds # Safe sampling with automatic retry output = safe_sample_with_retry( model, input_bits, max_retries=3, retry_strategy="diffusion" # Try diffusion mode on failure ) ``` ### Metric Drift Detection ```python from bit_transformer.telemetry import detect_metric_drift # Monitor metric stability over time metrics_history = [ {"K": 0.5, "C": 0.6, "S": 0.7}, {"K": 0.52, "C": 0.58, "S": 0.69}, {"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected! # ... more metrics ] drift_detected = detect_metric_drift( metrics_history, window=10, # Look back 10 steps threshold=0.2 # Alert if change > 0.2 ) if drift_detected: print("⚠️ Model behavior drift detected!") ``` --- ## Distributed Training ### FSDP (Fully Sharded Data Parallel) ```python from bit_transformer.distributed import wrap_fsdp, setup_distributed import torch.distributed as dist # Initialize distributed training setup_distributed(rank=0, world_size=4) # Wrap model with FSDP model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12) fsdp_model = wrap_fsdp( model, sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD" mixed_precision=True, device_id=0 ) # Train with FSDP train_loop( fsdp_model, data, epochs=10, batch_size=2, # Smaller batch per GPU amp=True ) ``` ### Pipeline Parallelism ```python from bit_transformer.distributed import make_pipeline # Create pipeline parallel model pipeline_model = make_pipeline( model, balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs devices=[0, 1, 2, 3], checkpoint="never" # or "always", "except_last" ) # Pipeline training requires special handling # See unified_workflow.py for complete implementation ``` ### Multi-GPU Training Script ```bash # Single node, multiple GPUs python -m torch.distributed.launch \ --nproc_per_node=4 \ unified_workflow.py \ --distributed \ --batch-size 2 \ --epochs 10 # Multiple nodes python -m torch.distributed.launch \ --nnodes=2 \ --node_rank=0 \ --master_addr="192.168.1.100" \ --master_port=29500 \ --nproc_per_node=4 \ unified_workflow.py \ --distributed ``` --- ## Performance Optimization ### Memory Optimization ```python # Enable all memory optimizations model = BitTransformerLM( d_model=512, nhead=8, num_layers=8, reversible=True, # Reversible layers save ~50% memory use_checkpoint=True, # Gradient checkpointing chunk_size=64, # Chunked attention for long sequences full_attn_logging=False # Skip full attention reconstruction ) # Training optimizations train_loop( model, data, batch_size=4, # Smaller batches accum_steps=8, # Gradient accumulation amp=True, # Mixed precision compile_model=True # torch.compile ) ``` ### CPU Optimization ```python from bit_transformer.torch_utils import cpu_autocast # Enable BF16 on CPU with cpu_autocast(): logits, telemetry = model(bits) # Or enable for entire model model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16 ``` ### Inference Optimization ```python # Quantize for inference from bit_transformer import quantize_dynamic # Switch to evaluation mode model.eval() set_dropout(model, 0.0) # Dynamic quantization quantized = quantize_dynamic(model, dtype=torch.qint8) # Optimize for inference with torch.no_grad(): logits, _ = quantized(input_bits) ``` ### Long Sequence Processing ```python from bit_transformer.model import infer_long_sequence # Process sequences longer than max_seq_len long_text = "Very long text..." * 1000 bits = text_to_bits(long_text) output = infer_long_sequence( model, torch.tensor(bits).unsqueeze(0), chunk_size=256, # Process in 256-bit chunks overlap=32, # 32-bit overlap between chunks stride=224 # 224-bit stride (256-32) ) ``` --- ## Troubleshooting ### Common Issues #### 1. **Memory Errors** ``` RuntimeError: CUDA out of memory ``` **Solutions:** - Enable reversible layers: `reversible=True` - Enable gradient checkpointing: `use_checkpoint=True` - Reduce batch size or use gradient accumulation - Use chunked attention: `chunk_size=64` - Enable mixed precision: `amp=True` #### 2. **Tensor Shape Mismatches** ``` RuntimeError: view size is not compatible with input tensor's size ``` **Solutions:** - Always use `.reshape()` instead of `.view()` with BitTransformerLM - Check that input sequences are properly formatted (1D for bits) - Ensure batch dimensions are consistent #### 3. **Parity Check Failures** ``` ValueError: Parity check failed ``` **Solutions:** - Use `enforce_parity()` to fix parity bits in generated sequences - Check that text encoding/decoding is consistent - Verify bit sequences have correct 9-bit (8+parity) structure #### 4. **Safety Gate Triggering** ``` SafetyError: Output blocked by safety gate ``` **Solutions:** - Lower safety thresholds: `c_floor=0.2, s_floor=0.4` - Increase burn-in period: `burn_in=20` - Use retry with diffusion: `safe_sample_with_retry()` - Check model training quality ### Debug Mode ```python # Enable detailed logging import logging logging.basicConfig(level=logging.DEBUG) # Model with debug telemetry model = BitTransformerLM( d_model=64, nhead=4, num_layers=2, full_attn_logging=True, # Log full attention maps chunk_size=None # Disable chunking for debugging ) # Inspect telemetry logits, telemetry = model(input_bits) print("Telemetry keys:", list(telemetry.keys())) print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']]) print("Activation stats:", torch.stack(telemetry['activations']).describe()) ``` ### Performance Profiling ```python import torch.profiler # Profile training step with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], record_shapes=True, with_stack=True, ) as prof: logits, telemetry = model(input_bits) loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1)) loss.backward() print(prof.key_averages().table(sort_by="cuda_time_total")) ``` --- ## Best Practices ### Model Configuration #### For Experimentation (< 1M parameters) ```python model = BitTransformerLM( d_model=64, nhead=4, num_layers=2, dim_feedforward=128, max_seq_len=128, reversible=False, # Simpler for debugging use_checkpoint=False ) ``` #### For Research (1M-100M parameters) ```python model = BitTransformerLM( d_model=256, nhead=8, num_layers=6, dim_feedforward=1024, max_seq_len=512, reversible=True, # Enable memory efficiency use_checkpoint=True, chunk_size=128, lambda_K=0.05, # Light regularization lambda_C=0.05, lambda_S=0.05 ) ``` #### For Large-Scale (100M+ parameters) ```python model = BitTransformerLM( d_model=1024, nhead=16, num_layers=20, dim_feedforward=4096, max_seq_len=2048, reversible=True, use_checkpoint=True, chunk_size=256, full_attn_logging=False, # Save memory lambda_K=0.1, lambda_C=0.1, lambda_S=0.1 ) ``` ### Training Best Practices 1. **Always validate on held-out data** to monitor overfitting 2. **Use gradient clipping** to prevent training instability 3. **Monitor telemetry metrics** for signs of model degradation 4. **Start with smaller models** before scaling up 5. **Use safety gates** in production deployments 6. **Enable logging** to track training progress 7. **Save checkpoints frequently** to prevent loss of progress ### Data Preparation ```python # Good: Clean, well-formatted text texts = [ "The quick brown fox jumps over the lazy dog.", "Machine learning is transforming technology.", "BitTransformer processes information at the bit level." ] # Convert to training sequences all_bits = [] for text in texts: bits = text_to_bits(text) all_bits.extend(bits) # Create overlapping sequences for better learning data = torch.tensor(all_bits) seq_len = 128 stride = 64 sequences = [] for i in range(0, len(data) - seq_len, stride): sequences.append(data[i:i + seq_len]) training_data = torch.stack(sequences) ``` ### Production Deployment ```python # Production-ready model setup model.eval() # Disable dropout set_dropout(model, 0.0) # Enable safety monitoring gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5) # Quantize for efficiency production_model = quantize_dynamic(model) # Safe inference with monitoring def safe_generate(input_text, max_length=100): try: return safe_sample_with_retry( production_model, text_to_bits(input_text), max_retries=3 ) except Exception as e: logging.error(f"Generation failed: {e}") return "Error: Unable to generate safe output" ``` --- ## Getting Help ### Documentation Resources - **ABOUTME.md**: Project overview and quick start - **README.md**: Professional model card and specifications - **RESEARCH_STATUS.md**: Current research status and limitations - **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities ### Community Support - **GitHub Issues**: Report bugs and request features - **Discussions**: Ask questions and share experiences - **Examples**: Check the `tests/` directory for usage examples ### **🤖 Recommended: Use with Claude Code** For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code): - **Interactive Setup**: Get step-by-step guidance for configuration - **Real-time Debugging**: Immediate help when things go wrong - **Code Generation**: Custom scripts and experiments tailored to your needs - **Architecture Explanation**: Deep understanding of bit-native processing - **Best Practices**: Learn optimal configurations for your use case Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling. --- **Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.** Happy experimenting! 🤖✨