WCNegentropy
/

BitTransformerLM

+# BitTransformerLM User Guide
+**Version:** 0.1.0 Experimental
+**Last Updated:** August 2025
+**Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience
+## Table of Contents
+1. [Quick Start](#quick-start)
+2. [Architecture Overview](#architecture-overview)
+3. [Core Features](#core-features)
+4. [Installation & Setup](#installation--setup)
+5. [Basic Usage Examples](#basic-usage-examples)
+6. [Advanced Features](#advanced-features)
+7. [Training Your Own Models](#training-your-own-models)
+8. [Safety and Monitoring](#safety-and-monitoring)
+9. [Distributed Training](#distributed-training)
+10. [Performance Optimization](#performance-optimization)
+11. [Troubleshooting](#troubleshooting)
+12. [Best Practices](#best-practices)
+---
+## Quick Start
+BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.
+### Minimal Example
+```python
+from bit_transformer import BitTransformerLM, example_training_step
+# Run basic example
+loss, telemetry = example_training_step()
+print(f"Training loss: {loss}")
+print(f"Available telemetry: {list(telemetry.keys())}")
+```
+### Text Processing Example
+```python
+from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text
+# Create model
+model = BitTransformerLM(
+    d_model=128,
+    nhead=4,
+    num_layers=2,
+    dim_feedforward=256,
+    max_seq_len=256
+)
+# Convert text to bits and process
+text = "Hello, world!"
+bits = text_to_bits(text)
+bit_tensor = torch.tensor(bits).unsqueeze(0)
+# Forward pass
+logits, telemetry = model(bit_tensor)
+print(f"Input bits: {len(bits)}")
+print(f"Output shape: {logits.shape}")
+print(f"Telemetry metrics: {list(telemetry.keys())}")
+```
+---
+## Architecture Overview
+### Bit-Native Processing
+Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:
+- **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
+- **Processing**: Multi-head attention on bit embeddings
+- **Output**: Probability distribution over next bit (0 or 1)
+### Key Innovations
+#### 1. **Reversible Transformer Layers**
+- Memory-efficient computation that doesn't store intermediate activations
+- Enables training of deeper models with same memory footprint
+- Mathematically reversible operations for gradient computation
+#### 2. **Built-in Safety Telemetry**
+- **K (Negentropy)**: Measures information content vs random noise
+- **C (LZ Complexity)**: Proxy for compressibility and pattern complexity
+- **S (Symbiosis)**: Alignment with reference distributions
+- Real-time monitoring and safety gates
+#### 3. **Dual-Mode Operation**
+- **Causal Mode**: Traditional autoregressive generation
+- **Diffusion Mode**: Bidirectional denoising for higher quality output
+#### 4. **Progressive Scaling**
+- Dynamic architecture expansion based on validation performance
+- Automatic addition of layers, width, or context length
+- Curriculum learning from simple to complex patterns
+---
+## Core Features
+### Text Processing
+- **Parity-Protected Encoding**: Each byte gets a parity bit for error detection
+- **UTF-8 Support**: Full Unicode text processing capability
+- **Bidirectional Processing**: Support for both causal and diffusion modes
+### Safety & Monitoring
+- **Real-time Telemetry**: K/C/S metrics computed during inference
+- **Safety Gates**: Automatic blocking of unsafe outputs
+- **Metric Drift Detection**: Alerts when model behavior changes
+- **Human-in-the-Loop**: Safe inference with retry mechanisms
+### Memory Efficiency
+- **Reversible Layers**: Significant memory savings for deep models
+- **Gradient Checkpointing**: Trade compute for memory in training
+- **Dynamic Quantization**: Runtime INT8 conversion for inference
+- **4-bit QAT**: Quantization-aware training for extreme efficiency
+### Advanced Training
+- **Distributed Training**: FSDP and pipeline parallelism support
+- **Mixed Precision**: FP16/BF16 optimization with CPU autocast
+- **Compression Pipeline**: Run-length encoding for efficient storage
+- **Progressive Curriculum**: Automatic difficulty scaling
+---
+## Installation & Setup
+### Requirements
+- Python 3.10 or later
+- PyTorch 2.7.1 or later
+- CUDA (optional, for GPU acceleration)
+### Installation
+```bash
+# Clone repository
+git clone https://huggingface.co/WCNegentropy/BitTransformerLM
+cd BitTransformerLM
+# Install dependencies
+pip install -r requirements.txt
+# For GPU support (optional)
+pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
+```
+### Quick Test
+```bash
+# Run basic example
+python example.py
+# Expected output:
+# Training loss: [some value]
+# Available telemetry: ['activations', 'attention_maps', ...]
+```
+### **🤖 Recommended: Setup with Claude Code**
+For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:
+1. **Open Claude Code** and navigate to your project directory
+2. **Clone the repository**: Claude Code can help with git operations and dependency management
+3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters
+4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging
+5. **Code Generation**: Generate custom training scripts and experiments with AI assistance
+Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.
+---
+## Basic Usage Examples
+### 1. Creating Models
+```python
+from bit_transformer import BitTransformerLM
+# Small model for experimentation
+small_model = BitTransformerLM(
+    d_model=64,           # Embedding dimension
+    nhead=4,              # Number of attention heads
+    num_layers=2,         # Number of transformer layers
+    dim_feedforward=128,  # Feedforward dimension
+    max_seq_len=128,      # Maximum sequence length
+    reversible=True,      # Use memory-efficient reversible layers
+    use_checkpoint=True   # Enable gradient checkpointing
+)
+# Medium model for research
+medium_model = BitTransformerLM(
+    d_model=512,
+    nhead=8,
+    num_layers=8,
+    dim_feedforward=2048,
+    max_seq_len=512,
+    reversible=True,
+    use_checkpoint=True,
+    chunk_size=64,        # Chunked attention for long sequences
+    lambda_K=0.1,         # Negentropy regularization weight
+    lambda_C=0.1,         # Complexity regularization weight
+    lambda_S=0.1          # Symbiosis regularization weight
+)
+```
+### 2. Text Generation
+```python
+from bit_transformer.bit_io import sample_text
+# Generate text from prompt
+prompt = "The future of AI is"
+generated = sample_text(
+    model,
+    prompt=prompt,
+    max_new_tokens=20,    # Generate ~20 new characters
+    temperature=0.8,      # Sampling temperature
+    top_p=0.9            # Nucleus sampling
+)
+print(f"Generated: {generated}")
+```
+### 3. Safe Inference
+```python
+from bit_transformer import hil_safe_inference, text_to_bits
+import torch
+# Convert text to bits
+text = "Hello, world!"
+bits = torch.tensor(text_to_bits(text)).unsqueeze(0)
+# Safe inference with telemetry monitoring
+try:
+    output_bits, telemetry = hil_safe_inference(
+        model,
+        bits,
+        c_floor=0.3,     # Minimum complexity threshold
+        s_floor=0.5,     # Minimum symbiosis threshold
+        strict=True      # Throw error if thresholds not met
+    )
+    print("✅ Safe inference completed")
+    print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
+    print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
+    print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
+except Exception as e:
+    print(f"⚠️ Safety check failed: {e}")
+```
+### 4. Interactive Dashboard
+```python
+# Launch the interactive dashboard
+python unified_workflow.py --dashboard
+# Or programmatically
+from bit_transformer.dashboard_app import run_dashboard
+run_dashboard(host="localhost", port=5000)
+```
+The dashboard provides:
+- Real-time training monitoring
+- Telemetry visualization
+- Model configuration controls
+- HuggingFace checkpoint management
+- Safe inference testing interface
+---
+## Advanced Features
+### 1. Diffusion Mode Training
+Diffusion mode enables bidirectional processing for higher quality generation:
+```python
+# Train with diffusion mode
+python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
+# Different noise schedules
+python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16
+# Diffusion curriculum (noise decay over epochs)
+python unified_workflow.py --diffusion --diffusion-curriculum
+```
+**Diffusion Parameters:**
+- `--diffusion-steps`: Number of denoising steps (higher = better quality)
+- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
+- `--diffusion-curriculum`: Gradually reduce noise over training epochs
+### 2. Progressive Scaling
+Enable automatic model growth based on performance:
+```python
+from bit_transformer.training import train_loop
+from bit_transformer.scale import expand_model
+# Training loop with automatic scaling
+model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
+train_data = torch.randint(0, 2, (1000, 64))
+# Train with progressive scaling
+train_loop(
+    model,
+    train_data,
+    epochs=10,
+    batch_size=8,
+    # Progressive scaling will automatically trigger when validation loss plateaus
+)
+# Manual model expansion
+expanded_model = expand_model(model, strategy="depth")  # Add layers
+expanded_model = expand_model(model, strategy="width")  # Increase width
+expanded_model = expand_model(model, strategy="context")  # Extend context
+```
+### 3. Compression Pipeline
+BitTransformerLM includes run-length encoding for efficient data storage:
+```python
+from bit_transformer import compress_bits, decompress_bits
+# Compress bit sequences
+original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
+compressed = compress_bits(original_bits)
+decompressed = decompress_bits(compressed)
+print(f"Original: {original_bits}")
+print(f"Compressed: {compressed}")
+print(f"Decompressed: {decompressed}")
+print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")
+# Use compression in training
+train_loop(
+    model,
+    data,
+    compress_prob=0.5,    # 50% of training uses compressed data
+    compress_warmup=100   # Start compression after 100 steps
+)
+```
+### 4. Quantization and Optimization
+```python
+from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx
+# Dynamic quantization for inference
+quantized_model = quantize_dynamic(model, dtype=torch.qint8)
+# 4-bit quantization-aware training
+qat_model = prepare_qat_fx(model)
+# ... train qat_model ...
+final_model = convert_qat_fx(qat_model)
+# Enable mixed precision and compilation
+train_loop(
+    model,
+    data,
+    amp=True,           # Enable automatic mixed precision
+    compile_model=True  # Use torch.compile for speedup
+)
+```
+---
+## Training Your Own Models
+### Basic Training Script
+```python
+import torch
+from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
+from bit_transformer.bit_io import text_to_bits
+# Prepare training data
+texts = ["Hello world", "How are you?", "BitTransformer is working!"]
+all_bits = []
+for text in texts:
+    bits = text_to_bits(text)
+    all_bits.extend(bits)
+# Convert to tensor and create sequences
+data = torch.tensor(all_bits)
+sequences = data.unfold(0, 64, 32)  # 64-bit sequences with 32-bit stride
+# Create model
+model = BitTransformerLM(
+    d_model=128,
+    nhead=8,
+    num_layers=4,
+    dim_feedforward=512,
+    max_seq_len=64,
+    reversible=True
+)
+# Configure optimizer
+optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)
+# Training loop
+train_loop(
+    model,
+    sequences,
+    epochs=10,
+    batch_size=4,
+    optimizer=optimizer,
+    amp=True,          # Mixed precision
+    log=True           # Enable logging
+)
+```
+### Advanced Training Configuration
+```python
+# Advanced training with all features enabled
+train_loop(
+    model,
+    data,
+    epochs=20,
+    batch_size=8,
+    accum_steps=4,            # Gradient accumulation
+    amp=True,                 # Mixed precision
+    compile_model=True,       # torch.compile optimization
+    # Compression settings
+    compress_prob=0.3,        # 30% compression probability
+    compress_warmup=50,       # Start compression after 50 steps
+    # Diffusion settings
+    diffusion=True,           # Enable diffusion mode
+    diffusion_curriculum=True, # Decay noise over epochs
+    # Direct bit training
+    direct_prob=0.1,          # 10% direct bit prediction
+    # Logging
+    log=True                  # Enable detailed logging
+)
+```
+### Custom Training Loop
+```python
+import torch.nn.functional as F
+from bit_transformer.utils import set_dropout
+# Manual training loop for full control
+model.train()
+set_dropout(model, 0.1)  # Enable dropout for training
+optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
+criterion = F.cross_entropy
+for epoch in range(10):
+    total_loss = 0
+    for batch in data_loader:
+        optimizer.zero_grad()
+        # Forward pass
+        logits, telemetry = model(batch)
+        # Compute loss
+        if logits.dim() == 3:  # (batch, seq, 2)
+            targets = batch[:, 1:]  # Next bit prediction
+            logits = logits[:, :-1]  # Remove last prediction
+            loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
+        else:
+            loss = criterion(logits, batch)
+        # Add telemetry regularization
+        if model.lambda_K > 0:
+            loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
+        if model.lambda_C > 0:
+            loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
+        # Backward pass
+        loss.backward()
+        # Gradient clipping
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        optimizer.step()
+        total_loss += loss.item()
+        # Safety check
+        if telemetry.get('symbiosis_score', 1.0) < 0.3:
+            print("⚠️ Low symbiosis score detected")
+    print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
+```
+---
+## Safety and Monitoring
+### Telemetry Metrics
+BitTransformerLM provides three key safety metrics:
+#### K (Negentropy) - Information Content
+- **Range**: 0-1 (0 = random noise, 1 = perfectly ordered)
+- **Purpose**: Measures departure from randomness
+- **Interpretation**:
+  - Very low K (< 0.1): Output is noise-like
+  - Moderate K (0.3-0.7): Structured but varied output
+  - Very high K (> 0.9): Repetitive or overly structured
+#### C (LZ Complexity) - Pattern Complexity
+- **Range**: 0-1 (higher = more complex patterns)
+- **Purpose**: Proxy for Lempel-Ziv compressibility
+- **Interpretation**:
+  - Low C (< 0.3): Highly repetitive patterns
+  - Moderate C (0.3-0.7): Balanced complexity
+  - High C (> 0.8): Complex, varied patterns
+#### S (Symbiosis) - Distribution Alignment
+- **Range**: 0-1 (higher = better alignment)
+- **Purpose**: Agreement with reference distributions via KL divergence
+- **Interpretation**:
+  - Low S (< 0.3): Poor alignment with expected patterns
+  - Moderate S (0.5-0.8): Good alignment
+  - High S (> 0.8): Excellent alignment
+### Safety Gates
+```python
+from bit_transformer.safety import SafetyGate, safe_sample_with_retry
+# Configure safety gate
+gate = SafetyGate(
+    c_floor=0.3,      # Minimum complexity
+    s_floor=0.5,      # Minimum symbiosis
+    decay=0.9,        # EMA decay factor
+    burn_in=10        # Steps before gating starts
+)
+# Check if output should be blocked
+should_block = gate.should_trigger(c_val=0.2, s_val=0.4)  # True - below thresholds
+# Safe sampling with automatic retry
+output = safe_sample_with_retry(
+    model,
+    input_bits,
+    max_retries=3,
+    retry_strategy="diffusion"  # Try diffusion mode on failure
+)
+```
+### Metric Drift Detection
+```python
+from bit_transformer.telemetry import detect_metric_drift
+# Monitor metric stability over time
+metrics_history = [
+    {"K": 0.5, "C": 0.6, "S": 0.7},
+    {"K": 0.52, "C": 0.58, "S": 0.69},
+    {"K": 0.8, "C": 0.9, "S": 0.4},   # Drift detected!
+    # ... more metrics
+]
+drift_detected = detect_metric_drift(
+    metrics_history,
+    window=10,        # Look back 10 steps
+    threshold=0.2     # Alert if change > 0.2
+)
+if drift_detected:
+    print("⚠️ Model behavior drift detected!")
+```
+---
+## Distributed Training
+### FSDP (Fully Sharded Data Parallel)
+```python
+from bit_transformer.distributed import wrap_fsdp, setup_distributed
+import torch.distributed as dist
+# Initialize distributed training
+setup_distributed(rank=0, world_size=4)
+# Wrap model with FSDP
+model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
+fsdp_model = wrap_fsdp(
+    model,
+    sharding_strategy="FULL_SHARD",  # or "SHARD_GRAD_OP", "NO_SHARD"
+    mixed_precision=True,
+    device_id=0
+)
+# Train with FSDP
+train_loop(
+    fsdp_model,
+    data,
+    epochs=10,
+    batch_size=2,    # Smaller batch per GPU
+    amp=True
+)
+```
+### Pipeline Parallelism
+```python
+from bit_transformer.distributed import make_pipeline
+# Create pipeline parallel model
+pipeline_model = make_pipeline(
+    model,
+    balance=[2, 2, 2, 2],  # Split 8 layers across 4 GPUs
+    devices=[0, 1, 2, 3],
+    checkpoint="never"     # or "always", "except_last"
+)
+# Pipeline training requires special handling
+# See unified_workflow.py for complete implementation
+```
+### Multi-GPU Training Script
+```bash
+# Single node, multiple GPUs
+python -m torch.distributed.launch \
+    --nproc_per_node=4 \
+    unified_workflow.py \
+    --distributed \
+    --batch-size 2 \
+    --epochs 10
+# Multiple nodes
+python -m torch.distributed.launch \
+    --nnodes=2 \
+    --node_rank=0 \
+    --master_addr="192.168.1.100" \
+    --master_port=29500 \
+    --nproc_per_node=4 \
+    unified_workflow.py \
+    --distributed
+```
+---
+## Performance Optimization
+### Memory Optimization
+```python
+# Enable all memory optimizations
+model = BitTransformerLM(
+    d_model=512,
+    nhead=8,
+    num_layers=8,
+    reversible=True,          # Reversible layers save ~50% memory
+    use_checkpoint=True,      # Gradient checkpointing
+    chunk_size=64,            # Chunked attention for long sequences
+    full_attn_logging=False   # Skip full attention reconstruction
+)
+# Training optimizations
+train_loop(
+    model,
+    data,
+    batch_size=4,            # Smaller batches
+    accum_steps=8,           # Gradient accumulation
+    amp=True,                # Mixed precision
+    compile_model=True       # torch.compile
+)
+```
+### CPU Optimization
+```python
+from bit_transformer.torch_utils import cpu_autocast
+# Enable BF16 on CPU
+with cpu_autocast():
+    logits, telemetry = model(bits)
+# Or enable for entire model
+model = BitTransformerLM(use_autocast=True)  # Automatically uses CPU BF16
+```
+### Inference Optimization
+```python
+# Quantize for inference
+from bit_transformer import quantize_dynamic
+# Switch to evaluation mode
+model.eval()
+set_dropout(model, 0.0)
+# Dynamic quantization
+quantized = quantize_dynamic(model, dtype=torch.qint8)
+# Optimize for inference
+with torch.no_grad():
+    logits, _ = quantized(input_bits)
+```
+### Long Sequence Processing
+```python
+from bit_transformer.model import infer_long_sequence
+# Process sequences longer than max_seq_len
+long_text = "Very long text..." * 1000
+bits = text_to_bits(long_text)
+output = infer_long_sequence(
+    model,
+    torch.tensor(bits).unsqueeze(0),
+    chunk_size=256,      # Process in 256-bit chunks
+    overlap=32,          # 32-bit overlap between chunks
+    stride=224           # 224-bit stride (256-32)
+)
+```
+---
+## Troubleshooting
+### Common Issues
+#### 1. **Memory Errors**
+```
+RuntimeError: CUDA out of memory
+```
+**Solutions:**
+- Enable reversible layers: `reversible=True`
+- Enable gradient checkpointing: `use_checkpoint=True`
+- Reduce batch size or use gradient accumulation
+- Use chunked attention: `chunk_size=64`
+- Enable mixed precision: `amp=True`
+#### 2. **Tensor Shape Mismatches**
+```
+RuntimeError: view size is not compatible with input tensor's size
+```
+**Solutions:**
+- Always use `.reshape()` instead of `.view()` with BitTransformerLM
+- Check that input sequences are properly formatted (1D for bits)
+- Ensure batch dimensions are consistent
+#### 3. **Parity Check Failures**
+```
+ValueError: Parity check failed
+```
+**Solutions:**
+- Use `enforce_parity()` to fix parity bits in generated sequences
+- Check that text encoding/decoding is consistent
+- Verify bit sequences have correct 9-bit (8+parity) structure
+#### 4. **Safety Gate Triggering**
+```
+SafetyError: Output blocked by safety gate
+```
+**Solutions:**
+- Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
+- Increase burn-in period: `burn_in=20`
+- Use retry with diffusion: `safe_sample_with_retry()`
+- Check model training quality
+### Debug Mode
+```python
+# Enable detailed logging
+import logging
+logging.basicConfig(level=logging.DEBUG)
+# Model with debug telemetry
+model = BitTransformerLM(
+    d_model=64,
+    nhead=4,
+    num_layers=2,
+    full_attn_logging=True,  # Log full attention maps
+    chunk_size=None          # Disable chunking for debugging
+)
+# Inspect telemetry
+logits, telemetry = model(input_bits)
+print("Telemetry keys:", list(telemetry.keys()))
+print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
+print("Activation stats:", torch.stack(telemetry['activations']).describe())
+```
+### Performance Profiling
+```python
+import torch.profiler
+# Profile training step
+with torch.profiler.profile(
+    activities=[
+        torch.profiler.ProfilerActivity.CPU,
+        torch.profiler.ProfilerActivity.CUDA,
+    ],
+    record_shapes=True,
+    with_stack=True,
+) as prof:
+    logits, telemetry = model(input_bits)
+    loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
+    loss.backward()
+print(prof.key_averages().table(sort_by="cuda_time_total"))
+```
+---
+## Best Practices
+### Model Configuration
+#### For Experimentation (< 1M parameters)
+```python
+model = BitTransformerLM(
+    d_model=64,
+    nhead=4,
+    num_layers=2,
+    dim_feedforward=128,
+    max_seq_len=128,
+    reversible=False,    # Simpler for debugging
+    use_checkpoint=False
+)
+```
+#### For Research (1M-100M parameters)
+```python
+model = BitTransformerLM(
+    d_model=256,
+    nhead=8,
+    num_layers=6,
+    dim_feedforward=1024,
+    max_seq_len=512,
+    reversible=True,     # Enable memory efficiency
+    use_checkpoint=True,
+    chunk_size=128,
+    lambda_K=0.05,       # Light regularization
+    lambda_C=0.05,
+    lambda_S=0.05
+)
+```
+#### For Large-Scale (100M+ parameters)
+```python
+model = BitTransformerLM(
+    d_model=1024,
+    nhead=16,
+    num_layers=20,
+    dim_feedforward=4096,
+    max_seq_len=2048,
+    reversible=True,
+    use_checkpoint=True,
+    chunk_size=256,
+    full_attn_logging=False,  # Save memory
+    lambda_K=0.1,
+    lambda_C=0.1,
+    lambda_S=0.1
+)
+```
+### Training Best Practices
+1. **Always validate on held-out data** to monitor overfitting
+2. **Use gradient clipping** to prevent training instability
+3. **Monitor telemetry metrics** for signs of model degradation
+4. **Start with smaller models** before scaling up
+5. **Use safety gates** in production deployments
+6. **Enable logging** to track training progress
+7. **Save checkpoints frequently** to prevent loss of progress
+### Data Preparation
+```python
+# Good: Clean, well-formatted text
+texts = [
+    "The quick brown fox jumps over the lazy dog.",
+    "Machine learning is transforming technology.",
+    "BitTransformer processes information at the bit level."
+]
+# Convert to training sequences
+all_bits = []
+for text in texts:
+    bits = text_to_bits(text)
+    all_bits.extend(bits)
+# Create overlapping sequences for better learning
+data = torch.tensor(all_bits)
+seq_len = 128
+stride = 64
+sequences = []
+for i in range(0, len(data) - seq_len, stride):
+    sequences.append(data[i:i + seq_len])
+training_data = torch.stack(sequences)
+```
+### Production Deployment
+```python
+# Production-ready model setup
+model.eval()  # Disable dropout
+set_dropout(model, 0.0)
+# Enable safety monitoring
+gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)
+# Quantize for efficiency
+production_model = quantize_dynamic(model)
+# Safe inference with monitoring
+def safe_generate(input_text, max_length=100):
+    try:
+        return safe_sample_with_retry(
+            production_model,
+            text_to_bits(input_text),
+            max_retries=3
+        )
+    except Exception as e:
+        logging.error(f"Generation failed: {e}")
+        return "Error: Unable to generate safe output"
+```
+---
+## Getting Help
+### Documentation Resources
+- **ABOUTME.md**: Project overview and quick start
+- **README.md**: Professional model card and specifications
+- **RESEARCH_STATUS.md**: Current research status and limitations
+- **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities
+### Community Support
+- **GitHub Issues**: Report bugs and request features
+- **Discussions**: Ask questions and share experiences
+- **Examples**: Check the `tests/` directory for usage examples
+### **🤖 Recommended: Use with Claude Code**
+For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):
+- **Interactive Setup**: Get step-by-step guidance for configuration
+- **Real-time Debugging**: Immediate help when things go wrong
+- **Code Generation**: Custom scripts and experiments tailored to your needs
+- **Architecture Explanation**: Deep understanding of bit-native processing
+- **Best Practices**: Learn optimal configurations for your use case
+Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.
+---
+**Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.**
+Happy experimenting! 🤖✨