BitTransformerLM / USER_GUIDE.md
WCNegentropy's picture
Add Comprehensive user handbook
58b962e verified

BitTransformerLM User Guide

Version: 0.1.0 Experimental
Last Updated: August 2025
Recommended Setup: Use with Claude Code for optimal experience

Table of Contents

  1. Quick Start
  2. Architecture Overview
  3. Core Features
  4. Installation & Setup
  5. Basic Usage Examples
  6. Advanced Features
  7. Training Your Own Models
  8. Safety and Monitoring
  9. Distributed Training
  10. Performance Optimization
  11. Troubleshooting
  12. Best Practices

Quick Start

BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.

Minimal Example

from bit_transformer import BitTransformerLM, example_training_step

# Run basic example
loss, telemetry = example_training_step()
print(f"Training loss: {loss}")
print(f"Available telemetry: {list(telemetry.keys())}")

Text Processing Example

from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=4,
    num_layers=2,
    dim_feedforward=256,
    max_seq_len=256
)

# Convert text to bits and process
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)

# Forward pass
logits, telemetry = model(bit_tensor)
print(f"Input bits: {len(bits)}")
print(f"Output shape: {logits.shape}")
print(f"Telemetry metrics: {list(telemetry.keys())}")

Architecture Overview

Bit-Native Processing

Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:

  • Input: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
  • Processing: Multi-head attention on bit embeddings
  • Output: Probability distribution over next bit (0 or 1)

Key Innovations

1. Reversible Transformer Layers

  • Memory-efficient computation that doesn't store intermediate activations
  • Enables training of deeper models with same memory footprint
  • Mathematically reversible operations for gradient computation

2. Built-in Safety Telemetry

  • K (Negentropy): Measures information content vs random noise
  • C (LZ Complexity): Proxy for compressibility and pattern complexity
  • S (Symbiosis): Alignment with reference distributions
  • Real-time monitoring and safety gates

3. Dual-Mode Operation

  • Causal Mode: Traditional autoregressive generation
  • Diffusion Mode: Bidirectional denoising for higher quality output

4. Progressive Scaling

  • Dynamic architecture expansion based on validation performance
  • Automatic addition of layers, width, or context length
  • Curriculum learning from simple to complex patterns

Core Features

Text Processing

  • Parity-Protected Encoding: Each byte gets a parity bit for error detection
  • UTF-8 Support: Full Unicode text processing capability
  • Bidirectional Processing: Support for both causal and diffusion modes

Safety & Monitoring

  • Real-time Telemetry: K/C/S metrics computed during inference
  • Safety Gates: Automatic blocking of unsafe outputs
  • Metric Drift Detection: Alerts when model behavior changes
  • Human-in-the-Loop: Safe inference with retry mechanisms

Memory Efficiency

  • Reversible Layers: Significant memory savings for deep models
  • Gradient Checkpointing: Trade compute for memory in training
  • Dynamic Quantization: Runtime INT8 conversion for inference
  • 4-bit QAT: Quantization-aware training for extreme efficiency

Advanced Training

  • Distributed Training: FSDP and pipeline parallelism support
  • Mixed Precision: FP16/BF16 optimization with CPU autocast
  • Compression Pipeline: Run-length encoding for efficient storage
  • Progressive Curriculum: Automatic difficulty scaling

Installation & Setup

Requirements

  • Python 3.10 or later
  • PyTorch 2.7.1 or later
  • CUDA (optional, for GPU acceleration)

Installation

# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM

# Install dependencies
pip install -r requirements.txt

# For GPU support (optional)
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118

Quick Test

# Run basic example
python example.py

# Expected output:
# Training loss: [some value]
# Available telemetry: ['activations', 'attention_maps', ...]

🤖 Recommended: Setup with Claude Code

For the best experience, we recommend using Claude Code to set up and work with BitTransformerLM:

  1. Open Claude Code and navigate to your project directory
  2. Clone the repository: Claude Code can help with git operations and dependency management
  3. Interactive Setup: Claude Code can guide you through configuration options and explain parameters
  4. Real-time Assistance: Get help with model architecture, training parameters, and debugging
  5. Code Generation: Generate custom training scripts and experiments with AI assistance

Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.


Basic Usage Examples

1. Creating Models

from bit_transformer import BitTransformerLM

# Small model for experimentation
small_model = BitTransformerLM(
    d_model=64,           # Embedding dimension
    nhead=4,              # Number of attention heads
    num_layers=2,         # Number of transformer layers
    dim_feedforward=128,  # Feedforward dimension
    max_seq_len=128,      # Maximum sequence length
    reversible=True,      # Use memory-efficient reversible layers
    use_checkpoint=True   # Enable gradient checkpointing
)

# Medium model for research
medium_model = BitTransformerLM(
    d_model=512,
    nhead=8, 
    num_layers=8,
    dim_feedforward=2048,
    max_seq_len=512,
    reversible=True,
    use_checkpoint=True,
    chunk_size=64,        # Chunked attention for long sequences
    lambda_K=0.1,         # Negentropy regularization weight
    lambda_C=0.1,         # Complexity regularization weight
    lambda_S=0.1          # Symbiosis regularization weight
)

2. Text Generation

from bit_transformer.bit_io import sample_text

# Generate text from prompt
prompt = "The future of AI is"
generated = sample_text(
    model,
    prompt=prompt,
    max_new_tokens=20,    # Generate ~20 new characters
    temperature=0.8,      # Sampling temperature
    top_p=0.9            # Nucleus sampling
)
print(f"Generated: {generated}")

3. Safe Inference

from bit_transformer import hil_safe_inference, text_to_bits
import torch

# Convert text to bits
text = "Hello, world!"
bits = torch.tensor(text_to_bits(text)).unsqueeze(0)

# Safe inference with telemetry monitoring
try:
    output_bits, telemetry = hil_safe_inference(
        model, 
        bits,
        c_floor=0.3,     # Minimum complexity threshold
        s_floor=0.5,     # Minimum symbiosis threshold
        strict=True      # Throw error if thresholds not met
    )
    print("✅ Safe inference completed")
    print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
    print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
    print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
except Exception as e:
    print(f"⚠️ Safety check failed: {e}")

4. Interactive Dashboard

# Launch the interactive dashboard
python unified_workflow.py --dashboard

# Or programmatically
from bit_transformer.dashboard_app import run_dashboard
run_dashboard(host="localhost", port=5000)

The dashboard provides:

  • Real-time training monitoring
  • Telemetry visualization
  • Model configuration controls
  • HuggingFace checkpoint management
  • Safe inference testing interface

Advanced Features

1. Diffusion Mode Training

Diffusion mode enables bidirectional processing for higher quality generation:

# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32

# Different noise schedules
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16

# Diffusion curriculum (noise decay over epochs)
python unified_workflow.py --diffusion --diffusion-curriculum

Diffusion Parameters:

  • --diffusion-steps: Number of denoising steps (higher = better quality)
  • --noise-schedule: linear, cosine, or exp noise decay
  • --diffusion-curriculum: Gradually reduce noise over training epochs

2. Progressive Scaling

Enable automatic model growth based on performance:

from bit_transformer.training import train_loop
from bit_transformer.scale import expand_model

# Training loop with automatic scaling
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
train_data = torch.randint(0, 2, (1000, 64))

# Train with progressive scaling
train_loop(
    model,
    train_data,
    epochs=10,
    batch_size=8,
    # Progressive scaling will automatically trigger when validation loss plateaus
)

# Manual model expansion
expanded_model = expand_model(model, strategy="depth")  # Add layers
expanded_model = expand_model(model, strategy="width")  # Increase width
expanded_model = expand_model(model, strategy="context")  # Extend context

3. Compression Pipeline

BitTransformerLM includes run-length encoding for efficient data storage:

from bit_transformer import compress_bits, decompress_bits

# Compress bit sequences
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
compressed = compress_bits(original_bits)
decompressed = decompress_bits(compressed)

print(f"Original: {original_bits}")
print(f"Compressed: {compressed}")  
print(f"Decompressed: {decompressed}")
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")

# Use compression in training
train_loop(
    model,
    data,
    compress_prob=0.5,    # 50% of training uses compressed data
    compress_warmup=100   # Start compression after 100 steps
)

4. Quantization and Optimization

from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx

# Dynamic quantization for inference
quantized_model = quantize_dynamic(model, dtype=torch.qint8)

# 4-bit quantization-aware training
qat_model = prepare_qat_fx(model)
# ... train qat_model ...
final_model = convert_qat_fx(qat_model)

# Enable mixed precision and compilation
train_loop(
    model,
    data,
    amp=True,           # Enable automatic mixed precision
    compile_model=True  # Use torch.compile for speedup
)

Training Your Own Models

Basic Training Script

import torch
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
from bit_transformer.bit_io import text_to_bits

# Prepare training data
texts = ["Hello world", "How are you?", "BitTransformer is working!"]
all_bits = []
for text in texts:
    bits = text_to_bits(text)
    all_bits.extend(bits)

# Convert to tensor and create sequences
data = torch.tensor(all_bits)
sequences = data.unfold(0, 64, 32)  # 64-bit sequences with 32-bit stride

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=8,
    num_layers=4,
    dim_feedforward=512,
    max_seq_len=64,
    reversible=True
)

# Configure optimizer
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)

# Training loop
train_loop(
    model,
    sequences,
    epochs=10,
    batch_size=4,
    optimizer=optimizer,
    amp=True,          # Mixed precision
    log=True           # Enable logging
)

Advanced Training Configuration

# Advanced training with all features enabled
train_loop(
    model,
    data,
    epochs=20,
    batch_size=8,
    accum_steps=4,            # Gradient accumulation
    amp=True,                 # Mixed precision
    compile_model=True,       # torch.compile optimization
    
    # Compression settings
    compress_prob=0.3,        # 30% compression probability
    compress_warmup=50,       # Start compression after 50 steps
    
    # Diffusion settings  
    diffusion=True,           # Enable diffusion mode
    diffusion_curriculum=True, # Decay noise over epochs
    
    # Direct bit training
    direct_prob=0.1,          # 10% direct bit prediction
    
    # Logging
    log=True                  # Enable detailed logging
)

Custom Training Loop

import torch.nn.functional as F
from bit_transformer.utils import set_dropout

# Manual training loop for full control
model.train()
set_dropout(model, 0.1)  # Enable dropout for training

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = F.cross_entropy

for epoch in range(10):
    total_loss = 0
    for batch in data_loader:
        optimizer.zero_grad()
        
        # Forward pass
        logits, telemetry = model(batch)
        
        # Compute loss
        if logits.dim() == 3:  # (batch, seq, 2)
            targets = batch[:, 1:]  # Next bit prediction
            logits = logits[:, :-1]  # Remove last prediction
            loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
        else:
            loss = criterion(logits, batch)
        
        # Add telemetry regularization
        if model.lambda_K > 0:
            loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
        if model.lambda_C > 0:
            loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
            
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        total_loss += loss.item()
        
        # Safety check
        if telemetry.get('symbiosis_score', 1.0) < 0.3:
            print("⚠️ Low symbiosis score detected")
    
    print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")

Safety and Monitoring

Telemetry Metrics

BitTransformerLM provides three key safety metrics:

K (Negentropy) - Information Content

  • Range: 0-1 (0 = random noise, 1 = perfectly ordered)
  • Purpose: Measures departure from randomness
  • Interpretation:
    • Very low K (< 0.1): Output is noise-like
    • Moderate K (0.3-0.7): Structured but varied output
    • Very high K (> 0.9): Repetitive or overly structured

C (LZ Complexity) - Pattern Complexity

  • Range: 0-1 (higher = more complex patterns)
  • Purpose: Proxy for Lempel-Ziv compressibility
  • Interpretation:
    • Low C (< 0.3): Highly repetitive patterns
    • Moderate C (0.3-0.7): Balanced complexity
    • High C (> 0.8): Complex, varied patterns

S (Symbiosis) - Distribution Alignment

  • Range: 0-1 (higher = better alignment)
  • Purpose: Agreement with reference distributions via KL divergence
  • Interpretation:
    • Low S (< 0.3): Poor alignment with expected patterns
    • Moderate S (0.5-0.8): Good alignment
    • High S (> 0.8): Excellent alignment

Safety Gates

from bit_transformer.safety import SafetyGate, safe_sample_with_retry

# Configure safety gate
gate = SafetyGate(
    c_floor=0.3,      # Minimum complexity
    s_floor=0.5,      # Minimum symbiosis  
    decay=0.9,        # EMA decay factor
    burn_in=10        # Steps before gating starts
)

# Check if output should be blocked
should_block = gate.should_trigger(c_val=0.2, s_val=0.4)  # True - below thresholds

# Safe sampling with automatic retry
output = safe_sample_with_retry(
    model,
    input_bits,
    max_retries=3,
    retry_strategy="diffusion"  # Try diffusion mode on failure
)

Metric Drift Detection

from bit_transformer.telemetry import detect_metric_drift

# Monitor metric stability over time
metrics_history = [
    {"K": 0.5, "C": 0.6, "S": 0.7},
    {"K": 0.52, "C": 0.58, "S": 0.69},  
    {"K": 0.8, "C": 0.9, "S": 0.4},   # Drift detected!
    # ... more metrics
]

drift_detected = detect_metric_drift(
    metrics_history,
    window=10,        # Look back 10 steps
    threshold=0.2     # Alert if change > 0.2
)

if drift_detected:
    print("⚠️ Model behavior drift detected!")

Distributed Training

FSDP (Fully Sharded Data Parallel)

from bit_transformer.distributed import wrap_fsdp, setup_distributed
import torch.distributed as dist

# Initialize distributed training
setup_distributed(rank=0, world_size=4)

# Wrap model with FSDP
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
fsdp_model = wrap_fsdp(
    model,
    sharding_strategy="FULL_SHARD",  # or "SHARD_GRAD_OP", "NO_SHARD"
    mixed_precision=True,
    device_id=0
)

# Train with FSDP
train_loop(
    fsdp_model,
    data,
    epochs=10,
    batch_size=2,    # Smaller batch per GPU
    amp=True
)

Pipeline Parallelism

from bit_transformer.distributed import make_pipeline

# Create pipeline parallel model
pipeline_model = make_pipeline(
    model,
    balance=[2, 2, 2, 2],  # Split 8 layers across 4 GPUs
    devices=[0, 1, 2, 3],
    checkpoint="never"     # or "always", "except_last"
)

# Pipeline training requires special handling
# See unified_workflow.py for complete implementation

Multi-GPU Training Script

# Single node, multiple GPUs
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    unified_workflow.py \
    --distributed \
    --batch-size 2 \
    --epochs 10

# Multiple nodes
python -m torch.distributed.launch \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="192.168.1.100" \
    --master_port=29500 \
    --nproc_per_node=4 \
    unified_workflow.py \
    --distributed

Performance Optimization

Memory Optimization

# Enable all memory optimizations
model = BitTransformerLM(
    d_model=512,
    nhead=8,
    num_layers=8,
    reversible=True,          # Reversible layers save ~50% memory
    use_checkpoint=True,      # Gradient checkpointing
    chunk_size=64,            # Chunked attention for long sequences
    full_attn_logging=False   # Skip full attention reconstruction
)

# Training optimizations
train_loop(
    model,
    data,
    batch_size=4,            # Smaller batches
    accum_steps=8,           # Gradient accumulation  
    amp=True,                # Mixed precision
    compile_model=True       # torch.compile
)

CPU Optimization

from bit_transformer.torch_utils import cpu_autocast

# Enable BF16 on CPU
with cpu_autocast():
    logits, telemetry = model(bits)

# Or enable for entire model
model = BitTransformerLM(use_autocast=True)  # Automatically uses CPU BF16

Inference Optimization

# Quantize for inference
from bit_transformer import quantize_dynamic

# Switch to evaluation mode
model.eval()
set_dropout(model, 0.0)

# Dynamic quantization
quantized = quantize_dynamic(model, dtype=torch.qint8)

# Optimize for inference
with torch.no_grad():
    logits, _ = quantized(input_bits)

Long Sequence Processing

from bit_transformer.model import infer_long_sequence

# Process sequences longer than max_seq_len
long_text = "Very long text..." * 1000
bits = text_to_bits(long_text)

output = infer_long_sequence(
    model,
    torch.tensor(bits).unsqueeze(0),
    chunk_size=256,      # Process in 256-bit chunks
    overlap=32,          # 32-bit overlap between chunks
    stride=224           # 224-bit stride (256-32)
)

Troubleshooting

Common Issues

1. Memory Errors

RuntimeError: CUDA out of memory

Solutions:

  • Enable reversible layers: reversible=True
  • Enable gradient checkpointing: use_checkpoint=True
  • Reduce batch size or use gradient accumulation
  • Use chunked attention: chunk_size=64
  • Enable mixed precision: amp=True

2. Tensor Shape Mismatches

RuntimeError: view size is not compatible with input tensor's size

Solutions:

  • Always use .reshape() instead of .view() with BitTransformerLM
  • Check that input sequences are properly formatted (1D for bits)
  • Ensure batch dimensions are consistent

3. Parity Check Failures

ValueError: Parity check failed

Solutions:

  • Use enforce_parity() to fix parity bits in generated sequences
  • Check that text encoding/decoding is consistent
  • Verify bit sequences have correct 9-bit (8+parity) structure

4. Safety Gate Triggering

SafetyError: Output blocked by safety gate

Solutions:

  • Lower safety thresholds: c_floor=0.2, s_floor=0.4
  • Increase burn-in period: burn_in=20
  • Use retry with diffusion: safe_sample_with_retry()
  • Check model training quality

Debug Mode

# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Model with debug telemetry
model = BitTransformerLM(
    d_model=64,
    nhead=4,
    num_layers=2,
    full_attn_logging=True,  # Log full attention maps
    chunk_size=None          # Disable chunking for debugging
)

# Inspect telemetry
logits, telemetry = model(input_bits)
print("Telemetry keys:", list(telemetry.keys()))
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
print("Activation stats:", torch.stack(telemetry['activations']).describe())

Performance Profiling

import torch.profiler

# Profile training step
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
) as prof:
    logits, telemetry = model(input_bits)
    loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total"))

Best Practices

Model Configuration

For Experimentation (< 1M parameters)

model = BitTransformerLM(
    d_model=64,
    nhead=4,
    num_layers=2,
    dim_feedforward=128,
    max_seq_len=128,
    reversible=False,    # Simpler for debugging
    use_checkpoint=False
)

For Research (1M-100M parameters)

model = BitTransformerLM(
    d_model=256,
    nhead=8,
    num_layers=6,
    dim_feedforward=1024,
    max_seq_len=512,
    reversible=True,     # Enable memory efficiency
    use_checkpoint=True,
    chunk_size=128,
    lambda_K=0.05,       # Light regularization
    lambda_C=0.05,
    lambda_S=0.05
)

For Large-Scale (100M+ parameters)

model = BitTransformerLM(
    d_model=1024,
    nhead=16, 
    num_layers=20,
    dim_feedforward=4096,
    max_seq_len=2048,
    reversible=True,
    use_checkpoint=True,
    chunk_size=256,
    full_attn_logging=False,  # Save memory
    lambda_K=0.1,
    lambda_C=0.1,
    lambda_S=0.1
)

Training Best Practices

  1. Always validate on held-out data to monitor overfitting
  2. Use gradient clipping to prevent training instability
  3. Monitor telemetry metrics for signs of model degradation
  4. Start with smaller models before scaling up
  5. Use safety gates in production deployments
  6. Enable logging to track training progress
  7. Save checkpoints frequently to prevent loss of progress

Data Preparation

# Good: Clean, well-formatted text
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming technology.",
    "BitTransformer processes information at the bit level."
]

# Convert to training sequences
all_bits = []
for text in texts:
    bits = text_to_bits(text)
    all_bits.extend(bits)

# Create overlapping sequences for better learning
data = torch.tensor(all_bits)
seq_len = 128
stride = 64
sequences = []
for i in range(0, len(data) - seq_len, stride):
    sequences.append(data[i:i + seq_len])

training_data = torch.stack(sequences)

Production Deployment

# Production-ready model setup
model.eval()  # Disable dropout
set_dropout(model, 0.0)

# Enable safety monitoring
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)

# Quantize for efficiency
production_model = quantize_dynamic(model)

# Safe inference with monitoring
def safe_generate(input_text, max_length=100):
    try:
        return safe_sample_with_retry(
            production_model,
            text_to_bits(input_text),
            max_retries=3
        )
    except Exception as e:
        logging.error(f"Generation failed: {e}")
        return "Error: Unable to generate safe output"

Getting Help

Documentation Resources

  • ABOUTME.md: Project overview and quick start
  • README.md: Professional model card and specifications
  • RESEARCH_STATUS.md: Current research status and limitations
  • EMPIRICAL_VALIDATION.md: Evidence-based analysis of capabilities

Community Support

  • GitHub Issues: Report bugs and request features
  • Discussions: Ask questions and share experiences
  • Examples: Check the tests/ directory for usage examples

🤖 Recommended: Use with Claude Code

For the best experience with BitTransformerLM, we recommend using Claude Code:

  • Interactive Setup: Get step-by-step guidance for configuration
  • Real-time Debugging: Immediate help when things go wrong
  • Code Generation: Custom scripts and experiments tailored to your needs
  • Architecture Explanation: Deep understanding of bit-native processing
  • Best Practices: Learn optimal configurations for your use case

Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.


Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.

Happy experimenting! 🤖✨