BitTransformerLM User Guide
Version: 0.1.0 Experimental
Last Updated: August 2025
Recommended Setup: Use with Claude Code for optimal experience
Table of Contents
- Quick Start
- Architecture Overview
- Core Features
- Installation & Setup
- Basic Usage Examples
- Advanced Features
- Training Your Own Models
- Safety and Monitoring
- Distributed Training
- Performance Optimization
- Troubleshooting
- Best Practices
Quick Start
BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.
Minimal Example
from bit_transformer import BitTransformerLM, example_training_step
# Run basic example
loss, telemetry = example_training_step()
print(f"Training loss: {loss}")
print(f"Available telemetry: {list(telemetry.keys())}")
Text Processing Example
from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text
# Create model
model = BitTransformerLM(
d_model=128,
nhead=4,
num_layers=2,
dim_feedforward=256,
max_seq_len=256
)
# Convert text to bits and process
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)
# Forward pass
logits, telemetry = model(bit_tensor)
print(f"Input bits: {len(bits)}")
print(f"Output shape: {logits.shape}")
print(f"Telemetry metrics: {list(telemetry.keys())}")
Architecture Overview
Bit-Native Processing
Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:
- Input: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
- Processing: Multi-head attention on bit embeddings
- Output: Probability distribution over next bit (0 or 1)
Key Innovations
1. Reversible Transformer Layers
- Memory-efficient computation that doesn't store intermediate activations
- Enables training of deeper models with same memory footprint
- Mathematically reversible operations for gradient computation
2. Built-in Safety Telemetry
- K (Negentropy): Measures information content vs random noise
- C (LZ Complexity): Proxy for compressibility and pattern complexity
- S (Symbiosis): Alignment with reference distributions
- Real-time monitoring and safety gates
3. Dual-Mode Operation
- Causal Mode: Traditional autoregressive generation
- Diffusion Mode: Bidirectional denoising for higher quality output
4. Progressive Scaling
- Dynamic architecture expansion based on validation performance
- Automatic addition of layers, width, or context length
- Curriculum learning from simple to complex patterns
Core Features
Text Processing
- Parity-Protected Encoding: Each byte gets a parity bit for error detection
- UTF-8 Support: Full Unicode text processing capability
- Bidirectional Processing: Support for both causal and diffusion modes
Safety & Monitoring
- Real-time Telemetry: K/C/S metrics computed during inference
- Safety Gates: Automatic blocking of unsafe outputs
- Metric Drift Detection: Alerts when model behavior changes
- Human-in-the-Loop: Safe inference with retry mechanisms
Memory Efficiency
- Reversible Layers: Significant memory savings for deep models
- Gradient Checkpointing: Trade compute for memory in training
- Dynamic Quantization: Runtime INT8 conversion for inference
- 4-bit QAT: Quantization-aware training for extreme efficiency
Advanced Training
- Distributed Training: FSDP and pipeline parallelism support
- Mixed Precision: FP16/BF16 optimization with CPU autocast
- Compression Pipeline: Run-length encoding for efficient storage
- Progressive Curriculum: Automatic difficulty scaling
Installation & Setup
Requirements
- Python 3.10 or later
- PyTorch 2.7.1 or later
- CUDA (optional, for GPU acceleration)
Installation
# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM
# Install dependencies
pip install -r requirements.txt
# For GPU support (optional)
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
Quick Test
# Run basic example
python example.py
# Expected output:
# Training loss: [some value]
# Available telemetry: ['activations', 'attention_maps', ...]
🤖 Recommended: Setup with Claude Code
For the best experience, we recommend using Claude Code to set up and work with BitTransformerLM:
- Open Claude Code and navigate to your project directory
- Clone the repository: Claude Code can help with git operations and dependency management
- Interactive Setup: Claude Code can guide you through configuration options and explain parameters
- Real-time Assistance: Get help with model architecture, training parameters, and debugging
- Code Generation: Generate custom training scripts and experiments with AI assistance
Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.
Basic Usage Examples
1. Creating Models
from bit_transformer import BitTransformerLM
# Small model for experimentation
small_model = BitTransformerLM(
d_model=64, # Embedding dimension
nhead=4, # Number of attention heads
num_layers=2, # Number of transformer layers
dim_feedforward=128, # Feedforward dimension
max_seq_len=128, # Maximum sequence length
reversible=True, # Use memory-efficient reversible layers
use_checkpoint=True # Enable gradient checkpointing
)
# Medium model for research
medium_model = BitTransformerLM(
d_model=512,
nhead=8,
num_layers=8,
dim_feedforward=2048,
max_seq_len=512,
reversible=True,
use_checkpoint=True,
chunk_size=64, # Chunked attention for long sequences
lambda_K=0.1, # Negentropy regularization weight
lambda_C=0.1, # Complexity regularization weight
lambda_S=0.1 # Symbiosis regularization weight
)
2. Text Generation
from bit_transformer.bit_io import sample_text
# Generate text from prompt
prompt = "The future of AI is"
generated = sample_text(
model,
prompt=prompt,
max_new_tokens=20, # Generate ~20 new characters
temperature=0.8, # Sampling temperature
top_p=0.9 # Nucleus sampling
)
print(f"Generated: {generated}")
3. Safe Inference
from bit_transformer import hil_safe_inference, text_to_bits
import torch
# Convert text to bits
text = "Hello, world!"
bits = torch.tensor(text_to_bits(text)).unsqueeze(0)
# Safe inference with telemetry monitoring
try:
output_bits, telemetry = hil_safe_inference(
model,
bits,
c_floor=0.3, # Minimum complexity threshold
s_floor=0.5, # Minimum symbiosis threshold
strict=True # Throw error if thresholds not met
)
print("✅ Safe inference completed")
print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
except Exception as e:
print(f"⚠️ Safety check failed: {e}")
4. Interactive Dashboard
# Launch the interactive dashboard
python unified_workflow.py --dashboard
# Or programmatically
from bit_transformer.dashboard_app import run_dashboard
run_dashboard(host="localhost", port=5000)
The dashboard provides:
- Real-time training monitoring
- Telemetry visualization
- Model configuration controls
- HuggingFace checkpoint management
- Safe inference testing interface
Advanced Features
1. Diffusion Mode Training
Diffusion mode enables bidirectional processing for higher quality generation:
# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
# Different noise schedules
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16
# Diffusion curriculum (noise decay over epochs)
python unified_workflow.py --diffusion --diffusion-curriculum
Diffusion Parameters:
--diffusion-steps
: Number of denoising steps (higher = better quality)--noise-schedule
:linear
,cosine
, orexp
noise decay--diffusion-curriculum
: Gradually reduce noise over training epochs
2. Progressive Scaling
Enable automatic model growth based on performance:
from bit_transformer.training import train_loop
from bit_transformer.scale import expand_model
# Training loop with automatic scaling
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
train_data = torch.randint(0, 2, (1000, 64))
# Train with progressive scaling
train_loop(
model,
train_data,
epochs=10,
batch_size=8,
# Progressive scaling will automatically trigger when validation loss plateaus
)
# Manual model expansion
expanded_model = expand_model(model, strategy="depth") # Add layers
expanded_model = expand_model(model, strategy="width") # Increase width
expanded_model = expand_model(model, strategy="context") # Extend context
3. Compression Pipeline
BitTransformerLM includes run-length encoding for efficient data storage:
from bit_transformer import compress_bits, decompress_bits
# Compress bit sequences
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
compressed = compress_bits(original_bits)
decompressed = decompress_bits(compressed)
print(f"Original: {original_bits}")
print(f"Compressed: {compressed}")
print(f"Decompressed: {decompressed}")
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")
# Use compression in training
train_loop(
model,
data,
compress_prob=0.5, # 50% of training uses compressed data
compress_warmup=100 # Start compression after 100 steps
)
4. Quantization and Optimization
from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx
# Dynamic quantization for inference
quantized_model = quantize_dynamic(model, dtype=torch.qint8)
# 4-bit quantization-aware training
qat_model = prepare_qat_fx(model)
# ... train qat_model ...
final_model = convert_qat_fx(qat_model)
# Enable mixed precision and compilation
train_loop(
model,
data,
amp=True, # Enable automatic mixed precision
compile_model=True # Use torch.compile for speedup
)
Training Your Own Models
Basic Training Script
import torch
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
from bit_transformer.bit_io import text_to_bits
# Prepare training data
texts = ["Hello world", "How are you?", "BitTransformer is working!"]
all_bits = []
for text in texts:
bits = text_to_bits(text)
all_bits.extend(bits)
# Convert to tensor and create sequences
data = torch.tensor(all_bits)
sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride
# Create model
model = BitTransformerLM(
d_model=128,
nhead=8,
num_layers=4,
dim_feedforward=512,
max_seq_len=64,
reversible=True
)
# Configure optimizer
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)
# Training loop
train_loop(
model,
sequences,
epochs=10,
batch_size=4,
optimizer=optimizer,
amp=True, # Mixed precision
log=True # Enable logging
)
Advanced Training Configuration
# Advanced training with all features enabled
train_loop(
model,
data,
epochs=20,
batch_size=8,
accum_steps=4, # Gradient accumulation
amp=True, # Mixed precision
compile_model=True, # torch.compile optimization
# Compression settings
compress_prob=0.3, # 30% compression probability
compress_warmup=50, # Start compression after 50 steps
# Diffusion settings
diffusion=True, # Enable diffusion mode
diffusion_curriculum=True, # Decay noise over epochs
# Direct bit training
direct_prob=0.1, # 10% direct bit prediction
# Logging
log=True # Enable detailed logging
)
Custom Training Loop
import torch.nn.functional as F
from bit_transformer.utils import set_dropout
# Manual training loop for full control
model.train()
set_dropout(model, 0.1) # Enable dropout for training
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = F.cross_entropy
for epoch in range(10):
total_loss = 0
for batch in data_loader:
optimizer.zero_grad()
# Forward pass
logits, telemetry = model(batch)
# Compute loss
if logits.dim() == 3: # (batch, seq, 2)
targets = batch[:, 1:] # Next bit prediction
logits = logits[:, :-1] # Remove last prediction
loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
else:
loss = criterion(logits, batch)
# Add telemetry regularization
if model.lambda_K > 0:
loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
if model.lambda_C > 0:
loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
# Safety check
if telemetry.get('symbiosis_score', 1.0) < 0.3:
print("⚠️ Low symbiosis score detected")
print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
Safety and Monitoring
Telemetry Metrics
BitTransformerLM provides three key safety metrics:
K (Negentropy) - Information Content
- Range: 0-1 (0 = random noise, 1 = perfectly ordered)
- Purpose: Measures departure from randomness
- Interpretation:
- Very low K (< 0.1): Output is noise-like
- Moderate K (0.3-0.7): Structured but varied output
- Very high K (> 0.9): Repetitive or overly structured
C (LZ Complexity) - Pattern Complexity
- Range: 0-1 (higher = more complex patterns)
- Purpose: Proxy for Lempel-Ziv compressibility
- Interpretation:
- Low C (< 0.3): Highly repetitive patterns
- Moderate C (0.3-0.7): Balanced complexity
- High C (> 0.8): Complex, varied patterns
S (Symbiosis) - Distribution Alignment
- Range: 0-1 (higher = better alignment)
- Purpose: Agreement with reference distributions via KL divergence
- Interpretation:
- Low S (< 0.3): Poor alignment with expected patterns
- Moderate S (0.5-0.8): Good alignment
- High S (> 0.8): Excellent alignment
Safety Gates
from bit_transformer.safety import SafetyGate, safe_sample_with_retry
# Configure safety gate
gate = SafetyGate(
c_floor=0.3, # Minimum complexity
s_floor=0.5, # Minimum symbiosis
decay=0.9, # EMA decay factor
burn_in=10 # Steps before gating starts
)
# Check if output should be blocked
should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds
# Safe sampling with automatic retry
output = safe_sample_with_retry(
model,
input_bits,
max_retries=3,
retry_strategy="diffusion" # Try diffusion mode on failure
)
Metric Drift Detection
from bit_transformer.telemetry import detect_metric_drift
# Monitor metric stability over time
metrics_history = [
{"K": 0.5, "C": 0.6, "S": 0.7},
{"K": 0.52, "C": 0.58, "S": 0.69},
{"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected!
# ... more metrics
]
drift_detected = detect_metric_drift(
metrics_history,
window=10, # Look back 10 steps
threshold=0.2 # Alert if change > 0.2
)
if drift_detected:
print("⚠️ Model behavior drift detected!")
Distributed Training
FSDP (Fully Sharded Data Parallel)
from bit_transformer.distributed import wrap_fsdp, setup_distributed
import torch.distributed as dist
# Initialize distributed training
setup_distributed(rank=0, world_size=4)
# Wrap model with FSDP
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
fsdp_model = wrap_fsdp(
model,
sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD"
mixed_precision=True,
device_id=0
)
# Train with FSDP
train_loop(
fsdp_model,
data,
epochs=10,
batch_size=2, # Smaller batch per GPU
amp=True
)
Pipeline Parallelism
from bit_transformer.distributed import make_pipeline
# Create pipeline parallel model
pipeline_model = make_pipeline(
model,
balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs
devices=[0, 1, 2, 3],
checkpoint="never" # or "always", "except_last"
)
# Pipeline training requires special handling
# See unified_workflow.py for complete implementation
Multi-GPU Training Script
# Single node, multiple GPUs
python -m torch.distributed.launch \
--nproc_per_node=4 \
unified_workflow.py \
--distributed \
--batch-size 2 \
--epochs 10
# Multiple nodes
python -m torch.distributed.launch \
--nnodes=2 \
--node_rank=0 \
--master_addr="192.168.1.100" \
--master_port=29500 \
--nproc_per_node=4 \
unified_workflow.py \
--distributed
Performance Optimization
Memory Optimization
# Enable all memory optimizations
model = BitTransformerLM(
d_model=512,
nhead=8,
num_layers=8,
reversible=True, # Reversible layers save ~50% memory
use_checkpoint=True, # Gradient checkpointing
chunk_size=64, # Chunked attention for long sequences
full_attn_logging=False # Skip full attention reconstruction
)
# Training optimizations
train_loop(
model,
data,
batch_size=4, # Smaller batches
accum_steps=8, # Gradient accumulation
amp=True, # Mixed precision
compile_model=True # torch.compile
)
CPU Optimization
from bit_transformer.torch_utils import cpu_autocast
# Enable BF16 on CPU
with cpu_autocast():
logits, telemetry = model(bits)
# Or enable for entire model
model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16
Inference Optimization
# Quantize for inference
from bit_transformer import quantize_dynamic
# Switch to evaluation mode
model.eval()
set_dropout(model, 0.0)
# Dynamic quantization
quantized = quantize_dynamic(model, dtype=torch.qint8)
# Optimize for inference
with torch.no_grad():
logits, _ = quantized(input_bits)
Long Sequence Processing
from bit_transformer.model import infer_long_sequence
# Process sequences longer than max_seq_len
long_text = "Very long text..." * 1000
bits = text_to_bits(long_text)
output = infer_long_sequence(
model,
torch.tensor(bits).unsqueeze(0),
chunk_size=256, # Process in 256-bit chunks
overlap=32, # 32-bit overlap between chunks
stride=224 # 224-bit stride (256-32)
)
Troubleshooting
Common Issues
1. Memory Errors
RuntimeError: CUDA out of memory
Solutions:
- Enable reversible layers:
reversible=True
- Enable gradient checkpointing:
use_checkpoint=True
- Reduce batch size or use gradient accumulation
- Use chunked attention:
chunk_size=64
- Enable mixed precision:
amp=True
2. Tensor Shape Mismatches
RuntimeError: view size is not compatible with input tensor's size
Solutions:
- Always use
.reshape()
instead of.view()
with BitTransformerLM - Check that input sequences are properly formatted (1D for bits)
- Ensure batch dimensions are consistent
3. Parity Check Failures
ValueError: Parity check failed
Solutions:
- Use
enforce_parity()
to fix parity bits in generated sequences - Check that text encoding/decoding is consistent
- Verify bit sequences have correct 9-bit (8+parity) structure
4. Safety Gate Triggering
SafetyError: Output blocked by safety gate
Solutions:
- Lower safety thresholds:
c_floor=0.2, s_floor=0.4
- Increase burn-in period:
burn_in=20
- Use retry with diffusion:
safe_sample_with_retry()
- Check model training quality
Debug Mode
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Model with debug telemetry
model = BitTransformerLM(
d_model=64,
nhead=4,
num_layers=2,
full_attn_logging=True, # Log full attention maps
chunk_size=None # Disable chunking for debugging
)
# Inspect telemetry
logits, telemetry = model(input_bits)
print("Telemetry keys:", list(telemetry.keys()))
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
print("Activation stats:", torch.stack(telemetry['activations']).describe())
Performance Profiling
import torch.profiler
# Profile training step
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
with_stack=True,
) as prof:
logits, telemetry = model(input_bits)
loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
loss.backward()
print(prof.key_averages().table(sort_by="cuda_time_total"))
Best Practices
Model Configuration
For Experimentation (< 1M parameters)
model = BitTransformerLM(
d_model=64,
nhead=4,
num_layers=2,
dim_feedforward=128,
max_seq_len=128,
reversible=False, # Simpler for debugging
use_checkpoint=False
)
For Research (1M-100M parameters)
model = BitTransformerLM(
d_model=256,
nhead=8,
num_layers=6,
dim_feedforward=1024,
max_seq_len=512,
reversible=True, # Enable memory efficiency
use_checkpoint=True,
chunk_size=128,
lambda_K=0.05, # Light regularization
lambda_C=0.05,
lambda_S=0.05
)
For Large-Scale (100M+ parameters)
model = BitTransformerLM(
d_model=1024,
nhead=16,
num_layers=20,
dim_feedforward=4096,
max_seq_len=2048,
reversible=True,
use_checkpoint=True,
chunk_size=256,
full_attn_logging=False, # Save memory
lambda_K=0.1,
lambda_C=0.1,
lambda_S=0.1
)
Training Best Practices
- Always validate on held-out data to monitor overfitting
- Use gradient clipping to prevent training instability
- Monitor telemetry metrics for signs of model degradation
- Start with smaller models before scaling up
- Use safety gates in production deployments
- Enable logging to track training progress
- Save checkpoints frequently to prevent loss of progress
Data Preparation
# Good: Clean, well-formatted text
texts = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming technology.",
"BitTransformer processes information at the bit level."
]
# Convert to training sequences
all_bits = []
for text in texts:
bits = text_to_bits(text)
all_bits.extend(bits)
# Create overlapping sequences for better learning
data = torch.tensor(all_bits)
seq_len = 128
stride = 64
sequences = []
for i in range(0, len(data) - seq_len, stride):
sequences.append(data[i:i + seq_len])
training_data = torch.stack(sequences)
Production Deployment
# Production-ready model setup
model.eval() # Disable dropout
set_dropout(model, 0.0)
# Enable safety monitoring
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)
# Quantize for efficiency
production_model = quantize_dynamic(model)
# Safe inference with monitoring
def safe_generate(input_text, max_length=100):
try:
return safe_sample_with_retry(
production_model,
text_to_bits(input_text),
max_retries=3
)
except Exception as e:
logging.error(f"Generation failed: {e}")
return "Error: Unable to generate safe output"
Getting Help
Documentation Resources
- ABOUTME.md: Project overview and quick start
- README.md: Professional model card and specifications
- RESEARCH_STATUS.md: Current research status and limitations
- EMPIRICAL_VALIDATION.md: Evidence-based analysis of capabilities
Community Support
- GitHub Issues: Report bugs and request features
- Discussions: Ask questions and share experiences
- Examples: Check the
tests/
directory for usage examples
🤖 Recommended: Use with Claude Code
For the best experience with BitTransformerLM, we recommend using Claude Code:
- Interactive Setup: Get step-by-step guidance for configuration
- Real-time Debugging: Immediate help when things go wrong
- Code Generation: Custom scripts and experiments tailored to your needs
- Architecture Explanation: Deep understanding of bit-native processing
- Best Practices: Learn optimal configurations for your use case
Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.
Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.
Happy experimenting! 🤖✨