File size: 26,331 Bytes

58b962e

# BitTransformerLM User Guide

**Version:** 0.1.0 Experimental  
**Last Updated:** August 2025  
**Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience  

## Table of Contents

1. [Quick Start](#quick-start)
2. [Architecture Overview](#architecture-overview)
3. [Core Features](#core-features)
4. [Installation & Setup](#installation--setup)
5. [Basic Usage Examples](#basic-usage-examples)
6. [Advanced Features](#advanced-features)
7. [Training Your Own Models](#training-your-own-models)
8. [Safety and Monitoring](#safety-and-monitoring)
9. [Distributed Training](#distributed-training)
10. [Performance Optimization](#performance-optimization)
11. [Troubleshooting](#troubleshooting)
12. [Best Practices](#best-practices)

---

## Quick Start

BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.

### Minimal Example
```python
from bit_transformer import BitTransformerLM, example_training_step

# Run basic example
loss, telemetry = example_training_step()
print(f"Training loss: {loss}")
print(f"Available telemetry: {list(telemetry.keys())}")
```

### Text Processing Example
```python
from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=4,
    num_layers=2,
    dim_feedforward=256,
    max_seq_len=256
)

# Convert text to bits and process
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)

# Forward pass
logits, telemetry = model(bit_tensor)
print(f"Input bits: {len(bits)}")
print(f"Output shape: {logits.shape}")
print(f"Telemetry metrics: {list(telemetry.keys())}")
```

---

## Architecture Overview

### Bit-Native Processing
Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:

- **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
- **Processing**: Multi-head attention on bit embeddings
- **Output**: Probability distribution over next bit (0 or 1)

### Key Innovations

#### 1. **Reversible Transformer Layers**
- Memory-efficient computation that doesn't store intermediate activations
- Enables training of deeper models with same memory footprint
- Mathematically reversible operations for gradient computation

#### 2. **Built-in Safety Telemetry** 
- **K (Negentropy)**: Measures information content vs random noise
- **C (LZ Complexity)**: Proxy for compressibility and pattern complexity  
- **S (Symbiosis)**: Alignment with reference distributions
- Real-time monitoring and safety gates

#### 3. **Dual-Mode Operation**
- **Causal Mode**: Traditional autoregressive generation
- **Diffusion Mode**: Bidirectional denoising for higher quality output

#### 4. **Progressive Scaling**
- Dynamic architecture expansion based on validation performance
- Automatic addition of layers, width, or context length
- Curriculum learning from simple to complex patterns

---

## Core Features

### Text Processing
- **Parity-Protected Encoding**: Each byte gets a parity bit for error detection
- **UTF-8 Support**: Full Unicode text processing capability
- **Bidirectional Processing**: Support for both causal and diffusion modes

### Safety & Monitoring
- **Real-time Telemetry**: K/C/S metrics computed during inference
- **Safety Gates**: Automatic blocking of unsafe outputs
- **Metric Drift Detection**: Alerts when model behavior changes
- **Human-in-the-Loop**: Safe inference with retry mechanisms

### Memory Efficiency
- **Reversible Layers**: Significant memory savings for deep models
- **Gradient Checkpointing**: Trade compute for memory in training
- **Dynamic Quantization**: Runtime INT8 conversion for inference
- **4-bit QAT**: Quantization-aware training for extreme efficiency

### Advanced Training
- **Distributed Training**: FSDP and pipeline parallelism support
- **Mixed Precision**: FP16/BF16 optimization with CPU autocast
- **Compression Pipeline**: Run-length encoding for efficient storage
- **Progressive Curriculum**: Automatic difficulty scaling

---

## Installation & Setup

### Requirements
- Python 3.10 or later
- PyTorch 2.7.1 or later
- CUDA (optional, for GPU acceleration)

### Installation
```bash
# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM

# Install dependencies
pip install -r requirements.txt

# For GPU support (optional)
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
```

### Quick Test
```bash
# Run basic example
python example.py

# Expected output:
# Training loss: [some value]
# Available telemetry: ['activations', 'attention_maps', ...]
```

### **🤖 Recommended: Setup with Claude Code**

For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:

1. **Open Claude Code** and navigate to your project directory
2. **Clone the repository**: Claude Code can help with git operations and dependency management  
3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters
4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging
5. **Code Generation**: Generate custom training scripts and experiments with AI assistance

Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.

---

## Basic Usage Examples

### 1. Creating Models

```python
from bit_transformer import BitTransformerLM

# Small model for experimentation
small_model = BitTransformerLM(
    d_model=64,           # Embedding dimension
    nhead=4,              # Number of attention heads
    num_layers=2,         # Number of transformer layers
    dim_feedforward=128,  # Feedforward dimension
    max_seq_len=128,      # Maximum sequence length
    reversible=True,      # Use memory-efficient reversible layers
    use_checkpoint=True   # Enable gradient checkpointing
)

# Medium model for research
medium_model = BitTransformerLM(
    d_model=512,
    nhead=8, 
    num_layers=8,
    dim_feedforward=2048,
    max_seq_len=512,
    reversible=True,
    use_checkpoint=True,
    chunk_size=64,        # Chunked attention for long sequences
    lambda_K=0.1,         # Negentropy regularization weight
    lambda_C=0.1,         # Complexity regularization weight
    lambda_S=0.1          # Symbiosis regularization weight
)
```

### 2. Text Generation

```python
from bit_transformer.bit_io import sample_text

# Generate text from prompt
prompt = "The future of AI is"
generated = sample_text(
    model,
    prompt=prompt,
    max_new_tokens=20,    # Generate ~20 new characters
    temperature=0.8,      # Sampling temperature
    top_p=0.9            # Nucleus sampling
)
print(f"Generated: {generated}")
```

### 3. Safe Inference

```python
from bit_transformer import hil_safe_inference, text_to_bits
import torch

# Convert text to bits
text = "Hello, world!"
bits = torch.tensor(text_to_bits(text)).unsqueeze(0)

# Safe inference with telemetry monitoring
try:
    output_bits, telemetry = hil_safe_inference(
        model, 
        bits,
        c_floor=0.3,     # Minimum complexity threshold
        s_floor=0.5,     # Minimum symbiosis threshold
        strict=True      # Throw error if thresholds not met
    )
    print("✅ Safe inference completed")
    print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
    print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
    print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
except Exception as e:
    print(f"⚠️ Safety check failed: {e}")
```

### 4. Interactive Dashboard

```python
# Launch the interactive dashboard
python unified_workflow.py --dashboard

# Or programmatically
from bit_transformer.dashboard_app import run_dashboard
run_dashboard(host="localhost", port=5000)
```

The dashboard provides:
- Real-time training monitoring
- Telemetry visualization  
- Model configuration controls
- HuggingFace checkpoint management
- Safe inference testing interface

---

## Advanced Features

### 1. Diffusion Mode Training

Diffusion mode enables bidirectional processing for higher quality generation:

```python
# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32

# Different noise schedules
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16

# Diffusion curriculum (noise decay over epochs)
python unified_workflow.py --diffusion --diffusion-curriculum
```

**Diffusion Parameters:**
- `--diffusion-steps`: Number of denoising steps (higher = better quality)
- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
- `--diffusion-curriculum`: Gradually reduce noise over training epochs

### 2. Progressive Scaling

Enable automatic model growth based on performance:

```python
from bit_transformer.training import train_loop
from bit_transformer.scale import expand_model

# Training loop with automatic scaling
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
train_data = torch.randint(0, 2, (1000, 64))

# Train with progressive scaling
train_loop(
    model,
    train_data,
    epochs=10,
    batch_size=8,
    # Progressive scaling will automatically trigger when validation loss plateaus
)

# Manual model expansion
expanded_model = expand_model(model, strategy="depth")  # Add layers
expanded_model = expand_model(model, strategy="width")  # Increase width
expanded_model = expand_model(model, strategy="context")  # Extend context
```

### 3. Compression Pipeline

BitTransformerLM includes run-length encoding for efficient data storage:

```python
from bit_transformer import compress_bits, decompress_bits

# Compress bit sequences
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
compressed = compress_bits(original_bits)
decompressed = decompress_bits(compressed)

print(f"Original: {original_bits}")
print(f"Compressed: {compressed}")  
print(f"Decompressed: {decompressed}")
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")

# Use compression in training
train_loop(
    model,
    data,
    compress_prob=0.5,    # 50% of training uses compressed data
    compress_warmup=100   # Start compression after 100 steps
)
```

### 4. Quantization and Optimization

```python
from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx

# Dynamic quantization for inference
quantized_model = quantize_dynamic(model, dtype=torch.qint8)

# 4-bit quantization-aware training
qat_model = prepare_qat_fx(model)
# ... train qat_model ...
final_model = convert_qat_fx(qat_model)

# Enable mixed precision and compilation
train_loop(
    model,
    data,
    amp=True,           # Enable automatic mixed precision
    compile_model=True  # Use torch.compile for speedup
)
```

---

## Training Your Own Models

### Basic Training Script

```python
import torch
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
from bit_transformer.bit_io import text_to_bits

# Prepare training data
texts = ["Hello world", "How are you?", "BitTransformer is working!"]
all_bits = []
for text in texts:
    bits = text_to_bits(text)
    all_bits.extend(bits)

# Convert to tensor and create sequences
data = torch.tensor(all_bits)
sequences = data.unfold(0, 64, 32)  # 64-bit sequences with 32-bit stride

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=8,
    num_layers=4,
    dim_feedforward=512,
    max_seq_len=64,
    reversible=True
)

# Configure optimizer
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)

# Training loop
train_loop(
    model,
    sequences,
    epochs=10,
    batch_size=4,
    optimizer=optimizer,
    amp=True,          # Mixed precision
    log=True           # Enable logging
)
```

### Advanced Training Configuration

```python
# Advanced training with all features enabled
train_loop(
    model,
    data,
    epochs=20,
    batch_size=8,
    accum_steps=4,            # Gradient accumulation
    amp=True,                 # Mixed precision
    compile_model=True,       # torch.compile optimization
    
    # Compression settings
    compress_prob=0.3,        # 30% compression probability
    compress_warmup=50,       # Start compression after 50 steps
    
    # Diffusion settings  
    diffusion=True,           # Enable diffusion mode
    diffusion_curriculum=True, # Decay noise over epochs
    
    # Direct bit training
    direct_prob=0.1,          # 10% direct bit prediction
    
    # Logging
    log=True                  # Enable detailed logging
)
```

### Custom Training Loop

```python
import torch.nn.functional as F
from bit_transformer.utils import set_dropout

# Manual training loop for full control
model.train()
set_dropout(model, 0.1)  # Enable dropout for training

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = F.cross_entropy

for epoch in range(10):
    total_loss = 0
    for batch in data_loader:
        optimizer.zero_grad()
        
        # Forward pass
        logits, telemetry = model(batch)
        
        # Compute loss
        if logits.dim() == 3:  # (batch, seq, 2)
            targets = batch[:, 1:]  # Next bit prediction
            logits = logits[:, :-1]  # Remove last prediction
            loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
        else:
            loss = criterion(logits, batch)
        
        # Add telemetry regularization
        if model.lambda_K > 0:
            loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
        if model.lambda_C > 0:
            loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
            
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        total_loss += loss.item()
        
        # Safety check
        if telemetry.get('symbiosis_score', 1.0) < 0.3:
            print("⚠️ Low symbiosis score detected")
    
    print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
```

---

## Safety and Monitoring

### Telemetry Metrics

BitTransformerLM provides three key safety metrics:

#### K (Negentropy) - Information Content
- **Range**: 0-1 (0 = random noise, 1 = perfectly ordered)
- **Purpose**: Measures departure from randomness
- **Interpretation**: 
  - Very low K (< 0.1): Output is noise-like
  - Moderate K (0.3-0.7): Structured but varied output  
  - Very high K (> 0.9): Repetitive or overly structured

#### C (LZ Complexity) - Pattern Complexity
- **Range**: 0-1 (higher = more complex patterns)
- **Purpose**: Proxy for Lempel-Ziv compressibility
- **Interpretation**:
  - Low C (< 0.3): Highly repetitive patterns
  - Moderate C (0.3-0.7): Balanced complexity
  - High C (> 0.8): Complex, varied patterns

#### S (Symbiosis) - Distribution Alignment  
- **Range**: 0-1 (higher = better alignment)
- **Purpose**: Agreement with reference distributions via KL divergence
- **Interpretation**:
  - Low S (< 0.3): Poor alignment with expected patterns
  - Moderate S (0.5-0.8): Good alignment
  - High S (> 0.8): Excellent alignment

### Safety Gates

```python
from bit_transformer.safety import SafetyGate, safe_sample_with_retry

# Configure safety gate
gate = SafetyGate(
    c_floor=0.3,      # Minimum complexity
    s_floor=0.5,      # Minimum symbiosis  
    decay=0.9,        # EMA decay factor
    burn_in=10        # Steps before gating starts
)

# Check if output should be blocked
should_block = gate.should_trigger(c_val=0.2, s_val=0.4)  # True - below thresholds

# Safe sampling with automatic retry
output = safe_sample_with_retry(
    model,
    input_bits,
    max_retries=3,
    retry_strategy="diffusion"  # Try diffusion mode on failure
)
```

### Metric Drift Detection

```python
from bit_transformer.telemetry import detect_metric_drift

# Monitor metric stability over time
metrics_history = [
    {"K": 0.5, "C": 0.6, "S": 0.7},
    {"K": 0.52, "C": 0.58, "S": 0.69},  
    {"K": 0.8, "C": 0.9, "S": 0.4},   # Drift detected!
    # ... more metrics
]

drift_detected = detect_metric_drift(
    metrics_history,
    window=10,        # Look back 10 steps
    threshold=0.2     # Alert if change > 0.2
)

if drift_detected:
    print("⚠️ Model behavior drift detected!")
```

---

## Distributed Training

### FSDP (Fully Sharded Data Parallel)

```python
from bit_transformer.distributed import wrap_fsdp, setup_distributed
import torch.distributed as dist

# Initialize distributed training
setup_distributed(rank=0, world_size=4)

# Wrap model with FSDP
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
fsdp_model = wrap_fsdp(
    model,
    sharding_strategy="FULL_SHARD",  # or "SHARD_GRAD_OP", "NO_SHARD"
    mixed_precision=True,
    device_id=0
)

# Train with FSDP
train_loop(
    fsdp_model,
    data,
    epochs=10,
    batch_size=2,    # Smaller batch per GPU
    amp=True
)
```

### Pipeline Parallelism

```python  
from bit_transformer.distributed import make_pipeline

# Create pipeline parallel model
pipeline_model = make_pipeline(
    model,
    balance=[2, 2, 2, 2],  # Split 8 layers across 4 GPUs
    devices=[0, 1, 2, 3],
    checkpoint="never"     # or "always", "except_last"
)

# Pipeline training requires special handling
# See unified_workflow.py for complete implementation
```

### Multi-GPU Training Script

```bash
# Single node, multiple GPUs
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    unified_workflow.py \
    --distributed \
    --batch-size 2 \
    --epochs 10

# Multiple nodes
python -m torch.distributed.launch \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="192.168.1.100" \
    --master_port=29500 \
    --nproc_per_node=4 \
    unified_workflow.py \
    --distributed
```

---

## Performance Optimization

### Memory Optimization

```python
# Enable all memory optimizations
model = BitTransformerLM(
    d_model=512,
    nhead=8,
    num_layers=8,
    reversible=True,          # Reversible layers save ~50% memory
    use_checkpoint=True,      # Gradient checkpointing
    chunk_size=64,            # Chunked attention for long sequences
    full_attn_logging=False   # Skip full attention reconstruction
)

# Training optimizations
train_loop(
    model,
    data,
    batch_size=4,            # Smaller batches
    accum_steps=8,           # Gradient accumulation  
    amp=True,                # Mixed precision
    compile_model=True       # torch.compile
)
```

### CPU Optimization

```python
from bit_transformer.torch_utils import cpu_autocast

# Enable BF16 on CPU
with cpu_autocast():
    logits, telemetry = model(bits)

# Or enable for entire model
model = BitTransformerLM(use_autocast=True)  # Automatically uses CPU BF16
```

### Inference Optimization

```python
# Quantize for inference
from bit_transformer import quantize_dynamic

# Switch to evaluation mode
model.eval()
set_dropout(model, 0.0)

# Dynamic quantization
quantized = quantize_dynamic(model, dtype=torch.qint8)

# Optimize for inference
with torch.no_grad():
    logits, _ = quantized(input_bits)
```

### Long Sequence Processing

```python
from bit_transformer.model import infer_long_sequence

# Process sequences longer than max_seq_len
long_text = "Very long text..." * 1000
bits = text_to_bits(long_text)

output = infer_long_sequence(
    model,
    torch.tensor(bits).unsqueeze(0),
    chunk_size=256,      # Process in 256-bit chunks
    overlap=32,          # 32-bit overlap between chunks
    stride=224           # 224-bit stride (256-32)
)
```

---

## Troubleshooting

### Common Issues

#### 1. **Memory Errors**
```
RuntimeError: CUDA out of memory
```
**Solutions:**
- Enable reversible layers: `reversible=True`
- Enable gradient checkpointing: `use_checkpoint=True`  
- Reduce batch size or use gradient accumulation
- Use chunked attention: `chunk_size=64`
- Enable mixed precision: `amp=True`

#### 2. **Tensor Shape Mismatches**
```
RuntimeError: view size is not compatible with input tensor's size
```
**Solutions:**
- Always use `.reshape()` instead of `.view()` with BitTransformerLM
- Check that input sequences are properly formatted (1D for bits)
- Ensure batch dimensions are consistent

#### 3. **Parity Check Failures**
```
ValueError: Parity check failed
```
**Solutions:**
- Use `enforce_parity()` to fix parity bits in generated sequences
- Check that text encoding/decoding is consistent
- Verify bit sequences have correct 9-bit (8+parity) structure

#### 4. **Safety Gate Triggering**
```
SafetyError: Output blocked by safety gate
```
**Solutions:**
- Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
- Increase burn-in period: `burn_in=20`
- Use retry with diffusion: `safe_sample_with_retry()`
- Check model training quality

### Debug Mode

```python
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Model with debug telemetry
model = BitTransformerLM(
    d_model=64,
    nhead=4,
    num_layers=2,
    full_attn_logging=True,  # Log full attention maps
    chunk_size=None          # Disable chunking for debugging
)

# Inspect telemetry
logits, telemetry = model(input_bits)
print("Telemetry keys:", list(telemetry.keys()))
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
print("Activation stats:", torch.stack(telemetry['activations']).describe())
```

### Performance Profiling

```python
import torch.profiler

# Profile training step
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
) as prof:
    logits, telemetry = model(input_bits)
    loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total"))
```

---

## Best Practices

### Model Configuration

#### For Experimentation (< 1M parameters)
```python
model = BitTransformerLM(
    d_model=64,
    nhead=4,
    num_layers=2,
    dim_feedforward=128,
    max_seq_len=128,
    reversible=False,    # Simpler for debugging
    use_checkpoint=False
)
```

#### For Research (1M-100M parameters)  
```python
model = BitTransformerLM(
    d_model=256,
    nhead=8,
    num_layers=6,
    dim_feedforward=1024,
    max_seq_len=512,
    reversible=True,     # Enable memory efficiency
    use_checkpoint=True,
    chunk_size=128,
    lambda_K=0.05,       # Light regularization
    lambda_C=0.05,
    lambda_S=0.05
)
```

#### For Large-Scale (100M+ parameters)
```python
model = BitTransformerLM(
    d_model=1024,
    nhead=16, 
    num_layers=20,
    dim_feedforward=4096,
    max_seq_len=2048,
    reversible=True,
    use_checkpoint=True,
    chunk_size=256,
    full_attn_logging=False,  # Save memory
    lambda_K=0.1,
    lambda_C=0.1,
    lambda_S=0.1
)
```

### Training Best Practices

1. **Always validate on held-out data** to monitor overfitting
2. **Use gradient clipping** to prevent training instability  
3. **Monitor telemetry metrics** for signs of model degradation
4. **Start with smaller models** before scaling up
5. **Use safety gates** in production deployments
6. **Enable logging** to track training progress
7. **Save checkpoints frequently** to prevent loss of progress

### Data Preparation

```python
# Good: Clean, well-formatted text
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming technology.",
    "BitTransformer processes information at the bit level."
]

# Convert to training sequences
all_bits = []
for text in texts:
    bits = text_to_bits(text)
    all_bits.extend(bits)

# Create overlapping sequences for better learning
data = torch.tensor(all_bits)
seq_len = 128
stride = 64
sequences = []
for i in range(0, len(data) - seq_len, stride):
    sequences.append(data[i:i + seq_len])

training_data = torch.stack(sequences)
```

### Production Deployment

```python
# Production-ready model setup
model.eval()  # Disable dropout
set_dropout(model, 0.0)

# Enable safety monitoring
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)

# Quantize for efficiency
production_model = quantize_dynamic(model)

# Safe inference with monitoring
def safe_generate(input_text, max_length=100):
    try:
        return safe_sample_with_retry(
            production_model,
            text_to_bits(input_text),
            max_retries=3
        )
    except Exception as e:
        logging.error(f"Generation failed: {e}")
        return "Error: Unable to generate safe output"
```

---

## Getting Help

### Documentation Resources
- **ABOUTME.md**: Project overview and quick start
- **README.md**: Professional model card and specifications  
- **RESEARCH_STATUS.md**: Current research status and limitations
- **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities

### Community Support
- **GitHub Issues**: Report bugs and request features
- **Discussions**: Ask questions and share experiences
- **Examples**: Check the `tests/` directory for usage examples

### **🤖 Recommended: Use with Claude Code**

For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):

- **Interactive Setup**: Get step-by-step guidance for configuration
- **Real-time Debugging**: Immediate help when things go wrong
- **Code Generation**: Custom scripts and experiments tailored to your needs
- **Architecture Explanation**: Deep understanding of bit-native processing
- **Best Practices**: Learn optimal configurations for your use case

Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.

---

**Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.**

Happy experimenting! 🤖✨