BitTransformerLM / USER_GUIDE.md
WCNegentropy's picture
Add Comprehensive user handbook
58b962e verified
|
raw
history blame
26.3 kB
# BitTransformerLM User Guide
**Version:** 0.1.0 Experimental
**Last Updated:** August 2025
**Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience
## Table of Contents
1. [Quick Start](#quick-start)
2. [Architecture Overview](#architecture-overview)
3. [Core Features](#core-features)
4. [Installation & Setup](#installation--setup)
5. [Basic Usage Examples](#basic-usage-examples)
6. [Advanced Features](#advanced-features)
7. [Training Your Own Models](#training-your-own-models)
8. [Safety and Monitoring](#safety-and-monitoring)
9. [Distributed Training](#distributed-training)
10. [Performance Optimization](#performance-optimization)
11. [Troubleshooting](#troubleshooting)
12. [Best Practices](#best-practices)
---
## Quick Start
BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.
### Minimal Example
```python
from bit_transformer import BitTransformerLM, example_training_step
# Run basic example
loss, telemetry = example_training_step()
print(f"Training loss: {loss}")
print(f"Available telemetry: {list(telemetry.keys())}")
```
### Text Processing Example
```python
from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text
# Create model
model = BitTransformerLM(
d_model=128,
nhead=4,
num_layers=2,
dim_feedforward=256,
max_seq_len=256
)
# Convert text to bits and process
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)
# Forward pass
logits, telemetry = model(bit_tensor)
print(f"Input bits: {len(bits)}")
print(f"Output shape: {logits.shape}")
print(f"Telemetry metrics: {list(telemetry.keys())}")
```
---
## Architecture Overview
### Bit-Native Processing
Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:
- **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
- **Processing**: Multi-head attention on bit embeddings
- **Output**: Probability distribution over next bit (0 or 1)
### Key Innovations
#### 1. **Reversible Transformer Layers**
- Memory-efficient computation that doesn't store intermediate activations
- Enables training of deeper models with same memory footprint
- Mathematically reversible operations for gradient computation
#### 2. **Built-in Safety Telemetry**
- **K (Negentropy)**: Measures information content vs random noise
- **C (LZ Complexity)**: Proxy for compressibility and pattern complexity
- **S (Symbiosis)**: Alignment with reference distributions
- Real-time monitoring and safety gates
#### 3. **Dual-Mode Operation**
- **Causal Mode**: Traditional autoregressive generation
- **Diffusion Mode**: Bidirectional denoising for higher quality output
#### 4. **Progressive Scaling**
- Dynamic architecture expansion based on validation performance
- Automatic addition of layers, width, or context length
- Curriculum learning from simple to complex patterns
---
## Core Features
### Text Processing
- **Parity-Protected Encoding**: Each byte gets a parity bit for error detection
- **UTF-8 Support**: Full Unicode text processing capability
- **Bidirectional Processing**: Support for both causal and diffusion modes
### Safety & Monitoring
- **Real-time Telemetry**: K/C/S metrics computed during inference
- **Safety Gates**: Automatic blocking of unsafe outputs
- **Metric Drift Detection**: Alerts when model behavior changes
- **Human-in-the-Loop**: Safe inference with retry mechanisms
### Memory Efficiency
- **Reversible Layers**: Significant memory savings for deep models
- **Gradient Checkpointing**: Trade compute for memory in training
- **Dynamic Quantization**: Runtime INT8 conversion for inference
- **4-bit QAT**: Quantization-aware training for extreme efficiency
### Advanced Training
- **Distributed Training**: FSDP and pipeline parallelism support
- **Mixed Precision**: FP16/BF16 optimization with CPU autocast
- **Compression Pipeline**: Run-length encoding for efficient storage
- **Progressive Curriculum**: Automatic difficulty scaling
---
## Installation & Setup
### Requirements
- Python 3.10 or later
- PyTorch 2.7.1 or later
- CUDA (optional, for GPU acceleration)
### Installation
```bash
# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM
# Install dependencies
pip install -r requirements.txt
# For GPU support (optional)
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
```
### Quick Test
```bash
# Run basic example
python example.py
# Expected output:
# Training loss: [some value]
# Available telemetry: ['activations', 'attention_maps', ...]
```
### **🤖 Recommended: Setup with Claude Code**
For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:
1. **Open Claude Code** and navigate to your project directory
2. **Clone the repository**: Claude Code can help with git operations and dependency management
3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters
4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging
5. **Code Generation**: Generate custom training scripts and experiments with AI assistance
Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.
---
## Basic Usage Examples
### 1. Creating Models
```python
from bit_transformer import BitTransformerLM
# Small model for experimentation
small_model = BitTransformerLM(
d_model=64, # Embedding dimension
nhead=4, # Number of attention heads
num_layers=2, # Number of transformer layers
dim_feedforward=128, # Feedforward dimension
max_seq_len=128, # Maximum sequence length
reversible=True, # Use memory-efficient reversible layers
use_checkpoint=True # Enable gradient checkpointing
)
# Medium model for research
medium_model = BitTransformerLM(
d_model=512,
nhead=8,
num_layers=8,
dim_feedforward=2048,
max_seq_len=512,
reversible=True,
use_checkpoint=True,
chunk_size=64, # Chunked attention for long sequences
lambda_K=0.1, # Negentropy regularization weight
lambda_C=0.1, # Complexity regularization weight
lambda_S=0.1 # Symbiosis regularization weight
)
```
### 2. Text Generation
```python
from bit_transformer.bit_io import sample_text
# Generate text from prompt
prompt = "The future of AI is"
generated = sample_text(
model,
prompt=prompt,
max_new_tokens=20, # Generate ~20 new characters
temperature=0.8, # Sampling temperature
top_p=0.9 # Nucleus sampling
)
print(f"Generated: {generated}")
```
### 3. Safe Inference
```python
from bit_transformer import hil_safe_inference, text_to_bits
import torch
# Convert text to bits
text = "Hello, world!"
bits = torch.tensor(text_to_bits(text)).unsqueeze(0)
# Safe inference with telemetry monitoring
try:
output_bits, telemetry = hil_safe_inference(
model,
bits,
c_floor=0.3, # Minimum complexity threshold
s_floor=0.5, # Minimum symbiosis threshold
strict=True # Throw error if thresholds not met
)
print("✅ Safe inference completed")
print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
except Exception as e:
print(f"⚠️ Safety check failed: {e}")
```
### 4. Interactive Dashboard
```python
# Launch the interactive dashboard
python unified_workflow.py --dashboard
# Or programmatically
from bit_transformer.dashboard_app import run_dashboard
run_dashboard(host="localhost", port=5000)
```
The dashboard provides:
- Real-time training monitoring
- Telemetry visualization
- Model configuration controls
- HuggingFace checkpoint management
- Safe inference testing interface
---
## Advanced Features
### 1. Diffusion Mode Training
Diffusion mode enables bidirectional processing for higher quality generation:
```python
# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
# Different noise schedules
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16
# Diffusion curriculum (noise decay over epochs)
python unified_workflow.py --diffusion --diffusion-curriculum
```
**Diffusion Parameters:**
- `--diffusion-steps`: Number of denoising steps (higher = better quality)
- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
- `--diffusion-curriculum`: Gradually reduce noise over training epochs
### 2. Progressive Scaling
Enable automatic model growth based on performance:
```python
from bit_transformer.training import train_loop
from bit_transformer.scale import expand_model
# Training loop with automatic scaling
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
train_data = torch.randint(0, 2, (1000, 64))
# Train with progressive scaling
train_loop(
model,
train_data,
epochs=10,
batch_size=8,
# Progressive scaling will automatically trigger when validation loss plateaus
)
# Manual model expansion
expanded_model = expand_model(model, strategy="depth") # Add layers
expanded_model = expand_model(model, strategy="width") # Increase width
expanded_model = expand_model(model, strategy="context") # Extend context
```
### 3. Compression Pipeline
BitTransformerLM includes run-length encoding for efficient data storage:
```python
from bit_transformer import compress_bits, decompress_bits
# Compress bit sequences
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
compressed = compress_bits(original_bits)
decompressed = decompress_bits(compressed)
print(f"Original: {original_bits}")
print(f"Compressed: {compressed}")
print(f"Decompressed: {decompressed}")
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")
# Use compression in training
train_loop(
model,
data,
compress_prob=0.5, # 50% of training uses compressed data
compress_warmup=100 # Start compression after 100 steps
)
```
### 4. Quantization and Optimization
```python
from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx
# Dynamic quantization for inference
quantized_model = quantize_dynamic(model, dtype=torch.qint8)
# 4-bit quantization-aware training
qat_model = prepare_qat_fx(model)
# ... train qat_model ...
final_model = convert_qat_fx(qat_model)
# Enable mixed precision and compilation
train_loop(
model,
data,
amp=True, # Enable automatic mixed precision
compile_model=True # Use torch.compile for speedup
)
```
---
## Training Your Own Models
### Basic Training Script
```python
import torch
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
from bit_transformer.bit_io import text_to_bits
# Prepare training data
texts = ["Hello world", "How are you?", "BitTransformer is working!"]
all_bits = []
for text in texts:
bits = text_to_bits(text)
all_bits.extend(bits)
# Convert to tensor and create sequences
data = torch.tensor(all_bits)
sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride
# Create model
model = BitTransformerLM(
d_model=128,
nhead=8,
num_layers=4,
dim_feedforward=512,
max_seq_len=64,
reversible=True
)
# Configure optimizer
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)
# Training loop
train_loop(
model,
sequences,
epochs=10,
batch_size=4,
optimizer=optimizer,
amp=True, # Mixed precision
log=True # Enable logging
)
```
### Advanced Training Configuration
```python
# Advanced training with all features enabled
train_loop(
model,
data,
epochs=20,
batch_size=8,
accum_steps=4, # Gradient accumulation
amp=True, # Mixed precision
compile_model=True, # torch.compile optimization
# Compression settings
compress_prob=0.3, # 30% compression probability
compress_warmup=50, # Start compression after 50 steps
# Diffusion settings
diffusion=True, # Enable diffusion mode
diffusion_curriculum=True, # Decay noise over epochs
# Direct bit training
direct_prob=0.1, # 10% direct bit prediction
# Logging
log=True # Enable detailed logging
)
```
### Custom Training Loop
```python
import torch.nn.functional as F
from bit_transformer.utils import set_dropout
# Manual training loop for full control
model.train()
set_dropout(model, 0.1) # Enable dropout for training
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = F.cross_entropy
for epoch in range(10):
total_loss = 0
for batch in data_loader:
optimizer.zero_grad()
# Forward pass
logits, telemetry = model(batch)
# Compute loss
if logits.dim() == 3: # (batch, seq, 2)
targets = batch[:, 1:] # Next bit prediction
logits = logits[:, :-1] # Remove last prediction
loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
else:
loss = criterion(logits, batch)
# Add telemetry regularization
if model.lambda_K > 0:
loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
if model.lambda_C > 0:
loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
# Safety check
if telemetry.get('symbiosis_score', 1.0) < 0.3:
print("⚠️ Low symbiosis score detected")
print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
```
---
## Safety and Monitoring
### Telemetry Metrics
BitTransformerLM provides three key safety metrics:
#### K (Negentropy) - Information Content
- **Range**: 0-1 (0 = random noise, 1 = perfectly ordered)
- **Purpose**: Measures departure from randomness
- **Interpretation**:
- Very low K (< 0.1): Output is noise-like
- Moderate K (0.3-0.7): Structured but varied output
- Very high K (> 0.9): Repetitive or overly structured
#### C (LZ Complexity) - Pattern Complexity
- **Range**: 0-1 (higher = more complex patterns)
- **Purpose**: Proxy for Lempel-Ziv compressibility
- **Interpretation**:
- Low C (< 0.3): Highly repetitive patterns
- Moderate C (0.3-0.7): Balanced complexity
- High C (> 0.8): Complex, varied patterns
#### S (Symbiosis) - Distribution Alignment
- **Range**: 0-1 (higher = better alignment)
- **Purpose**: Agreement with reference distributions via KL divergence
- **Interpretation**:
- Low S (< 0.3): Poor alignment with expected patterns
- Moderate S (0.5-0.8): Good alignment
- High S (> 0.8): Excellent alignment
### Safety Gates
```python
from bit_transformer.safety import SafetyGate, safe_sample_with_retry
# Configure safety gate
gate = SafetyGate(
c_floor=0.3, # Minimum complexity
s_floor=0.5, # Minimum symbiosis
decay=0.9, # EMA decay factor
burn_in=10 # Steps before gating starts
)
# Check if output should be blocked
should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds
# Safe sampling with automatic retry
output = safe_sample_with_retry(
model,
input_bits,
max_retries=3,
retry_strategy="diffusion" # Try diffusion mode on failure
)
```
### Metric Drift Detection
```python
from bit_transformer.telemetry import detect_metric_drift
# Monitor metric stability over time
metrics_history = [
{"K": 0.5, "C": 0.6, "S": 0.7},
{"K": 0.52, "C": 0.58, "S": 0.69},
{"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected!
# ... more metrics
]
drift_detected = detect_metric_drift(
metrics_history,
window=10, # Look back 10 steps
threshold=0.2 # Alert if change > 0.2
)
if drift_detected:
print("⚠️ Model behavior drift detected!")
```
---
## Distributed Training
### FSDP (Fully Sharded Data Parallel)
```python
from bit_transformer.distributed import wrap_fsdp, setup_distributed
import torch.distributed as dist
# Initialize distributed training
setup_distributed(rank=0, world_size=4)
# Wrap model with FSDP
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
fsdp_model = wrap_fsdp(
model,
sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD"
mixed_precision=True,
device_id=0
)
# Train with FSDP
train_loop(
fsdp_model,
data,
epochs=10,
batch_size=2, # Smaller batch per GPU
amp=True
)
```
### Pipeline Parallelism
```python
from bit_transformer.distributed import make_pipeline
# Create pipeline parallel model
pipeline_model = make_pipeline(
model,
balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs
devices=[0, 1, 2, 3],
checkpoint="never" # or "always", "except_last"
)
# Pipeline training requires special handling
# See unified_workflow.py for complete implementation
```
### Multi-GPU Training Script
```bash
# Single node, multiple GPUs
python -m torch.distributed.launch \
--nproc_per_node=4 \
unified_workflow.py \
--distributed \
--batch-size 2 \
--epochs 10
# Multiple nodes
python -m torch.distributed.launch \
--nnodes=2 \
--node_rank=0 \
--master_addr="192.168.1.100" \
--master_port=29500 \
--nproc_per_node=4 \
unified_workflow.py \
--distributed
```
---
## Performance Optimization
### Memory Optimization
```python
# Enable all memory optimizations
model = BitTransformerLM(
d_model=512,
nhead=8,
num_layers=8,
reversible=True, # Reversible layers save ~50% memory
use_checkpoint=True, # Gradient checkpointing
chunk_size=64, # Chunked attention for long sequences
full_attn_logging=False # Skip full attention reconstruction
)
# Training optimizations
train_loop(
model,
data,
batch_size=4, # Smaller batches
accum_steps=8, # Gradient accumulation
amp=True, # Mixed precision
compile_model=True # torch.compile
)
```
### CPU Optimization
```python
from bit_transformer.torch_utils import cpu_autocast
# Enable BF16 on CPU
with cpu_autocast():
logits, telemetry = model(bits)
# Or enable for entire model
model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16
```
### Inference Optimization
```python
# Quantize for inference
from bit_transformer import quantize_dynamic
# Switch to evaluation mode
model.eval()
set_dropout(model, 0.0)
# Dynamic quantization
quantized = quantize_dynamic(model, dtype=torch.qint8)
# Optimize for inference
with torch.no_grad():
logits, _ = quantized(input_bits)
```
### Long Sequence Processing
```python
from bit_transformer.model import infer_long_sequence
# Process sequences longer than max_seq_len
long_text = "Very long text..." * 1000
bits = text_to_bits(long_text)
output = infer_long_sequence(
model,
torch.tensor(bits).unsqueeze(0),
chunk_size=256, # Process in 256-bit chunks
overlap=32, # 32-bit overlap between chunks
stride=224 # 224-bit stride (256-32)
)
```
---
## Troubleshooting
### Common Issues
#### 1. **Memory Errors**
```
RuntimeError: CUDA out of memory
```
**Solutions:**
- Enable reversible layers: `reversible=True`
- Enable gradient checkpointing: `use_checkpoint=True`
- Reduce batch size or use gradient accumulation
- Use chunked attention: `chunk_size=64`
- Enable mixed precision: `amp=True`
#### 2. **Tensor Shape Mismatches**
```
RuntimeError: view size is not compatible with input tensor's size
```
**Solutions:**
- Always use `.reshape()` instead of `.view()` with BitTransformerLM
- Check that input sequences are properly formatted (1D for bits)
- Ensure batch dimensions are consistent
#### 3. **Parity Check Failures**
```
ValueError: Parity check failed
```
**Solutions:**
- Use `enforce_parity()` to fix parity bits in generated sequences
- Check that text encoding/decoding is consistent
- Verify bit sequences have correct 9-bit (8+parity) structure
#### 4. **Safety Gate Triggering**
```
SafetyError: Output blocked by safety gate
```
**Solutions:**
- Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
- Increase burn-in period: `burn_in=20`
- Use retry with diffusion: `safe_sample_with_retry()`
- Check model training quality
### Debug Mode
```python
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Model with debug telemetry
model = BitTransformerLM(
d_model=64,
nhead=4,
num_layers=2,
full_attn_logging=True, # Log full attention maps
chunk_size=None # Disable chunking for debugging
)
# Inspect telemetry
logits, telemetry = model(input_bits)
print("Telemetry keys:", list(telemetry.keys()))
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
print("Activation stats:", torch.stack(telemetry['activations']).describe())
```
### Performance Profiling
```python
import torch.profiler
# Profile training step
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
with_stack=True,
) as prof:
logits, telemetry = model(input_bits)
loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
loss.backward()
print(prof.key_averages().table(sort_by="cuda_time_total"))
```
---
## Best Practices
### Model Configuration
#### For Experimentation (< 1M parameters)
```python
model = BitTransformerLM(
d_model=64,
nhead=4,
num_layers=2,
dim_feedforward=128,
max_seq_len=128,
reversible=False, # Simpler for debugging
use_checkpoint=False
)
```
#### For Research (1M-100M parameters)
```python
model = BitTransformerLM(
d_model=256,
nhead=8,
num_layers=6,
dim_feedforward=1024,
max_seq_len=512,
reversible=True, # Enable memory efficiency
use_checkpoint=True,
chunk_size=128,
lambda_K=0.05, # Light regularization
lambda_C=0.05,
lambda_S=0.05
)
```
#### For Large-Scale (100M+ parameters)
```python
model = BitTransformerLM(
d_model=1024,
nhead=16,
num_layers=20,
dim_feedforward=4096,
max_seq_len=2048,
reversible=True,
use_checkpoint=True,
chunk_size=256,
full_attn_logging=False, # Save memory
lambda_K=0.1,
lambda_C=0.1,
lambda_S=0.1
)
```
### Training Best Practices
1. **Always validate on held-out data** to monitor overfitting
2. **Use gradient clipping** to prevent training instability
3. **Monitor telemetry metrics** for signs of model degradation
4. **Start with smaller models** before scaling up
5. **Use safety gates** in production deployments
6. **Enable logging** to track training progress
7. **Save checkpoints frequently** to prevent loss of progress
### Data Preparation
```python
# Good: Clean, well-formatted text
texts = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming technology.",
"BitTransformer processes information at the bit level."
]
# Convert to training sequences
all_bits = []
for text in texts:
bits = text_to_bits(text)
all_bits.extend(bits)
# Create overlapping sequences for better learning
data = torch.tensor(all_bits)
seq_len = 128
stride = 64
sequences = []
for i in range(0, len(data) - seq_len, stride):
sequences.append(data[i:i + seq_len])
training_data = torch.stack(sequences)
```
### Production Deployment
```python
# Production-ready model setup
model.eval() # Disable dropout
set_dropout(model, 0.0)
# Enable safety monitoring
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)
# Quantize for efficiency
production_model = quantize_dynamic(model)
# Safe inference with monitoring
def safe_generate(input_text, max_length=100):
try:
return safe_sample_with_retry(
production_model,
text_to_bits(input_text),
max_retries=3
)
except Exception as e:
logging.error(f"Generation failed: {e}")
return "Error: Unable to generate safe output"
```
---
## Getting Help
### Documentation Resources
- **ABOUTME.md**: Project overview and quick start
- **README.md**: Professional model card and specifications
- **RESEARCH_STATUS.md**: Current research status and limitations
- **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities
### Community Support
- **GitHub Issues**: Report bugs and request features
- **Discussions**: Ask questions and share experiences
- **Examples**: Check the `tests/` directory for usage examples
### **🤖 Recommended: Use with Claude Code**
For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):
- **Interactive Setup**: Get step-by-step guidance for configuration
- **Real-time Debugging**: Immediate help when things go wrong
- **Code Generation**: Custom scripts and experiments tailored to your needs
- **Architecture Explanation**: Deep understanding of bit-native processing
- **Best Practices**: Learn optimal configurations for your use case
Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.
---
**Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.**
Happy experimenting! 🤖✨