File size: 26,331 Bytes
58b962e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 |
# BitTransformerLM User Guide
**Version:** 0.1.0 Experimental
**Last Updated:** August 2025
**Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience
## Table of Contents
1. [Quick Start](#quick-start)
2. [Architecture Overview](#architecture-overview)
3. [Core Features](#core-features)
4. [Installation & Setup](#installation--setup)
5. [Basic Usage Examples](#basic-usage-examples)
6. [Advanced Features](#advanced-features)
7. [Training Your Own Models](#training-your-own-models)
8. [Safety and Monitoring](#safety-and-monitoring)
9. [Distributed Training](#distributed-training)
10. [Performance Optimization](#performance-optimization)
11. [Troubleshooting](#troubleshooting)
12. [Best Practices](#best-practices)
---
## Quick Start
BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.
### Minimal Example
```python
from bit_transformer import BitTransformerLM, example_training_step
# Run basic example
loss, telemetry = example_training_step()
print(f"Training loss: {loss}")
print(f"Available telemetry: {list(telemetry.keys())}")
```
### Text Processing Example
```python
from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text
# Create model
model = BitTransformerLM(
d_model=128,
nhead=4,
num_layers=2,
dim_feedforward=256,
max_seq_len=256
)
# Convert text to bits and process
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)
# Forward pass
logits, telemetry = model(bit_tensor)
print(f"Input bits: {len(bits)}")
print(f"Output shape: {logits.shape}")
print(f"Telemetry metrics: {list(telemetry.keys())}")
```
---
## Architecture Overview
### Bit-Native Processing
Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:
- **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
- **Processing**: Multi-head attention on bit embeddings
- **Output**: Probability distribution over next bit (0 or 1)
### Key Innovations
#### 1. **Reversible Transformer Layers**
- Memory-efficient computation that doesn't store intermediate activations
- Enables training of deeper models with same memory footprint
- Mathematically reversible operations for gradient computation
#### 2. **Built-in Safety Telemetry**
- **K (Negentropy)**: Measures information content vs random noise
- **C (LZ Complexity)**: Proxy for compressibility and pattern complexity
- **S (Symbiosis)**: Alignment with reference distributions
- Real-time monitoring and safety gates
#### 3. **Dual-Mode Operation**
- **Causal Mode**: Traditional autoregressive generation
- **Diffusion Mode**: Bidirectional denoising for higher quality output
#### 4. **Progressive Scaling**
- Dynamic architecture expansion based on validation performance
- Automatic addition of layers, width, or context length
- Curriculum learning from simple to complex patterns
---
## Core Features
### Text Processing
- **Parity-Protected Encoding**: Each byte gets a parity bit for error detection
- **UTF-8 Support**: Full Unicode text processing capability
- **Bidirectional Processing**: Support for both causal and diffusion modes
### Safety & Monitoring
- **Real-time Telemetry**: K/C/S metrics computed during inference
- **Safety Gates**: Automatic blocking of unsafe outputs
- **Metric Drift Detection**: Alerts when model behavior changes
- **Human-in-the-Loop**: Safe inference with retry mechanisms
### Memory Efficiency
- **Reversible Layers**: Significant memory savings for deep models
- **Gradient Checkpointing**: Trade compute for memory in training
- **Dynamic Quantization**: Runtime INT8 conversion for inference
- **4-bit QAT**: Quantization-aware training for extreme efficiency
### Advanced Training
- **Distributed Training**: FSDP and pipeline parallelism support
- **Mixed Precision**: FP16/BF16 optimization with CPU autocast
- **Compression Pipeline**: Run-length encoding for efficient storage
- **Progressive Curriculum**: Automatic difficulty scaling
---
## Installation & Setup
### Requirements
- Python 3.10 or later
- PyTorch 2.7.1 or later
- CUDA (optional, for GPU acceleration)
### Installation
```bash
# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM
# Install dependencies
pip install -r requirements.txt
# For GPU support (optional)
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
```
### Quick Test
```bash
# Run basic example
python example.py
# Expected output:
# Training loss: [some value]
# Available telemetry: ['activations', 'attention_maps', ...]
```
### **🤖 Recommended: Setup with Claude Code**
For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:
1. **Open Claude Code** and navigate to your project directory
2. **Clone the repository**: Claude Code can help with git operations and dependency management
3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters
4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging
5. **Code Generation**: Generate custom training scripts and experiments with AI assistance
Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.
---
## Basic Usage Examples
### 1. Creating Models
```python
from bit_transformer import BitTransformerLM
# Small model for experimentation
small_model = BitTransformerLM(
d_model=64, # Embedding dimension
nhead=4, # Number of attention heads
num_layers=2, # Number of transformer layers
dim_feedforward=128, # Feedforward dimension
max_seq_len=128, # Maximum sequence length
reversible=True, # Use memory-efficient reversible layers
use_checkpoint=True # Enable gradient checkpointing
)
# Medium model for research
medium_model = BitTransformerLM(
d_model=512,
nhead=8,
num_layers=8,
dim_feedforward=2048,
max_seq_len=512,
reversible=True,
use_checkpoint=True,
chunk_size=64, # Chunked attention for long sequences
lambda_K=0.1, # Negentropy regularization weight
lambda_C=0.1, # Complexity regularization weight
lambda_S=0.1 # Symbiosis regularization weight
)
```
### 2. Text Generation
```python
from bit_transformer.bit_io import sample_text
# Generate text from prompt
prompt = "The future of AI is"
generated = sample_text(
model,
prompt=prompt,
max_new_tokens=20, # Generate ~20 new characters
temperature=0.8, # Sampling temperature
top_p=0.9 # Nucleus sampling
)
print(f"Generated: {generated}")
```
### 3. Safe Inference
```python
from bit_transformer import hil_safe_inference, text_to_bits
import torch
# Convert text to bits
text = "Hello, world!"
bits = torch.tensor(text_to_bits(text)).unsqueeze(0)
# Safe inference with telemetry monitoring
try:
output_bits, telemetry = hil_safe_inference(
model,
bits,
c_floor=0.3, # Minimum complexity threshold
s_floor=0.5, # Minimum symbiosis threshold
strict=True # Throw error if thresholds not met
)
print("✅ Safe inference completed")
print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
except Exception as e:
print(f"⚠️ Safety check failed: {e}")
```
### 4. Interactive Dashboard
```python
# Launch the interactive dashboard
python unified_workflow.py --dashboard
# Or programmatically
from bit_transformer.dashboard_app import run_dashboard
run_dashboard(host="localhost", port=5000)
```
The dashboard provides:
- Real-time training monitoring
- Telemetry visualization
- Model configuration controls
- HuggingFace checkpoint management
- Safe inference testing interface
---
## Advanced Features
### 1. Diffusion Mode Training
Diffusion mode enables bidirectional processing for higher quality generation:
```python
# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
# Different noise schedules
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16
# Diffusion curriculum (noise decay over epochs)
python unified_workflow.py --diffusion --diffusion-curriculum
```
**Diffusion Parameters:**
- `--diffusion-steps`: Number of denoising steps (higher = better quality)
- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
- `--diffusion-curriculum`: Gradually reduce noise over training epochs
### 2. Progressive Scaling
Enable automatic model growth based on performance:
```python
from bit_transformer.training import train_loop
from bit_transformer.scale import expand_model
# Training loop with automatic scaling
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
train_data = torch.randint(0, 2, (1000, 64))
# Train with progressive scaling
train_loop(
model,
train_data,
epochs=10,
batch_size=8,
# Progressive scaling will automatically trigger when validation loss plateaus
)
# Manual model expansion
expanded_model = expand_model(model, strategy="depth") # Add layers
expanded_model = expand_model(model, strategy="width") # Increase width
expanded_model = expand_model(model, strategy="context") # Extend context
```
### 3. Compression Pipeline
BitTransformerLM includes run-length encoding for efficient data storage:
```python
from bit_transformer import compress_bits, decompress_bits
# Compress bit sequences
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
compressed = compress_bits(original_bits)
decompressed = decompress_bits(compressed)
print(f"Original: {original_bits}")
print(f"Compressed: {compressed}")
print(f"Decompressed: {decompressed}")
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")
# Use compression in training
train_loop(
model,
data,
compress_prob=0.5, # 50% of training uses compressed data
compress_warmup=100 # Start compression after 100 steps
)
```
### 4. Quantization and Optimization
```python
from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx
# Dynamic quantization for inference
quantized_model = quantize_dynamic(model, dtype=torch.qint8)
# 4-bit quantization-aware training
qat_model = prepare_qat_fx(model)
# ... train qat_model ...
final_model = convert_qat_fx(qat_model)
# Enable mixed precision and compilation
train_loop(
model,
data,
amp=True, # Enable automatic mixed precision
compile_model=True # Use torch.compile for speedup
)
```
---
## Training Your Own Models
### Basic Training Script
```python
import torch
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
from bit_transformer.bit_io import text_to_bits
# Prepare training data
texts = ["Hello world", "How are you?", "BitTransformer is working!"]
all_bits = []
for text in texts:
bits = text_to_bits(text)
all_bits.extend(bits)
# Convert to tensor and create sequences
data = torch.tensor(all_bits)
sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride
# Create model
model = BitTransformerLM(
d_model=128,
nhead=8,
num_layers=4,
dim_feedforward=512,
max_seq_len=64,
reversible=True
)
# Configure optimizer
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)
# Training loop
train_loop(
model,
sequences,
epochs=10,
batch_size=4,
optimizer=optimizer,
amp=True, # Mixed precision
log=True # Enable logging
)
```
### Advanced Training Configuration
```python
# Advanced training with all features enabled
train_loop(
model,
data,
epochs=20,
batch_size=8,
accum_steps=4, # Gradient accumulation
amp=True, # Mixed precision
compile_model=True, # torch.compile optimization
# Compression settings
compress_prob=0.3, # 30% compression probability
compress_warmup=50, # Start compression after 50 steps
# Diffusion settings
diffusion=True, # Enable diffusion mode
diffusion_curriculum=True, # Decay noise over epochs
# Direct bit training
direct_prob=0.1, # 10% direct bit prediction
# Logging
log=True # Enable detailed logging
)
```
### Custom Training Loop
```python
import torch.nn.functional as F
from bit_transformer.utils import set_dropout
# Manual training loop for full control
model.train()
set_dropout(model, 0.1) # Enable dropout for training
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = F.cross_entropy
for epoch in range(10):
total_loss = 0
for batch in data_loader:
optimizer.zero_grad()
# Forward pass
logits, telemetry = model(batch)
# Compute loss
if logits.dim() == 3: # (batch, seq, 2)
targets = batch[:, 1:] # Next bit prediction
logits = logits[:, :-1] # Remove last prediction
loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
else:
loss = criterion(logits, batch)
# Add telemetry regularization
if model.lambda_K > 0:
loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
if model.lambda_C > 0:
loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
# Safety check
if telemetry.get('symbiosis_score', 1.0) < 0.3:
print("⚠️ Low symbiosis score detected")
print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
```
---
## Safety and Monitoring
### Telemetry Metrics
BitTransformerLM provides three key safety metrics:
#### K (Negentropy) - Information Content
- **Range**: 0-1 (0 = random noise, 1 = perfectly ordered)
- **Purpose**: Measures departure from randomness
- **Interpretation**:
- Very low K (< 0.1): Output is noise-like
- Moderate K (0.3-0.7): Structured but varied output
- Very high K (> 0.9): Repetitive or overly structured
#### C (LZ Complexity) - Pattern Complexity
- **Range**: 0-1 (higher = more complex patterns)
- **Purpose**: Proxy for Lempel-Ziv compressibility
- **Interpretation**:
- Low C (< 0.3): Highly repetitive patterns
- Moderate C (0.3-0.7): Balanced complexity
- High C (> 0.8): Complex, varied patterns
#### S (Symbiosis) - Distribution Alignment
- **Range**: 0-1 (higher = better alignment)
- **Purpose**: Agreement with reference distributions via KL divergence
- **Interpretation**:
- Low S (< 0.3): Poor alignment with expected patterns
- Moderate S (0.5-0.8): Good alignment
- High S (> 0.8): Excellent alignment
### Safety Gates
```python
from bit_transformer.safety import SafetyGate, safe_sample_with_retry
# Configure safety gate
gate = SafetyGate(
c_floor=0.3, # Minimum complexity
s_floor=0.5, # Minimum symbiosis
decay=0.9, # EMA decay factor
burn_in=10 # Steps before gating starts
)
# Check if output should be blocked
should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds
# Safe sampling with automatic retry
output = safe_sample_with_retry(
model,
input_bits,
max_retries=3,
retry_strategy="diffusion" # Try diffusion mode on failure
)
```
### Metric Drift Detection
```python
from bit_transformer.telemetry import detect_metric_drift
# Monitor metric stability over time
metrics_history = [
{"K": 0.5, "C": 0.6, "S": 0.7},
{"K": 0.52, "C": 0.58, "S": 0.69},
{"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected!
# ... more metrics
]
drift_detected = detect_metric_drift(
metrics_history,
window=10, # Look back 10 steps
threshold=0.2 # Alert if change > 0.2
)
if drift_detected:
print("⚠️ Model behavior drift detected!")
```
---
## Distributed Training
### FSDP (Fully Sharded Data Parallel)
```python
from bit_transformer.distributed import wrap_fsdp, setup_distributed
import torch.distributed as dist
# Initialize distributed training
setup_distributed(rank=0, world_size=4)
# Wrap model with FSDP
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
fsdp_model = wrap_fsdp(
model,
sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD"
mixed_precision=True,
device_id=0
)
# Train with FSDP
train_loop(
fsdp_model,
data,
epochs=10,
batch_size=2, # Smaller batch per GPU
amp=True
)
```
### Pipeline Parallelism
```python
from bit_transformer.distributed import make_pipeline
# Create pipeline parallel model
pipeline_model = make_pipeline(
model,
balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs
devices=[0, 1, 2, 3],
checkpoint="never" # or "always", "except_last"
)
# Pipeline training requires special handling
# See unified_workflow.py for complete implementation
```
### Multi-GPU Training Script
```bash
# Single node, multiple GPUs
python -m torch.distributed.launch \
--nproc_per_node=4 \
unified_workflow.py \
--distributed \
--batch-size 2 \
--epochs 10
# Multiple nodes
python -m torch.distributed.launch \
--nnodes=2 \
--node_rank=0 \
--master_addr="192.168.1.100" \
--master_port=29500 \
--nproc_per_node=4 \
unified_workflow.py \
--distributed
```
---
## Performance Optimization
### Memory Optimization
```python
# Enable all memory optimizations
model = BitTransformerLM(
d_model=512,
nhead=8,
num_layers=8,
reversible=True, # Reversible layers save ~50% memory
use_checkpoint=True, # Gradient checkpointing
chunk_size=64, # Chunked attention for long sequences
full_attn_logging=False # Skip full attention reconstruction
)
# Training optimizations
train_loop(
model,
data,
batch_size=4, # Smaller batches
accum_steps=8, # Gradient accumulation
amp=True, # Mixed precision
compile_model=True # torch.compile
)
```
### CPU Optimization
```python
from bit_transformer.torch_utils import cpu_autocast
# Enable BF16 on CPU
with cpu_autocast():
logits, telemetry = model(bits)
# Or enable for entire model
model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16
```
### Inference Optimization
```python
# Quantize for inference
from bit_transformer import quantize_dynamic
# Switch to evaluation mode
model.eval()
set_dropout(model, 0.0)
# Dynamic quantization
quantized = quantize_dynamic(model, dtype=torch.qint8)
# Optimize for inference
with torch.no_grad():
logits, _ = quantized(input_bits)
```
### Long Sequence Processing
```python
from bit_transformer.model import infer_long_sequence
# Process sequences longer than max_seq_len
long_text = "Very long text..." * 1000
bits = text_to_bits(long_text)
output = infer_long_sequence(
model,
torch.tensor(bits).unsqueeze(0),
chunk_size=256, # Process in 256-bit chunks
overlap=32, # 32-bit overlap between chunks
stride=224 # 224-bit stride (256-32)
)
```
---
## Troubleshooting
### Common Issues
#### 1. **Memory Errors**
```
RuntimeError: CUDA out of memory
```
**Solutions:**
- Enable reversible layers: `reversible=True`
- Enable gradient checkpointing: `use_checkpoint=True`
- Reduce batch size or use gradient accumulation
- Use chunked attention: `chunk_size=64`
- Enable mixed precision: `amp=True`
#### 2. **Tensor Shape Mismatches**
```
RuntimeError: view size is not compatible with input tensor's size
```
**Solutions:**
- Always use `.reshape()` instead of `.view()` with BitTransformerLM
- Check that input sequences are properly formatted (1D for bits)
- Ensure batch dimensions are consistent
#### 3. **Parity Check Failures**
```
ValueError: Parity check failed
```
**Solutions:**
- Use `enforce_parity()` to fix parity bits in generated sequences
- Check that text encoding/decoding is consistent
- Verify bit sequences have correct 9-bit (8+parity) structure
#### 4. **Safety Gate Triggering**
```
SafetyError: Output blocked by safety gate
```
**Solutions:**
- Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
- Increase burn-in period: `burn_in=20`
- Use retry with diffusion: `safe_sample_with_retry()`
- Check model training quality
### Debug Mode
```python
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Model with debug telemetry
model = BitTransformerLM(
d_model=64,
nhead=4,
num_layers=2,
full_attn_logging=True, # Log full attention maps
chunk_size=None # Disable chunking for debugging
)
# Inspect telemetry
logits, telemetry = model(input_bits)
print("Telemetry keys:", list(telemetry.keys()))
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
print("Activation stats:", torch.stack(telemetry['activations']).describe())
```
### Performance Profiling
```python
import torch.profiler
# Profile training step
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
with_stack=True,
) as prof:
logits, telemetry = model(input_bits)
loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
loss.backward()
print(prof.key_averages().table(sort_by="cuda_time_total"))
```
---
## Best Practices
### Model Configuration
#### For Experimentation (< 1M parameters)
```python
model = BitTransformerLM(
d_model=64,
nhead=4,
num_layers=2,
dim_feedforward=128,
max_seq_len=128,
reversible=False, # Simpler for debugging
use_checkpoint=False
)
```
#### For Research (1M-100M parameters)
```python
model = BitTransformerLM(
d_model=256,
nhead=8,
num_layers=6,
dim_feedforward=1024,
max_seq_len=512,
reversible=True, # Enable memory efficiency
use_checkpoint=True,
chunk_size=128,
lambda_K=0.05, # Light regularization
lambda_C=0.05,
lambda_S=0.05
)
```
#### For Large-Scale (100M+ parameters)
```python
model = BitTransformerLM(
d_model=1024,
nhead=16,
num_layers=20,
dim_feedforward=4096,
max_seq_len=2048,
reversible=True,
use_checkpoint=True,
chunk_size=256,
full_attn_logging=False, # Save memory
lambda_K=0.1,
lambda_C=0.1,
lambda_S=0.1
)
```
### Training Best Practices
1. **Always validate on held-out data** to monitor overfitting
2. **Use gradient clipping** to prevent training instability
3. **Monitor telemetry metrics** for signs of model degradation
4. **Start with smaller models** before scaling up
5. **Use safety gates** in production deployments
6. **Enable logging** to track training progress
7. **Save checkpoints frequently** to prevent loss of progress
### Data Preparation
```python
# Good: Clean, well-formatted text
texts = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming technology.",
"BitTransformer processes information at the bit level."
]
# Convert to training sequences
all_bits = []
for text in texts:
bits = text_to_bits(text)
all_bits.extend(bits)
# Create overlapping sequences for better learning
data = torch.tensor(all_bits)
seq_len = 128
stride = 64
sequences = []
for i in range(0, len(data) - seq_len, stride):
sequences.append(data[i:i + seq_len])
training_data = torch.stack(sequences)
```
### Production Deployment
```python
# Production-ready model setup
model.eval() # Disable dropout
set_dropout(model, 0.0)
# Enable safety monitoring
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)
# Quantize for efficiency
production_model = quantize_dynamic(model)
# Safe inference with monitoring
def safe_generate(input_text, max_length=100):
try:
return safe_sample_with_retry(
production_model,
text_to_bits(input_text),
max_retries=3
)
except Exception as e:
logging.error(f"Generation failed: {e}")
return "Error: Unable to generate safe output"
```
---
## Getting Help
### Documentation Resources
- **ABOUTME.md**: Project overview and quick start
- **README.md**: Professional model card and specifications
- **RESEARCH_STATUS.md**: Current research status and limitations
- **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities
### Community Support
- **GitHub Issues**: Report bugs and request features
- **Discussions**: Ask questions and share experiences
- **Examples**: Check the `tests/` directory for usage examples
### **🤖 Recommended: Use with Claude Code**
For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):
- **Interactive Setup**: Get step-by-step guidance for configuration
- **Real-time Debugging**: Immediate help when things go wrong
- **Code Generation**: Custom scripts and experiments tailored to your needs
- **Architecture Explanation**: Deep understanding of bit-native processing
- **Best Practices**: Learn optimal configurations for your use case
Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.
---
**Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.**
Happy experimenting! 🤖✨ |