File size: 26,331 Bytes
58b962e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
# BitTransformerLM User Guide

**Version:** 0.1.0 Experimental  
**Last Updated:** August 2025  
**Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience  

## Table of Contents

1. [Quick Start](#quick-start)
2. [Architecture Overview](#architecture-overview)
3. [Core Features](#core-features)
4. [Installation & Setup](#installation--setup)
5. [Basic Usage Examples](#basic-usage-examples)
6. [Advanced Features](#advanced-features)
7. [Training Your Own Models](#training-your-own-models)
8. [Safety and Monitoring](#safety-and-monitoring)
9. [Distributed Training](#distributed-training)
10. [Performance Optimization](#performance-optimization)
11. [Troubleshooting](#troubleshooting)
12. [Best Practices](#best-practices)

---

## Quick Start

BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.

### Minimal Example
```python
from bit_transformer import BitTransformerLM, example_training_step

# Run basic example
loss, telemetry = example_training_step()
print(f"Training loss: {loss}")
print(f"Available telemetry: {list(telemetry.keys())}")
```

### Text Processing Example
```python
from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=4,
    num_layers=2,
    dim_feedforward=256,
    max_seq_len=256
)

# Convert text to bits and process
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)

# Forward pass
logits, telemetry = model(bit_tensor)
print(f"Input bits: {len(bits)}")
print(f"Output shape: {logits.shape}")
print(f"Telemetry metrics: {list(telemetry.keys())}")
```

---

## Architecture Overview

### Bit-Native Processing
Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:

- **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
- **Processing**: Multi-head attention on bit embeddings
- **Output**: Probability distribution over next bit (0 or 1)

### Key Innovations

#### 1. **Reversible Transformer Layers**
- Memory-efficient computation that doesn't store intermediate activations
- Enables training of deeper models with same memory footprint
- Mathematically reversible operations for gradient computation

#### 2. **Built-in Safety Telemetry** 
- **K (Negentropy)**: Measures information content vs random noise
- **C (LZ Complexity)**: Proxy for compressibility and pattern complexity  
- **S (Symbiosis)**: Alignment with reference distributions
- Real-time monitoring and safety gates

#### 3. **Dual-Mode Operation**
- **Causal Mode**: Traditional autoregressive generation
- **Diffusion Mode**: Bidirectional denoising for higher quality output

#### 4. **Progressive Scaling**
- Dynamic architecture expansion based on validation performance
- Automatic addition of layers, width, or context length
- Curriculum learning from simple to complex patterns

---

## Core Features

### Text Processing
- **Parity-Protected Encoding**: Each byte gets a parity bit for error detection
- **UTF-8 Support**: Full Unicode text processing capability
- **Bidirectional Processing**: Support for both causal and diffusion modes

### Safety & Monitoring
- **Real-time Telemetry**: K/C/S metrics computed during inference
- **Safety Gates**: Automatic blocking of unsafe outputs
- **Metric Drift Detection**: Alerts when model behavior changes
- **Human-in-the-Loop**: Safe inference with retry mechanisms

### Memory Efficiency
- **Reversible Layers**: Significant memory savings for deep models
- **Gradient Checkpointing**: Trade compute for memory in training
- **Dynamic Quantization**: Runtime INT8 conversion for inference
- **4-bit QAT**: Quantization-aware training for extreme efficiency

### Advanced Training
- **Distributed Training**: FSDP and pipeline parallelism support
- **Mixed Precision**: FP16/BF16 optimization with CPU autocast
- **Compression Pipeline**: Run-length encoding for efficient storage
- **Progressive Curriculum**: Automatic difficulty scaling

---

## Installation & Setup

### Requirements
- Python 3.10 or later
- PyTorch 2.7.1 or later
- CUDA (optional, for GPU acceleration)

### Installation
```bash
# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM

# Install dependencies
pip install -r requirements.txt

# For GPU support (optional)
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
```

### Quick Test
```bash
# Run basic example
python example.py

# Expected output:
# Training loss: [some value]
# Available telemetry: ['activations', 'attention_maps', ...]
```

### **🤖 Recommended: Setup with Claude Code**

For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:

1. **Open Claude Code** and navigate to your project directory
2. **Clone the repository**: Claude Code can help with git operations and dependency management  
3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters
4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging
5. **Code Generation**: Generate custom training scripts and experiments with AI assistance

Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.

---

## Basic Usage Examples

### 1. Creating Models

```python
from bit_transformer import BitTransformerLM

# Small model for experimentation
small_model = BitTransformerLM(
    d_model=64,           # Embedding dimension
    nhead=4,              # Number of attention heads
    num_layers=2,         # Number of transformer layers
    dim_feedforward=128,  # Feedforward dimension
    max_seq_len=128,      # Maximum sequence length
    reversible=True,      # Use memory-efficient reversible layers
    use_checkpoint=True   # Enable gradient checkpointing
)

# Medium model for research
medium_model = BitTransformerLM(
    d_model=512,
    nhead=8, 
    num_layers=8,
    dim_feedforward=2048,
    max_seq_len=512,
    reversible=True,
    use_checkpoint=True,
    chunk_size=64,        # Chunked attention for long sequences
    lambda_K=0.1,         # Negentropy regularization weight
    lambda_C=0.1,         # Complexity regularization weight
    lambda_S=0.1          # Symbiosis regularization weight
)
```

### 2. Text Generation

```python
from bit_transformer.bit_io import sample_text

# Generate text from prompt
prompt = "The future of AI is"
generated = sample_text(
    model,
    prompt=prompt,
    max_new_tokens=20,    # Generate ~20 new characters
    temperature=0.8,      # Sampling temperature
    top_p=0.9            # Nucleus sampling
)
print(f"Generated: {generated}")
```

### 3. Safe Inference

```python
from bit_transformer import hil_safe_inference, text_to_bits
import torch

# Convert text to bits
text = "Hello, world!"
bits = torch.tensor(text_to_bits(text)).unsqueeze(0)

# Safe inference with telemetry monitoring
try:
    output_bits, telemetry = hil_safe_inference(
        model, 
        bits,
        c_floor=0.3,     # Minimum complexity threshold
        s_floor=0.5,     # Minimum symbiosis threshold
        strict=True      # Throw error if thresholds not met
    )
    print("✅ Safe inference completed")
    print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
    print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
    print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
except Exception as e:
    print(f"⚠️ Safety check failed: {e}")
```

### 4. Interactive Dashboard

```python
# Launch the interactive dashboard
python unified_workflow.py --dashboard

# Or programmatically
from bit_transformer.dashboard_app import run_dashboard
run_dashboard(host="localhost", port=5000)
```

The dashboard provides:
- Real-time training monitoring
- Telemetry visualization  
- Model configuration controls
- HuggingFace checkpoint management
- Safe inference testing interface

---

## Advanced Features

### 1. Diffusion Mode Training

Diffusion mode enables bidirectional processing for higher quality generation:

```python
# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32

# Different noise schedules
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16

# Diffusion curriculum (noise decay over epochs)
python unified_workflow.py --diffusion --diffusion-curriculum
```

**Diffusion Parameters:**
- `--diffusion-steps`: Number of denoising steps (higher = better quality)
- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
- `--diffusion-curriculum`: Gradually reduce noise over training epochs

### 2. Progressive Scaling

Enable automatic model growth based on performance:

```python
from bit_transformer.training import train_loop
from bit_transformer.scale import expand_model

# Training loop with automatic scaling
model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
train_data = torch.randint(0, 2, (1000, 64))

# Train with progressive scaling
train_loop(
    model,
    train_data,
    epochs=10,
    batch_size=8,
    # Progressive scaling will automatically trigger when validation loss plateaus
)

# Manual model expansion
expanded_model = expand_model(model, strategy="depth")  # Add layers
expanded_model = expand_model(model, strategy="width")  # Increase width
expanded_model = expand_model(model, strategy="context")  # Extend context
```

### 3. Compression Pipeline

BitTransformerLM includes run-length encoding for efficient data storage:

```python
from bit_transformer import compress_bits, decompress_bits

# Compress bit sequences
original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
compressed = compress_bits(original_bits)
decompressed = decompress_bits(compressed)

print(f"Original: {original_bits}")
print(f"Compressed: {compressed}")  
print(f"Decompressed: {decompressed}")
print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")

# Use compression in training
train_loop(
    model,
    data,
    compress_prob=0.5,    # 50% of training uses compressed data
    compress_warmup=100   # Start compression after 100 steps
)
```

### 4. Quantization and Optimization

```python
from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx

# Dynamic quantization for inference
quantized_model = quantize_dynamic(model, dtype=torch.qint8)

# 4-bit quantization-aware training
qat_model = prepare_qat_fx(model)
# ... train qat_model ...
final_model = convert_qat_fx(qat_model)

# Enable mixed precision and compilation
train_loop(
    model,
    data,
    amp=True,           # Enable automatic mixed precision
    compile_model=True  # Use torch.compile for speedup
)
```

---

## Training Your Own Models

### Basic Training Script

```python
import torch
from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
from bit_transformer.bit_io import text_to_bits

# Prepare training data
texts = ["Hello world", "How are you?", "BitTransformer is working!"]
all_bits = []
for text in texts:
    bits = text_to_bits(text)
    all_bits.extend(bits)

# Convert to tensor and create sequences
data = torch.tensor(all_bits)
sequences = data.unfold(0, 64, 32)  # 64-bit sequences with 32-bit stride

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=8,
    num_layers=4,
    dim_feedforward=512,
    max_seq_len=64,
    reversible=True
)

# Configure optimizer
optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)

# Training loop
train_loop(
    model,
    sequences,
    epochs=10,
    batch_size=4,
    optimizer=optimizer,
    amp=True,          # Mixed precision
    log=True           # Enable logging
)
```

### Advanced Training Configuration

```python
# Advanced training with all features enabled
train_loop(
    model,
    data,
    epochs=20,
    batch_size=8,
    accum_steps=4,            # Gradient accumulation
    amp=True,                 # Mixed precision
    compile_model=True,       # torch.compile optimization
    
    # Compression settings
    compress_prob=0.3,        # 30% compression probability
    compress_warmup=50,       # Start compression after 50 steps
    
    # Diffusion settings  
    diffusion=True,           # Enable diffusion mode
    diffusion_curriculum=True, # Decay noise over epochs
    
    # Direct bit training
    direct_prob=0.1,          # 10% direct bit prediction
    
    # Logging
    log=True                  # Enable detailed logging
)
```

### Custom Training Loop

```python
import torch.nn.functional as F
from bit_transformer.utils import set_dropout

# Manual training loop for full control
model.train()
set_dropout(model, 0.1)  # Enable dropout for training

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = F.cross_entropy

for epoch in range(10):
    total_loss = 0
    for batch in data_loader:
        optimizer.zero_grad()
        
        # Forward pass
        logits, telemetry = model(batch)
        
        # Compute loss
        if logits.dim() == 3:  # (batch, seq, 2)
            targets = batch[:, 1:]  # Next bit prediction
            logits = logits[:, :-1]  # Remove last prediction
            loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
        else:
            loss = criterion(logits, batch)
        
        # Add telemetry regularization
        if model.lambda_K > 0:
            loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
        if model.lambda_C > 0:
            loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
            
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        total_loss += loss.item()
        
        # Safety check
        if telemetry.get('symbiosis_score', 1.0) < 0.3:
            print("⚠️ Low symbiosis score detected")
    
    print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
```

---

## Safety and Monitoring

### Telemetry Metrics

BitTransformerLM provides three key safety metrics:

#### K (Negentropy) - Information Content
- **Range**: 0-1 (0 = random noise, 1 = perfectly ordered)
- **Purpose**: Measures departure from randomness
- **Interpretation**: 
  - Very low K (< 0.1): Output is noise-like
  - Moderate K (0.3-0.7): Structured but varied output  
  - Very high K (> 0.9): Repetitive or overly structured

#### C (LZ Complexity) - Pattern Complexity
- **Range**: 0-1 (higher = more complex patterns)
- **Purpose**: Proxy for Lempel-Ziv compressibility
- **Interpretation**:
  - Low C (< 0.3): Highly repetitive patterns
  - Moderate C (0.3-0.7): Balanced complexity
  - High C (> 0.8): Complex, varied patterns

#### S (Symbiosis) - Distribution Alignment  
- **Range**: 0-1 (higher = better alignment)
- **Purpose**: Agreement with reference distributions via KL divergence
- **Interpretation**:
  - Low S (< 0.3): Poor alignment with expected patterns
  - Moderate S (0.5-0.8): Good alignment
  - High S (> 0.8): Excellent alignment

### Safety Gates

```python
from bit_transformer.safety import SafetyGate, safe_sample_with_retry

# Configure safety gate
gate = SafetyGate(
    c_floor=0.3,      # Minimum complexity
    s_floor=0.5,      # Minimum symbiosis  
    decay=0.9,        # EMA decay factor
    burn_in=10        # Steps before gating starts
)

# Check if output should be blocked
should_block = gate.should_trigger(c_val=0.2, s_val=0.4)  # True - below thresholds

# Safe sampling with automatic retry
output = safe_sample_with_retry(
    model,
    input_bits,
    max_retries=3,
    retry_strategy="diffusion"  # Try diffusion mode on failure
)
```

### Metric Drift Detection

```python
from bit_transformer.telemetry import detect_metric_drift

# Monitor metric stability over time
metrics_history = [
    {"K": 0.5, "C": 0.6, "S": 0.7},
    {"K": 0.52, "C": 0.58, "S": 0.69},  
    {"K": 0.8, "C": 0.9, "S": 0.4},   # Drift detected!
    # ... more metrics
]

drift_detected = detect_metric_drift(
    metrics_history,
    window=10,        # Look back 10 steps
    threshold=0.2     # Alert if change > 0.2
)

if drift_detected:
    print("⚠️ Model behavior drift detected!")
```

---

## Distributed Training

### FSDP (Fully Sharded Data Parallel)

```python
from bit_transformer.distributed import wrap_fsdp, setup_distributed
import torch.distributed as dist

# Initialize distributed training
setup_distributed(rank=0, world_size=4)

# Wrap model with FSDP
model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
fsdp_model = wrap_fsdp(
    model,
    sharding_strategy="FULL_SHARD",  # or "SHARD_GRAD_OP", "NO_SHARD"
    mixed_precision=True,
    device_id=0
)

# Train with FSDP
train_loop(
    fsdp_model,
    data,
    epochs=10,
    batch_size=2,    # Smaller batch per GPU
    amp=True
)
```

### Pipeline Parallelism

```python  
from bit_transformer.distributed import make_pipeline

# Create pipeline parallel model
pipeline_model = make_pipeline(
    model,
    balance=[2, 2, 2, 2],  # Split 8 layers across 4 GPUs
    devices=[0, 1, 2, 3],
    checkpoint="never"     # or "always", "except_last"
)

# Pipeline training requires special handling
# See unified_workflow.py for complete implementation
```

### Multi-GPU Training Script

```bash
# Single node, multiple GPUs
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    unified_workflow.py \
    --distributed \
    --batch-size 2 \
    --epochs 10

# Multiple nodes
python -m torch.distributed.launch \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="192.168.1.100" \
    --master_port=29500 \
    --nproc_per_node=4 \
    unified_workflow.py \
    --distributed
```

---

## Performance Optimization

### Memory Optimization

```python
# Enable all memory optimizations
model = BitTransformerLM(
    d_model=512,
    nhead=8,
    num_layers=8,
    reversible=True,          # Reversible layers save ~50% memory
    use_checkpoint=True,      # Gradient checkpointing
    chunk_size=64,            # Chunked attention for long sequences
    full_attn_logging=False   # Skip full attention reconstruction
)

# Training optimizations
train_loop(
    model,
    data,
    batch_size=4,            # Smaller batches
    accum_steps=8,           # Gradient accumulation  
    amp=True,                # Mixed precision
    compile_model=True       # torch.compile
)
```

### CPU Optimization

```python
from bit_transformer.torch_utils import cpu_autocast

# Enable BF16 on CPU
with cpu_autocast():
    logits, telemetry = model(bits)

# Or enable for entire model
model = BitTransformerLM(use_autocast=True)  # Automatically uses CPU BF16
```

### Inference Optimization

```python
# Quantize for inference
from bit_transformer import quantize_dynamic

# Switch to evaluation mode
model.eval()
set_dropout(model, 0.0)

# Dynamic quantization
quantized = quantize_dynamic(model, dtype=torch.qint8)

# Optimize for inference
with torch.no_grad():
    logits, _ = quantized(input_bits)
```

### Long Sequence Processing

```python
from bit_transformer.model import infer_long_sequence

# Process sequences longer than max_seq_len
long_text = "Very long text..." * 1000
bits = text_to_bits(long_text)

output = infer_long_sequence(
    model,
    torch.tensor(bits).unsqueeze(0),
    chunk_size=256,      # Process in 256-bit chunks
    overlap=32,          # 32-bit overlap between chunks
    stride=224           # 224-bit stride (256-32)
)
```

---

## Troubleshooting

### Common Issues

#### 1. **Memory Errors**
```
RuntimeError: CUDA out of memory
```
**Solutions:**
- Enable reversible layers: `reversible=True`
- Enable gradient checkpointing: `use_checkpoint=True`  
- Reduce batch size or use gradient accumulation
- Use chunked attention: `chunk_size=64`
- Enable mixed precision: `amp=True`

#### 2. **Tensor Shape Mismatches**
```
RuntimeError: view size is not compatible with input tensor's size
```
**Solutions:**
- Always use `.reshape()` instead of `.view()` with BitTransformerLM
- Check that input sequences are properly formatted (1D for bits)
- Ensure batch dimensions are consistent

#### 3. **Parity Check Failures**
```
ValueError: Parity check failed
```
**Solutions:**
- Use `enforce_parity()` to fix parity bits in generated sequences
- Check that text encoding/decoding is consistent
- Verify bit sequences have correct 9-bit (8+parity) structure

#### 4. **Safety Gate Triggering**
```
SafetyError: Output blocked by safety gate
```
**Solutions:**
- Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
- Increase burn-in period: `burn_in=20`
- Use retry with diffusion: `safe_sample_with_retry()`
- Check model training quality

### Debug Mode

```python
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Model with debug telemetry
model = BitTransformerLM(
    d_model=64,
    nhead=4,
    num_layers=2,
    full_attn_logging=True,  # Log full attention maps
    chunk_size=None          # Disable chunking for debugging
)

# Inspect telemetry
logits, telemetry = model(input_bits)
print("Telemetry keys:", list(telemetry.keys()))
print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
print("Activation stats:", torch.stack(telemetry['activations']).describe())
```

### Performance Profiling

```python
import torch.profiler

# Profile training step
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
) as prof:
    logits, telemetry = model(input_bits)
    loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total"))
```

---

## Best Practices

### Model Configuration

#### For Experimentation (< 1M parameters)
```python
model = BitTransformerLM(
    d_model=64,
    nhead=4,
    num_layers=2,
    dim_feedforward=128,
    max_seq_len=128,
    reversible=False,    # Simpler for debugging
    use_checkpoint=False
)
```

#### For Research (1M-100M parameters)  
```python
model = BitTransformerLM(
    d_model=256,
    nhead=8,
    num_layers=6,
    dim_feedforward=1024,
    max_seq_len=512,
    reversible=True,     # Enable memory efficiency
    use_checkpoint=True,
    chunk_size=128,
    lambda_K=0.05,       # Light regularization
    lambda_C=0.05,
    lambda_S=0.05
)
```

#### For Large-Scale (100M+ parameters)
```python
model = BitTransformerLM(
    d_model=1024,
    nhead=16, 
    num_layers=20,
    dim_feedforward=4096,
    max_seq_len=2048,
    reversible=True,
    use_checkpoint=True,
    chunk_size=256,
    full_attn_logging=False,  # Save memory
    lambda_K=0.1,
    lambda_C=0.1,
    lambda_S=0.1
)
```

### Training Best Practices

1. **Always validate on held-out data** to monitor overfitting
2. **Use gradient clipping** to prevent training instability  
3. **Monitor telemetry metrics** for signs of model degradation
4. **Start with smaller models** before scaling up
5. **Use safety gates** in production deployments
6. **Enable logging** to track training progress
7. **Save checkpoints frequently** to prevent loss of progress

### Data Preparation

```python
# Good: Clean, well-formatted text
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming technology.",
    "BitTransformer processes information at the bit level."
]

# Convert to training sequences
all_bits = []
for text in texts:
    bits = text_to_bits(text)
    all_bits.extend(bits)

# Create overlapping sequences for better learning
data = torch.tensor(all_bits)
seq_len = 128
stride = 64
sequences = []
for i in range(0, len(data) - seq_len, stride):
    sequences.append(data[i:i + seq_len])

training_data = torch.stack(sequences)
```

### Production Deployment

```python
# Production-ready model setup
model.eval()  # Disable dropout
set_dropout(model, 0.0)

# Enable safety monitoring
gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)

# Quantize for efficiency
production_model = quantize_dynamic(model)

# Safe inference with monitoring
def safe_generate(input_text, max_length=100):
    try:
        return safe_sample_with_retry(
            production_model,
            text_to_bits(input_text),
            max_retries=3
        )
    except Exception as e:
        logging.error(f"Generation failed: {e}")
        return "Error: Unable to generate safe output"
```

---

## Getting Help

### Documentation Resources
- **ABOUTME.md**: Project overview and quick start
- **README.md**: Professional model card and specifications  
- **RESEARCH_STATUS.md**: Current research status and limitations
- **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities

### Community Support
- **GitHub Issues**: Report bugs and request features
- **Discussions**: Ask questions and share experiences
- **Examples**: Check the `tests/` directory for usage examples

### **🤖 Recommended: Use with Claude Code**

For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):

- **Interactive Setup**: Get step-by-step guidance for configuration
- **Real-time Debugging**: Immediate help when things go wrong
- **Code Generation**: Custom scripts and experiments tailored to your needs
- **Architecture Explanation**: Deep understanding of bit-native processing
- **Best Practices**: Learn optimal configurations for your use case

Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.

---

**Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.**

Happy experimenting! 🤖✨