WCNegentropy commited on
Commit
58b962e
·
verified ·
1 Parent(s): cd203a2

Add Comprehensive user handbook

Browse files
Files changed (1) hide show
  1. USER_GUIDE.md +957 -0
USER_GUIDE.md ADDED
@@ -0,0 +1,957 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BitTransformerLM User Guide
2
+
3
+ **Version:** 0.1.0 Experimental
4
+ **Last Updated:** August 2025
5
+ **Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience
6
+
7
+ ## Table of Contents
8
+
9
+ 1. [Quick Start](#quick-start)
10
+ 2. [Architecture Overview](#architecture-overview)
11
+ 3. [Core Features](#core-features)
12
+ 4. [Installation & Setup](#installation--setup)
13
+ 5. [Basic Usage Examples](#basic-usage-examples)
14
+ 6. [Advanced Features](#advanced-features)
15
+ 7. [Training Your Own Models](#training-your-own-models)
16
+ 8. [Safety and Monitoring](#safety-and-monitoring)
17
+ 9. [Distributed Training](#distributed-training)
18
+ 10. [Performance Optimization](#performance-optimization)
19
+ 11. [Troubleshooting](#troubleshooting)
20
+ 12. [Best Practices](#best-practices)
21
+
22
+ ---
23
+
24
+ ## Quick Start
25
+
26
+ BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.
27
+
28
+ ### Minimal Example
29
+ ```python
30
+ from bit_transformer import BitTransformerLM, example_training_step
31
+
32
+ # Run basic example
33
+ loss, telemetry = example_training_step()
34
+ print(f"Training loss: {loss}")
35
+ print(f"Available telemetry: {list(telemetry.keys())}")
36
+ ```
37
+
38
+ ### Text Processing Example
39
+ ```python
40
+ from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text
41
+
42
+ # Create model
43
+ model = BitTransformerLM(
44
+ d_model=128,
45
+ nhead=4,
46
+ num_layers=2,
47
+ dim_feedforward=256,
48
+ max_seq_len=256
49
+ )
50
+
51
+ # Convert text to bits and process
52
+ text = "Hello, world!"
53
+ bits = text_to_bits(text)
54
+ bit_tensor = torch.tensor(bits).unsqueeze(0)
55
+
56
+ # Forward pass
57
+ logits, telemetry = model(bit_tensor)
58
+ print(f"Input bits: {len(bits)}")
59
+ print(f"Output shape: {logits.shape}")
60
+ print(f"Telemetry metrics: {list(telemetry.keys())}")
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Architecture Overview
66
+
67
+ ### Bit-Native Processing
68
+ Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:
69
+
70
+ - **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
71
+ - **Processing**: Multi-head attention on bit embeddings
72
+ - **Output**: Probability distribution over next bit (0 or 1)
73
+
74
+ ### Key Innovations
75
+
76
+ #### 1. **Reversible Transformer Layers**
77
+ - Memory-efficient computation that doesn't store intermediate activations
78
+ - Enables training of deeper models with same memory footprint
79
+ - Mathematically reversible operations for gradient computation
80
+
81
+ #### 2. **Built-in Safety Telemetry**
82
+ - **K (Negentropy)**: Measures information content vs random noise
83
+ - **C (LZ Complexity)**: Proxy for compressibility and pattern complexity
84
+ - **S (Symbiosis)**: Alignment with reference distributions
85
+ - Real-time monitoring and safety gates
86
+
87
+ #### 3. **Dual-Mode Operation**
88
+ - **Causal Mode**: Traditional autoregressive generation
89
+ - **Diffusion Mode**: Bidirectional denoising for higher quality output
90
+
91
+ #### 4. **Progressive Scaling**
92
+ - Dynamic architecture expansion based on validation performance
93
+ - Automatic addition of layers, width, or context length
94
+ - Curriculum learning from simple to complex patterns
95
+
96
+ ---
97
+
98
+ ## Core Features
99
+
100
+ ### Text Processing
101
+ - **Parity-Protected Encoding**: Each byte gets a parity bit for error detection
102
+ - **UTF-8 Support**: Full Unicode text processing capability
103
+ - **Bidirectional Processing**: Support for both causal and diffusion modes
104
+
105
+ ### Safety & Monitoring
106
+ - **Real-time Telemetry**: K/C/S metrics computed during inference
107
+ - **Safety Gates**: Automatic blocking of unsafe outputs
108
+ - **Metric Drift Detection**: Alerts when model behavior changes
109
+ - **Human-in-the-Loop**: Safe inference with retry mechanisms
110
+
111
+ ### Memory Efficiency
112
+ - **Reversible Layers**: Significant memory savings for deep models
113
+ - **Gradient Checkpointing**: Trade compute for memory in training
114
+ - **Dynamic Quantization**: Runtime INT8 conversion for inference
115
+ - **4-bit QAT**: Quantization-aware training for extreme efficiency
116
+
117
+ ### Advanced Training
118
+ - **Distributed Training**: FSDP and pipeline parallelism support
119
+ - **Mixed Precision**: FP16/BF16 optimization with CPU autocast
120
+ - **Compression Pipeline**: Run-length encoding for efficient storage
121
+ - **Progressive Curriculum**: Automatic difficulty scaling
122
+
123
+ ---
124
+
125
+ ## Installation & Setup
126
+
127
+ ### Requirements
128
+ - Python 3.10 or later
129
+ - PyTorch 2.7.1 or later
130
+ - CUDA (optional, for GPU acceleration)
131
+
132
+ ### Installation
133
+ ```bash
134
+ # Clone repository
135
+ git clone https://huggingface.co/WCNegentropy/BitTransformerLM
136
+ cd BitTransformerLM
137
+
138
+ # Install dependencies
139
+ pip install -r requirements.txt
140
+
141
+ # For GPU support (optional)
142
+ pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
143
+ ```
144
+
145
+ ### Quick Test
146
+ ```bash
147
+ # Run basic example
148
+ python example.py
149
+
150
+ # Expected output:
151
+ # Training loss: [some value]
152
+ # Available telemetry: ['activations', 'attention_maps', ...]
153
+ ```
154
+
155
+ ### **🤖 Recommended: Setup with Claude Code**
156
+
157
+ For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:
158
+
159
+ 1. **Open Claude Code** and navigate to your project directory
160
+ 2. **Clone the repository**: Claude Code can help with git operations and dependency management
161
+ 3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters
162
+ 4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging
163
+ 5. **Code Generation**: Generate custom training scripts and experiments with AI assistance
164
+
165
+ Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.
166
+
167
+ ---
168
+
169
+ ## Basic Usage Examples
170
+
171
+ ### 1. Creating Models
172
+
173
+ ```python
174
+ from bit_transformer import BitTransformerLM
175
+
176
+ # Small model for experimentation
177
+ small_model = BitTransformerLM(
178
+ d_model=64, # Embedding dimension
179
+ nhead=4, # Number of attention heads
180
+ num_layers=2, # Number of transformer layers
181
+ dim_feedforward=128, # Feedforward dimension
182
+ max_seq_len=128, # Maximum sequence length
183
+ reversible=True, # Use memory-efficient reversible layers
184
+ use_checkpoint=True # Enable gradient checkpointing
185
+ )
186
+
187
+ # Medium model for research
188
+ medium_model = BitTransformerLM(
189
+ d_model=512,
190
+ nhead=8,
191
+ num_layers=8,
192
+ dim_feedforward=2048,
193
+ max_seq_len=512,
194
+ reversible=True,
195
+ use_checkpoint=True,
196
+ chunk_size=64, # Chunked attention for long sequences
197
+ lambda_K=0.1, # Negentropy regularization weight
198
+ lambda_C=0.1, # Complexity regularization weight
199
+ lambda_S=0.1 # Symbiosis regularization weight
200
+ )
201
+ ```
202
+
203
+ ### 2. Text Generation
204
+
205
+ ```python
206
+ from bit_transformer.bit_io import sample_text
207
+
208
+ # Generate text from prompt
209
+ prompt = "The future of AI is"
210
+ generated = sample_text(
211
+ model,
212
+ prompt=prompt,
213
+ max_new_tokens=20, # Generate ~20 new characters
214
+ temperature=0.8, # Sampling temperature
215
+ top_p=0.9 # Nucleus sampling
216
+ )
217
+ print(f"Generated: {generated}")
218
+ ```
219
+
220
+ ### 3. Safe Inference
221
+
222
+ ```python
223
+ from bit_transformer import hil_safe_inference, text_to_bits
224
+ import torch
225
+
226
+ # Convert text to bits
227
+ text = "Hello, world!"
228
+ bits = torch.tensor(text_to_bits(text)).unsqueeze(0)
229
+
230
+ # Safe inference with telemetry monitoring
231
+ try:
232
+ output_bits, telemetry = hil_safe_inference(
233
+ model,
234
+ bits,
235
+ c_floor=0.3, # Minimum complexity threshold
236
+ s_floor=0.5, # Minimum symbiosis threshold
237
+ strict=True # Throw error if thresholds not met
238
+ )
239
+ print("✅ Safe inference completed")
240
+ print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
241
+ print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
242
+ print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
243
+ except Exception as e:
244
+ print(f"⚠️ Safety check failed: {e}")
245
+ ```
246
+
247
+ ### 4. Interactive Dashboard
248
+
249
+ ```python
250
+ # Launch the interactive dashboard
251
+ python unified_workflow.py --dashboard
252
+
253
+ # Or programmatically
254
+ from bit_transformer.dashboard_app import run_dashboard
255
+ run_dashboard(host="localhost", port=5000)
256
+ ```
257
+
258
+ The dashboard provides:
259
+ - Real-time training monitoring
260
+ - Telemetry visualization
261
+ - Model configuration controls
262
+ - HuggingFace checkpoint management
263
+ - Safe inference testing interface
264
+
265
+ ---
266
+
267
+ ## Advanced Features
268
+
269
+ ### 1. Diffusion Mode Training
270
+
271
+ Diffusion mode enables bidirectional processing for higher quality generation:
272
+
273
+ ```python
274
+ # Train with diffusion mode
275
+ python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
276
+
277
+ # Different noise schedules
278
+ python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16
279
+
280
+ # Diffusion curriculum (noise decay over epochs)
281
+ python unified_workflow.py --diffusion --diffusion-curriculum
282
+ ```
283
+
284
+ **Diffusion Parameters:**
285
+ - `--diffusion-steps`: Number of denoising steps (higher = better quality)
286
+ - `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
287
+ - `--diffusion-curriculum`: Gradually reduce noise over training epochs
288
+
289
+ ### 2. Progressive Scaling
290
+
291
+ Enable automatic model growth based on performance:
292
+
293
+ ```python
294
+ from bit_transformer.training import train_loop
295
+ from bit_transformer.scale import expand_model
296
+
297
+ # Training loop with automatic scaling
298
+ model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
299
+ train_data = torch.randint(0, 2, (1000, 64))
300
+
301
+ # Train with progressive scaling
302
+ train_loop(
303
+ model,
304
+ train_data,
305
+ epochs=10,
306
+ batch_size=8,
307
+ # Progressive scaling will automatically trigger when validation loss plateaus
308
+ )
309
+
310
+ # Manual model expansion
311
+ expanded_model = expand_model(model, strategy="depth") # Add layers
312
+ expanded_model = expand_model(model, strategy="width") # Increase width
313
+ expanded_model = expand_model(model, strategy="context") # Extend context
314
+ ```
315
+
316
+ ### 3. Compression Pipeline
317
+
318
+ BitTransformerLM includes run-length encoding for efficient data storage:
319
+
320
+ ```python
321
+ from bit_transformer import compress_bits, decompress_bits
322
+
323
+ # Compress bit sequences
324
+ original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
325
+ compressed = compress_bits(original_bits)
326
+ decompressed = decompress_bits(compressed)
327
+
328
+ print(f"Original: {original_bits}")
329
+ print(f"Compressed: {compressed}")
330
+ print(f"Decompressed: {decompressed}")
331
+ print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")
332
+
333
+ # Use compression in training
334
+ train_loop(
335
+ model,
336
+ data,
337
+ compress_prob=0.5, # 50% of training uses compressed data
338
+ compress_warmup=100 # Start compression after 100 steps
339
+ )
340
+ ```
341
+
342
+ ### 4. Quantization and Optimization
343
+
344
+ ```python
345
+ from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx
346
+
347
+ # Dynamic quantization for inference
348
+ quantized_model = quantize_dynamic(model, dtype=torch.qint8)
349
+
350
+ # 4-bit quantization-aware training
351
+ qat_model = prepare_qat_fx(model)
352
+ # ... train qat_model ...
353
+ final_model = convert_qat_fx(qat_model)
354
+
355
+ # Enable mixed precision and compilation
356
+ train_loop(
357
+ model,
358
+ data,
359
+ amp=True, # Enable automatic mixed precision
360
+ compile_model=True # Use torch.compile for speedup
361
+ )
362
+ ```
363
+
364
+ ---
365
+
366
+ ## Training Your Own Models
367
+
368
+ ### Basic Training Script
369
+
370
+ ```python
371
+ import torch
372
+ from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
373
+ from bit_transformer.bit_io import text_to_bits
374
+
375
+ # Prepare training data
376
+ texts = ["Hello world", "How are you?", "BitTransformer is working!"]
377
+ all_bits = []
378
+ for text in texts:
379
+ bits = text_to_bits(text)
380
+ all_bits.extend(bits)
381
+
382
+ # Convert to tensor and create sequences
383
+ data = torch.tensor(all_bits)
384
+ sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride
385
+
386
+ # Create model
387
+ model = BitTransformerLM(
388
+ d_model=128,
389
+ nhead=8,
390
+ num_layers=4,
391
+ dim_feedforward=512,
392
+ max_seq_len=64,
393
+ reversible=True
394
+ )
395
+
396
+ # Configure optimizer
397
+ optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)
398
+
399
+ # Training loop
400
+ train_loop(
401
+ model,
402
+ sequences,
403
+ epochs=10,
404
+ batch_size=4,
405
+ optimizer=optimizer,
406
+ amp=True, # Mixed precision
407
+ log=True # Enable logging
408
+ )
409
+ ```
410
+
411
+ ### Advanced Training Configuration
412
+
413
+ ```python
414
+ # Advanced training with all features enabled
415
+ train_loop(
416
+ model,
417
+ data,
418
+ epochs=20,
419
+ batch_size=8,
420
+ accum_steps=4, # Gradient accumulation
421
+ amp=True, # Mixed precision
422
+ compile_model=True, # torch.compile optimization
423
+
424
+ # Compression settings
425
+ compress_prob=0.3, # 30% compression probability
426
+ compress_warmup=50, # Start compression after 50 steps
427
+
428
+ # Diffusion settings
429
+ diffusion=True, # Enable diffusion mode
430
+ diffusion_curriculum=True, # Decay noise over epochs
431
+
432
+ # Direct bit training
433
+ direct_prob=0.1, # 10% direct bit prediction
434
+
435
+ # Logging
436
+ log=True # Enable detailed logging
437
+ )
438
+ ```
439
+
440
+ ### Custom Training Loop
441
+
442
+ ```python
443
+ import torch.nn.functional as F
444
+ from bit_transformer.utils import set_dropout
445
+
446
+ # Manual training loop for full control
447
+ model.train()
448
+ set_dropout(model, 0.1) # Enable dropout for training
449
+
450
+ optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
451
+ criterion = F.cross_entropy
452
+
453
+ for epoch in range(10):
454
+ total_loss = 0
455
+ for batch in data_loader:
456
+ optimizer.zero_grad()
457
+
458
+ # Forward pass
459
+ logits, telemetry = model(batch)
460
+
461
+ # Compute loss
462
+ if logits.dim() == 3: # (batch, seq, 2)
463
+ targets = batch[:, 1:] # Next bit prediction
464
+ logits = logits[:, :-1] # Remove last prediction
465
+ loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
466
+ else:
467
+ loss = criterion(logits, batch)
468
+
469
+ # Add telemetry regularization
470
+ if model.lambda_K > 0:
471
+ loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
472
+ if model.lambda_C > 0:
473
+ loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))
474
+
475
+ # Backward pass
476
+ loss.backward()
477
+
478
+ # Gradient clipping
479
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
480
+
481
+ optimizer.step()
482
+ total_loss += loss.item()
483
+
484
+ # Safety check
485
+ if telemetry.get('symbiosis_score', 1.0) < 0.3:
486
+ print("⚠️ Low symbiosis score detected")
487
+
488
+ print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
489
+ ```
490
+
491
+ ---
492
+
493
+ ## Safety and Monitoring
494
+
495
+ ### Telemetry Metrics
496
+
497
+ BitTransformerLM provides three key safety metrics:
498
+
499
+ #### K (Negentropy) - Information Content
500
+ - **Range**: 0-1 (0 = random noise, 1 = perfectly ordered)
501
+ - **Purpose**: Measures departure from randomness
502
+ - **Interpretation**:
503
+ - Very low K (< 0.1): Output is noise-like
504
+ - Moderate K (0.3-0.7): Structured but varied output
505
+ - Very high K (> 0.9): Repetitive or overly structured
506
+
507
+ #### C (LZ Complexity) - Pattern Complexity
508
+ - **Range**: 0-1 (higher = more complex patterns)
509
+ - **Purpose**: Proxy for Lempel-Ziv compressibility
510
+ - **Interpretation**:
511
+ - Low C (< 0.3): Highly repetitive patterns
512
+ - Moderate C (0.3-0.7): Balanced complexity
513
+ - High C (> 0.8): Complex, varied patterns
514
+
515
+ #### S (Symbiosis) - Distribution Alignment
516
+ - **Range**: 0-1 (higher = better alignment)
517
+ - **Purpose**: Agreement with reference distributions via KL divergence
518
+ - **Interpretation**:
519
+ - Low S (< 0.3): Poor alignment with expected patterns
520
+ - Moderate S (0.5-0.8): Good alignment
521
+ - High S (> 0.8): Excellent alignment
522
+
523
+ ### Safety Gates
524
+
525
+ ```python
526
+ from bit_transformer.safety import SafetyGate, safe_sample_with_retry
527
+
528
+ # Configure safety gate
529
+ gate = SafetyGate(
530
+ c_floor=0.3, # Minimum complexity
531
+ s_floor=0.5, # Minimum symbiosis
532
+ decay=0.9, # EMA decay factor
533
+ burn_in=10 # Steps before gating starts
534
+ )
535
+
536
+ # Check if output should be blocked
537
+ should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds
538
+
539
+ # Safe sampling with automatic retry
540
+ output = safe_sample_with_retry(
541
+ model,
542
+ input_bits,
543
+ max_retries=3,
544
+ retry_strategy="diffusion" # Try diffusion mode on failure
545
+ )
546
+ ```
547
+
548
+ ### Metric Drift Detection
549
+
550
+ ```python
551
+ from bit_transformer.telemetry import detect_metric_drift
552
+
553
+ # Monitor metric stability over time
554
+ metrics_history = [
555
+ {"K": 0.5, "C": 0.6, "S": 0.7},
556
+ {"K": 0.52, "C": 0.58, "S": 0.69},
557
+ {"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected!
558
+ # ... more metrics
559
+ ]
560
+
561
+ drift_detected = detect_metric_drift(
562
+ metrics_history,
563
+ window=10, # Look back 10 steps
564
+ threshold=0.2 # Alert if change > 0.2
565
+ )
566
+
567
+ if drift_detected:
568
+ print("⚠️ Model behavior drift detected!")
569
+ ```
570
+
571
+ ---
572
+
573
+ ## Distributed Training
574
+
575
+ ### FSDP (Fully Sharded Data Parallel)
576
+
577
+ ```python
578
+ from bit_transformer.distributed import wrap_fsdp, setup_distributed
579
+ import torch.distributed as dist
580
+
581
+ # Initialize distributed training
582
+ setup_distributed(rank=0, world_size=4)
583
+
584
+ # Wrap model with FSDP
585
+ model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
586
+ fsdp_model = wrap_fsdp(
587
+ model,
588
+ sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD"
589
+ mixed_precision=True,
590
+ device_id=0
591
+ )
592
+
593
+ # Train with FSDP
594
+ train_loop(
595
+ fsdp_model,
596
+ data,
597
+ epochs=10,
598
+ batch_size=2, # Smaller batch per GPU
599
+ amp=True
600
+ )
601
+ ```
602
+
603
+ ### Pipeline Parallelism
604
+
605
+ ```python
606
+ from bit_transformer.distributed import make_pipeline
607
+
608
+ # Create pipeline parallel model
609
+ pipeline_model = make_pipeline(
610
+ model,
611
+ balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs
612
+ devices=[0, 1, 2, 3],
613
+ checkpoint="never" # or "always", "except_last"
614
+ )
615
+
616
+ # Pipeline training requires special handling
617
+ # See unified_workflow.py for complete implementation
618
+ ```
619
+
620
+ ### Multi-GPU Training Script
621
+
622
+ ```bash
623
+ # Single node, multiple GPUs
624
+ python -m torch.distributed.launch \
625
+ --nproc_per_node=4 \
626
+ unified_workflow.py \
627
+ --distributed \
628
+ --batch-size 2 \
629
+ --epochs 10
630
+
631
+ # Multiple nodes
632
+ python -m torch.distributed.launch \
633
+ --nnodes=2 \
634
+ --node_rank=0 \
635
+ --master_addr="192.168.1.100" \
636
+ --master_port=29500 \
637
+ --nproc_per_node=4 \
638
+ unified_workflow.py \
639
+ --distributed
640
+ ```
641
+
642
+ ---
643
+
644
+ ## Performance Optimization
645
+
646
+ ### Memory Optimization
647
+
648
+ ```python
649
+ # Enable all memory optimizations
650
+ model = BitTransformerLM(
651
+ d_model=512,
652
+ nhead=8,
653
+ num_layers=8,
654
+ reversible=True, # Reversible layers save ~50% memory
655
+ use_checkpoint=True, # Gradient checkpointing
656
+ chunk_size=64, # Chunked attention for long sequences
657
+ full_attn_logging=False # Skip full attention reconstruction
658
+ )
659
+
660
+ # Training optimizations
661
+ train_loop(
662
+ model,
663
+ data,
664
+ batch_size=4, # Smaller batches
665
+ accum_steps=8, # Gradient accumulation
666
+ amp=True, # Mixed precision
667
+ compile_model=True # torch.compile
668
+ )
669
+ ```
670
+
671
+ ### CPU Optimization
672
+
673
+ ```python
674
+ from bit_transformer.torch_utils import cpu_autocast
675
+
676
+ # Enable BF16 on CPU
677
+ with cpu_autocast():
678
+ logits, telemetry = model(bits)
679
+
680
+ # Or enable for entire model
681
+ model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16
682
+ ```
683
+
684
+ ### Inference Optimization
685
+
686
+ ```python
687
+ # Quantize for inference
688
+ from bit_transformer import quantize_dynamic
689
+
690
+ # Switch to evaluation mode
691
+ model.eval()
692
+ set_dropout(model, 0.0)
693
+
694
+ # Dynamic quantization
695
+ quantized = quantize_dynamic(model, dtype=torch.qint8)
696
+
697
+ # Optimize for inference
698
+ with torch.no_grad():
699
+ logits, _ = quantized(input_bits)
700
+ ```
701
+
702
+ ### Long Sequence Processing
703
+
704
+ ```python
705
+ from bit_transformer.model import infer_long_sequence
706
+
707
+ # Process sequences longer than max_seq_len
708
+ long_text = "Very long text..." * 1000
709
+ bits = text_to_bits(long_text)
710
+
711
+ output = infer_long_sequence(
712
+ model,
713
+ torch.tensor(bits).unsqueeze(0),
714
+ chunk_size=256, # Process in 256-bit chunks
715
+ overlap=32, # 32-bit overlap between chunks
716
+ stride=224 # 224-bit stride (256-32)
717
+ )
718
+ ```
719
+
720
+ ---
721
+
722
+ ## Troubleshooting
723
+
724
+ ### Common Issues
725
+
726
+ #### 1. **Memory Errors**
727
+ ```
728
+ RuntimeError: CUDA out of memory
729
+ ```
730
+ **Solutions:**
731
+ - Enable reversible layers: `reversible=True`
732
+ - Enable gradient checkpointing: `use_checkpoint=True`
733
+ - Reduce batch size or use gradient accumulation
734
+ - Use chunked attention: `chunk_size=64`
735
+ - Enable mixed precision: `amp=True`
736
+
737
+ #### 2. **Tensor Shape Mismatches**
738
+ ```
739
+ RuntimeError: view size is not compatible with input tensor's size
740
+ ```
741
+ **Solutions:**
742
+ - Always use `.reshape()` instead of `.view()` with BitTransformerLM
743
+ - Check that input sequences are properly formatted (1D for bits)
744
+ - Ensure batch dimensions are consistent
745
+
746
+ #### 3. **Parity Check Failures**
747
+ ```
748
+ ValueError: Parity check failed
749
+ ```
750
+ **Solutions:**
751
+ - Use `enforce_parity()` to fix parity bits in generated sequences
752
+ - Check that text encoding/decoding is consistent
753
+ - Verify bit sequences have correct 9-bit (8+parity) structure
754
+
755
+ #### 4. **Safety Gate Triggering**
756
+ ```
757
+ SafetyError: Output blocked by safety gate
758
+ ```
759
+ **Solutions:**
760
+ - Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
761
+ - Increase burn-in period: `burn_in=20`
762
+ - Use retry with diffusion: `safe_sample_with_retry()`
763
+ - Check model training quality
764
+
765
+ ### Debug Mode
766
+
767
+ ```python
768
+ # Enable detailed logging
769
+ import logging
770
+ logging.basicConfig(level=logging.DEBUG)
771
+
772
+ # Model with debug telemetry
773
+ model = BitTransformerLM(
774
+ d_model=64,
775
+ nhead=4,
776
+ num_layers=2,
777
+ full_attn_logging=True, # Log full attention maps
778
+ chunk_size=None # Disable chunking for debugging
779
+ )
780
+
781
+ # Inspect telemetry
782
+ logits, telemetry = model(input_bits)
783
+ print("Telemetry keys:", list(telemetry.keys()))
784
+ print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
785
+ print("Activation stats:", torch.stack(telemetry['activations']).describe())
786
+ ```
787
+
788
+ ### Performance Profiling
789
+
790
+ ```python
791
+ import torch.profiler
792
+
793
+ # Profile training step
794
+ with torch.profiler.profile(
795
+ activities=[
796
+ torch.profiler.ProfilerActivity.CPU,
797
+ torch.profiler.ProfilerActivity.CUDA,
798
+ ],
799
+ record_shapes=True,
800
+ with_stack=True,
801
+ ) as prof:
802
+ logits, telemetry = model(input_bits)
803
+ loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
804
+ loss.backward()
805
+
806
+ print(prof.key_averages().table(sort_by="cuda_time_total"))
807
+ ```
808
+
809
+ ---
810
+
811
+ ## Best Practices
812
+
813
+ ### Model Configuration
814
+
815
+ #### For Experimentation (< 1M parameters)
816
+ ```python
817
+ model = BitTransformerLM(
818
+ d_model=64,
819
+ nhead=4,
820
+ num_layers=2,
821
+ dim_feedforward=128,
822
+ max_seq_len=128,
823
+ reversible=False, # Simpler for debugging
824
+ use_checkpoint=False
825
+ )
826
+ ```
827
+
828
+ #### For Research (1M-100M parameters)
829
+ ```python
830
+ model = BitTransformerLM(
831
+ d_model=256,
832
+ nhead=8,
833
+ num_layers=6,
834
+ dim_feedforward=1024,
835
+ max_seq_len=512,
836
+ reversible=True, # Enable memory efficiency
837
+ use_checkpoint=True,
838
+ chunk_size=128,
839
+ lambda_K=0.05, # Light regularization
840
+ lambda_C=0.05,
841
+ lambda_S=0.05
842
+ )
843
+ ```
844
+
845
+ #### For Large-Scale (100M+ parameters)
846
+ ```python
847
+ model = BitTransformerLM(
848
+ d_model=1024,
849
+ nhead=16,
850
+ num_layers=20,
851
+ dim_feedforward=4096,
852
+ max_seq_len=2048,
853
+ reversible=True,
854
+ use_checkpoint=True,
855
+ chunk_size=256,
856
+ full_attn_logging=False, # Save memory
857
+ lambda_K=0.1,
858
+ lambda_C=0.1,
859
+ lambda_S=0.1
860
+ )
861
+ ```
862
+
863
+ ### Training Best Practices
864
+
865
+ 1. **Always validate on held-out data** to monitor overfitting
866
+ 2. **Use gradient clipping** to prevent training instability
867
+ 3. **Monitor telemetry metrics** for signs of model degradation
868
+ 4. **Start with smaller models** before scaling up
869
+ 5. **Use safety gates** in production deployments
870
+ 6. **Enable logging** to track training progress
871
+ 7. **Save checkpoints frequently** to prevent loss of progress
872
+
873
+ ### Data Preparation
874
+
875
+ ```python
876
+ # Good: Clean, well-formatted text
877
+ texts = [
878
+ "The quick brown fox jumps over the lazy dog.",
879
+ "Machine learning is transforming technology.",
880
+ "BitTransformer processes information at the bit level."
881
+ ]
882
+
883
+ # Convert to training sequences
884
+ all_bits = []
885
+ for text in texts:
886
+ bits = text_to_bits(text)
887
+ all_bits.extend(bits)
888
+
889
+ # Create overlapping sequences for better learning
890
+ data = torch.tensor(all_bits)
891
+ seq_len = 128
892
+ stride = 64
893
+ sequences = []
894
+ for i in range(0, len(data) - seq_len, stride):
895
+ sequences.append(data[i:i + seq_len])
896
+
897
+ training_data = torch.stack(sequences)
898
+ ```
899
+
900
+ ### Production Deployment
901
+
902
+ ```python
903
+ # Production-ready model setup
904
+ model.eval() # Disable dropout
905
+ set_dropout(model, 0.0)
906
+
907
+ # Enable safety monitoring
908
+ gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)
909
+
910
+ # Quantize for efficiency
911
+ production_model = quantize_dynamic(model)
912
+
913
+ # Safe inference with monitoring
914
+ def safe_generate(input_text, max_length=100):
915
+ try:
916
+ return safe_sample_with_retry(
917
+ production_model,
918
+ text_to_bits(input_text),
919
+ max_retries=3
920
+ )
921
+ except Exception as e:
922
+ logging.error(f"Generation failed: {e}")
923
+ return "Error: Unable to generate safe output"
924
+ ```
925
+
926
+ ---
927
+
928
+ ## Getting Help
929
+
930
+ ### Documentation Resources
931
+ - **ABOUTME.md**: Project overview and quick start
932
+ - **README.md**: Professional model card and specifications
933
+ - **RESEARCH_STATUS.md**: Current research status and limitations
934
+ - **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities
935
+
936
+ ### Community Support
937
+ - **GitHub Issues**: Report bugs and request features
938
+ - **Discussions**: Ask questions and share experiences
939
+ - **Examples**: Check the `tests/` directory for usage examples
940
+
941
+ ### **🤖 Recommended: Use with Claude Code**
942
+
943
+ For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):
944
+
945
+ - **Interactive Setup**: Get step-by-step guidance for configuration
946
+ - **Real-time Debugging**: Immediate help when things go wrong
947
+ - **Code Generation**: Custom scripts and experiments tailored to your needs
948
+ - **Architecture Explanation**: Deep understanding of bit-native processing
949
+ - **Best Practices**: Learn optimal configurations for your use case
950
+
951
+ Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.
952
+
953
+ ---
954
+
955
+ **Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.**
956
+
957
+ Happy experimenting! 🤖✨