File size: 13,481 Bytes
cd203a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# BitTransformerLM Claude Code Integration Guide

## Overview

BitTransformerLM is optimally designed for use with [Claude Code](https://claude.ai/code), providing AI-assisted setup, development, and research workflows. This document provides guidelines for working with BitTransformerLM in Claude Code and standalone development.

## Why Claude Code?

BitTransformerLM's unique bit-native architecture has several complexities that Claude Code can help navigate:

- **Complex Architecture**: Understanding bit-level processing, reversible layers, and safety telemetry
- **Parameter Tuning**: Optimizing model configurations for different use cases
- **Safety Monitoring**: Interpreting K/C/S metrics and configuring safety gates
- **Distributed Training**: Setting up FSDP and pipeline parallelism correctly
- **Debugging**: Identifying issues specific to bit-native processing

Claude Code understands these nuances and can provide real-time assistance.

---

## Repository Scope and Architecture

### Core Capabilities
BitTransformerLM implements bit-native language modeling with:
- **Bit-Native Processing**: Direct binary sequence modeling with parity protection
- **Reversible Layers**: Memory-efficient transformer blocks that save ~50% memory
- **Safety Telemetry**: Real-time K/C/S (Negentropy/Complexity/Symbiosis) monitoring
- **Diffusion Mode**: Bidirectional denoising with multiple noise schedules
- **Progressive Scaling**: Automatic model expansion based on validation performance
- **Distributed Training**: FSDP and pipeline parallelism for large-scale training
- **Interactive Dashboard**: Real-time training control and visualization

### Experimental Status
**Important**: BitTransformerLM is experimental research software requiring:
- Rigorous baseline comparisons against standard transformers
- Validation on established language modeling benchmarks
- Statistical significance testing across multiple runs
- Careful interpretation of safety metrics and claims

---

## Environment Setup

### Requirements
- **Python 3.10+** (required for modern PyTorch features)
- **PyTorch 2.7.1+** with appropriate CUDA support if using GPUs

### Installation Options

#### CPU-Only Installation
```bash
pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
```

#### GPU Installation  
```bash
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
pip install -r requirements.txt
```

#### Claude Code Assisted Setup
When using Claude Code, simply ask for:
- "Help me set up BitTransformerLM for my system"
- "Configure BitTransformerLM for GPU training"
- "Set up a development environment for bit-native language modeling"

Claude Code will guide you through hardware detection, dependency installation, and initial configuration.

---

## Repository Structure

```
BitTransformerLM/
β”œβ”€β”€ bit_transformer/              # Core package
β”‚   β”œβ”€β”€ model.py                  # BitTransformerLM architecture
β”‚   β”œβ”€β”€ telemetry.py              # K/C/S safety metrics
β”‚   β”œβ”€β”€ safety.py                 # Safety gates and monitoring
β”‚   β”œβ”€β”€ bit_io.py                 # Text ↔ bits conversion
β”‚   β”œβ”€β”€ compression.py            # Run-length encoding
β”‚   β”œβ”€β”€ training.py               # Training utilities
β”‚   β”œβ”€β”€ distributed.py            # FSDP and pipeline parallelism
β”‚   β”œβ”€β”€ dashboard_app.py          # Interactive web dashboard
β”‚   β”œβ”€β”€ quantization.py           # INT8/4-bit quantization
β”‚   └── [other modules...]        # Additional functionality
β”œβ”€β”€ tests/                        # Test suite and results
β”œβ”€β”€ example.py                    # Basic usage example
β”œβ”€β”€ unified_workflow.py           # Main training script
β”œβ”€β”€ mcp_server.py                 # Management Control Protocol server
β”œβ”€β”€ USER_GUIDE.md                 # Comprehensive user documentation
└── [other scripts...]            # Utilities and examples
```

---

## Development Workflow with Claude Code

### Getting Started

1. **Initial Setup**
   ```
   "Help me understand BitTransformerLM's architecture"
   "Create a simple training script for bit-native language modeling"
   "Explain the difference between causal and diffusion modes"
   ```

2. **Model Configuration**
   ```
   "Configure a BitTransformerLM for [my specific use case]"
   "What are optimal hyperparameters for a [size] model?"
   "Help me enable reversible layers and gradient checkpointing"
   ```

3. **Training and Monitoring**
   ```
   "Set up distributed training with FSDP"
   "Interpret these K/C/S telemetry values: K=0.3, C=0.6, S=0.4"
   "Debug this memory error during training"
   ```

### Claude Code Advantages

**Real-time Assistance**: Get immediate help with:
- Parameter configuration and tuning
- Error diagnosis and resolution  
- Architecture modification and experimentation
- Safety metric interpretation
- Performance optimization

**Context-Aware Suggestions**: Claude Code understands:
- BitTransformerLM's unique bit-native processing
- The relationship between safety metrics
- Memory optimization strategies
- Distributed training complexities

---

## Key Commands and Workflows

### Basic Usage
```bash
# Run simple example
python example.py

# Launch interactive dashboard
python unified_workflow.py --dashboard

# Train with diffusion mode
python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
```

### Advanced Training
```bash
# Distributed training with FSDP
python unified_workflow.py --distributed --batch-size 2 --epochs 10

# Mixed precision with quantization
python unified_workflow.py --amp --qat

# Progressive scaling with curriculum learning
python unified_workflow.py --progressive --diffusion-curriculum
```

### Dashboard and Monitoring
```bash
# Start MCP server and dashboard
python mcp_server.py &
MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app
```

**Dashboard Features:**
- Real-time telemetry visualization
- Interactive model configuration
- HuggingFace checkpoint management
- Safe inference testing interface

---

## Safety and Telemetry

### Core Metrics

| Metric | Full Name | Range | Interpretation |
|--------|-----------|-------|----------------|
| **K** | Negentropy | 0-1 | Information content (0=noise, 1=ordered) |
| **C** | LZ Complexity | 0-1 | Pattern complexity (higher=more complex) |
| **S** | Symbiosis | 0-1 | Alignment with reference (higher=better) |

### Using with Claude Code

```
"Explain what K=0.2, C=0.8, S=0.3 means for my model"
"Configure safety gates for production use"  
"My model is generating repetitive output, what safety metrics should I check?"
"Set up drift detection for telemetry monitoring"
```

Claude Code can help interpret these metrics in context and suggest appropriate safety thresholds.

### Safety Gate Configuration
```python
from bit_transformer.safety import SafetyGate

# Production-ready safety gate
gate = SafetyGate(
    c_floor=0.3,      # Minimum complexity
    s_floor=0.5,      # Minimum symbiosis
    decay=0.9,        # EMA decay factor
    burn_in=10        # Steps before gating starts
)
```

---

## Best Practices for Claude Code Development

### 1. **Always Validate Research Claims**
Ask Claude Code to help you:
- Set up proper baseline comparisons
- Design statistical significance tests
- Implement evaluation on standard benchmarks
- Document limitations and assumptions

### 2. **Use Progressive Development**
```
"Start me with a minimal BitTransformerLM example"
"Now add safety monitoring"
"Scale up to distributed training"
"Add diffusion mode capabilities"
```

### 3. **Leverage Claude Code for Architecture Understanding**
```
"Explain how reversible layers save memory"
"Walk me through the bit encoding process"
"How does the safety telemetry system work?"
"Compare BitTransformerLM to standard transformers"
```

### 4. **Get Help with Complex Configurations**
```python
# Ask Claude Code to help configure models like:
model = BitTransformerLM(
    d_model=1024,           # Claude Code can suggest optimal values
    nhead=16,               # Based on your hardware and use case
    num_layers=20,
    dim_feedforward=4096,
    max_seq_len=2048,
    reversible=True,        # Memory optimization
    use_checkpoint=True,    # Gradient checkpointing
    chunk_size=256,         # Attention chunking
    lambda_K=0.1,           # Regularization weights
    lambda_C=0.1,
    lambda_S=0.1
)
```

---

## Development Guidelines

### Code Style
- **Functions**: `snake_case` (e.g., `train_loop`, `safe_inference`)
- **Classes**: `CamelCase` (e.g., `BitTransformerLM`, `SafetyGate`)
- **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LEN`)
- **Keep functions under 300 lines** and minimize deep nesting

### Security and Safety
- **Never reintroduce deprecated `/exec` endpoint**
- **Always use safety gates in production**
- **Validate all user inputs** in dashboard and API endpoints
- **Monitor telemetry metrics** for anomalous behavior
- **Use `cpu_autocast()` helper** instead of direct `torch.amp.autocast`

### Memory Management
```python
# Good: Memory-efficient configuration
model = BitTransformerLM(
    reversible=True,        # Enable reversible layers
    use_checkpoint=True,    # Gradient checkpointing  
    chunk_size=128,         # Chunked attention
    full_attn_logging=False # Skip full attention reconstruction
)

# Training with memory optimizations
train_loop(
    model, data,
    amp=True,              # Mixed precision
    accum_steps=4,         # Gradient accumulation
    compile_model=True     # torch.compile optimization
)
```

### Testing and Validation
```bash
# Run tests after changes
pytest -q

# Model evaluation modes
model.train()    # For training
model.eval()     # For inference/evaluation
set_dropout(model, 0.0)  # Disable dropout for reproducible results
```

---

## Getting Help from Claude Code

### Specific Areas Where Claude Code Excels

1. **Architecture Design**
   - "Design a BitTransformerLM architecture for [specific task]"
   - "Optimize this model configuration for memory efficiency"
   - "Explain the trade-offs between reversible and standard layers"

2. **Training Optimization**
   - "My training is running out of memory, help optimize"
   - "Configure distributed training for 4 GPUs"
   - "Set up a training curriculum for bit-native language modeling"

3. **Safety and Monitoring**
   - "Interpret these telemetry readings and suggest adjustments"
   - "Set up production-ready safety monitoring"
   - "Debug why my safety gate is triggering constantly"

4. **Research and Evaluation**
   - "Design a rigorous evaluation comparing BitTransformerLM to GPT-2"
   - "Set up statistical significance testing for my experiments"
   - "Help me write up my research findings honestly"

### Sample Claude Code Interactions

```
User: "I'm getting OOM errors training a 500M parameter BitTransformerLM"

Claude Code: "Let me help optimize your memory usage. Here's a configuration 
that should work better for your model size:

model = BitTransformerLM(
    d_model=768,
    nhead=12, 
    num_layers=12,
    reversible=True,          # Critical for large models
    use_checkpoint=True,      # Trade compute for memory
    chunk_size=64,            # Reduce attention memory
    full_attn_logging=False   # Skip expensive logging
)

train_loop(
    model, data,
    batch_size=1,             # Small batch size
    accum_steps=16,           # Maintain effective batch size
    amp=True                  # Mixed precision training
)

This should reduce memory usage by ~60% compared to standard configuration."
```

---

## Licensing and Distribution

BitTransformerLM is available under dual licensing:
- **Open Source**: AGPLv3 for research and open source use
- **Commercial**: Contact **[email protected]** for commercial licensing

When working with Claude Code, ensure compliance with the AGPLv3 license for any derivatives or modifications you create.

---

## Research Integrity

**Important Reminder**: BitTransformerLM is experimental research software. When using Claude Code:

1. **Always validate claims** through proper baseline comparisons
2. **Document limitations** honestly in any publications or reports  
3. **Use statistical significance testing** for any performance claims
4. **Follow established ML research best practices**
5. **Share negative results** as well as positive ones

Claude Code can help you design rigorous experiments and avoid common pitfalls in ML research.

---

## Support and Community

### Getting Help
- **Claude Code**: Real-time AI assistance with BitTransformerLM
- **GitHub Issues**: Bug reports and feature requests
- **Discussions**: Community questions and sharing
- **User Guide**: Comprehensive documentation (`USER_GUIDE.md`)
- **Project Overview**: Complete project information (`ABOUTME.md`)

### Contributing
When contributing to BitTransformerLM:
1. Use Claude Code to ensure code quality and consistency
2. Follow the development guidelines in this document
3. Add tests for new functionality
4. Update documentation as needed
5. Ensure all safety and security practices are followed

---

**BitTransformerLM + Claude Code provides a powerful combination for exploring bit-native language modeling with AI assistance. Start experimenting responsibly and share your findings with the research community!** πŸ€–βœ¨