WCNegentropy commited on
Commit
cd203a2
·
verified ·
1 Parent(s): 7cf71dd

Add Claude Code integration guide

Browse files
Files changed (1) hide show
  1. CLAUDE.md +404 -0
CLAUDE.md ADDED
@@ -0,0 +1,404 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BitTransformerLM Claude Code Integration Guide
2
+
3
+ ## Overview
4
+
5
+ BitTransformerLM is optimally designed for use with [Claude Code](https://claude.ai/code), providing AI-assisted setup, development, and research workflows. This document provides guidelines for working with BitTransformerLM in Claude Code and standalone development.
6
+
7
+ ## Why Claude Code?
8
+
9
+ BitTransformerLM's unique bit-native architecture has several complexities that Claude Code can help navigate:
10
+
11
+ - **Complex Architecture**: Understanding bit-level processing, reversible layers, and safety telemetry
12
+ - **Parameter Tuning**: Optimizing model configurations for different use cases
13
+ - **Safety Monitoring**: Interpreting K/C/S metrics and configuring safety gates
14
+ - **Distributed Training**: Setting up FSDP and pipeline parallelism correctly
15
+ - **Debugging**: Identifying issues specific to bit-native processing
16
+
17
+ Claude Code understands these nuances and can provide real-time assistance.
18
+
19
+ ---
20
+
21
+ ## Repository Scope and Architecture
22
+
23
+ ### Core Capabilities
24
+ BitTransformerLM implements bit-native language modeling with:
25
+ - **Bit-Native Processing**: Direct binary sequence modeling with parity protection
26
+ - **Reversible Layers**: Memory-efficient transformer blocks that save ~50% memory
27
+ - **Safety Telemetry**: Real-time K/C/S (Negentropy/Complexity/Symbiosis) monitoring
28
+ - **Diffusion Mode**: Bidirectional denoising with multiple noise schedules
29
+ - **Progressive Scaling**: Automatic model expansion based on validation performance
30
+ - **Distributed Training**: FSDP and pipeline parallelism for large-scale training
31
+ - **Interactive Dashboard**: Real-time training control and visualization
32
+
33
+ ### Experimental Status
34
+ **Important**: BitTransformerLM is experimental research software requiring:
35
+ - Rigorous baseline comparisons against standard transformers
36
+ - Validation on established language modeling benchmarks
37
+ - Statistical significance testing across multiple runs
38
+ - Careful interpretation of safety metrics and claims
39
+
40
+ ---
41
+
42
+ ## Environment Setup
43
+
44
+ ### Requirements
45
+ - **Python 3.10+** (required for modern PyTorch features)
46
+ - **PyTorch 2.7.1+** with appropriate CUDA support if using GPUs
47
+
48
+ ### Installation Options
49
+
50
+ #### CPU-Only Installation
51
+ ```bash
52
+ pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
53
+ ```
54
+
55
+ #### GPU Installation
56
+ ```bash
57
+ pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+ #### Claude Code Assisted Setup
62
+ When using Claude Code, simply ask for:
63
+ - "Help me set up BitTransformerLM for my system"
64
+ - "Configure BitTransformerLM for GPU training"
65
+ - "Set up a development environment for bit-native language modeling"
66
+
67
+ Claude Code will guide you through hardware detection, dependency installation, and initial configuration.
68
+
69
+ ---
70
+
71
+ ## Repository Structure
72
+
73
+ ```
74
+ BitTransformerLM/
75
+ ├── bit_transformer/ # Core package
76
+ │ ├── model.py # BitTransformerLM architecture
77
+ │ ├── telemetry.py # K/C/S safety metrics
78
+ │ ├── safety.py # Safety gates and monitoring
79
+ │ ├── bit_io.py # Text ↔ bits conversion
80
+ │ ├── compression.py # Run-length encoding
81
+ │ ├── training.py # Training utilities
82
+ │ ├── distributed.py # FSDP and pipeline parallelism
83
+ │ ├── dashboard_app.py # Interactive web dashboard
84
+ │ ├── quantization.py # INT8/4-bit quantization
85
+ │ └── [other modules...] # Additional functionality
86
+ ├── tests/ # Test suite and results
87
+ ├── example.py # Basic usage example
88
+ ├── unified_workflow.py # Main training script
89
+ ├── mcp_server.py # Management Control Protocol server
90
+ ├── USER_GUIDE.md # Comprehensive user documentation
91
+ └── [other scripts...] # Utilities and examples
92
+ ```
93
+
94
+ ---
95
+
96
+ ## Development Workflow with Claude Code
97
+
98
+ ### Getting Started
99
+
100
+ 1. **Initial Setup**
101
+ ```
102
+ "Help me understand BitTransformerLM's architecture"
103
+ "Create a simple training script for bit-native language modeling"
104
+ "Explain the difference between causal and diffusion modes"
105
+ ```
106
+
107
+ 2. **Model Configuration**
108
+ ```
109
+ "Configure a BitTransformerLM for [my specific use case]"
110
+ "What are optimal hyperparameters for a [size] model?"
111
+ "Help me enable reversible layers and gradient checkpointing"
112
+ ```
113
+
114
+ 3. **Training and Monitoring**
115
+ ```
116
+ "Set up distributed training with FSDP"
117
+ "Interpret these K/C/S telemetry values: K=0.3, C=0.6, S=0.4"
118
+ "Debug this memory error during training"
119
+ ```
120
+
121
+ ### Claude Code Advantages
122
+
123
+ **Real-time Assistance**: Get immediate help with:
124
+ - Parameter configuration and tuning
125
+ - Error diagnosis and resolution
126
+ - Architecture modification and experimentation
127
+ - Safety metric interpretation
128
+ - Performance optimization
129
+
130
+ **Context-Aware Suggestions**: Claude Code understands:
131
+ - BitTransformerLM's unique bit-native processing
132
+ - The relationship between safety metrics
133
+ - Memory optimization strategies
134
+ - Distributed training complexities
135
+
136
+ ---
137
+
138
+ ## Key Commands and Workflows
139
+
140
+ ### Basic Usage
141
+ ```bash
142
+ # Run simple example
143
+ python example.py
144
+
145
+ # Launch interactive dashboard
146
+ python unified_workflow.py --dashboard
147
+
148
+ # Train with diffusion mode
149
+ python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
150
+ ```
151
+
152
+ ### Advanced Training
153
+ ```bash
154
+ # Distributed training with FSDP
155
+ python unified_workflow.py --distributed --batch-size 2 --epochs 10
156
+
157
+ # Mixed precision with quantization
158
+ python unified_workflow.py --amp --qat
159
+
160
+ # Progressive scaling with curriculum learning
161
+ python unified_workflow.py --progressive --diffusion-curriculum
162
+ ```
163
+
164
+ ### Dashboard and Monitoring
165
+ ```bash
166
+ # Start MCP server and dashboard
167
+ python mcp_server.py &
168
+ MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app
169
+ ```
170
+
171
+ **Dashboard Features:**
172
+ - Real-time telemetry visualization
173
+ - Interactive model configuration
174
+ - HuggingFace checkpoint management
175
+ - Safe inference testing interface
176
+
177
+ ---
178
+
179
+ ## Safety and Telemetry
180
+
181
+ ### Core Metrics
182
+
183
+ | Metric | Full Name | Range | Interpretation |
184
+ |--------|-----------|-------|----------------|
185
+ | **K** | Negentropy | 0-1 | Information content (0=noise, 1=ordered) |
186
+ | **C** | LZ Complexity | 0-1 | Pattern complexity (higher=more complex) |
187
+ | **S** | Symbiosis | 0-1 | Alignment with reference (higher=better) |
188
+
189
+ ### Using with Claude Code
190
+
191
+ ```
192
+ "Explain what K=0.2, C=0.8, S=0.3 means for my model"
193
+ "Configure safety gates for production use"
194
+ "My model is generating repetitive output, what safety metrics should I check?"
195
+ "Set up drift detection for telemetry monitoring"
196
+ ```
197
+
198
+ Claude Code can help interpret these metrics in context and suggest appropriate safety thresholds.
199
+
200
+ ### Safety Gate Configuration
201
+ ```python
202
+ from bit_transformer.safety import SafetyGate
203
+
204
+ # Production-ready safety gate
205
+ gate = SafetyGate(
206
+ c_floor=0.3, # Minimum complexity
207
+ s_floor=0.5, # Minimum symbiosis
208
+ decay=0.9, # EMA decay factor
209
+ burn_in=10 # Steps before gating starts
210
+ )
211
+ ```
212
+
213
+ ---
214
+
215
+ ## Best Practices for Claude Code Development
216
+
217
+ ### 1. **Always Validate Research Claims**
218
+ Ask Claude Code to help you:
219
+ - Set up proper baseline comparisons
220
+ - Design statistical significance tests
221
+ - Implement evaluation on standard benchmarks
222
+ - Document limitations and assumptions
223
+
224
+ ### 2. **Use Progressive Development**
225
+ ```
226
+ "Start me with a minimal BitTransformerLM example"
227
+ "Now add safety monitoring"
228
+ "Scale up to distributed training"
229
+ "Add diffusion mode capabilities"
230
+ ```
231
+
232
+ ### 3. **Leverage Claude Code for Architecture Understanding**
233
+ ```
234
+ "Explain how reversible layers save memory"
235
+ "Walk me through the bit encoding process"
236
+ "How does the safety telemetry system work?"
237
+ "Compare BitTransformerLM to standard transformers"
238
+ ```
239
+
240
+ ### 4. **Get Help with Complex Configurations**
241
+ ```python
242
+ # Ask Claude Code to help configure models like:
243
+ model = BitTransformerLM(
244
+ d_model=1024, # Claude Code can suggest optimal values
245
+ nhead=16, # Based on your hardware and use case
246
+ num_layers=20,
247
+ dim_feedforward=4096,
248
+ max_seq_len=2048,
249
+ reversible=True, # Memory optimization
250
+ use_checkpoint=True, # Gradient checkpointing
251
+ chunk_size=256, # Attention chunking
252
+ lambda_K=0.1, # Regularization weights
253
+ lambda_C=0.1,
254
+ lambda_S=0.1
255
+ )
256
+ ```
257
+
258
+ ---
259
+
260
+ ## Development Guidelines
261
+
262
+ ### Code Style
263
+ - **Functions**: `snake_case` (e.g., `train_loop`, `safe_inference`)
264
+ - **Classes**: `CamelCase` (e.g., `BitTransformerLM`, `SafetyGate`)
265
+ - **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LEN`)
266
+ - **Keep functions under 300 lines** and minimize deep nesting
267
+
268
+ ### Security and Safety
269
+ - **Never reintroduce deprecated `/exec` endpoint**
270
+ - **Always use safety gates in production**
271
+ - **Validate all user inputs** in dashboard and API endpoints
272
+ - **Monitor telemetry metrics** for anomalous behavior
273
+ - **Use `cpu_autocast()` helper** instead of direct `torch.amp.autocast`
274
+
275
+ ### Memory Management
276
+ ```python
277
+ # Good: Memory-efficient configuration
278
+ model = BitTransformerLM(
279
+ reversible=True, # Enable reversible layers
280
+ use_checkpoint=True, # Gradient checkpointing
281
+ chunk_size=128, # Chunked attention
282
+ full_attn_logging=False # Skip full attention reconstruction
283
+ )
284
+
285
+ # Training with memory optimizations
286
+ train_loop(
287
+ model, data,
288
+ amp=True, # Mixed precision
289
+ accum_steps=4, # Gradient accumulation
290
+ compile_model=True # torch.compile optimization
291
+ )
292
+ ```
293
+
294
+ ### Testing and Validation
295
+ ```bash
296
+ # Run tests after changes
297
+ pytest -q
298
+
299
+ # Model evaluation modes
300
+ model.train() # For training
301
+ model.eval() # For inference/evaluation
302
+ set_dropout(model, 0.0) # Disable dropout for reproducible results
303
+ ```
304
+
305
+ ---
306
+
307
+ ## Getting Help from Claude Code
308
+
309
+ ### Specific Areas Where Claude Code Excels
310
+
311
+ 1. **Architecture Design**
312
+ - "Design a BitTransformerLM architecture for [specific task]"
313
+ - "Optimize this model configuration for memory efficiency"
314
+ - "Explain the trade-offs between reversible and standard layers"
315
+
316
+ 2. **Training Optimization**
317
+ - "My training is running out of memory, help optimize"
318
+ - "Configure distributed training for 4 GPUs"
319
+ - "Set up a training curriculum for bit-native language modeling"
320
+
321
+ 3. **Safety and Monitoring**
322
+ - "Interpret these telemetry readings and suggest adjustments"
323
+ - "Set up production-ready safety monitoring"
324
+ - "Debug why my safety gate is triggering constantly"
325
+
326
+ 4. **Research and Evaluation**
327
+ - "Design a rigorous evaluation comparing BitTransformerLM to GPT-2"
328
+ - "Set up statistical significance testing for my experiments"
329
+ - "Help me write up my research findings honestly"
330
+
331
+ ### Sample Claude Code Interactions
332
+
333
+ ```
334
+ User: "I'm getting OOM errors training a 500M parameter BitTransformerLM"
335
+
336
+ Claude Code: "Let me help optimize your memory usage. Here's a configuration
337
+ that should work better for your model size:
338
+
339
+ model = BitTransformerLM(
340
+ d_model=768,
341
+ nhead=12,
342
+ num_layers=12,
343
+ reversible=True, # Critical for large models
344
+ use_checkpoint=True, # Trade compute for memory
345
+ chunk_size=64, # Reduce attention memory
346
+ full_attn_logging=False # Skip expensive logging
347
+ )
348
+
349
+ train_loop(
350
+ model, data,
351
+ batch_size=1, # Small batch size
352
+ accum_steps=16, # Maintain effective batch size
353
+ amp=True # Mixed precision training
354
+ )
355
+
356
+ This should reduce memory usage by ~60% compared to standard configuration."
357
+ ```
358
+
359
+ ---
360
+
361
+ ## Licensing and Distribution
362
+
363
+ BitTransformerLM is available under dual licensing:
364
+ - **Open Source**: AGPLv3 for research and open source use
365
+ - **Commercial**: Contact **[email protected]** for commercial licensing
366
+
367
+ When working with Claude Code, ensure compliance with the AGPLv3 license for any derivatives or modifications you create.
368
+
369
+ ---
370
+
371
+ ## Research Integrity
372
+
373
+ **Important Reminder**: BitTransformerLM is experimental research software. When using Claude Code:
374
+
375
+ 1. **Always validate claims** through proper baseline comparisons
376
+ 2. **Document limitations** honestly in any publications or reports
377
+ 3. **Use statistical significance testing** for any performance claims
378
+ 4. **Follow established ML research best practices**
379
+ 5. **Share negative results** as well as positive ones
380
+
381
+ Claude Code can help you design rigorous experiments and avoid common pitfalls in ML research.
382
+
383
+ ---
384
+
385
+ ## Support and Community
386
+
387
+ ### Getting Help
388
+ - **Claude Code**: Real-time AI assistance with BitTransformerLM
389
+ - **GitHub Issues**: Bug reports and feature requests
390
+ - **Discussions**: Community questions and sharing
391
+ - **User Guide**: Comprehensive documentation (`USER_GUIDE.md`)
392
+ - **Project Overview**: Complete project information (`ABOUTME.md`)
393
+
394
+ ### Contributing
395
+ When contributing to BitTransformerLM:
396
+ 1. Use Claude Code to ensure code quality and consistency
397
+ 2. Follow the development guidelines in this document
398
+ 3. Add tests for new functionality
399
+ 4. Update documentation as needed
400
+ 5. Ensure all safety and security practices are followed
401
+
402
+ ---
403
+
404
+ **BitTransformerLM + Claude Code provides a powerful combination for exploring bit-native language modeling with AI assistance. Start experimenting responsibly and share your findings with the research community!** 🤖✨