WCNegentropy
/

BitTransformerLM

+# BitTransformerLM Claude Code Integration Guide
+## Overview
+BitTransformerLM is optimally designed for use with [Claude Code](https://claude.ai/code), providing AI-assisted setup, development, and research workflows. This document provides guidelines for working with BitTransformerLM in Claude Code and standalone development.
+## Why Claude Code?
+BitTransformerLM's unique bit-native architecture has several complexities that Claude Code can help navigate:
+- **Complex Architecture**: Understanding bit-level processing, reversible layers, and safety telemetry
+- **Parameter Tuning**: Optimizing model configurations for different use cases
+- **Safety Monitoring**: Interpreting K/C/S metrics and configuring safety gates
+- **Distributed Training**: Setting up FSDP and pipeline parallelism correctly
+- **Debugging**: Identifying issues specific to bit-native processing
+Claude Code understands these nuances and can provide real-time assistance.
+---
+## Repository Scope and Architecture
+### Core Capabilities
+BitTransformerLM implements bit-native language modeling with:
+- **Bit-Native Processing**: Direct binary sequence modeling with parity protection
+- **Reversible Layers**: Memory-efficient transformer blocks that save ~50% memory
+- **Safety Telemetry**: Real-time K/C/S (Negentropy/Complexity/Symbiosis) monitoring
+- **Diffusion Mode**: Bidirectional denoising with multiple noise schedules
+- **Progressive Scaling**: Automatic model expansion based on validation performance
+- **Distributed Training**: FSDP and pipeline parallelism for large-scale training
+- **Interactive Dashboard**: Real-time training control and visualization
+### Experimental Status
+**Important**: BitTransformerLM is experimental research software requiring:
+- Rigorous baseline comparisons against standard transformers
+- Validation on established language modeling benchmarks
+- Statistical significance testing across multiple runs
+- Careful interpretation of safety metrics and claims
+---
+## Environment Setup
+### Requirements
+- **Python 3.10+** (required for modern PyTorch features)
+- **PyTorch 2.7.1+** with appropriate CUDA support if using GPUs
+### Installation Options
+#### CPU-Only Installation
+```bash
+pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
+```
+#### GPU Installation
+```bash
+pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
+pip install -r requirements.txt
+```
+#### Claude Code Assisted Setup
+When using Claude Code, simply ask for:
+- "Help me set up BitTransformerLM for my system"
+- "Configure BitTransformerLM for GPU training"
+- "Set up a development environment for bit-native language modeling"
+Claude Code will guide you through hardware detection, dependency installation, and initial configuration.
+---
+## Repository Structure
+```
+BitTransformerLM/
+├── bit_transformer/              # Core package
+│   ├── model.py                  # BitTransformerLM architecture
+│   ├── telemetry.py              # K/C/S safety metrics
+│   ├── safety.py                 # Safety gates and monitoring
+│   ├── bit_io.py                 # Text ↔ bits conversion
+│   ├── compression.py            # Run-length encoding
+│   ├── training.py               # Training utilities
+│   ├── distributed.py            # FSDP and pipeline parallelism
+│   ├── dashboard_app.py          # Interactive web dashboard
+│   ├── quantization.py           # INT8/4-bit quantization
+│   └── [other modules...]        # Additional functionality
+├── tests/                        # Test suite and results
+├── example.py                    # Basic usage example
+├── unified_workflow.py           # Main training script
+├── mcp_server.py                 # Management Control Protocol server
+├── USER_GUIDE.md                 # Comprehensive user documentation
+└── [other scripts...]            # Utilities and examples
+```
+---
+## Development Workflow with Claude Code
+### Getting Started
+1. **Initial Setup**
+   ```
+   "Help me understand BitTransformerLM's architecture"
+   "Create a simple training script for bit-native language modeling"
+   "Explain the difference between causal and diffusion modes"
+   ```
+2. **Model Configuration**
+   ```
+   "Configure a BitTransformerLM for [my specific use case]"
+   "What are optimal hyperparameters for a [size] model?"
+   "Help me enable reversible layers and gradient checkpointing"
+   ```
+3. **Training and Monitoring**
+   ```
+   "Set up distributed training with FSDP"
+   "Interpret these K/C/S telemetry values: K=0.3, C=0.6, S=0.4"
+   "Debug this memory error during training"
+   ```
+### Claude Code Advantages
+**Real-time Assistance**: Get immediate help with:
+- Parameter configuration and tuning
+- Error diagnosis and resolution
+- Architecture modification and experimentation
+- Safety metric interpretation
+- Performance optimization
+**Context-Aware Suggestions**: Claude Code understands:
+- BitTransformerLM's unique bit-native processing
+- The relationship between safety metrics
+- Memory optimization strategies
+- Distributed training complexities
+---
+## Key Commands and Workflows
+### Basic Usage
+```bash
+# Run simple example
+python example.py
+# Launch interactive dashboard
+python unified_workflow.py --dashboard
+# Train with diffusion mode
+python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
+```
+### Advanced Training
+```bash
+# Distributed training with FSDP
+python unified_workflow.py --distributed --batch-size 2 --epochs 10
+# Mixed precision with quantization
+python unified_workflow.py --amp --qat
+# Progressive scaling with curriculum learning
+python unified_workflow.py --progressive --diffusion-curriculum
+```
+### Dashboard and Monitoring
+```bash
+# Start MCP server and dashboard
+python mcp_server.py &
+MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app
+```
+**Dashboard Features:**
+- Real-time telemetry visualization
+- Interactive model configuration
+- HuggingFace checkpoint management
+- Safe inference testing interface
+---
+## Safety and Telemetry
+### Core Metrics
+| Metric | Full Name | Range | Interpretation |
+|--------|-----------|-------|----------------|
+| **K** | Negentropy | 0-1 | Information content (0=noise, 1=ordered) |
+| **C** | LZ Complexity | 0-1 | Pattern complexity (higher=more complex) |
+| **S** | Symbiosis | 0-1 | Alignment with reference (higher=better) |
+### Using with Claude Code
+```
+"Explain what K=0.2, C=0.8, S=0.3 means for my model"
+"Configure safety gates for production use"
+"My model is generating repetitive output, what safety metrics should I check?"
+"Set up drift detection for telemetry monitoring"
+```
+Claude Code can help interpret these metrics in context and suggest appropriate safety thresholds.
+### Safety Gate Configuration
+```python
+from bit_transformer.safety import SafetyGate
+# Production-ready safety gate
+gate = SafetyGate(
+    c_floor=0.3,      # Minimum complexity
+    s_floor=0.5,      # Minimum symbiosis
+    decay=0.9,        # EMA decay factor
+    burn_in=10        # Steps before gating starts
+)
+```
+---
+## Best Practices for Claude Code Development
+### 1. **Always Validate Research Claims**
+Ask Claude Code to help you:
+- Set up proper baseline comparisons
+- Design statistical significance tests
+- Implement evaluation on standard benchmarks
+- Document limitations and assumptions
+### 2. **Use Progressive Development**
+```
+"Start me with a minimal BitTransformerLM example"
+"Now add safety monitoring"
+"Scale up to distributed training"
+"Add diffusion mode capabilities"
+```
+### 3. **Leverage Claude Code for Architecture Understanding**
+```
+"Explain how reversible layers save memory"
+"Walk me through the bit encoding process"
+"How does the safety telemetry system work?"
+"Compare BitTransformerLM to standard transformers"
+```
+### 4. **Get Help with Complex Configurations**
+```python
+# Ask Claude Code to help configure models like:
+model = BitTransformerLM(
+    d_model=1024,           # Claude Code can suggest optimal values
+    nhead=16,               # Based on your hardware and use case
+    num_layers=20,
+    dim_feedforward=4096,
+    max_seq_len=2048,
+    reversible=True,        # Memory optimization
+    use_checkpoint=True,    # Gradient checkpointing
+    chunk_size=256,         # Attention chunking
+    lambda_K=0.1,           # Regularization weights
+    lambda_C=0.1,
+    lambda_S=0.1
+)
+```
+---
+## Development Guidelines
+### Code Style
+- **Functions**: `snake_case` (e.g., `train_loop`, `safe_inference`)
+- **Classes**: `CamelCase` (e.g., `BitTransformerLM`, `SafetyGate`)
+- **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LEN`)
+- **Keep functions under 300 lines** and minimize deep nesting
+### Security and Safety
+- **Never reintroduce deprecated `/exec` endpoint**
+- **Always use safety gates in production**
+- **Validate all user inputs** in dashboard and API endpoints
+- **Monitor telemetry metrics** for anomalous behavior
+- **Use `cpu_autocast()` helper** instead of direct `torch.amp.autocast`
+### Memory Management
+```python
+# Good: Memory-efficient configuration
+model = BitTransformerLM(
+    reversible=True,        # Enable reversible layers
+    use_checkpoint=True,    # Gradient checkpointing
+    chunk_size=128,         # Chunked attention
+    full_attn_logging=False # Skip full attention reconstruction
+)
+# Training with memory optimizations
+train_loop(
+    model, data,
+    amp=True,              # Mixed precision
+    accum_steps=4,         # Gradient accumulation
+    compile_model=True     # torch.compile optimization
+)
+```
+### Testing and Validation
+```bash
+# Run tests after changes
+pytest -q
+# Model evaluation modes
+model.train()    # For training
+model.eval()     # For inference/evaluation
+set_dropout(model, 0.0)  # Disable dropout for reproducible results
+```
+---
+## Getting Help from Claude Code
+### Specific Areas Where Claude Code Excels
+1. **Architecture Design**
+   - "Design a BitTransformerLM architecture for [specific task]"
+   - "Optimize this model configuration for memory efficiency"
+   - "Explain the trade-offs between reversible and standard layers"
+2. **Training Optimization**
+   - "My training is running out of memory, help optimize"
+   - "Configure distributed training for 4 GPUs"
+   - "Set up a training curriculum for bit-native language modeling"
+3. **Safety and Monitoring**
+   - "Interpret these telemetry readings and suggest adjustments"
+   - "Set up production-ready safety monitoring"
+   - "Debug why my safety gate is triggering constantly"
+4. **Research and Evaluation**
+   - "Design a rigorous evaluation comparing BitTransformerLM to GPT-2"
+   - "Set up statistical significance testing for my experiments"
+   - "Help me write up my research findings honestly"
+### Sample Claude Code Interactions
+```
+User: "I'm getting OOM errors training a 500M parameter BitTransformerLM"
+Claude Code: "Let me help optimize your memory usage. Here's a configuration
+that should work better for your model size:
+model = BitTransformerLM(
+    d_model=768,
+    nhead=12,
+    num_layers=12,
+    reversible=True,          # Critical for large models
+    use_checkpoint=True,      # Trade compute for memory
+    chunk_size=64,            # Reduce attention memory
+    full_attn_logging=False   # Skip expensive logging
+)
+train_loop(
+    model, data,
+    batch_size=1,             # Small batch size
+    accum_steps=16,           # Maintain effective batch size
+    amp=True                  # Mixed precision training
+)
+This should reduce memory usage by ~60% compared to standard configuration."
+```
+---
+## Licensing and Distribution
+BitTransformerLM is available under dual licensing:
+- **Open Source**: AGPLv3 for research and open source use
+- **Commercial**: Contact **[email protected]** for commercial licensing
+When working with Claude Code, ensure compliance with the AGPLv3 license for any derivatives or modifications you create.
+---
+## Research Integrity
+**Important Reminder**: BitTransformerLM is experimental research software. When using Claude Code:
+1. **Always validate claims** through proper baseline comparisons
+2. **Document limitations** honestly in any publications or reports
+3. **Use statistical significance testing** for any performance claims
+4. **Follow established ML research best practices**
+5. **Share negative results** as well as positive ones
+Claude Code can help you design rigorous experiments and avoid common pitfalls in ML research.
+---
+## Support and Community
+### Getting Help
+- **Claude Code**: Real-time AI assistance with BitTransformerLM
+- **GitHub Issues**: Bug reports and feature requests
+- **Discussions**: Community questions and sharing
+- **User Guide**: Comprehensive documentation (`USER_GUIDE.md`)
+- **Project Overview**: Complete project information (`ABOUTME.md`)
+### Contributing
+When contributing to BitTransformerLM:
+1. Use Claude Code to ensure code quality and consistency
+2. Follow the development guidelines in this document
+3. Add tests for new functionality
+4. Update documentation as needed
+5. Ensure all safety and security practices are followed
+---
+**BitTransformerLM + Claude Code provides a powerful combination for exploring bit-native language modeling with AI assistance. Start experimenting responsibly and share your findings with the research community!** 🤖✨