WCNegentropy commited on
Commit
7cf71dd
·
verified ·
1 Parent(s): 0f9d62d

Add Project overview and quick start guide

Browse files
Files changed (1) hide show
  1. ABOUTME.md +238 -102
ABOUTME.md CHANGED
@@ -1,110 +1,246 @@
1
- Here’s a menu of additional, “pure-PyTorch” extensions that can close the gap even further to a production-grade LLM:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
-
4
 
5
- 1. Native Low-Rank & MoE Layers (DO LAST)
6
 
7
- Why: Expert mixtures and low-rank adapters let you balloon effective parameter count without proportional compute.
8
- • Mixture-of-Experts: Implement a tiny gating network (one or two linear layers) that routes each token’s representation to one of E experts (each a small FFN). Only that expert runs on that position, so compute per token stays constant while total capacity grows by E×.
9
- • PyTorch sketch:
10
 
11
- class MoE(nn.Module):
12
- def __init__(self, d_model, d_ff, n_experts=4):
13
- super.__init__
14
- self.gate = nn.Linear(d_model, n_experts)
15
- self.experts = nn.ModuleList(
16
- [nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU, nn.Linear(d_ff, d_model))
17
- for _ in range(n_experts)]
18
- )
19
- def forward(self, x):
20
- # x: [T,B,D]
21
- logits = self.gate(x) # [T,B,E]
22
- w = F.softmax(logits, dim=-1) # [T,B,E]
23
- y = torch.stack([expert(x) for expert in self.experts], -1)
24
- # y: [T,B,D,E] → weighted sum:
25
- out = (y * w.unsqueeze(2)).sum(-1)
26
- return out
27
 
 
28
 
29
- • Trade-off: You’ll need a load-balancing loss term (e.g. encourage the gate to spread load) and telemetry on expert usage, but the code stays pure PyTorch.
 
 
 
30
 
31
-
32
 
33
- 2. [x] Adaptive Computation Time (ACT)
34
-
35
- Why: Let the model learn to spend more depth on “hard” bits and skip layers on easier ones.
36
- • Implementation: Add a tiny halting unit after each layer—e.g. a single linear+sigmoid per token that predicts stop/pause. Accumulate “halt probability” across layers and stop processing tokens once they cross a threshold.
37
- • Benefit: On average you’ll do fewer layer passes per token, reducing compute without touching PyTorch internals.
38
-
39
-
40
-
41
- 3. [x] Advanced PyTorch-Native Quantization
42
-
43
- Why: Move beyond static 4-bit packaging to full QAT / dynamic quant.
44
- • FX-graph QAT: Use torch.quantization.prepare_qat_fx on your SparseQuantTransformerLayer with a custom 4-bit observer (we sketched one earlier). Then convert_fx to int8 or 4-bit for weights—no external libs needed.
45
- • Dynamic quant for inference: Wrap your model in torch.quantization.quantize_dynamic(...), quantizing only Linear modules to int8 on-the-fly. Gives a big speed/memory win at inference time on CPU.
46
-
47
-
48
-
49
- 4. [x] Chunked & Overlapping Attention
50
-
51
- Why: Emulate sparse attention with pure PyTorch and no for-loops.
52
- • How: Break your sequence into fixed-size chunks (e.g. 512 bits), attend within each chunk plus a small overlap window to neighbors.
53
- • Pure PyTorch: Use unfold + batched torch.matmul to compute all chunked attention in parallel:
54
-
55
- x: [B, L, D], chunk_size=C, overlap=O
56
- pads = (O, O)
57
- x_padded = F.pad(x, (0,0) + pads) # pad on seq dim
58
- chunks = x_padded.unfold(1, C+2*O, C) # [B, n_chunks, C+2O, D]
59
- Then project Q,K,V per-chunk and do fused matmuls batchwise
60
-
61
-
62
- • Benefit: You get an O(L·(C+2O)) algorithm without Python loops, all in tensor ops.
63
-
64
-
65
-
66
- 5. Functorch-Based Vectorization & vmap
67
-
68
- Why: Fuse your per-head or per-expert loops automatically.
69
- • Use functorch.vmap to turn your per-head attention code (the one inside the for t in range(T)) into a single batched kernel.
70
- • Benefit: Cleaner code, fewer Python loops, and TorchInductor can fuse it just as well as hand-written loops.
71
-
72
-
73
-
74
- 6. [x] Fully-Sharded DataParallel & Pipeline Parallel (PyTorch-Native)
75
-
76
- Why: Scale out to multiple GPUs without external frameworks.
77
- • FSDP: Wrap your model in torch.distributed.fsdp.FullyShardedDataParallel to shard both parameters and optimizer state across GPUs.
78
- • Pipe: Use torch.distributed.pipeline.sync.Pipe to split your 40+ layer model across GPUs as pipeline stages.
79
- • Benefit: Zero external deps—pure PyTorch DDP/FS/PIPE—so you can train 100M+ parameter models.
80
-
81
-
82
-
83
- 7. [x] Mixed Precision & Autocast on CPU (bfloat16)
84
-
85
- Why: PyTorch now supports `torch.amp.autocast('cpu')` for bfloat16 on some architectures.
86
- • Surround your forward in with `torch.amp.autocast('cpu')`: to cut memory and speed up linear/attention kernels, even on CPU.
87
-
88
-
89
-
90
- 8. [x] Optimized Learning-Rate Schedules & Optimizers
91
-
92
- Why: Achieve GPT-level convergence behavior…
93
- • Implement OneCycleLR or CosineAnnealingWarmRestarts directly via torch.optim.lr_scheduler.
94
- • Swap to AdamW with decoupled weight decay (torch.optim.AdamW) and dynamic gradient clipping (torch.nn.utils.clip_grad_norm_).
95
- • All of these live in core PyTorch.
96
-
97
-
98
-
99
- Putting It All Together
100
- 1. MoE + ACT will let you scale capacity (E× experts) while controlling average compute.
101
- 2. FX/QAT + dynamic quant gives you 4-bit int inference with no external libs.
102
- 3. Chunked attention + vmap replaces loops with giant fused tensor ops.
103
- 4. FSDP + Pipe moves you onto multi-GPU purely in torch.distributed.
104
- 5. Autocast (bfloat16) on CPU/GPU for mixed precision speed.
105
-
106
- By layering these techniques, you can:
107
- • Reach hundreds of millions (even billions) of effective parameters
108
- • Maintain single-library purity (just PyTorch)
109
- • Hit LLM-class throughputs (100’s of tokens/sec GPU, 10’s CPU)
110
- • Keep full NRB telemetry available for safety checks
 
1
+ # BitTransformerLM
2
+
3
+ **Project Status:** Experimental Research Implementation
4
+ **Codebase Maturity:** 57 Python files, 10,699 lines of research code
5
+ **Current Stage:** Pre-release requiring validation and baseline comparisons
6
+
7
+ BitTransformerLM is an experimental **bit-native transformer language model** with built-in safety telemetry, exploring a novel approach to language modeling at the bit level. This research implementation includes distributed training capabilities, real-time monitoring, automated scaling, and comprehensive safety mechanisms. The architecture demonstrates potential for memory-efficient processing through reversible layers and fine-grained control via bit-level operations.
8
+
9
+ ## Historical Background
10
+ - **Early Experiments** – Initial prototypes explored mapping text to parity-protected bits and training a minimal transformer on random data.
11
+ - **Telemetry & Safety** – Added negentropy, LZ complexity and symbiosis scoring to measure information flow and gate unsafe outputs.
12
+ - **Progressive Scaling** – Introduced reversible layers and automatic depth/width expansion for efficient curriculum training. The schedule now triggers expansions only when validation loss plateaus and decays the learning rate by √2 after each growth with a 100-step warm‑up.
13
+ - **Compression Support** – Integrated run-length encoding and packed bit I/O with optional multi-task training on compressed sequences.
14
+ - **Context Extension** – Implemented chunked attention and sliding-window inference for long sequences with optional overlapping windows.
15
+ - **Attention Logging Toggle** – ``full_attn_logging=False`` skips reconstructing full ``T×T`` attention maps during chunked attention, cutting memory use for very long sequences.
16
+ - **Diffusion LM Mode** – Enable bidirectional denoising by setting ``causal=False`` or toggling **Diffusion LM** in the dashboard. Chunked attention is automatically disabled in this mode and restored afterward.
17
+ - **Dashboard & MCP Server** – Built a lightweight web UI backed by a management server for real‑time training, inference and model collapse. New `/metrics` and `/model_config` endpoints surface live telemetry and hyperparameters, and `/save_checkpoint` and `/download_checkpoint` enable Hugging Face weight sync. The insecure `/exec` route has been removed.
18
+ - **Phase 1 Optimizations** – Configurable batch sizes with aligned OneCycle scheduling, gradient accumulation, mixed‑precision, memory‑mapped dataset streaming, scheduled compression ramps, selective ``torch.compile``, and an EMA‑smoothed safety gate with burn‑in to cut false positives.
19
+
20
+ The codebase includes comprehensive testing and experimental validation, representing a complete research implementation with potential for production deployment pending rigorous evaluation against standard baselines.
21
+
22
+ ## 🧪 Experimental Feature Matrix
23
+
24
+ ### Core Architecture Innovations
25
+ - ✅ **Bit-Native Processing**: Direct 0/1 computation without token intermediates
26
+ - ✅ **Reversible Layers**: 50%+ memory reduction through mathematically reversible blocks
27
+ - ✅ **Safety-First Design**: Built-in K/C/S (Negentropy/Complexity/Symbiosis) telemetry
28
+ - ✅ **Progressive Scaling**: Dynamic architecture expansion based on performance metrics
29
+ - ✅ **Diffusion Mode**: Bidirectional denoising for advanced generation capabilities
30
+
31
+ ### Distributed Training Framework
32
+ - ✅ **Multi-GPU FSDP**: Fully Sharded Data Parallel implementation (tested up to 771M parameters)
33
+ - ✅ **Pipeline Parallelism**: Distributed training infrastructure
34
+ - ✅ **Mixed Precision**: FP16/BF16 optimization with CPU autocast support
35
+ - ✅ **Gradient Checkpointing**: Memory-efficient training for large models
36
+ - ✅ **Dynamic Quantization**: Runtime INT8 conversion + experimental 4-bit QAT
37
+
38
+ ### Experimental Safety & Monitoring
39
+ - ✅ **Real-Time Telemetry**: Live K/C/S metric tracking with drift detection
40
+ - ✅ **Safety Gates**: EMA-smoothed thresholds with configurable burn-in
41
+ - ✅ **Metric Synthesis**: Clustering-based activation analysis
42
+ - ✅ **Collapse Detection**: Automated model collapse prevention and recovery
43
+ - ✅ **Human-in-Loop**: Safe inference with retry mechanisms
44
+
45
+ ### Research Tools
46
+ - ✅ **Interactive Dashboard**: Real-time training control and visualization
47
+ - ✅ **MCP Server**: Management Control Protocol for research workflows
48
+ - ✅ **HuggingFace Integration**: Model weight sharing and checkpoint management
49
+ - ✅ **Enhanced Checkpointing**: Multi-run management with cloud backup
50
+ - ✅ **CLI Standardization**: Unified command-line interface across tools
51
+
52
+ ### Development Infrastructure
53
+ - ✅ **Comprehensive Testing**: 11 test modules with automated CI validation
54
+ - ✅ **Type Safety**: Full type annotations with custom type system
55
+ - ✅ **Error Recovery**: Robust error handling with automatic retry logic
56
+ - ✅ **Memory Management**: Intelligent caching with automatic cleanup
57
+ - ✅ **Documentation**: Research-grade docstrings and API reference
58
+
59
+ ### Performance Optimizations
60
+ - ✅ **Torch.Compile**: Selective compilation for performance-critical paths
61
+ - ✅ **Chunked Attention**: Memory-efficient processing of long sequences
62
+ - ✅ **Compression Pipeline**: Lossless bit compression with performance ramps
63
+ - ✅ **Context Extension**: Sliding window inference for arbitrary lengths
64
+ - ✅ **ACT Integration**: Adaptive Computation Time for dynamic depth
65
+
66
+ **Research Status**: BitTransformerLM provides a complete experimental framework for bit-native language modeling research, requiring baseline comparisons and rigorous evaluation for production use.
67
+
68
+ ## Quick Start
69
+ Install dependencies using the CPU wheel of PyTorch (default):
70
+ ```bash
71
+ pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
72
+ ```
73
+ When GPU acceleration is toggled in the dashboard, the application automatically
74
+ installs the CUDA-enabled wheel:
75
+ ```bash
76
+ pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
77
+ ```
78
+ Run the example script:
79
+ ```bash
80
+ python example.py
81
+ ```
82
+ Adaptive scaling demo:
83
+ The legacy `progressive_scaleup.py` script is retained for reference but has been
84
+ superseded by `integration_schedule.py`, which offers a more flexible scaling
85
+ workflow.
86
+
87
+ Run the unified workflow:
88
+ ```bash
89
+ python unified_workflow.py --dashboard
90
+ # disable gradient checkpointing for faster but memory-hungry runs
91
+ python unified_workflow.py --no-checkpoint
92
+ # use standard (non-reversible) transformer blocks
93
+ python unified_workflow.py --no-reversible
94
+ # enable 4-bit quantization-aware training
95
+ python unified_workflow.py --qat
96
+ ```
97
+
98
+ For faster CPU execution, BitTransformerLM exposes a `cpu_autocast()` helper
99
+ that enables bfloat16 mixed precision. Models created with
100
+ `use_autocast=True` apply this automatically, or you can wrap individual
101
+ forward passes:
102
+
103
+ ```python
104
+ from bit_transformer.torch_utils import cpu_autocast
105
+
106
+ with cpu_autocast():
107
+ logits, telemetry = model(bits)
108
+ ```
109
+
110
+ Reduce memory use when chunked attention is active by disabling full
111
+ attention logging:
112
+
113
+ ```python
114
+ model = BitTransformerLM(chunk_size=128, full_attn_logging=False)
115
+ ```
116
+
117
+ Enable Diffusion LM training and sampling:
118
+ ```bash
119
+ python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
120
+ # choose noise schedule: linear, cosine, exp
121
+ python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 --dataset-size 32
122
+ # linearly decay noise over epochs
123
+ python unified_workflow.py --diffusion --diffusion-curriculum --dataset-size 32
124
+ ```
125
+ Higher `--diffusion-steps` (8–16) improves sample quality at the cost of compute. When using the dashboard, enable the **Diffusion LM** toggle to run the model without causal masking or chunked attention.
126
+ Generated samples automatically fix parity bits so they can be decoded back to text.
127
+ To resume training across machines using Hugging Face storage:
128
+ ```bash
129
+ python unified_workflow.py --hf-repo your-username/bittransformerlm --hf-token $HF_TOKEN
130
+ ```
131
+ The dashboard exposes matching controls under **Hugging Face Checkpoints**. Provide a repository ID and optional token (falling back to the `HF_TOKEN` environment variable) and click **Upload weights** or **Download weights** to sync the model.
132
+ Run the unit tests:
133
+ ```bash
134
+ pytest -q
135
+ ```
136
+
137
+ ### Mode management
138
+
139
+ During training, ensure the model is in training mode with dropout enabled:
140
+
141
+ ```python
142
+ from bit_transformer.utils import set_dropout
143
+
144
+ model.train()
145
+ set_dropout(model, 0.1)
146
+ ```
147
+
148
+ Before running tests, performing inference, or committing weights to the repository, switch the model to evaluation mode and disable dropout:
149
+
150
+ ```python
151
+ model.eval()
152
+ set_dropout(model, 0.0)
153
+ ```
154
+
155
+ This prevents CI failures from accidentally pushing weights that still have active dropout.
156
+
157
+ ## Telemetry Metrics Explained
158
+ BitTransformerLM reports three bounded metrics in ``[0, 1]`` during training and inference:
159
+
160
+ - **Negentropy (K)** – departure from random noise; ``1`` denotes perfectly ordered bits while ``0`` is uniform randomness.
161
+ - **LZ Complexity (C)** – differentiable proxy for Lempel–Ziv compressibility; low values imply repetitive patterns and high values frequent transitions.
162
+ - **Symbiosis (S)** – agreement between model predictions and a reference distribution via KL divergence; scores near ``1`` show strong alignment.
163
+
164
+ An Adaptive Computation Time (ACT) mechanism lets layers halt early once confidence exceeds a threshold. Halt probabilities are exported as ``halt_probs`` in telemetry for inspection.
165
+
166
+ These metrics are logged alongside losses and can trigger safety gates when thresholds are violated. The dashboard monitors drift and emits warnings when recent values deviate beyond a configurable threshold.
167
+
168
+ ## Core Features
169
+ - **Bit-Native Modeling** – Works directly on 0/1 inputs with positional encodings and parity-protected text helpers.
170
+ - **Telemetry Synthesizer** – Clusters activation summaries to surface coherent subspaces and detect drift.
171
+ - **Submodel Distillation** – `TelemetrySynthesizer` selects representative sequences for `collapse_submodel`, which deepens
172
+ and widens once (`width_scale` = 1.5) if telemetry floors aren't met; `save_distilled_model` places a `metrics.json` summary
173
+ beside the distilled weights.
174
+ - **Safety Gate** – `hil_safe_inference` enforces minimum complexity and symbiosis scores at runtime with EMA smoothing and a configurable burn‑in period.
175
+ - **Quantization** – CPU inference can be quantized to int8 or trained with 4-bit QAT using the `--qat` flag.
176
+ - **Distributed Training** – FSDP and pipeline helpers allow multi‑GPU scaling when hardware is available.
177
+ - **Interactive Dashboard** – Live control of training, scaling and compression with optional GPU acceleration. The dashboard now exposes reversible layers, gradient checkpointing, ACT thresholds, λ floors, 4‑bit QAT and Diffusion LM toggles, real‑time telemetry charts powered by Chart.js, and Hugging Face checkpoint upload/download controls with `HF_TOKEN` fallback. Settings persist via `localStorage`.
178
+ - **CI/CD Pipeline** – GitHub Actions install dependencies, run the tests and build distribution artifacts on every push.
179
+
180
+ ## Development Workflow
181
+ 1. Start the MCP server:
182
+ ```bash
183
+ python mcp_server.py
184
+ ```
185
+ 2. Launch the dashboard in another terminal:
186
+ ```bash
187
+ MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app
188
+ ```
189
+ 3. Submit training batches, scale the model and monitor telemetry from the web UI.
190
+ The dashboard's appearance is controlled by `bit_transformer/static/style.css`.
191
+
192
+ A `watcher.py` script can automatically restart the server and run tests when files change during local development.
193
+
194
+ ## Container Deployment
195
+ A `Dockerfile` and `start.sh` script build a minimal VM image that launches both the MCP server and dashboard.
196
+
197
+ ```bash
198
+ docker build -t bittransformerlm .
199
+ docker run -p 5000:5000 -p 7000:7000 bittransformerlm
200
+ ```
201
+
202
+ By default the container installs the CPU-only PyTorch wheel. Set the build
203
+ argument `TORCH_CUDA=cu118` to preinstall the GPU version. The container sets
204
+ `MCP_SERVER_ADDR=http://127.0.0.1:7000` and exposes the dashboard on port 5000.
205
+
206
+ ## Research Development Roadmap
207
+
208
+ ### ✅ **COMPLETED - Experimental Implementation**
209
+ - **Architecture**: Bit-native transformer with reversible layers ✅
210
+ - **Safety Systems**: K/C/S telemetry with real-time monitoring ✅
211
+ - **Distributed Training**: FSDP implementation (tested up to 771M parameters) ✅
212
+ - **Research Tools**: Dashboard, MCP server, HF integration ✅
213
+ - **Testing & Validation**: Comprehensive test suite with CI ✅
214
+ - **Documentation**: Research-grade API documentation ✅
215
+ - **Performance**: Memory optimization, quantization, compression ✅
216
+
217
+ ### 🎯 **VALIDATION TARGETS**
218
+ - **Baseline Comparisons**: Rigorous evaluation against standard transformers
219
+ - **Statistical Analysis**: Multiple runs with proper significance testing
220
+ - **Long-Duration Training**: Training convergence studies on real datasets
221
+ - **Scaling Studies**: Systematic evaluation of model sizes and architectures
222
+
223
+ ### 🚀 **FUTURE RESEARCH DIRECTIONS**
224
+ - **Scale Validation**: Multi-billion parameter experiments with proper baselines
225
+ - **Hardware Optimization**: Custom CUDA kernels and neuromorphic support
226
+ - **Application Studies**: Real-world deployment case studies with evaluation
227
+ - **Academic Validation**: Peer review and publication processes
228
 
229
+ **Current Status**: Complete experimental framework requiring rigorous validation against established baselines before production deployment.
230
 
231
+ ## Licensing
232
 
233
+ BitTransformerLM is available under a dual licensing scheme:
 
 
234
 
235
+ * **Open Source License:** AGPLv3 (see `LICENSE/LICENSE.txt`)
236
+ * **Commercial License:** Available by contacting **[email protected]**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
+ Additional licensing documents in the `LICENSE/` directory:
239
 
240
+ * `COMMERCIAL_LICENSE.txt`: Information about commercial licensing options
241
+ * `DISCLAIMER.txt`: Important legal disclaimers and limitations
242
+ * `TRADEMARK_POLICY.txt`: Guidelines for using project trademarks
243
+ * `CONTRIBUTOR_LICENSE_AGREEMENT.txt`: Terms for contributors
244
 
245
+ For commercial use cases that require different licensing terms than AGPLv3, please contact **[email protected]** to discuss commercial licensing options.
246