BitTransformerLM

Project Status: Experimental Research Implementation
Codebase Maturity: 57 Python files, 10,699 lines of research code
Current Stage: Pre-release requiring validation and baseline comparisons

BitTransformerLM is an experimental bit-native transformer language model with built-in safety telemetry, exploring a novel approach to language modeling at the bit level. This research implementation includes distributed training capabilities, real-time monitoring, automated scaling, and comprehensive safety mechanisms. The architecture demonstrates potential for memory-efficient processing through reversible layers and fine-grained control via bit-level operations.

Historical Background

Early Experiments – Initial prototypes explored mapping text to parity-protected bits and training a minimal transformer on random data.
Telemetry & Safety – Added negentropy, LZ complexity and symbiosis scoring to measure information flow and gate unsafe outputs.
Progressive Scaling – Introduced reversible layers and automatic depth/width expansion for efficient curriculum training. The schedule now triggers expansions only when validation loss plateaus and decays the learning rate by √2 after each growth with a 100-step warm‑up.
Compression Support – Integrated run-length encoding and packed bit I/O with optional multi-task training on compressed sequences.
Context Extension – Implemented chunked attention and sliding-window inference for long sequences with optional overlapping windows.
Attention Logging Toggle – full_attn_logging=False skips reconstructing full T×T attention maps during chunked attention, cutting memory use for very long sequences.
Diffusion LM Mode – Enable bidirectional denoising by setting causal=False or toggling Diffusion LM in the dashboard. Chunked attention is automatically disabled in this mode and restored afterward.
Dashboard & MCP Server – Built a lightweight web UI backed by a management server for real‑time training, inference and model collapse. New /metrics and /model_config endpoints surface live telemetry and hyperparameters, and /save_checkpoint and /download_checkpoint enable Hugging Face weight sync. The insecure /exec route has been removed.
Phase 1 Optimizations – Configurable batch sizes with aligned OneCycle scheduling, gradient accumulation, mixed‑precision, memory‑mapped dataset streaming, scheduled compression ramps, selective torch.compile, and an EMA‑smoothed safety gate with burn‑in to cut false positives.

The codebase includes comprehensive testing and experimental validation, representing a complete research implementation with potential for production deployment pending rigorous evaluation against standard baselines.

🧪 Experimental Feature Matrix

Core Architecture Innovations

✅ Bit-Native Processing: Direct 0/1 computation without token intermediates
✅ Reversible Layers: 50%+ memory reduction through mathematically reversible blocks
✅ Safety-First Design: Built-in K/C/S (Negentropy/Complexity/Symbiosis) telemetry
✅ Progressive Scaling: Dynamic architecture expansion based on performance metrics
✅ Diffusion Mode: Bidirectional denoising for advanced generation capabilities

Distributed Training Framework

✅ Multi-GPU FSDP: Fully Sharded Data Parallel implementation (tested up to 771M parameters)
✅ Pipeline Parallelism: Distributed training infrastructure
✅ Mixed Precision: FP16/BF16 optimization with CPU autocast support
✅ Gradient Checkpointing: Memory-efficient training for large models
✅ Dynamic Quantization: Runtime INT8 conversion + experimental 4-bit QAT

Experimental Safety & Monitoring

✅ Real-Time Telemetry: Live K/C/S metric tracking with drift detection
✅ Safety Gates: EMA-smoothed thresholds with configurable burn-in
✅ Metric Synthesis: Clustering-based activation analysis
✅ Collapse Detection: Automated model collapse prevention and recovery
✅ Human-in-Loop: Safe inference with retry mechanisms

Research Tools

✅ Interactive Dashboard: Real-time training control and visualization
✅ MCP Server: Management Control Protocol for research workflows
✅ HuggingFace Integration: Model weight sharing and checkpoint management
✅ Enhanced Checkpointing: Multi-run management with cloud backup
✅ CLI Standardization: Unified command-line interface across tools

Development Infrastructure

✅ Comprehensive Testing: 11 test modules with automated CI validation
✅ Type Safety: Full type annotations with custom type system
✅ Error Recovery: Robust error handling with automatic retry logic
✅ Memory Management: Intelligent caching with automatic cleanup
✅ Documentation: Research-grade docstrings and API reference

Performance Optimizations

✅ Torch.Compile: Selective compilation for performance-critical paths
✅ Chunked Attention: Memory-efficient processing of long sequences
✅ Compression Pipeline: Lossless bit compression with performance ramps
✅ Context Extension: Sliding window inference for arbitrary lengths
✅ ACT Integration: Adaptive Computation Time for dynamic depth

Research Status: BitTransformerLM provides a complete experimental framework for bit-native language modeling research, requiring baseline comparisons and rigorous evaluation for production use.

Quick Start

Install dependencies using the CPU wheel of PyTorch (default):

pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt

When GPU acceleration is toggled in the dashboard, the application automatically installs the CUDA-enabled wheel:

pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118

Run the example script:

python example.py

Adaptive scaling demo: The legacy progressive_scaleup.py script is retained for reference but has been superseded by integration_schedule.py, which offers a more flexible scaling workflow.

Run the unified workflow:

python unified_workflow.py --dashboard
# disable gradient checkpointing for faster but memory-hungry runs
python unified_workflow.py --no-checkpoint
# use standard (non-reversible) transformer blocks
python unified_workflow.py --no-reversible
# enable 4-bit quantization-aware training
python unified_workflow.py --qat

For faster CPU execution, BitTransformerLM exposes a cpu_autocast() helper that enables bfloat16 mixed precision. Models created with use_autocast=True apply this automatically, or you can wrap individual forward passes:

from bit_transformer.torch_utils import cpu_autocast

with cpu_autocast():
    logits, telemetry = model(bits)

Reduce memory use when chunked attention is active by disabling full attention logging:

model = BitTransformerLM(chunk_size=128, full_attn_logging=False)

Enable Diffusion LM training and sampling:

python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
# choose noise schedule: linear, cosine, exp
python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 --dataset-size 32
# linearly decay noise over epochs
python unified_workflow.py --diffusion --diffusion-curriculum --dataset-size 32

Higher --diffusion-steps (8–16) improves sample quality at the cost of compute. When using the dashboard, enable the Diffusion LM toggle to run the model without causal masking or chunked attention. Generated samples automatically fix parity bits so they can be decoded back to text. To resume training across machines using Hugging Face storage:

python unified_workflow.py --hf-repo your-username/bittransformerlm --hf-token $HF_TOKEN

The dashboard exposes matching controls under Hugging Face Checkpoints. Provide a repository ID and optional token (falling back to the HF_TOKEN environment variable) and click Upload weights or Download weights to sync the model. Run the unit tests:

pytest -q

Mode management

During training, ensure the model is in training mode with dropout enabled:

from bit_transformer.utils import set_dropout

model.train()
set_dropout(model, 0.1)

Before running tests, performing inference, or committing weights to the repository, switch the model to evaluation mode and disable dropout:

model.eval()
set_dropout(model, 0.0)

This prevents CI failures from accidentally pushing weights that still have active dropout.

Telemetry Metrics Explained

BitTransformerLM reports three bounded metrics in [0, 1] during training and inference:

Negentropy (K) – departure from random noise; 1 denotes perfectly ordered bits while 0 is uniform randomness.
LZ Complexity (C) – differentiable proxy for Lempel–Ziv compressibility; low values imply repetitive patterns and high values frequent transitions.
Symbiosis (S) – agreement between model predictions and a reference distribution via KL divergence; scores near 1 show strong alignment.

An Adaptive Computation Time (ACT) mechanism lets layers halt early once confidence exceeds a threshold. Halt probabilities are exported as halt_probs in telemetry for inspection.

These metrics are logged alongside losses and can trigger safety gates when thresholds are violated. The dashboard monitors drift and emits warnings when recent values deviate beyond a configurable threshold.

Core Features

Bit-Native Modeling – Works directly on 0/1 inputs with positional encodings and parity-protected text helpers.
Telemetry Synthesizer – Clusters activation summaries to surface coherent subspaces and detect drift.
Submodel Distillation – TelemetrySynthesizer selects representative sequences for collapse_submodel, which deepens and widens once (width_scale = 1.5) if telemetry floors aren't met; save_distilled_model places a metrics.json summary beside the distilled weights.
Safety Gate – hil_safe_inference enforces minimum complexity and symbiosis scores at runtime with EMA smoothing and a configurable burn‑in period.
Quantization – CPU inference can be quantized to int8 or trained with 4-bit QAT using the --qat flag.
Distributed Training – FSDP and pipeline helpers allow multi‑GPU scaling when hardware is available.
Interactive Dashboard – Live control of training, scaling and compression with optional GPU acceleration. The dashboard now exposes reversible layers, gradient checkpointing, ACT thresholds, λ floors, 4‑bit QAT and Diffusion LM toggles, real‑time telemetry charts powered by Chart.js, and Hugging Face checkpoint upload/download controls with HF_TOKEN fallback. Settings persist via localStorage.
CI/CD Pipeline – GitHub Actions install dependencies, run the tests and build distribution artifacts on every push.

Development Workflow

Start the MCP server:
```
python mcp_server.py
```

Launch the dashboard in another terminal:

MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app

Submit training batches, scale the model and monitor telemetry from the web UI. The dashboard's appearance is controlled by bit_transformer/static/style.css.

A watcher.py script can automatically restart the server and run tests when files change during local development.

Container Deployment

A Dockerfile and start.sh script build a minimal VM image that launches both the MCP server and dashboard.

docker build -t bittransformerlm .
docker run -p 5000:5000 -p 7000:7000 bittransformerlm

By default the container installs the CPU-only PyTorch wheel. Set the build argument TORCH_CUDA=cu118 to preinstall the GPU version. The container sets MCP_SERVER_ADDR=http://127.0.0.1:7000 and exposes the dashboard on port 5000.

Research Development Roadmap

✅ COMPLETED - Experimental Implementation

Architecture: Bit-native transformer with reversible layers ✅
Safety Systems: K/C/S telemetry with real-time monitoring ✅
Distributed Training: FSDP implementation (tested up to 771M parameters) ✅
Research Tools: Dashboard, MCP server, HF integration ✅
Testing & Validation: Comprehensive test suite with CI ✅
Documentation: Research-grade API documentation ✅
Performance: Memory optimization, quantization, compression ✅

🎯 VALIDATION TARGETS

Baseline Comparisons: Rigorous evaluation against standard transformers
Statistical Analysis: Multiple runs with proper significance testing
Long-Duration Training: Training convergence studies on real datasets
Scaling Studies: Systematic evaluation of model sizes and architectures

🚀 FUTURE RESEARCH DIRECTIONS

Scale Validation: Multi-billion parameter experiments with proper baselines
Hardware Optimization: Custom CUDA kernels and neuromorphic support
Application Studies: Real-world deployment case studies with evaluation
Academic Validation: Peer review and publication processes

Current Status: Complete experimental framework requiring rigorous validation against established baselines before production deployment.

Licensing

BitTransformerLM is available under a dual licensing scheme:

Open Source License: AGPLv3 (see LICENSE/LICENSE.txt)
Commercial License: Available by contacting [email protected]

Additional licensing documents in the LICENSE/ directory:

COMMERCIAL_LICENSE.txt: Information about commercial licensing options
DISCLAIMER.txt: Important legal disclaimers and limitations
TRADEMARK_POLICY.txt: Guidelines for using project trademarks
CONTRIBUTOR_LICENSE_AGREEMENT.txt: Terms for contributors

For commercial use cases that require different licensing terms than AGPLv3, please contact [email protected] to discuss commercial licensing options.