BitTransformerLM Model Card

Model Details

Model Type: Experimental Bit-Native Transformer Language Model
Architecture: Transformer with reversible layers and bit-level processing
Developer: WCNegentropy Research
Release Date: August 2025
Version: Pre-release Experimental
License: AGPLv3 (see LICENSE/ directory)

Model Description

BitTransformerLM is an experimental language model that processes text at the bit level rather than using traditional token-based approaches. The architecture explores potential memory efficiency improvements through reversible transformer layers and provides built-in safety monitoring through real-time telemetry.

Architecture Details

Input Processing: Direct binary sequence processing (0/1 bits)
Attention Mechanism: Multi-head self-attention on bit embeddings
Layer Design: Reversible transformer blocks for memory efficiency
Safety Features: Built-in K/C/S (Negentropy/Complexity/Symbiosis) telemetry
Training Modes: Causal autoregressive and experimental diffusion mode

Training Data and Methodology

Experimental Configurations Tested

Small-scale CPU Training (793K parameters)
- Dataset: 4 samples, 16 sequence length
- Training time: 0.21 seconds
- Convergence: Achieved on toy data
Large-scale GPU Training (771M parameters)
- Dataset: 5 text samples with zero-padding
- Hardware: Single GPU (despite multi-GPU claims in some docs)
- Training time: 11.47 seconds
- Architecture: d_model=1792, 20 layers, 28 attention heads

Limitations Identified

Limited Training Data: Experiments used minimal datasets insufficient for language modeling evaluation
No Baseline Comparisons: Missing comparative evaluation against standard transformers
Scale Claims: Some documentation overstated parameter counts and GPU usage
Training Duration: Short training periods insufficient for convergence assessment

Performance and Evaluation

Empirical Results (From test data)

Small Model (793K parameters):

Final Loss: 0.629
Best Loss: 0.571
Success Rate: 100% on single test prompt
Telemetry: Empty (minimal data)

Large Model (771M parameters):

Training Loss Progression: 11.84 → 18.65 → 17.15 → 8.15 → 5.35
Peak Memory Usage: 15.28 GB
Inference Success: 100% on 5 test prompts
Telemetry Metrics: K≈0.0013, C≈0.52, S≈0.46

Known Issues and Limitations

Experimental Status: This is research code requiring rigorous validation
Training Data: Evaluated only on toy datasets, not real language modeling tasks
Baseline Gaps: No systematic comparison to established transformer architectures
Scale Verification: Largest validated model is 771M parameters, not 1B+ as claimed elsewhere
Convergence: Training times too short to establish genuine convergence behavior

Intended Use and Applications

Research Applications ✅

Bit-level language modeling research
Memory-efficient transformer architecture studies
Safety telemetry and monitoring system development
Experimental diffusion-based text generation

Production Applications ⚠️

Not Recommended: Requires extensive validation and baseline comparisons
Missing: Proper evaluation on standard datasets and benchmarks
Needs: Long-duration training studies and statistical significance testing

Ethical Considerations and Risks

Potential Benefits

Enhanced interpretability through bit-level processing
Built-in safety monitoring and gating mechanisms
Memory-efficient architecture exploration
Open research contributing to AI safety

Potential Risks

Overstated Capabilities: Early documentation contained inflated claims
Incomplete Evaluation: Missing critical baseline comparisons
Research Maturity: Experimental status requires careful interpretation of results

Recommendations

Use for research and experimentation only
Conduct rigorous baseline comparisons before any production use
Validate claims through independent evaluation
Follow established ML research best practices

Technical Specifications

Model Architecture

Bit Embedding Size: Configurable (16-1792 tested)
Attention Heads: Configurable (2-28 tested)
Layers: Configurable (1-20 tested)
Max Sequence Length: Configurable (16-512 tested)
Reversible Layers: Optional memory-efficient computation
Quantization: Experimental 4-bit QAT support

System Requirements

Minimum: Python 3.10+, PyTorch 2.7.1, 8GB RAM
Recommended: 16GB+ RAM, CUDA-capable GPU for larger models
Dependencies: See requirements.txt for complete specification

Training Features

FSDP distributed training support
Mixed precision (FP16/BF16) training
Progressive scaling and curriculum learning
Real-time telemetry and safety monitoring
Interactive dashboard for training control

Citation

If you use BitTransformerLM in your research, please cite:

@software{bittransformerlm2025,
  title={BitTransformerLM: Experimental Bit-Native Transformer Language Model},
  author={WCNegentropy Research},
  year={2025},
  url={https://github.com/WCNegentropy/BitTransformerLM},
  note={Experimental research implementation}
}

Additional Resources

Repository: GitHub - WCNegentropy/BitTransformerLM
Documentation: README.md, AGENTS.md
License: AGPLv3 with additional terms (see LICENSE/ directory)
Issues: GitHub Issues for bug reports and feature requests

Disclaimer: This is experimental research code. Claims in some historical documentation may be overstated. Users should conduct independent evaluation and validation before any production use. The model requires rigorous baseline comparisons and statistical validation to establish its capabilities relative to standard approaches.