AGENTS Guidelines for BitTransformerLM
Repository Scope and Purpose
- BitTransformerLM models raw binary streams using reversible transformer blocks and safety telemetry. The project is the canonical implementation under WCNegentropy.
- Core capabilities include bit-native modeling, telemetry metrics (negentropy, LZ complexity, symbiosis), progressive scaling, compression, context extension, diffusion mode (linear/cosine/exp noise schedules with parity correction), dashboard control, distributed training, and quantization.
- Phase 1 optimizations provide configurable batch sizing, gradient accumulation, mixed-precision, memory-mapped dataset streaming, scheduled compression ramps, selective
torch.compile
, and an EMA-smoothed safety gate with burn-in.
Environment Setup
- Requires Python 3.10+.
- Install dependencies:
- CPU:
pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
- Optional GPU:
pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
- CPU:
- The package name is
bit-transformer
; project metadata lives inpyproject.toml
.
Repository Layout
bit_transformer/
– core package (model
,compression
,telemetry
,safety
,dashboard_app
,quantization
, etc.).tests/
– pytest suite and historicalTEST_RESULTS.md
.- Scripts:
example.py
,unified_workflow.py
,full_bits_train.py
,build_full_bits.py
,mcp_server.py
,wikitext_*
utilities. The legacyprogressive_scaleup.py
is retained for reference but superseded byintegration_schedule.py
. - Docs and specs:
README.md
,state_of_the_repo_audit.md
, licensing files inLICENSE/
.
Development Practices
- Follow snake_case for functions and CamelCase for classes.
- Keep functions under ~300 lines and minimize deeply nested control flow.
- Avoid reintroducing the deprecated dashboard
/exec
endpoint or other insecure code paths. - Use the
/status
endpoint for model introspection; all routes return JSON and surface errors with stack traces. - Ensure compression, decompression, and halting logic stay consistent with current implementation.
- Use the
cpu_autocast()
helper for BF16 mixed precision on CPU instead of callingtorch.amp.autocast
directly. - Adaptive training now expands depth, width, or context only when validation loss plateaus and automatically decays the base learning rate by √2 after each expansion with a 100‑step warm‑up.
Workflow & Commands
- Run the example:
python example.py
. - Adaptive scaling now lives in
integration_schedule.py
;progressive_scaleup.py
is deprecated. - Unified workflow (optionally with dashboard or diffusion):
python unified_workflow.py --dashboard
orpython unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
. - Increase
--diffusion-steps
for higher fidelity (8–16) and add--diffusion-curriculum
to linearly decay noise over epochs. - Disable checkpointing or reversible blocks when speed is prioritized over memory:
python unified_workflow.py --no-checkpoint --no-reversible
. - Enable 4-bit quantization-aware training:
python unified_workflow.py --qat
. - Skip full attention logging during chunked attention for memory savings by constructing the model with
full_attn_logging=False
. - Start MCP server:
python mcp_server.py
and launch dashboard:MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app
. /metrics
and/model_config
endpoints expose telemetry streams and hyperparameters./save_checkpoint
and/download_checkpoint
sync weights with Hugging Face (token defaults toHF_TOKEN
).- Container build:
docker build -t bittransformerlm .
and run with exposed ports5000
(dashboard) and7000
(MCP).
Telemetry Metrics
Metric | Meaning | Range |
---|---|---|
K | Negentropy – deviation from random noise | 0–1 (1 = ordered) |
C | LZ Complexity – compressibility proxy | 0–1 (higher = more changes) |
S | Symbiosis – agreement with reference distribution | 0–1 (1 = aligned) |
ACT halting exports halt_probs
in telemetry showing how many layers executed. For robust sampling under safety constraints, call safe_sample_with_retry(model, bits)
which retries with diffusion mode and exponential backoff.
TelemetrySynthesizer.cluster_sequences
can be used to select representative training samples before invoking collapse_submodel
. The distillation helper deepens the model and widens once (width_scale
= 1.5) if floors are missed, and save_distilled_model
emits a metrics.json
summary beside the weights.
Testing
- Run unit tests after any change:
pytest -q
. - Use
watcher.py
for auto-reload and test on local development if desired. - During training, call
model.train()
and keep dropout probabilities around0.1–0.2
. - Before running tests, inference, or pushing weights, switch to
model.eval()
and set all dropout probabilities to0
to avoid flaky results. - Dashboard will warn if telemetry metrics drift by more than 0.2 over the last 10 steps. Adjust via
ModelManager(drift_window, drift_threshold)
as needed.
Licensing
- Project governed by documents in
LICENSE/
(AGPLv3, commercial terms, disclaimers, etc.). Ensure compliance before contributing or distributing.
These guidelines keep the repository consistent with the project roadmap and previous audits. Maintain security, style, and testing discipline to keep BitTransformerLM production-ready.