BitTransformerLM / context_extension.md
WCNegentropy's picture
🤖 Updated BitTransformerLM from development space
36c78b1 verified
|
raw
history blame
5.82 kB

Increasing the BitTransformerLM context window

Current limitations and mechanisms The default max_seq_len in BitTransformerLM is 1 024 bits GitHub . Since text is encoded using parity bits (9 bits per byte) GitHub , this translates to roughly 113 bytes (≈113 characters) of input. The model uses full self‑attention, giving quadratic memory complexity in sequence length. To train on very long sequences, train_full_sequence slides a fixed‑size context window along a long bit tensor, detaching the computation graph periodically GitHub . Compression can shorten sequences via run‑length encoding GitHub , and chunked attention can divide long inputs into overlapping windows for attention calculations GitHub . However, the maximum positional encoding still defines an upper bound.

Strategies to reach 2 k‑word context (18 k bits) Increase max_seq_len and positional encoding. The positional encoding precomputes a [max_len, d_model] matrix GitHub . Raising max_len to accommodate ~18 000 bits (for ~2 000 words × 9 bits per word) is possible but memory‑intensive. At d_model=128, the positional encoding would be ~18 000×128≈2.3 M floats (≈9 MB). That is reasonable for a CPU VM. Codex can modify the default max_seq_len and update any dependent tests. Use chunked attention and overlapping windows. LoggingTransformerEncoderLayer already supports chunk_size and overlap parameters GitHub . Setting chunk_size (e.g., 2 048 bits) and an overlap of e.g., 128 bits enables the model to handle sequences far longer than the attention window while still allowing information flow across chunks. Codex can expose chunk_size and overlap through the dashboard and CLI so users can tune them for longer contexts. Codex prompt example: “Modify the dashboard /init endpoint to accept chunk_size and overlap fields and pass them to BitTransformerLM. Update the HTML template to include input fields for these parameters.” Apply sliding‑window training and inference. The train_full_sequence method trains on long bit tensors by sliding a context window and detaching the graph every ctx_bits bits GitHub . For inference, a similar sliding approach could produce outputs for long sequences. Codex can add an infer_long_sequence method that divides a long bit sequence into overlapping windows, runs the model with causal=True to preserve order, and stitches the outputs. Prompt example: “Implement def infer_long_sequence(model: BitTransformerLM, bits: torch.Tensor, ctx_bits: int = 4096, overlap: int = 256): that processes a long bit tensor in sliding windows with overlap, uses causal=True, and returns the concatenated output bits.” Exploit run‑length compression more aggressively. Since binary data often contains runs of identical bits (e.g., long sequences of zeros), increasing compression ratio reduces the effective sequence length. Codex could add additional compression schemes (e.g., bit‑packing into bytes using numpy.packbits) and integrate them into the model’s I/O pipeline. Care must be taken to maintain parity bits for error detection. Prompt example: “Add functions pack_bits and unpack_bits that use numpy.packbits to pack 8 bits into a byte. Modify train_loop so that when direct_prob>0 the model is trained on packed bits with a suitable embedding.” Memory‑efficient attention alternatives. For even larger contexts, one could replace full attention with sparse, local or linear attention mechanisms. However, this would change the core architecture, which the task seeks to avoid. Using chunked attention (already present) and reversible layers is therefore preferred. Dynamic quantization and mixed precision. Larger context sizes increase model activations. Enabling use_autocast=True to compute in bfloat16 and applying quantize_dynamic after training reduces memory usage GitHub GitHub . Codex can create scripts that quantify memory usage and automatically toggle these features when large contexts are requested. Proposed Codex tasks to implement context extension Expose context parameters in the API/UI. Extend the dashboard and MCP server to allow clients to specify max_seq_len, chunk_size, overlap, and ctx_bits when initializing a model or running long inference. Prompt example: “Add optional parameters max_seq_len, chunk_size and overlap to the /init endpoint and pass them into BitTransformerLM and ModelManager. Update the HTML template to include these fields.” Implement sliding‑window inference. Add a function infer_long_sequence as described above and expose it via the dashboard and MCP server. Prompt example: “Add a new endpoint /infer_long to mcp_server.py that accepts a list of bits and processes them using a sliding window with overlap. The endpoint should return the predicted bits and telemetry summaries for each window.” Allow dynamic context scaling. Add a method to BitTransformerLM to adjust its pos_enc buffer when the context exceeds the current max_seq_len. This can be done by creating a new positional encoding tensor with the new length and copying the existing values. Prompt example: “Implement BitTransformerLM.expand_positional_encoding(new_len: int) that creates a new positional encoding buffer of size new_len and copies the existing encoding. Update the model’s max_seq_len accordingly.” Integrate aggressive compression. Implement alternative compression schemes (e.g., bit‑packing or general‑purpose compressors) and add toggles for them in training and inference. Evaluate compression ratio and latency to decide when to use them. Benchmark and tune hyperparameters. Write scripts to benchmark model memory use and throughput for various max_seq_len, chunk_size, reversible, use_act, and quantization settings. These benchmarks can inform safe defaults for the VM build.