IncunabuLM

Model Description

IncunabuLM is a decoder-only transformer language model designed for text generation tasks. The model implements a custom architecture with RMSNorm normalization and modern attention mechanisms, optimized for resource-efficient training and inference.

Model Details

Model Type

Architecture: Decoder-only Transformer
Language(s): Primarily trained on Polish text (lectures)
Model size: 111.7M parameters

Source and Base Model

Base Architecture: Custom implementation inspired by modern transformer architectures
Training Approach: Trained from scratch on Polish text corpus
Educational Source: Implementation follows principles from Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out" tutorial
Tutorial Reference: https://www.youtube.com/watch?v=kCc8FmEb1nY
Custom Modifications:
- RMSNorm instead of LayerNorm for improved stability
- SiLU activation functions in feed-forward networks
- Optimized for resource-efficient training and inference

Architecture Details

Layers: 12 transformer blocks
Hidden size: 768
Attention heads: 12
Head dimension: 64
Context length: 2048 tokens
Vocabulary size: 16,384 tokens
Normalization: RMSNorm (Root Mean Square Layer Normalization)
Activation: SiLU (Swish) in feed-forward networks
Attention: Causal self-attention with triangular masking

Key Features

RMSNorm: More stable and efficient than LayerNorm
SiLU Activation: Better gradient flow than ReLU
BPE Tokenization: Byte-level BPE with 16K vocabulary
Mixed Precision: Support for bfloat16/float16 training
Generation Controls: Temperature, top-k sampling, repetition penalty

Training Details

Training Data

Dataset: Custom Polish text corpus
Preprocessing: Byte-level BPE tokenization
Split: 90% training, 10% validation

Training Configuration

Batch size: 8 (with gradient accumulation steps: 8)
Effective batch size: 64
Context length: 2048 tokens
Training steps: 50,000
Optimizer: AdamW
Learning rate: 3e-4 (peak)
Learning rate schedule: Cosine with linear warmup
Warmup steps: 2,000
Weight decay: 0.1
Gradient clipping: 1.0
Dropout: 0.2

Training Infrastructure

Hardware: 1x Nvidia A100 80Gb
Precision: Mixed precision (bfloat16/float16)
Gradient scaling: Automatic mixed precision with GradScaler

Performance

Model Size and Efficiency

Parameters: 111.7M (111,718,144 total parameters)
Context window: 2048 tokens
Inference speed: Optimized for single-GPU inference

Training Metrics

Final training loss: [4.5544]
Final validation loss: [4.7100]
Training time: [~8h]

Limitations and Biases

Known Limitations

Context Length: Limited to 2048 tokens, may struggle with very long documents
Language Scope: Primarily designed for Polish text, may have reduced performance on other languages
Model Size: At 111M parameters, may have limited knowledge compared to larger models
Training Data: Performance heavily dependent on training corpus quality and diversity

Potential Biases

Language Bias: Optimized for Polish language patterns
Domain Bias: Reflects the domain distribution of training data
Temporal Bias: Training data cutoff affects knowledge of recent events
Cultural Bias: May reflect cultural perspectives present in training data

Technical Specifications

Hardware Recommendations

Minimum: 4GB GPU memory for inference
CPU: Compatible but significantly slower

Jakub Sztyber