IncunabuLM
Model Description
IncunabuLM is a decoder-only transformer language model designed for text generation tasks. The model implements a custom architecture with RMSNorm normalization and modern attention mechanisms, optimized for resource-efficient training and inference.
Model Details
Model Type
- Architecture: Decoder-only Transformer
- Language(s): Primarily trained on Polish text (lectures)
- Model size: 111.7M parameters
Source and Base Model
- Base Architecture: Custom implementation inspired by modern transformer architectures
- Training Approach: Trained from scratch on Polish text corpus
- Educational Source: Implementation follows principles from Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out" tutorial
- Tutorial Reference: https://www.youtube.com/watch?v=kCc8FmEb1nY
- Custom Modifications:
- RMSNorm instead of LayerNorm for improved stability
- SiLU activation functions in feed-forward networks
- Optimized for resource-efficient training and inference
Architecture Details
- Layers: 12 transformer blocks
- Hidden size: 768
- Attention heads: 12
- Head dimension: 64
- Context length: 2048 tokens
- Vocabulary size: 16,384 tokens
- Normalization: RMSNorm (Root Mean Square Layer Normalization)
- Activation: SiLU (Swish) in feed-forward networks
- Attention: Causal self-attention with triangular masking
Key Features
- RMSNorm: More stable and efficient than LayerNorm
- SiLU Activation: Better gradient flow than ReLU
- BPE Tokenization: Byte-level BPE with 16K vocabulary
- Mixed Precision: Support for bfloat16/float16 training
- Generation Controls: Temperature, top-k sampling, repetition penalty
Training Details
Training Data
- Dataset: Custom Polish text corpus
- Preprocessing: Byte-level BPE tokenization
- Split: 90% training, 10% validation
Training Configuration
- Batch size: 8 (with gradient accumulation steps: 8)
- Effective batch size: 64
- Context length: 2048 tokens
- Training steps: 50,000
- Optimizer: AdamW
- Learning rate: 3e-4 (peak)
- Learning rate schedule: Cosine with linear warmup
- Warmup steps: 2,000
- Weight decay: 0.1
- Gradient clipping: 1.0
- Dropout: 0.2
Training Infrastructure
- Hardware: 1x Nvidia A100 80Gb
- Precision: Mixed precision (bfloat16/float16)
- Gradient scaling: Automatic mixed precision with GradScaler
Performance
Model Size and Efficiency
- Parameters: 111.7M (111,718,144 total parameters)
- Context window: 2048 tokens
- Inference speed: Optimized for single-GPU inference
Training Metrics
- Final training loss: [4.5544]
- Final validation loss: [4.7100]
- Training time: [~8h]
Limitations and Biases
Known Limitations
- Context Length: Limited to 2048 tokens, may struggle with very long documents
- Language Scope: Primarily designed for Polish text, may have reduced performance on other languages
- Model Size: At 111M parameters, may have limited knowledge compared to larger models
- Training Data: Performance heavily dependent on training corpus quality and diversity
Potential Biases
- Language Bias: Optimized for Polish language patterns
- Domain Bias: Reflects the domain distribution of training data
- Temporal Bias: Training data cutoff affects knowledge of recent events
- Cultural Bias: May reflect cultural perspectives present in training data
Technical Specifications
Hardware Recommendations
- Minimum: 4GB GPU memory for inference
- CPU: Compatible but significantly slower
Jakub Sztyber