IncunabuLM

Model Description

IncunabuLM is a decoder-only transformer language model designed for text generation tasks. The model implements a custom architecture with RMSNorm normalization and modern attention mechanisms, optimized for resource-efficient training and inference.

Model Details

Model Type

  • Architecture: Decoder-only Transformer
  • Language(s): Primarily trained on Polish text (lectures)
  • Model size: 111.7M parameters

Source and Base Model

  • Base Architecture: Custom implementation inspired by modern transformer architectures
  • Training Approach: Trained from scratch on Polish text corpus
  • Educational Source: Implementation follows principles from Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out" tutorial
  • Tutorial Reference: https://www.youtube.com/watch?v=kCc8FmEb1nY
  • Custom Modifications:
    • RMSNorm instead of LayerNorm for improved stability
    • SiLU activation functions in feed-forward networks
    • Optimized for resource-efficient training and inference

Architecture Details

  • Layers: 12 transformer blocks
  • Hidden size: 768
  • Attention heads: 12
  • Head dimension: 64
  • Context length: 2048 tokens
  • Vocabulary size: 16,384 tokens
  • Normalization: RMSNorm (Root Mean Square Layer Normalization)
  • Activation: SiLU (Swish) in feed-forward networks
  • Attention: Causal self-attention with triangular masking

Key Features

  • RMSNorm: More stable and efficient than LayerNorm
  • SiLU Activation: Better gradient flow than ReLU
  • BPE Tokenization: Byte-level BPE with 16K vocabulary
  • Mixed Precision: Support for bfloat16/float16 training
  • Generation Controls: Temperature, top-k sampling, repetition penalty

Training Details

Training Data

  • Dataset: Custom Polish text corpus
  • Preprocessing: Byte-level BPE tokenization
  • Split: 90% training, 10% validation

Training Configuration

  • Batch size: 8 (with gradient accumulation steps: 8)
  • Effective batch size: 64
  • Context length: 2048 tokens
  • Training steps: 50,000
  • Optimizer: AdamW
  • Learning rate: 3e-4 (peak)
  • Learning rate schedule: Cosine with linear warmup
  • Warmup steps: 2,000
  • Weight decay: 0.1
  • Gradient clipping: 1.0
  • Dropout: 0.2

Training Infrastructure

  • Hardware: 1x Nvidia A100 80Gb
  • Precision: Mixed precision (bfloat16/float16)
  • Gradient scaling: Automatic mixed precision with GradScaler

Performance

Model Size and Efficiency

  • Parameters: 111.7M (111,718,144 total parameters)
  • Context window: 2048 tokens
  • Inference speed: Optimized for single-GPU inference

Training Metrics

  • Final training loss: [4.5544]
  • Final validation loss: [4.7100]
  • Training time: [~8h]

Limitations and Biases

Known Limitations

  1. Context Length: Limited to 2048 tokens, may struggle with very long documents
  2. Language Scope: Primarily designed for Polish text, may have reduced performance on other languages
  3. Model Size: At 111M parameters, may have limited knowledge compared to larger models
  4. Training Data: Performance heavily dependent on training corpus quality and diversity

Potential Biases

  • Language Bias: Optimized for Polish language patterns
  • Domain Bias: Reflects the domain distribution of training data
  • Temporal Bias: Training data cutoff affects knowledge of recent events
  • Cultural Bias: May reflect cultural perspectives present in training data

Technical Specifications

Hardware Recommendations

  • Minimum: 4GB GPU memory for inference
  • CPU: Compatible but significantly slower

Jakub Sztyber

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support