SmolLM3-3B German Embeddings

Experimental German text embedding model based on SmolLM3-3B, trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.

Model Description

This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.

Key Features

Architecture: SmolLM3-3B with bidirectional attention
Embedding Dimension: 2048
Max Sequence Length: 512 tokens
Language: German (primary), may have some cross-lingual capabilities
Training Method: LLM2Vec (MNTP + Supervised Contrastive Learning)

Training Process

Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)

Model Transformation: Modified SmolLM3-3B architecture to enable bidirectional attention by:
- Removing causal attention masks
- Enabling position-agnostic attention computation
- Preserving the original model weights
MNTP Training:
- Dataset: 50,000 samples from German Wikipedia
- Task: Predicting masked tokens using bidirectional context
- Training Steps: 1,000
- Batch Size: 512 (64 per device × 8 gradient accumulation)
- LoRA Configuration: rank=16, alpha=32
- Learning Rate: 1e-4 with warmup

Stage 2: Supervised Contrastive Learning

Supervised Fine-tuning:
- Dataset: German text pairs with positive/negative examples
- Training Format: Contrastive learning using (query, positive, negative) triplets
- Training Steps: 500 steps
- Batch Size: 32 (16 per device × 2 gradient accumulation)
- Learning Rate: 2e-4 with warmup
- Loss: Contrastive loss to maximize similarity between semantically related texts

Training Infrastructure

Hardware: NVIDIA RTX A6000 (48GB VRAM)
Precision: bfloat16
Framework: Transformers + PEFT + LLM2Vec

Usage

Using with LLM2Vec Library

from llm2vec import LLM2Vec
import torch

# Load model
model = LLM2Vec.from_pretrained(
    "mayflowergmbh/smollm3-3b-embed-de",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Encode German texts
texts = [
    "Berlin ist die Hauptstadt von Deutschland.",
    "Die deutsche Hauptstadt ist Berlin.",
    "München ist eine Stadt in Bayern."
]

embeddings = model.encode(texts)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)

Using with Sentence Transformers

from sentence_transformers import SentenceTransformer

# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)

Intended Uses

Primary Use Cases

Semantic Search: Find relevant documents in German text corpora
Text Classification: Use embeddings as features for downstream classifiers
Clustering: Group similar German texts together
Duplicate Detection: Identify semantically similar content
Question Answering: Match questions with relevant answers

Example: Semantic Search

# Create document embeddings
documents = [
    "Die Katze sitzt auf dem Sofa.",
    "Der Hund spielt im Garten.",
    "Python ist eine Programmiersprache.",
    "Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)

# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])

# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]

for idx in top_indices:
    print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")

Performance Characteristics

Strengths

Excellent German language understanding
Strong performance on semantic similarity tasks
Efficient inference despite larger model size
Benefits from SmolLM3's strong foundation

Limitations

Larger than typical embedding models (3B parameters)
Requires GPU for optimal performance
Limited to 512 token sequences
Primarily optimized for German (cross-lingual performance not evaluated)

Model Architecture Details

Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)

Training Hyperparameters

MNTP Stage:

Learning Rate: 1e-4
Batch Size: 512
Max Sequence Length: 512
Gradient Accumulation: 8
LoRA r: 16
LoRA alpha: 32
Warmup Steps: 100
Total Steps: 1000

Supervised Stage:

Learning Rate: 2e-4
Batch Size: 32
Max Sequence Length: 256
Training Epochs: 3
Warmup Steps: 100
Weight Decay: 0.01

Ethical Considerations

Bias: Model may reflect biases present in German Wikipedia and training data
Use Cases: Should not be used for making decisions about individuals
Privacy: Do not use with personally identifiable information

Citation

If you use this model, please cite:

@misc{smollm3-embed-de,
  title={SmolLM3-3B German Embeddings},
  author={Johann-Peter Hartmann},
  year={2025},
  publisher={Mayflower GmbH},
  url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}

@article{llm2vec,
  title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
  author={Behnamghader, Parishad and others},
  journal={arXiv preprint arXiv:2404.05961},
  year={2024}
}

Acknowledgments

Base model: HuggingFaceTB/SmolLM3-3B
Training methodology: McGill-NLP/LLM2Vec
Training data: German Wikipedia

Contact

For questions or issues, please open an issue on the GitHub repository.

mayflowergmbh
/

smollm3-3b-embed-de