SmolLM3-3B German Embeddings
Experimental German text embedding model based on SmolLM3-3B, trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.
Model Description
This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.
Key Features
- Architecture: SmolLM3-3B with bidirectional attention
- Embedding Dimension: 2048
- Max Sequence Length: 512 tokens
- Language: German (primary), may have some cross-lingual capabilities
- Training Method: LLM2Vec (MNTP + Supervised Contrastive Learning)
Training Process
Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)
Model Transformation: Modified SmolLM3-3B architecture to enable bidirectional attention by:
- Removing causal attention masks
- Enabling position-agnostic attention computation
- Preserving the original model weights
MNTP Training:
- Dataset: 50,000 samples from German Wikipedia
- Task: Predicting masked tokens using bidirectional context
- Training Steps: 1,000
- Batch Size: 512 (64 per device × 8 gradient accumulation)
- LoRA Configuration: rank=16, alpha=32
- Learning Rate: 1e-4 with warmup
Stage 2: Supervised Contrastive Learning
- Supervised Fine-tuning:
- Dataset: German text pairs with positive/negative examples
- Training Format: Contrastive learning using (query, positive, negative) triplets
- Training Steps: 500 steps
- Batch Size: 32 (16 per device × 2 gradient accumulation)
- Learning Rate: 2e-4 with warmup
- Loss: Contrastive loss to maximize similarity between semantically related texts
Training Infrastructure
- Hardware: NVIDIA RTX A6000 (48GB VRAM)
- Precision: bfloat16
- Framework: Transformers + PEFT + LLM2Vec
Usage
Using with LLM2Vec Library
from llm2vec import LLM2Vec
import torch
# Load model
model = LLM2Vec.from_pretrained(
"mayflowergmbh/smollm3-3b-embed-de",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Encode German texts
texts = [
"Berlin ist die Hauptstadt von Deutschland.",
"Die deutsche Hauptstadt ist Berlin.",
"München ist eine Stadt in Bayern."
]
embeddings = model.encode(texts)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
Using with Sentence Transformers
from sentence_transformers import SentenceTransformer
# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)
Intended Uses
Primary Use Cases
- Semantic Search: Find relevant documents in German text corpora
- Text Classification: Use embeddings as features for downstream classifiers
- Clustering: Group similar German texts together
- Duplicate Detection: Identify semantically similar content
- Question Answering: Match questions with relevant answers
Example: Semantic Search
# Create document embeddings
documents = [
"Die Katze sitzt auf dem Sofa.",
"Der Hund spielt im Garten.",
"Python ist eine Programmiersprache.",
"Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)
# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])
# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]
for idx in top_indices:
print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
Performance Characteristics
Strengths
- Excellent German language understanding
- Strong performance on semantic similarity tasks
- Efficient inference despite larger model size
- Benefits from SmolLM3's strong foundation
Limitations
- Larger than typical embedding models (3B parameters)
- Requires GPU for optimal performance
- Limited to 512 token sequences
- Primarily optimized for German (cross-lingual performance not evaluated)
Model Architecture Details
Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)
Training Hyperparameters
MNTP Stage:
- Learning Rate: 1e-4
- Batch Size: 512
- Max Sequence Length: 512
- Gradient Accumulation: 8
- LoRA r: 16
- LoRA alpha: 32
- Warmup Steps: 100
- Total Steps: 1000
Supervised Stage:
- Learning Rate: 2e-4
- Batch Size: 32
- Max Sequence Length: 256
- Training Epochs: 3
- Warmup Steps: 100
- Weight Decay: 0.01
Ethical Considerations
- Bias: Model may reflect biases present in German Wikipedia and training data
- Use Cases: Should not be used for making decisions about individuals
- Privacy: Do not use with personally identifiable information
Citation
If you use this model, please cite:
@misc{smollm3-embed-de,
title={SmolLM3-3B German Embeddings},
author={Johann-Peter Hartmann},
year={2025},
publisher={Mayflower GmbH},
url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}
@article{llm2vec,
title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
author={Behnamghader, Parishad and others},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
Acknowledgments
- Base model: HuggingFaceTB/SmolLM3-3B
- Training methodology: McGill-NLP/LLM2Vec
- Training data: German Wikipedia
Contact
For questions or issues, please open an issue on the GitHub repository.
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support