Experimental SmolLM3 3B German Embedding Model
This is an experimental German text embedding model based on SmolLM3 3B, optimized for retrieval tasks using LoRA (Low-Rank Adaptation) fine-tuning. The model has been specifically trained to excel at German information retrieval and semantic similarity tasks.
Model Details
- Base Model: SmolLM3 3B (microsoft/SmolLM3-3B)
- Language: German (de)
- Model Type: Sentence Transformers
- Embedding Dimension: 2048
- Max Sequence Length: 512
- Pooling Strategy: Mean pooling
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Data: German retrieval datasets
Key Features
🚀 Retrieval-Optimized: Specifically fine-tuned for information retrieval tasks
🇩🇪 German-Focused: Optimized for German language understanding
⚡ High Performance: Significant improvements over baseline embeddings
📏 Standard Format: Compatible with sentence-transformers library
Usage
Installation
pip install sentence-transformers
Basic Usage
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('mayflowergmbh/smollm3-3b-german-embed')
# Encode sentences
sentences = [
"Was ist die Hauptstadt von Deutschland?",
"Berlin ist die Hauptstadt von Deutschland.",
"München ist eine große Stadt in Bayern."
]
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
print(similarities)
Information Retrieval Example
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mayflowergmbh/smollm3-3b-german-embed')
# Query and documents
query = "Was ist die Hauptstadt von Deutschland?"
documents = [
"Berlin ist die Hauptstadt und größte Stadt Deutschlands.",
"München ist die Hauptstadt des Freistaates Bayern.",
"Hamburg ist eine Hansestadt im Norden Deutschlands.",
"Köln ist eine Großstadt in Nordrhein-Westfalen."
]
# Encode query and documents
query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)
# Compute similarities
similarities = np.dot(query_embedding, doc_embeddings.T)[0]
# Rank documents by relevance
ranked_indices = np.argsort(similarities)[::-1]
print("Query:", query)
print("\nRanked Results:")
for i, idx in enumerate(ranked_indices):
print(f"{i+1}. {documents[idx]} (Score: {similarities[idx]:.3f})")
Technical Details
Architecture
The model uses the LLM2Vec approach to convert the decoder-only SmolLM3 model into an effective encoder:
- Bidirectional Attention: Modified attention mechanism for better context understanding
- Mean Pooling: Aggregates token embeddings using attention-weighted mean
- LoRA Fine-tuning: Parameter-efficient adaptation targeting Q and V projection layers
Training Process
- Base Model: Started with SmolLM3 3B converted to LLM2Vec format
- LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Target modules: q_proj, v_proj
- Dropout: 0.05
- Training Data: German retrieval datasets with contrastive learning
- Optimization: Hard negative mining for improved discrimination
Model Card Metadata
- Developed by: mayflowergmbh
- Model type: Sentence Transformer
- Language(s): German (de)
- License: Apache 2.0
- Base model: microsoft/SmolLM3-3B
- Training approach: LoRA fine-tuning
- Primary use: Information retrieval, semantic similarity
Limitations and Bias
- Language Scope: Optimized specifically for German; performance on other languages not evaluated
- Domain: Best performance on factual/informational content similar to training data
- Sequence Length: Maximum 512 tokens; longer texts will be truncated
- Computational Requirements: Requires ~6GB GPU memory for inference
Citation
If you use this model in your research, please cite:
@misc{smollm3-german-embed-retrieval,
title={SmolLM3 3B German Embedding Model (Retrieval-Optimized)},
author={mayflowergmbh},
year={2025},
howpublished={\url{https://huggingface.co/mayflowergmbh/smollm3-3b-german-embed}},
note={Retrieved-optimized German embedding model using LoRA fine-tuning}
}
Contact
For questions or issues, please open an issue on the model repository or contact the author.
Generated on 2025-07-16
- Downloads last month
- 2
Dataset used to train mayflowergmbh/smollm3-3b-german-embed
Evaluation results
- Mean Reciprocal Rank on GermanQuADtest set self-reported0.917
- Recall@1 on GermanQuADtest set self-reported0.872
- Recall@5 on GermanQuADtest set self-reported0.980
- Recall@10 on GermanQuADtest set self-reported0.986