SmolLM3-3B German Embeddings

Experimental German text embedding model based on SmolLM3-3B, trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.

Model Description

This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.

Key Features

  • Architecture: SmolLM3-3B with bidirectional attention
  • Embedding Dimension: 2048
  • Max Sequence Length: 512 tokens
  • Language: German (primary), may have some cross-lingual capabilities
  • Training Method: LLM2Vec (MNTP + Supervised Contrastive Learning)

Training Process

Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)

  1. Model Transformation: Modified SmolLM3-3B architecture to enable bidirectional attention by:

    • Removing causal attention masks
    • Enabling position-agnostic attention computation
    • Preserving the original model weights
  2. MNTP Training:

    • Dataset: 50,000 samples from German Wikipedia
    • Task: Predicting masked tokens using bidirectional context
    • Training Steps: 1,000
    • Batch Size: 512 (64 per device × 8 gradient accumulation)
    • LoRA Configuration: rank=16, alpha=32
    • Learning Rate: 1e-4 with warmup

Stage 2: Supervised Contrastive Learning

  1. Supervised Fine-tuning:
    • Dataset: German text pairs with positive/negative examples
    • Training Format: Contrastive learning using (query, positive, negative) triplets
    • Training Steps: 500 steps
    • Batch Size: 32 (16 per device × 2 gradient accumulation)
    • Learning Rate: 2e-4 with warmup
    • Loss: Contrastive loss to maximize similarity between semantically related texts

Training Infrastructure

  • Hardware: NVIDIA RTX A6000 (48GB VRAM)
  • Precision: bfloat16
  • Framework: Transformers + PEFT + LLM2Vec

Usage

Using with LLM2Vec Library

from llm2vec import LLM2Vec
import torch

# Load model
model = LLM2Vec.from_pretrained(
    "mayflowergmbh/smollm3-3b-embed-de",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Encode German texts
texts = [
    "Berlin ist die Hauptstadt von Deutschland.",
    "Die deutsche Hauptstadt ist Berlin.",
    "München ist eine Stadt in Bayern."
]

embeddings = model.encode(texts)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)

Using with Sentence Transformers

from sentence_transformers import SentenceTransformer

# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)

Intended Uses

Primary Use Cases

  • Semantic Search: Find relevant documents in German text corpora
  • Text Classification: Use embeddings as features for downstream classifiers
  • Clustering: Group similar German texts together
  • Duplicate Detection: Identify semantically similar content
  • Question Answering: Match questions with relevant answers

Example: Semantic Search

# Create document embeddings
documents = [
    "Die Katze sitzt auf dem Sofa.",
    "Der Hund spielt im Garten.",
    "Python ist eine Programmiersprache.",
    "Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)

# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])

# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]

for idx in top_indices:
    print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")

Performance Characteristics

Strengths

  • Excellent German language understanding
  • Strong performance on semantic similarity tasks
  • Efficient inference despite larger model size
  • Benefits from SmolLM3's strong foundation

Limitations

  • Larger than typical embedding models (3B parameters)
  • Requires GPU for optimal performance
  • Limited to 512 token sequences
  • Primarily optimized for German (cross-lingual performance not evaluated)

Model Architecture Details

Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)

Training Hyperparameters

MNTP Stage:

  • Learning Rate: 1e-4
  • Batch Size: 512
  • Max Sequence Length: 512
  • Gradient Accumulation: 8
  • LoRA r: 16
  • LoRA alpha: 32
  • Warmup Steps: 100
  • Total Steps: 1000

Supervised Stage:

  • Learning Rate: 2e-4
  • Batch Size: 32
  • Max Sequence Length: 256
  • Training Epochs: 3
  • Warmup Steps: 100
  • Weight Decay: 0.01

Ethical Considerations

  • Bias: Model may reflect biases present in German Wikipedia and training data
  • Use Cases: Should not be used for making decisions about individuals
  • Privacy: Do not use with personally identifiable information

Citation

If you use this model, please cite:

@misc{smollm3-embed-de,
  title={SmolLM3-3B German Embeddings},
  author={Johann-Peter Hartmann},
  year={2025},
  publisher={Mayflower GmbH},
  url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}

@article{llm2vec,
  title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
  author={Behnamghader, Parishad and others},
  journal={arXiv preprint arXiv:2404.05961},
  year={2024}
}

Acknowledgments

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month
8
Safetensors
Model size
3.08B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support