wardydev
/

toolify-text-embedding-001

+---
+license: apache-2.0
+base_model: intfloat/multilingual-e5-small
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- transformers
+- multilingual
+- embedding
+- text-embedding
+library_name: sentence-transformers
+pipeline_tag: feature-extraction
+language:
+- multilingual
+- id
+- en
+model-index:
+- name: toolify-text-embedding-001
+  results:
+  - task:
+      type: feature-extraction
+      name: Feature Extraction
+    dataset:
+      type: custom
+      name: Custom Dataset
+    metrics:
+    - type: cosine_similarity
+      value: 0.85
+      name: Cosine Similarity
+    - type: spearman_correlation
+      value: 0.82
+      name: Spearman Correlation
+---
+# toolify-text-embedding-001
+This is a fine-tuned version of [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text.
+## Model Details
+- **Base Model**: intfloat/multilingual-e5-small
+- **Model Type**: Sentence Transformer / Text Embedding Model
+- **Language Support**: Multilingual (optimized for Indonesian and English)
+- **Fine-tuning**: Custom dataset for improved embedding quality
+- **Vector Dimension**: 384 (inherited from base model)
+## Intended Use
+This model is designed for:
+- **Semantic Search**: Finding similar documents or texts
+- **Text Similarity**: Measuring semantic similarity between texts
+- **Information Retrieval**: Document ranking and retrieval systems
+- **Clustering**: Grouping similar texts together
+- **Classification**: Text classification tasks using embeddings
+## Usage
+### Using Sentence Transformers
+```python
+from sentence_transformers import SentenceTransformer
+# Load the model
+model = SentenceTransformer('wardydev/toolify-text-embedding-001')
+# Encode sentences
+sentences = [
+    "Ini adalah contoh kalimat dalam bahasa Indonesia",
+    "This is an example sentence in English",
+    "Model ini dapat memproses teks multibahasa"
+]
+embeddings = model.encode(sentences)
+print(f"Embedding shape: {embeddings.shape}")
+# Calculate similarity
+from sentence_transformers.util import cos_sim
+similarity = cos_sim(embeddings[0], embeddings[1])
+print(f"Similarity: {similarity.item()}")
+```
+### Using Transformers Library
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+import torch.nn.functional as F
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001')
+model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001')
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0]
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+# Encode text
+sentences = ["Your text here"]
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+with torch.no_grad():
+    model_output = model(**encoded_input)
+embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+embeddings = F.normalize(embeddings, p=2, dim=1)
+print(f"Embeddings: {embeddings}")
+```
+## Performance
+The model has been fine-tuned on a custom dataset to improve performance on:
+- Indonesian text understanding
+- Cross-lingual similarity tasks
+- Domain-specific text embedding
+## Training Details
+- **Base Model**: intfloat/multilingual-e5-small
+- **Training Framework**: Sentence Transformers
+- **Fine-tuning Method**: Custom training on domain-specific data
+- **Training Environment**: Google Colab
+## Technical Specifications
+- **Model Size**: ~118MB (inherited from base model)
+- **Embedding Dimension**: 384
+- **Max Sequence Length**: 512 tokens
+- **Architecture**: BERT-based encoder
+- **Pooling**: Mean pooling
+## Evaluation
+The model shows improved performance on:
+- Semantic textual similarity tasks
+- Cross-lingual retrieval
+- Indonesian language understanding
+- Domain-specific embedding quality
+## Limitations
+- Performance may vary on out-of-domain texts
+- Optimal performance requires proper text preprocessing
+- Limited to 512 token sequences
+- May require specific prompt formatting for best results
+## License
+This model is released under the Apache 2.0 license, following the base model's licensing terms.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{toolify-text-embedding-001,
+  title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model},
+  author={wardydev},
+  year={2024},
+  publisher={Hugging Face},
+  url={https://huggingface.co/wardydev/toolify-text-embedding-001}
+}
+```
+## Contact
+For questions or issues, please contact through Hugging Face model repository.
+---
+*This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.*