# 🚀 Setup Guide for Hugging Face Deployment ## Prerequisites 1. **Install required packages:** ```bash pip install huggingface_hub sentence-transformers ``` 2. **Login to Hugging Face:** ```bash huggingface-cli login ``` Enter your Hugging Face token when prompted. ## 📦 Repository Contents ``` final_repo/ ├── README.md # Main model documentation ├── USAGE_EXAMPLES.md # Comprehensive usage examples ├── SETUP.md # This setup guide ├── push_to_hf.py # Upload script ├── .gitignore # Git ignore rules ├── model.safetensors # Model weights ├── config.json # Model configuration ├── tokenizer.json # Tokenizer ├── vocab.txt # Vocabulary ├── sentence_bert_config.json # Sentence-BERT config ├── modules.json # Model modules ├── 1_Pooling/config.json # Pooling configuration ├── training_metadata.json # Training information └── configuration_hf_nomic_bert.py # Model architecture ``` ## 🔄 Push to Hugging Face ### Option 1: Automated Upload (Recommended) ```bash cd final_repo python push_to_hf.py ``` ### Option 2: Manual Upload ```bash cd final_repo # Clone/create the repo git clone https://huggingface.co/asmud/nomic-embed-indonesian # OR create new: huggingface-cli repo create nomic-embed-indonesian # Copy files cp -r * nomic-embed-indonesian/ cd nomic-embed-indonesian/ # Git commands git add . git commit -m "Add Indonesian text embedding model - Fine-tuned from nomic-embed-text-v1.5 - Optimized for Indonesian language - 6,294 training examples across 17 categories - Conservative training to prevent embedding collapse - Maintains base model performance with Indonesian specialization" git push ``` ## ✅ Verification Steps After uploading, verify the model works: ```python from sentence_transformers import SentenceTransformer # Load the uploaded model model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True) # Test Indonesian text texts = [ "search_query: Apa itu kecerdasan buatan?", "search_document: Kecerdasan buatan adalah teknologi yang memungkinkan mesin belajar", "classification: Produk ini sangat berkualitas (sentimen: positif)" ] embeddings = model.encode(texts) print(f"✅ Model working! Embedding shape: {embeddings.shape}") ``` ## 📊 Model Information - **Base Model**: nomic-ai/nomic-embed-text-v1.5 - **Language**: Indonesian (Bahasa Indonesia) - **Embedding Dimension**: 768 - **Max Sequence Length**: 8192 - **Training Examples**: 6,294 (balanced positive/negative) - **Categories**: 17 Indonesian content domains - **Loss Function**: MultipleNegativesRankingLoss - **Training**: Conservative approach to prevent embedding collapse ## 🎯 Model Performance - **Search Retrieval**: Maintains base performance (1.000 precision@1) - **Classification**: Stable performance (0.667 accuracy) - **Clustering**: Excellent performance (1.000 accuracy) - **Semantic Similarity**: High correlation (0.794) - **Embedding Health**: Healthy diversity range (0.625-0.898) ## 📝 License & Attribution This model inherits the license from nomic-ai/nomic-embed-text-v1.5. Please refer to the base model's license terms. ## 🔗 Links - **Model Repository**: https://huggingface.co/asmud/nomic-embed-indonesian - **Base Model**: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 - **Sentence Transformers**: https://www.sbert.net ## 🐛 Troubleshooting ### Common Issues: 1. **Authentication Error**: ```bash huggingface-cli login ``` 2. **Large File Upload Issues**: ```bash git lfs install git lfs track "*.safetensors" ``` 3. **Model Loading Error**: ```python # Ensure trust_remote_code=True if needed model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True) ``` 4. **Memory Issues**: ```python # Use CPU if GPU memory insufficient model = SentenceTransformer("asmud/nomic-embed-indonesian", device='cpu') ```