# 🚀 Setup Guide for Hugging Face Deployment

## Prerequisites

1. **Install required packages:**
```bash
pip install huggingface_hub sentence-transformers
```

2. **Login to Hugging Face:**
```bash
huggingface-cli login
```
Enter your Hugging Face token when prompted.

## 📦 Repository Contents

```
final_repo/
├── README.md                           # Main model documentation
├── USAGE_EXAMPLES.md                   # Comprehensive usage examples
├── SETUP.md                           # This setup guide
├── push_to_hf.py                      # Upload script
├── .gitignore                         # Git ignore rules
├── model.safetensors                  # Model weights
├── config.json                        # Model configuration
├── tokenizer.json                     # Tokenizer
├── vocab.txt                          # Vocabulary
├── sentence_bert_config.json          # Sentence-BERT config
├── modules.json                       # Model modules
├── 1_Pooling/config.json             # Pooling configuration
├── training_metadata.json            # Training information
└── configuration_hf_nomic_bert.py    # Model architecture
```

## 🔄 Push to Hugging Face

### Option 1: Automated Upload (Recommended)
```bash
cd final_repo
python push_to_hf.py
```

### Option 2: Manual Upload
```bash
cd final_repo

# Clone/create the repo
git clone https://huggingface.co/asmud/nomic-embed-indonesian
# OR create new: huggingface-cli repo create nomic-embed-indonesian

# Copy files
cp -r * nomic-embed-indonesian/
cd nomic-embed-indonesian/

# Git commands
git add .
git commit -m "Add Indonesian text embedding model

- Fine-tuned from nomic-embed-text-v1.5
- Optimized for Indonesian language
- 6,294 training examples across 17 categories
- Conservative training to prevent embedding collapse
- Maintains base model performance with Indonesian specialization"

git push
```

## ✅ Verification Steps

After uploading, verify the model works:

```python
from sentence_transformers import SentenceTransformer

# Load the uploaded model
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)

# Test Indonesian text
texts = [
    "search_query: Apa itu kecerdasan buatan?",
    "search_document: Kecerdasan buatan adalah teknologi yang memungkinkan mesin belajar",
    "classification: Produk ini sangat berkualitas (sentimen: positif)"
]

embeddings = model.encode(texts)
print(f"✅ Model working! Embedding shape: {embeddings.shape}")
```

## 📊 Model Information

- **Base Model**: nomic-ai/nomic-embed-text-v1.5
- **Language**: Indonesian (Bahasa Indonesia)
- **Embedding Dimension**: 768
- **Max Sequence Length**: 8192
- **Training Examples**: 6,294 (balanced positive/negative)
- **Categories**: 17 Indonesian content domains
- **Loss Function**: MultipleNegativesRankingLoss
- **Training**: Conservative approach to prevent embedding collapse

## 🎯 Model Performance

- **Search Retrieval**: Maintains base performance (1.000 precision@1)  
- **Classification**: Stable performance (0.667 accuracy)
- **Clustering**: Excellent performance (1.000 accuracy)
- **Semantic Similarity**: High correlation (0.794)
- **Embedding Health**: Healthy diversity range (0.625-0.898)

## 📝 License & Attribution

This model inherits the license from nomic-ai/nomic-embed-text-v1.5. Please refer to the base model's license terms.

## 🔗 Links

- **Model Repository**: https://huggingface.co/asmud/nomic-embed-indonesian
- **Base Model**: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
- **Sentence Transformers**: https://www.sbert.net

## 🐛 Troubleshooting

### Common Issues:

1. **Authentication Error**:
```bash
huggingface-cli login
```

2. **Large File Upload Issues**:
```bash
git lfs install
git lfs track "*.safetensors"
```

3. **Model Loading Error**:
```python
# Ensure trust_remote_code=True if needed
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)
```

4. **Memory Issues**:
```python
# Use CPU if GPU memory insufficient
model = SentenceTransformer("asmud/nomic-embed-indonesian", device='cpu')
```