---
language:
- km
- en
library_name: unsloth
license: llama3
base_model: unsloth/llama-3-8b-bnb-4bit
tags:
- khmer
- cambodian
- llama-3
- continue-pretraining
- unsloth
- lora
- text-generation
datasets:
- metythorn/khmer-corpus
model-index:
- name: llama-3-8b-bnb-4bit-khmer
  results: []
---

# Llama-3-8B Pretrain on Khmer Corpus

This model is a pretrain version of [unsloth/llama-3-8b-bnb-4bit](https://huggingface.co/unsloth/llama-3-8b-bnb-4bit) on the [metythorn/khmer-corpus](https://huggingface.co/datasets/metythorn/khmer-corpus) dataset.

## Model Description

This is a Llama-3-8B model that has been continue pretrained using the Unsloth framework to improve performance on Khmer (Cambodian) text generation tasks. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning with 4-bit quantization.

## Training Details

### Training Data
- **Dataset**: [metythorn/khmer-corpus](https://huggingface.co/datasets/metythorn/khmer-corpus)
- **Language**: Primarily Khmer with some English
- **Dataset Split**: Training split

### Training Configuration
- **Base Model**: unsloth/llama-3-8b-bnb-4bit
- **Training Framework**: Unsloth with LoRA
- **Quantization**: 4-bit (bnb-4bit)
- **Max Sequence Length**: 2048
- **LoRA Rank (r)**: 128
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_proj, lm_head
- **Use RSLoRA**: True
- **Gradient Checkpointing**: unsloth

### Training Hyperparameters
- **Epochs**: 1
- **Batch Size**: 2 (per device)
- **Gradient Accumulation Steps**: 8
- **Learning Rate**: 5e-5
- **Embedding Learning Rate**: 5e-6
- **Warmup Ratio**: 0.1
- **Optimizer**: adamw_8bit
- **LR Scheduler**: cosine
- **Weight Decay**: 0.0
- **Seed**: 3407

## Usage

### Basic Usage with Unsloth

```python
from unsloth import FastLanguageModel
import torch

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="metythorn/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,  # None for auto detection
    load_in_4bit=True,
)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Simple generation
prompt = "សួស្តី"  # Khmer text
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Khmer-Optimized Streaming Generation

For proper Khmer text streaming that handles Unicode combining characters:

```python
from transformers import TextIteratorStreamer
from threading import Thread
import unicodedata

# Khmer-aware text streamer
text_streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,  # Skip the input prompt
    skip_special_tokens=True  # Skip special tokens
)

# Buffer to collect tokens for proper Khmer display
token_buffer = ""
buffer_size = 3  # Collect a few tokens before displaying

# Before running inference
FastLanguageModel.for_inference(model)

inputs = tokenizer(["ហាយ"], return_tensors="pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

length = 0
token_count = 0

for j, new_text in enumerate(text_streamer):
    # Add new text to buffer
    token_buffer += new_text
    token_count += 1
    
    # Process buffer when we have enough tokens
    should_display = token_count >= buffer_size
    
    if should_display or j == 0:
        # Normalize Unicode for proper Khmer display
        display_text = unicodedata.normalize('NFC', token_buffer)
        print(display_text, end="", flush=True)
        
        # Reset buffer
        token_buffer = ""
        token_count = 0

# Handle any remaining tokens in buffer
if token_buffer:
    display_text = unicodedata.normalize('NFC', token_buffer)
    print(display_text, end="", flush=True)

thread.join()
print()  # Final newline
```

### Using with Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "metythorn/llama-3-8b-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate text
prompt = "ប្រទេសកម្ពុជា"  # Cambodia in Khmer
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Batch Generation for Multiple Prompts

```python
def generate_khmer_batch(prompts, max_new_tokens=256):
    FastLanguageModel.for_inference(model)
    
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    responses = []
    for i, output in enumerate(outputs):
        response = tokenizer.decode(output, skip_special_tokens=True)
        # Remove the original prompt from response
        generated = response[len(prompts[i]):].strip()
        responses.append(generated)
    
    return responses

# Example usage
prompts = ["សួស្តី", "ខ្ញុំឈ្មោះ", "ប្រទេសកម្ពុជា"]
results = generate_khmer_batch(prompts)
for prompt, result in zip(prompts, results):
    print(f"Input: {prompt}")
    print(f"Output: {result}")
    print("---")
```

## Model Performance

This model has been specifically continue pretrained to understand and generate Khmer text more effectively than the base Llama-3-8B model. The training focused on:

- **Improved Khmer language understanding**: Better comprehension of Khmer syntax and semantics
- **Enhanced Khmer text generation**: More natural and coherent Khmer text output
- **Unicode handling**: Proper support for Khmer combining characters and complex scripts
- **Maintained multilingual capabilities**: Preserves English and other language abilities
- **Efficient inference**: Optimized with 4-bit quantization for faster generation

### Special Features for Khmer

- **Proper Unicode Normalization**: Handles Khmer combining characters correctly
- **Streaming Support**: Includes optimized streaming generation for real-time applications
- **Batch Processing**: Efficient handling of multiple Khmer prompts simultaneously
- **Context Awareness**: Better understanding of Khmer cultural and linguistic context

### Recommended Usage Patterns

- Use the **Khmer-optimized streaming** for real-time chat applications
- Use **batch generation** for processing multiple texts efficiently  
- Use **simple generation** for basic text completion tasks
- Buffer tokens (3-5) when streaming to ensure proper Khmer character display

## Limitations and Biases

- The model's performance is limited by the quality and size of the training dataset
- May exhibit biases present in the training data  
- Performance may vary for different Khmer dialects or specialized domains
- 4-bit quantization may slightly impact model quality compared to full precision
- **Khmer-specific limitations**:
  - Streaming requires token buffering for proper Unicode character display
  - Performance may vary with different Khmer romanization systems
  - Limited understanding of very specialized Khmer terminology
  - May occasionally mix Khmer and English in responses

## Important Notes for Khmer Usage

⚠️ **Streaming Considerations**: When implementing streaming generation with Khmer text, always use token buffering (3-5 tokens) and Unicode normalization to prevent broken character display.

✅ **Best Practices**:
- Use `skip_prompt=True` in TextIteratorStreamer for cleaner output
- Apply `unicodedata.normalize('NFC', text)` for proper Khmer character composition
- Set `pad_token_id=tokenizer.eos_token_id` to avoid generation issues
- Use temperature 0.7-0.9 for more natural Khmer text generation

## Technical Specifications

- **Model Size**: ~4.5GB (4-bit quantized)
- **Architecture**: Llama-3-8B with LoRA adapters
- **Precision**: 4-bit quantization with LoRA in higher precision
- **Memory Requirements**: ~6-8GB VRAM for inference
- **Framework**: Compatible with Transformers and Unsloth

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{llama3-8b-khmer-2024,
  title={Llama-3-8B Fine-tuned on Khmer Corpus},
  author={metythorn},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/metythorn/llama-3-8b-bnb-4bit}
}
```

## Acknowledgments

- Meta AI for the Llama-3 model
- Unsloth team for the efficient fine-tuning framework
- The Khmer corpus dataset contributors

## License

This model is released under the same license as the base Llama-3 model. Please refer to the [Llama-3 license](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE) for more details.