metadata

language:
  - km
  - en
library_name: unsloth
license: llama3
base_model: unsloth/llama-3-8b-bnb-4bit
tags:
  - khmer
  - cambodian
  - llama-3
  - continue-pretraining
  - unsloth
  - lora
  - text-generation
datasets:
  - metythorn/khmer-corpus
model-index:
  - name: llama-3-8b-bnb-4bit-khmer
    results: []

Llama-3-8B Pretrain on Khmer Corpus

This model is a pretrain version of unsloth/llama-3-8b-bnb-4bit on the metythorn/khmer-corpus dataset.

Model Description

This is a Llama-3-8B model that has been continue pretrained using the Unsloth framework to improve performance on Khmer (Cambodian) text generation tasks. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning with 4-bit quantization.

Training Details

Training Data

Dataset: metythorn/khmer-corpus
Language: Primarily Khmer with some English
Dataset Split: Training split

Training Configuration

Base Model: unsloth/llama-3-8b-bnb-4bit
Training Framework: Unsloth with LoRA
Quantization: 4-bit (bnb-4bit)
Max Sequence Length: 2048
LoRA Rank (r): 128
LoRA Alpha: 32
LoRA Dropout: 0
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_proj, lm_head
Use RSLoRA: True
Gradient Checkpointing: unsloth

Training Hyperparameters

Epochs: 1
Batch Size: 2 (per device)
Gradient Accumulation Steps: 8
Learning Rate: 5e-5
Embedding Learning Rate: 5e-6
Warmup Ratio: 0.1
Optimizer: adamw_8bit
LR Scheduler: cosine
Weight Decay: 0.0
Seed: 3407

Usage

Basic Usage with Unsloth

from unsloth import FastLanguageModel
import torch

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="metythorn/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,  # None for auto detection
    load_in_4bit=True,
)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Simple generation
prompt = "សួស្តី"  # Khmer text
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Khmer-Optimized Streaming Generation

For proper Khmer text streaming that handles Unicode combining characters:

from transformers import TextIteratorStreamer
from threading import Thread
import unicodedata

# Khmer-aware text streamer
text_streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,  # Skip the input prompt
    skip_special_tokens=True  # Skip special tokens
)

# Buffer to collect tokens for proper Khmer display
token_buffer = ""
buffer_size = 3  # Collect a few tokens before displaying

# Before running inference
FastLanguageModel.for_inference(model)

inputs = tokenizer(["ហាយ"], return_tensors="pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

length = 0
token_count = 0

for j, new_text in enumerate(text_streamer):
    # Add new text to buffer
    token_buffer += new_text
    token_count += 1
    
    # Process buffer when we have enough tokens
    should_display = token_count >= buffer_size
    
    if should_display or j == 0:
        # Normalize Unicode for proper Khmer display
        display_text = unicodedata.normalize('NFC', token_buffer)
        print(display_text, end="", flush=True)
        
        # Reset buffer
        token_buffer = ""
        token_count = 0

# Handle any remaining tokens in buffer
if token_buffer:
    display_text = unicodedata.normalize('NFC', token_buffer)
    print(display_text, end="", flush=True)

thread.join()
print()  # Final newline

Using with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "metythorn/llama-3-8b-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate text
prompt = "ប្រទេសកម្ពុជា"  # Cambodia in Khmer
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Batch Generation for Multiple Prompts

def generate_khmer_batch(prompts, max_new_tokens=256):
    FastLanguageModel.for_inference(model)
    
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    responses = []
    for i, output in enumerate(outputs):
        response = tokenizer.decode(output, skip_special_tokens=True)
        # Remove the original prompt from response
        generated = response[len(prompts[i]):].strip()
        responses.append(generated)
    
    return responses

# Example usage
prompts = ["សួស្តី", "ខ្ញុំឈ្មោះ", "ប្រទេសកម្ពុជា"]
results = generate_khmer_batch(prompts)
for prompt, result in zip(prompts, results):
    print(f"Input: {prompt}")
    print(f"Output: {result}")
    print("---")

Model Performance

This model has been specifically continue pretrained to understand and generate Khmer text more effectively than the base Llama-3-8B model. The training focused on:

Improved Khmer language understanding: Better comprehension of Khmer syntax and semantics
Enhanced Khmer text generation: More natural and coherent Khmer text output
Unicode handling: Proper support for Khmer combining characters and complex scripts
Maintained multilingual capabilities: Preserves English and other language abilities
Efficient inference: Optimized with 4-bit quantization for faster generation

Special Features for Khmer

Proper Unicode Normalization: Handles Khmer combining characters correctly
Streaming Support: Includes optimized streaming generation for real-time applications
Batch Processing: Efficient handling of multiple Khmer prompts simultaneously
Context Awareness: Better understanding of Khmer cultural and linguistic context

Recommended Usage Patterns

Use the Khmer-optimized streaming for real-time chat applications
Use batch generation for processing multiple texts efficiently
Use simple generation for basic text completion tasks
Buffer tokens (3-5) when streaming to ensure proper Khmer character display

Limitations and Biases

The model's performance is limited by the quality and size of the training dataset
May exhibit biases present in the training data
Performance may vary for different Khmer dialects or specialized domains
4-bit quantization may slightly impact model quality compared to full precision
Khmer-specific limitations:
- Streaming requires token buffering for proper Unicode character display
- Performance may vary with different Khmer romanization systems
- Limited understanding of very specialized Khmer terminology
- May occasionally mix Khmer and English in responses

Important Notes for Khmer Usage

⚠️ Streaming Considerations: When implementing streaming generation with Khmer text, always use token buffering (3-5 tokens) and Unicode normalization to prevent broken character display.

✅ Best Practices:

Use skip_prompt=True in TextIteratorStreamer for cleaner output
Apply unicodedata.normalize('NFC', text) for proper Khmer character composition
Set pad_token_id=tokenizer.eos_token_id to avoid generation issues
Use temperature 0.7-0.9 for more natural Khmer text generation

Technical Specifications

Model Size: ~4.5GB (4-bit quantized)
Architecture: Llama-3-8B with LoRA adapters
Precision: 4-bit quantization with LoRA in higher precision
Memory Requirements: ~6-8GB VRAM for inference
Framework: Compatible with Transformers and Unsloth

Citation

If you use this model in your research, please cite:

@misc{llama3-8b-khmer-2024,
  title={Llama-3-8B Fine-tuned on Khmer Corpus},
  author={metythorn},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/metythorn/llama-3-8b-bnb-4bit}
}

Acknowledgments

Meta AI for the Llama-3 model
Unsloth team for the efficient fine-tuning framework
The Khmer corpus dataset contributors

License

This model is released under the same license as the base Llama-3 model. Please refer to the Llama-3 license for more details.