language:
- km
- en
library_name: unsloth
license: llama3
base_model: unsloth/llama-3-8b-bnb-4bit
tags:
- khmer
- cambodian
- llama-3
- continue-pretraining
- unsloth
- lora
- text-generation
datasets:
- metythorn/khmer-corpus
model-index:
- name: llama-3-8b-bnb-4bit-khmer
results: []
Llama-3-8B Pretrain on Khmer Corpus
This model is a pretrain version of unsloth/llama-3-8b-bnb-4bit on the metythorn/khmer-corpus dataset.
Model Description
This is a Llama-3-8B model that has been continue pretrained using the Unsloth framework to improve performance on Khmer (Cambodian) text generation tasks. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning with 4-bit quantization.
Training Details
Training Data
- Dataset: metythorn/khmer-corpus
- Language: Primarily Khmer with some English
- Dataset Split: Training split
Training Configuration
- Base Model: unsloth/llama-3-8b-bnb-4bit
- Training Framework: Unsloth with LoRA
- Quantization: 4-bit (bnb-4bit)
- Max Sequence Length: 2048
- LoRA Rank (r): 128
- LoRA Alpha: 32
- LoRA Dropout: 0
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_proj, lm_head
- Use RSLoRA: True
- Gradient Checkpointing: unsloth
Training Hyperparameters
- Epochs: 1
- Batch Size: 2 (per device)
- Gradient Accumulation Steps: 8
- Learning Rate: 5e-5
- Embedding Learning Rate: 5e-6
- Warmup Ratio: 0.1
- Optimizer: adamw_8bit
- LR Scheduler: cosine
- Weight Decay: 0.0
- Seed: 3407
Usage
Basic Usage with Unsloth
from unsloth import FastLanguageModel
import torch
# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="metythorn/llama-3-8b-bnb-4bit",
max_seq_length=2048,
dtype=None, # None for auto detection
load_in_4bit=True,
)
# Enable inference mode
FastLanguageModel.for_inference(model)
# Simple generation
prompt = "αα½ααααΈ" # Khmer text
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Khmer-Optimized Streaming Generation
For proper Khmer text streaming that handles Unicode combining characters:
from transformers import TextIteratorStreamer
from threading import Thread
import unicodedata
# Khmer-aware text streamer
text_streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True, # Skip the input prompt
skip_special_tokens=True # Skip special tokens
)
# Buffer to collect tokens for proper Khmer display
token_buffer = ""
buffer_size = 3 # Collect a few tokens before displaying
# Before running inference
FastLanguageModel.for_inference(model)
inputs = tokenizer(["α αΆα"], return_tensors="pt").to("cuda")
generation_kwargs = dict(
inputs,
streamer=text_streamer,
max_new_tokens=256,
use_cache=True,
pad_token_id=tokenizer.eos_token_id,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
length = 0
token_count = 0
for j, new_text in enumerate(text_streamer):
# Add new text to buffer
token_buffer += new_text
token_count += 1
# Process buffer when we have enough tokens
should_display = token_count >= buffer_size
if should_display or j == 0:
# Normalize Unicode for proper Khmer display
display_text = unicodedata.normalize('NFC', token_buffer)
print(display_text, end="", flush=True)
# Reset buffer
token_buffer = ""
token_count = 0
# Handle any remaining tokens in buffer
if token_buffer:
display_text = unicodedata.normalize('NFC', token_buffer)
print(display_text, end="", flush=True)
thread.join()
print() # Final newline
Using with Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "metythorn/llama-3-8b-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Generate text
prompt = "ααααααααααα»ααΆ" # Cambodia in Khmer
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Batch Generation for Multiple Prompts
def generate_khmer_batch(prompts, max_new_tokens=256):
FastLanguageModel.for_inference(model)
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)
responses = []
for i, output in enumerate(outputs):
response = tokenizer.decode(output, skip_special_tokens=True)
# Remove the original prompt from response
generated = response[len(prompts[i]):].strip()
responses.append(generated)
return responses
# Example usage
prompts = ["αα½ααααΈ", "αααα»αααααα", "ααααααααααα»ααΆ"]
results = generate_khmer_batch(prompts)
for prompt, result in zip(prompts, results):
print(f"Input: {prompt}")
print(f"Output: {result}")
print("---")
Model Performance
This model has been specifically continue pretrained to understand and generate Khmer text more effectively than the base Llama-3-8B model. The training focused on:
- Improved Khmer language understanding: Better comprehension of Khmer syntax and semantics
- Enhanced Khmer text generation: More natural and coherent Khmer text output
- Unicode handling: Proper support for Khmer combining characters and complex scripts
- Maintained multilingual capabilities: Preserves English and other language abilities
- Efficient inference: Optimized with 4-bit quantization for faster generation
Special Features for Khmer
- Proper Unicode Normalization: Handles Khmer combining characters correctly
- Streaming Support: Includes optimized streaming generation for real-time applications
- Batch Processing: Efficient handling of multiple Khmer prompts simultaneously
- Context Awareness: Better understanding of Khmer cultural and linguistic context
Recommended Usage Patterns
- Use the Khmer-optimized streaming for real-time chat applications
- Use batch generation for processing multiple texts efficiently
- Use simple generation for basic text completion tasks
- Buffer tokens (3-5) when streaming to ensure proper Khmer character display
Limitations and Biases
- The model's performance is limited by the quality and size of the training dataset
- May exhibit biases present in the training data
- Performance may vary for different Khmer dialects or specialized domains
- 4-bit quantization may slightly impact model quality compared to full precision
- Khmer-specific limitations:
- Streaming requires token buffering for proper Unicode character display
- Performance may vary with different Khmer romanization systems
- Limited understanding of very specialized Khmer terminology
- May occasionally mix Khmer and English in responses
Important Notes for Khmer Usage
β οΈ Streaming Considerations: When implementing streaming generation with Khmer text, always use token buffering (3-5 tokens) and Unicode normalization to prevent broken character display.
β Best Practices:
- Use
skip_prompt=True
in TextIteratorStreamer for cleaner output - Apply
unicodedata.normalize('NFC', text)
for proper Khmer character composition - Set
pad_token_id=tokenizer.eos_token_id
to avoid generation issues - Use temperature 0.7-0.9 for more natural Khmer text generation
Technical Specifications
- Model Size: ~4.5GB (4-bit quantized)
- Architecture: Llama-3-8B with LoRA adapters
- Precision: 4-bit quantization with LoRA in higher precision
- Memory Requirements: ~6-8GB VRAM for inference
- Framework: Compatible with Transformers and Unsloth
Citation
If you use this model in your research, please cite:
@misc{llama3-8b-khmer-2024,
title={Llama-3-8B Fine-tuned on Khmer Corpus},
author={metythorn},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/metythorn/llama-3-8b-bnb-4bit}
}
Acknowledgments
- Meta AI for the Llama-3 model
- Unsloth team for the efficient fine-tuning framework
- The Khmer corpus dataset contributors
License
This model is released under the same license as the base Llama-3 model. Please refer to the Llama-3 license for more details.