--- language: - km - en library_name: unsloth license: llama3 base_model: unsloth/llama-3-8b-bnb-4bit tags: - khmer - cambodian - llama-3 - continue-pretraining - unsloth - lora - text-generation datasets: - metythorn/khmer-corpus model-index: - name: llama-3-8b-bnb-4bit-khmer results: [] --- # Llama-3-8B Pretrain on Khmer Corpus This model is a pretrain version of [unsloth/llama-3-8b-bnb-4bit](https://huggingface.co/unsloth/llama-3-8b-bnb-4bit) on the [metythorn/khmer-corpus](https://huggingface.co/datasets/metythorn/khmer-corpus) dataset. ## Model Description This is a Llama-3-8B model that has been continue pretrained using the Unsloth framework to improve performance on Khmer (Cambodian) text generation tasks. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning with 4-bit quantization. ## Training Details ### Training Data - **Dataset**: [metythorn/khmer-corpus](https://huggingface.co/datasets/metythorn/khmer-corpus) - **Language**: Primarily Khmer with some English - **Dataset Split**: Training split ### Training Configuration - **Base Model**: unsloth/llama-3-8b-bnb-4bit - **Training Framework**: Unsloth with LoRA - **Quantization**: 4-bit (bnb-4bit) - **Max Sequence Length**: 2048 - **LoRA Rank (r)**: 128 - **LoRA Alpha**: 32 - **LoRA Dropout**: 0 - **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_proj, lm_head - **Use RSLoRA**: True - **Gradient Checkpointing**: unsloth ### Training Hyperparameters - **Epochs**: 1 - **Batch Size**: 2 (per device) - **Gradient Accumulation Steps**: 8 - **Learning Rate**: 5e-5 - **Embedding Learning Rate**: 5e-6 - **Warmup Ratio**: 0.1 - **Optimizer**: adamw_8bit - **LR Scheduler**: cosine - **Weight Decay**: 0.0 - **Seed**: 3407 ## Usage ### Basic Usage with Unsloth ```python from unsloth import FastLanguageModel import torch # Load the model and tokenizer model, tokenizer = FastLanguageModel.from_pretrained( model_name="metythorn/llama-3-8b-bnb-4bit", max_seq_length=2048, dtype=None, # None for auto detection load_in_4bit=True, ) # Enable inference mode FastLanguageModel.for_inference(model) # Simple generation prompt = "សួស្តី" # Khmer text inputs = tokenizer([prompt], return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Khmer-Optimized Streaming Generation For proper Khmer text streaming that handles Unicode combining characters: ```python from transformers import TextIteratorStreamer from threading import Thread import unicodedata # Khmer-aware text streamer text_streamer = TextIteratorStreamer( tokenizer, skip_prompt=True, # Skip the input prompt skip_special_tokens=True # Skip special tokens ) # Buffer to collect tokens for proper Khmer display token_buffer = "" buffer_size = 3 # Collect a few tokens before displaying # Before running inference FastLanguageModel.for_inference(model) inputs = tokenizer(["ហាយ"], return_tensors="pt").to("cuda") generation_kwargs = dict( inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True, pad_token_id=tokenizer.eos_token_id, do_sample=True, temperature=0.7, top_p=0.9, ) thread = Thread(target=model.generate, kwargs=generation_kwargs) thread.start() length = 0 token_count = 0 for j, new_text in enumerate(text_streamer): # Add new text to buffer token_buffer += new_text token_count += 1 # Process buffer when we have enough tokens should_display = token_count >= buffer_size if should_display or j == 0: # Normalize Unicode for proper Khmer display display_text = unicodedata.normalize('NFC', token_buffer) print(display_text, end="", flush=True) # Reset buffer token_buffer = "" token_count = 0 # Handle any remaining tokens in buffer if token_buffer: display_text = unicodedata.normalize('NFC', token_buffer) print(display_text, end="", flush=True) thread.join() print() # Final newline ``` ### Using with Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model_name = "metythorn/llama-3-8b-bnb-4bit" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) # Generate text prompt = "ប្រទេសកម្ពុជា" # Cambodia in Khmer inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Batch Generation for Multiple Prompts ```python def generate_khmer_batch(prompts, max_new_tokens=256): FastLanguageModel.for_inference(model) inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id, ) responses = [] for i, output in enumerate(outputs): response = tokenizer.decode(output, skip_special_tokens=True) # Remove the original prompt from response generated = response[len(prompts[i]):].strip() responses.append(generated) return responses # Example usage prompts = ["សួស្តី", "ខ្ញុំឈ្មោះ", "ប្រទេសកម្ពុជា"] results = generate_khmer_batch(prompts) for prompt, result in zip(prompts, results): print(f"Input: {prompt}") print(f"Output: {result}") print("---") ``` ## Model Performance This model has been specifically continue pretrained to understand and generate Khmer text more effectively than the base Llama-3-8B model. The training focused on: - **Improved Khmer language understanding**: Better comprehension of Khmer syntax and semantics - **Enhanced Khmer text generation**: More natural and coherent Khmer text output - **Unicode handling**: Proper support for Khmer combining characters and complex scripts - **Maintained multilingual capabilities**: Preserves English and other language abilities - **Efficient inference**: Optimized with 4-bit quantization for faster generation ### Special Features for Khmer - **Proper Unicode Normalization**: Handles Khmer combining characters correctly - **Streaming Support**: Includes optimized streaming generation for real-time applications - **Batch Processing**: Efficient handling of multiple Khmer prompts simultaneously - **Context Awareness**: Better understanding of Khmer cultural and linguistic context ### Recommended Usage Patterns - Use the **Khmer-optimized streaming** for real-time chat applications - Use **batch generation** for processing multiple texts efficiently - Use **simple generation** for basic text completion tasks - Buffer tokens (3-5) when streaming to ensure proper Khmer character display ## Limitations and Biases - The model's performance is limited by the quality and size of the training dataset - May exhibit biases present in the training data - Performance may vary for different Khmer dialects or specialized domains - 4-bit quantization may slightly impact model quality compared to full precision - **Khmer-specific limitations**: - Streaming requires token buffering for proper Unicode character display - Performance may vary with different Khmer romanization systems - Limited understanding of very specialized Khmer terminology - May occasionally mix Khmer and English in responses ## Important Notes for Khmer Usage ⚠️ **Streaming Considerations**: When implementing streaming generation with Khmer text, always use token buffering (3-5 tokens) and Unicode normalization to prevent broken character display. ✅ **Best Practices**: - Use `skip_prompt=True` in TextIteratorStreamer for cleaner output - Apply `unicodedata.normalize('NFC', text)` for proper Khmer character composition - Set `pad_token_id=tokenizer.eos_token_id` to avoid generation issues - Use temperature 0.7-0.9 for more natural Khmer text generation ## Technical Specifications - **Model Size**: ~4.5GB (4-bit quantized) - **Architecture**: Llama-3-8B with LoRA adapters - **Precision**: 4-bit quantization with LoRA in higher precision - **Memory Requirements**: ~6-8GB VRAM for inference - **Framework**: Compatible with Transformers and Unsloth ## Citation If you use this model in your research, please cite: ```bibtex @misc{llama3-8b-khmer-2024, title={Llama-3-8B Fine-tuned on Khmer Corpus}, author={metythorn}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/metythorn/llama-3-8b-bnb-4bit} } ``` ## Acknowledgments - Meta AI for the Llama-3 model - Unsloth team for the efficient fine-tuning framework - The Khmer corpus dataset contributors ## License This model is released under the same license as the base Llama-3 model. Please refer to the [Llama-3 license](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE) for more details.