metythorn's picture
Upload fine-tuned Llama-3-8B model on Khmer corpus
c339791 verified
metadata
language:
  - km
  - en
library_name: unsloth
license: llama3
base_model: unsloth/llama-3-8b-bnb-4bit
tags:
  - khmer
  - cambodian
  - llama-3
  - continue-pretraining
  - unsloth
  - lora
  - text-generation
datasets:
  - metythorn/khmer-corpus
model-index:
  - name: llama-3-8b-bnb-4bit-khmer
    results: []

Llama-3-8B Pretrain on Khmer Corpus

This model is a pretrain version of unsloth/llama-3-8b-bnb-4bit on the metythorn/khmer-corpus dataset.

Model Description

This is a Llama-3-8B model that has been continue pretrained using the Unsloth framework to improve performance on Khmer (Cambodian) text generation tasks. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning with 4-bit quantization.

Training Details

Training Data

Training Configuration

  • Base Model: unsloth/llama-3-8b-bnb-4bit
  • Training Framework: Unsloth with LoRA
  • Quantization: 4-bit (bnb-4bit)
  • Max Sequence Length: 2048
  • LoRA Rank (r): 128
  • LoRA Alpha: 32
  • LoRA Dropout: 0
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_proj, lm_head
  • Use RSLoRA: True
  • Gradient Checkpointing: unsloth

Training Hyperparameters

  • Epochs: 1
  • Batch Size: 2 (per device)
  • Gradient Accumulation Steps: 8
  • Learning Rate: 5e-5
  • Embedding Learning Rate: 5e-6
  • Warmup Ratio: 0.1
  • Optimizer: adamw_8bit
  • LR Scheduler: cosine
  • Weight Decay: 0.0
  • Seed: 3407

Usage

Basic Usage with Unsloth

from unsloth import FastLanguageModel
import torch

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="metythorn/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,  # None for auto detection
    load_in_4bit=True,
)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Simple generation
prompt = "αžŸαž½αžŸαŸ’αžαžΈ"  # Khmer text
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Khmer-Optimized Streaming Generation

For proper Khmer text streaming that handles Unicode combining characters:

from transformers import TextIteratorStreamer
from threading import Thread
import unicodedata

# Khmer-aware text streamer
text_streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,  # Skip the input prompt
    skip_special_tokens=True  # Skip special tokens
)

# Buffer to collect tokens for proper Khmer display
token_buffer = ""
buffer_size = 3  # Collect a few tokens before displaying

# Before running inference
FastLanguageModel.for_inference(model)

inputs = tokenizer(["αž αžΆαž™"], return_tensors="pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

length = 0
token_count = 0

for j, new_text in enumerate(text_streamer):
    # Add new text to buffer
    token_buffer += new_text
    token_count += 1
    
    # Process buffer when we have enough tokens
    should_display = token_count >= buffer_size
    
    if should_display or j == 0:
        # Normalize Unicode for proper Khmer display
        display_text = unicodedata.normalize('NFC', token_buffer)
        print(display_text, end="", flush=True)
        
        # Reset buffer
        token_buffer = ""
        token_count = 0

# Handle any remaining tokens in buffer
if token_buffer:
    display_text = unicodedata.normalize('NFC', token_buffer)
    print(display_text, end="", flush=True)

thread.join()
print()  # Final newline

Using with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "metythorn/llama-3-8b-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate text
prompt = "αž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆ"  # Cambodia in Khmer
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Batch Generation for Multiple Prompts

def generate_khmer_batch(prompts, max_new_tokens=256):
    FastLanguageModel.for_inference(model)
    
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    responses = []
    for i, output in enumerate(outputs):
        response = tokenizer.decode(output, skip_special_tokens=True)
        # Remove the original prompt from response
        generated = response[len(prompts[i]):].strip()
        responses.append(generated)
    
    return responses

# Example usage
prompts = ["αžŸαž½αžŸαŸ’αžαžΈ", "αžαŸ’αž‰αž»αŸ†αžˆαŸ’αž˜αŸ„αŸ‡", "αž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆ"]
results = generate_khmer_batch(prompts)
for prompt, result in zip(prompts, results):
    print(f"Input: {prompt}")
    print(f"Output: {result}")
    print("---")

Model Performance

This model has been specifically continue pretrained to understand and generate Khmer text more effectively than the base Llama-3-8B model. The training focused on:

  • Improved Khmer language understanding: Better comprehension of Khmer syntax and semantics
  • Enhanced Khmer text generation: More natural and coherent Khmer text output
  • Unicode handling: Proper support for Khmer combining characters and complex scripts
  • Maintained multilingual capabilities: Preserves English and other language abilities
  • Efficient inference: Optimized with 4-bit quantization for faster generation

Special Features for Khmer

  • Proper Unicode Normalization: Handles Khmer combining characters correctly
  • Streaming Support: Includes optimized streaming generation for real-time applications
  • Batch Processing: Efficient handling of multiple Khmer prompts simultaneously
  • Context Awareness: Better understanding of Khmer cultural and linguistic context

Recommended Usage Patterns

  • Use the Khmer-optimized streaming for real-time chat applications
  • Use batch generation for processing multiple texts efficiently
  • Use simple generation for basic text completion tasks
  • Buffer tokens (3-5) when streaming to ensure proper Khmer character display

Limitations and Biases

  • The model's performance is limited by the quality and size of the training dataset
  • May exhibit biases present in the training data
  • Performance may vary for different Khmer dialects or specialized domains
  • 4-bit quantization may slightly impact model quality compared to full precision
  • Khmer-specific limitations:
    • Streaming requires token buffering for proper Unicode character display
    • Performance may vary with different Khmer romanization systems
    • Limited understanding of very specialized Khmer terminology
    • May occasionally mix Khmer and English in responses

Important Notes for Khmer Usage

⚠️ Streaming Considerations: When implementing streaming generation with Khmer text, always use token buffering (3-5 tokens) and Unicode normalization to prevent broken character display.

βœ… Best Practices:

  • Use skip_prompt=True in TextIteratorStreamer for cleaner output
  • Apply unicodedata.normalize('NFC', text) for proper Khmer character composition
  • Set pad_token_id=tokenizer.eos_token_id to avoid generation issues
  • Use temperature 0.7-0.9 for more natural Khmer text generation

Technical Specifications

  • Model Size: ~4.5GB (4-bit quantized)
  • Architecture: Llama-3-8B with LoRA adapters
  • Precision: 4-bit quantization with LoRA in higher precision
  • Memory Requirements: ~6-8GB VRAM for inference
  • Framework: Compatible with Transformers and Unsloth

Citation

If you use this model in your research, please cite:

@misc{llama3-8b-khmer-2024,
  title={Llama-3-8B Fine-tuned on Khmer Corpus},
  author={metythorn},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/metythorn/llama-3-8b-bnb-4bit}
}

Acknowledgments

  • Meta AI for the Llama-3 model
  • Unsloth team for the efficient fine-tuning framework
  • The Khmer corpus dataset contributors

License

This model is released under the same license as the base Llama-3 model. Please refer to the Llama-3 license for more details.