Gemma2-2B Tamil 16-bit Instruct

A fine-tuned Tamil instruction-following model based on Google's Gemma2-2B

This model is a specialized version of Google's Gemma2-2B, fine-tuned specifically for Tamil language instruction following tasks. It has been optimized to understand and respond to instructions in Tamil while maintaining capabilities in English.

Model Details

  • Model Type: Causal Language Model (Instruct-tuned)
  • Base Model: google/gemma-2-2b
  • Language: Tamil (primary), English (secondary)
  • Parameters: 2.6B
  • Precision: 16-bit optimization
  • Training Framework: Unsloth
  • Fine-tuning Method: Supervised Fine-Tuning (SFT)

Training Details

Datasets Used

This model was trained on high-quality Tamil instruction-following datasets:

Training Configuration

  • Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
  • Sequence Length: 2048 tokens
  • Batch Size: 8 (effective batch size with gradient accumulation)
  • Learning Rate: 2e-4
  • Training Steps: 200+ steps
  • Optimizer: AdamW with 8-bit optimization
  • Hardware: GPU with 15GB VRAM

Capabilities

This model excels at:

  • 📝 Tamil Text Generation - Natural and fluent Tamil text creation
  • Question Answering - Answering questions in Tamil across various domains
  • 💻 Code Generation - Writing Python code with Tamil explanations
  • 🧮 Mathematical Reasoning - Solving math problems with Tamil explanations
  • 📚 Literature & Culture - Tamil literature, history, and cultural knowledge
  • 🔄 Translation - English ↔ Tamil translation tasks
  • 💬 Conversational AI - Natural dialogue in Tamil

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "sabaridsnfuji/gemma2-2b-tamil-16bit-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Example usage
def generate_response(instruction, input_text="", max_new_tokens=512):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

# Test the model
instruction = "தமிழ் மொழியின் சிறப்புகள் என்ன?"
response = generate_response(instruction)
print(response)

Streaming Generation

from transformers import TextStreamer

# Initialize text streamer for real-time output
text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

def stream_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    print("Tamil Response:")
    print("-" * 30)
    
    model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Example with streaming
stream_response("பைத்தானில் ஒரு எளிய for loop எழுதவும்")

Batch Processing

def generate_batch_responses(instructions, max_new_tokens=512):
    results = []
    
    for instruction in instructions:
        prompt = alpaca_prompt.format(instruction, "", "")
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        clean_response = response[len(prompt):].strip()
        results.append(clean_response)
    
    return results

# Test with multiple instructions
tamil_instructions = [
    "கோடையில் குளிர்ச்சியாக இருக்க என்ன செய்யலாம்?",
    "தமிழ் இலக்கியத்தில் திருக்குறளின் முக்கியத்துவம் என்ன?",
    "சுற்றுச்சூழல் பாதுகாப்பு ஏன் முக்கியம்?"
]

responses = generate_batch_responses(tamil_instructions)
for i, response in enumerate(responses):
    print(f"Q{i+1}: {tamil_instructions[i]}")
    print(f"A{i+1}: {response}\n")

Model Performance

Benchmark Comparison

  • Tamil Language Understanding: Comparable to specialized Tamil models
  • Code Generation: Strong performance in Python with Tamil explanations
  • Mathematical Reasoning: Accurate solutions with clear Tamil explanations
  • Cultural Knowledge: Deep understanding of Tamil culture and literature

Sample Outputs

Example 1: Mathematical Problem

Input: ஒரு பாதி மற்றும் மூன்றில் ஒரு பங்கின் கூட்டுத்தொகையைக் கணக்கிடுங்கள்

Output:

1/2 + 1/3 இன் கூட்டுத்தொகையைக் கணக்கிட வேண்டும்.

முதலில் இரு பின்னங்களுக்கும் பொதுவான கணக்காளர் கண்டுபிடிக்க வேண்டும்:
- 2 மற்றும் 3 இன் மீச்சிறு பொதுமடங்கு = 6

எனவே:
1/2 = 3/6
1/3 = 2/6

கூட்டுத்தொகை: 3/6 + 2/6 = 5/6

விடை: 5/6

Example 2: Code Generation

Input: பைத்தானில் ஒரு எளிய பைபோனச்சி வரிசை நிரல் எழுதவும்

Output:

def fibonacci(n):
    """பைபோனச்சி வரிசையை உருவாக்கும் செயல்பாடு"""
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# முதல் 10 பைபோனச்சி எண்களை அச்சிடுதல்
print("பைபோனச்சி வரிசை:")
for i in range(10):
    print(f"F({i}) = {fibonacci(i)}")

Example 3: Cultural Knowledge

Input: திருக்குறள் ஏன் உலகப் பொதுமறை என்று அழைக்கப்படுகிறது?

Output:

திருக்குறள் "உலகப் பொதுமறை" என்று அழைக்கப்படுவதற்கு பல காரணங்கள் உள்ளன:

1. **சர்வமத ஏற்பு**: அனைத்து மதங்களும் ஏற்கக்கூடிய நெறிமுறைகள்
2. **காலவேறுபாடின்மை**: எந்தக் காலத்திலும் பொருந்தும் நன்னெறிகள்
3. **உலகளாவிய மதிப்புகள்**: மனித அறம், நேர்மை, அன்பு போன்ற பொதுவான கோட்பாடுகள்
4. **எளிமையான மொழி**: இரண்டு வரிகளில் ஆழமான கருத்துகள்
5. **வாழ்க்கை வழிகாட்டி**: அன்றாட வாழ்வில் பின்பற்றக்கூடிய நெறிகள்

இதனால் உலகத்தின் பொதுவான அறநூலாக திருக்குறள் போற்றப்படுகிறது.

Limitations

  • Context Length: Limited to 2048 tokens
  • Domain Specialization: May require additional fine-tuning for highly specialized domains
  • Resource Requirements: Requires GPU for optimal performance
  • Language Mixing: Occasional code-switching between Tamil and English

Technical Specifications

  • Architecture: Gemma2 (RMSNorm, SwiGLU, RoPE)
  • Attention: Multi-head attention with 16 heads
  • Vocabulary: 256,000 tokens (includes Tamil script)
  • Training Precision: Mixed precision (FP16)
  • Inference: Optimized for GPU inference

Hardware Requirements

Minimum Requirements

  • GPU: 4GB VRAM (with 4-bit quantization)
  • RAM: 8GB system RAM
  • Storage: 5GB free space

Recommended Requirements

  • GPU: 8GB+ VRAM (RTX 3070/4060 or better)
  • RAM: 16GB system RAM
  • Storage: 10GB free space

Installation

# Install required packages
pip install transformers torch accelerate

# For faster inference
pip install bitsandbytes

# For 4-bit quantization
pip install transformers[quantization]

Citation

If you use this model in your research or applications, please cite:

@misc{gemma2-tamil-instruct-2024,
  title={Gemma2-2B Tamil 16-bit Instruct: A Fine-tuned Tamil Instruction-Following Model},
  author={Sabarids N Fuji},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/sabaridsnfuji/gemma2-2b-tamil-16bit-instruct}
}

Base Model Citation

@article{gemma_2024,
  title={Gemma: Open Models Based on Gemini Research and Technology},
  author={Gemma Team},
  year={2024},
  journal={arXiv preprint arXiv:2403.08295}
}

Dataset Citations

@misc{balachandran2023tamilllama,
  title={Tamil-Llama: A New Tamil Language Model Based on Llama 2},
  author={Abhinand Balachandran},
  year={2023},
  eprint={2311.05845},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

License

This model is released under the Apache 2.0 License. Please note that this model is based on Gemma2, so it also inherits the Gemma Terms of Use.

Disclaimer

This model is for research and educational purposes. Please ensure responsible use and consider potential biases in the training data. The model may occasionally generate incorrect or biased content.

Acknowledgments

  • Google: For the base Gemma2-2B model
  • Unsloth Team: For the efficient fine-tuning framework
  • Abhinand Balachandran: For the Tamil datasets and evaluation framework
  • Tamil NLP Community: For ongoing support and contributions

Support

For issues, questions, or contributions:


Made with ❤️ for the Tamil NLP community

Downloads last month
12
Safetensors
Model size
2.61B params
Tensor type
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sabaridsnfuji/gemma2-2b-tamil-16bit-insturct

Base model

google/gemma-2-2b
Finetuned
(527)
this model

Datasets used to train sabaridsnfuji/gemma2-2b-tamil-16bit-insturct