Gemma2-2B Tamil 16-bit Instruct
A fine-tuned Tamil instruction-following model based on Google's Gemma2-2B
This model is a specialized version of Google's Gemma2-2B, fine-tuned specifically for Tamil language instruction following tasks. It has been optimized to understand and respond to instructions in Tamil while maintaining capabilities in English.
Model Details
- Model Type: Causal Language Model (Instruct-tuned)
- Base Model: google/gemma-2-2b
- Language: Tamil (primary), English (secondary)
- Parameters: 2.6B
- Precision: 16-bit optimization
- Training Framework: Unsloth
- Fine-tuning Method: Supervised Fine-Tuning (SFT)
Training Details
Datasets Used
This model was trained on high-quality Tamil instruction-following datasets:
- abhinand/tamil-alpaca - Tamil-translated version of the Alpaca dataset
- abhinand/tamil-alpaca-orca - Tamil subset of OpenOrca dataset
- Custom Tamil instruction dataset - Additional Tamil language tasks
Training Configuration
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
- Sequence Length: 2048 tokens
- Batch Size: 8 (effective batch size with gradient accumulation)
- Learning Rate: 2e-4
- Training Steps: 200+ steps
- Optimizer: AdamW with 8-bit optimization
- Hardware: GPU with 15GB VRAM
Capabilities
This model excels at:
- 📝 Tamil Text Generation - Natural and fluent Tamil text creation
- ❓ Question Answering - Answering questions in Tamil across various domains
- 💻 Code Generation - Writing Python code with Tamil explanations
- 🧮 Mathematical Reasoning - Solving math problems with Tamil explanations
- 📚 Literature & Culture - Tamil literature, history, and cultural knowledge
- 🔄 Translation - English ↔ Tamil translation tasks
- 💬 Conversational AI - Natural dialogue in Tamil
Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "sabaridsnfuji/gemma2-2b-tamil-16bit-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Example usage
def generate_response(instruction, input_text="", max_new_tokens=512):
prompt = alpaca_prompt.format(instruction, input_text, "")
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response[len(prompt):].strip()
# Test the model
instruction = "தமிழ் மொழியின் சிறப்புகள் என்ன?"
response = generate_response(instruction)
print(response)
Streaming Generation
from transformers import TextStreamer
# Initialize text streamer for real-time output
text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
def stream_response(instruction, input_text=""):
prompt = alpaca_prompt.format(instruction, input_text, "")
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print("Tamil Response:")
print("-" * 30)
model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# Example with streaming
stream_response("பைத்தானில் ஒரு எளிய for loop எழுதவும்")
Batch Processing
def generate_batch_responses(instructions, max_new_tokens=512):
results = []
for instruction in instructions:
prompt = alpaca_prompt.format(instruction, "", "")
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
clean_response = response[len(prompt):].strip()
results.append(clean_response)
return results
# Test with multiple instructions
tamil_instructions = [
"கோடையில் குளிர்ச்சியாக இருக்க என்ன செய்யலாம்?",
"தமிழ் இலக்கியத்தில் திருக்குறளின் முக்கியத்துவம் என்ன?",
"சுற்றுச்சூழல் பாதுகாப்பு ஏன் முக்கியம்?"
]
responses = generate_batch_responses(tamil_instructions)
for i, response in enumerate(responses):
print(f"Q{i+1}: {tamil_instructions[i]}")
print(f"A{i+1}: {response}\n")
Model Performance
Benchmark Comparison
- Tamil Language Understanding: Comparable to specialized Tamil models
- Code Generation: Strong performance in Python with Tamil explanations
- Mathematical Reasoning: Accurate solutions with clear Tamil explanations
- Cultural Knowledge: Deep understanding of Tamil culture and literature
Sample Outputs
Example 1: Mathematical Problem
Input: ஒரு பாதி மற்றும் மூன்றில் ஒரு பங்கின் கூட்டுத்தொகையைக் கணக்கிடுங்கள்
Output:
1/2 + 1/3 இன் கூட்டுத்தொகையைக் கணக்கிட வேண்டும்.
முதலில் இரு பின்னங்களுக்கும் பொதுவான கணக்காளர் கண்டுபிடிக்க வேண்டும்:
- 2 மற்றும் 3 இன் மீச்சிறு பொதுமடங்கு = 6
எனவே:
1/2 = 3/6
1/3 = 2/6
கூட்டுத்தொகை: 3/6 + 2/6 = 5/6
விடை: 5/6
Example 2: Code Generation
Input: பைத்தானில் ஒரு எளிய பைபோனச்சி வரிசை நிரல் எழுதவும்
Output:
def fibonacci(n):
"""பைபோனச்சி வரிசையை உருவாக்கும் செயல்பாடு"""
if n <= 1:
return n
else:
return fibonacci(n-1) + fibonacci(n-2)
# முதல் 10 பைபோனச்சி எண்களை அச்சிடுதல்
print("பைபோனச்சி வரிசை:")
for i in range(10):
print(f"F({i}) = {fibonacci(i)}")
Example 3: Cultural Knowledge
Input: திருக்குறள் ஏன் உலகப் பொதுமறை என்று அழைக்கப்படுகிறது?
Output:
திருக்குறள் "உலகப் பொதுமறை" என்று அழைக்கப்படுவதற்கு பல காரணங்கள் உள்ளன:
1. **சர்வமத ஏற்பு**: அனைத்து மதங்களும் ஏற்கக்கூடிய நெறிமுறைகள்
2. **காலவேறுபாடின்மை**: எந்தக் காலத்திலும் பொருந்தும் நன்னெறிகள்
3. **உலகளாவிய மதிப்புகள்**: மனித அறம், நேர்மை, அன்பு போன்ற பொதுவான கோட்பாடுகள்
4. **எளிமையான மொழி**: இரண்டு வரிகளில் ஆழமான கருத்துகள்
5. **வாழ்க்கை வழிகாட்டி**: அன்றாட வாழ்வில் பின்பற்றக்கூடிய நெறிகள்
இதனால் உலகத்தின் பொதுவான அறநூலாக திருக்குறள் போற்றப்படுகிறது.
Limitations
- Context Length: Limited to 2048 tokens
- Domain Specialization: May require additional fine-tuning for highly specialized domains
- Resource Requirements: Requires GPU for optimal performance
- Language Mixing: Occasional code-switching between Tamil and English
Technical Specifications
- Architecture: Gemma2 (RMSNorm, SwiGLU, RoPE)
- Attention: Multi-head attention with 16 heads
- Vocabulary: 256,000 tokens (includes Tamil script)
- Training Precision: Mixed precision (FP16)
- Inference: Optimized for GPU inference
Hardware Requirements
Minimum Requirements
- GPU: 4GB VRAM (with 4-bit quantization)
- RAM: 8GB system RAM
- Storage: 5GB free space
Recommended Requirements
- GPU: 8GB+ VRAM (RTX 3070/4060 or better)
- RAM: 16GB system RAM
- Storage: 10GB free space
Installation
# Install required packages
pip install transformers torch accelerate
# For faster inference
pip install bitsandbytes
# For 4-bit quantization
pip install transformers[quantization]
Citation
If you use this model in your research or applications, please cite:
@misc{gemma2-tamil-instruct-2024,
title={Gemma2-2B Tamil 16-bit Instruct: A Fine-tuned Tamil Instruction-Following Model},
author={Sabarids N Fuji},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/sabaridsnfuji/gemma2-2b-tamil-16bit-instruct}
}
Base Model Citation
@article{gemma_2024,
title={Gemma: Open Models Based on Gemini Research and Technology},
author={Gemma Team},
year={2024},
journal={arXiv preprint arXiv:2403.08295}
}
Dataset Citations
@misc{balachandran2023tamilllama,
title={Tamil-Llama: A New Tamil Language Model Based on Llama 2},
author={Abhinand Balachandran},
year={2023},
eprint={2311.05845},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
This model is released under the Apache 2.0 License. Please note that this model is based on Gemma2, so it also inherits the Gemma Terms of Use.
Disclaimer
This model is for research and educational purposes. Please ensure responsible use and consider potential biases in the training data. The model may occasionally generate incorrect or biased content.
Acknowledgments
- Google: For the base Gemma2-2B model
- Unsloth Team: For the efficient fine-tuning framework
- Abhinand Balachandran: For the Tamil datasets and evaluation framework
- Tamil NLP Community: For ongoing support and contributions
Support
For issues, questions, or contributions:
- 🐛 Report Issues
- 💬 Community Discussions
- 📧 Contact:
Made with ❤️ for the Tamil NLP community
- Downloads last month
- 12
Model tree for sabaridsnfuji/gemma2-2b-tamil-16bit-insturct
Base model
google/gemma-2-2b