Gemma2-2B Tamil 16-bit Instruct

A fine-tuned Tamil instruction-following model based on Google's Gemma2-2B

This model is a specialized version of Google's Gemma2-2B, fine-tuned specifically for Tamil language instruction following tasks. It has been optimized to understand and respond to instructions in Tamil while maintaining capabilities in English.

Model Details

Model Type: Causal Language Model (Instruct-tuned)
Base Model: google/gemma-2-2b
Language: Tamil (primary), English (secondary)
Parameters: 2.6B
Precision: 16-bit optimization
Training Framework: Unsloth
Fine-tuning Method: Supervised Fine-Tuning (SFT)

Training Details

Datasets Used

This model was trained on high-quality Tamil instruction-following datasets:

abhinand/tamil-alpaca - Tamil-translated version of the Alpaca dataset
abhinand/tamil-alpaca-orca - Tamil subset of OpenOrca dataset
Custom Tamil instruction dataset - Additional Tamil language tasks

Training Configuration

Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
Sequence Length: 2048 tokens
Batch Size: 8 (effective batch size with gradient accumulation)
Learning Rate: 2e-4
Training Steps: 200+ steps
Optimizer: AdamW with 8-bit optimization
Hardware: GPU with 15GB VRAM

Capabilities

This model excels at:

📝 Tamil Text Generation - Natural and fluent Tamil text creation
❓ Question Answering - Answering questions in Tamil across various domains
💻 Code Generation - Writing Python code with Tamil explanations
🧮 Mathematical Reasoning - Solving math problems with Tamil explanations
📚 Literature & Culture - Tamil literature, history, and cultural knowledge
🔄 Translation - English ↔ Tamil translation tasks
💬 Conversational AI - Natural dialogue in Tamil

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "sabaridsnfuji/gemma2-2b-tamil-16bit-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Example usage
def generate_response(instruction, input_text="", max_new_tokens=512):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

# Test the model
instruction = "தமிழ் மொழியின் சிறப்புகள் என்ன?"
response = generate_response(instruction)
print(response)

Streaming Generation

from transformers import TextStreamer

# Initialize text streamer for real-time output
text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

def stream_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    print("Tamil Response:")
    print("-" * 30)
    
    model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Example with streaming
stream_response("பைத்தானில் ஒரு எளிய for loop எழுதவும்")

Batch Processing

def generate_batch_responses(instructions, max_new_tokens=512):
    results = []
    
    for instruction in instructions:
        prompt = alpaca_prompt.format(instruction, "", "")
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        clean_response = response[len(prompt):].strip()
        results.append(clean_response)
    
    return results

# Test with multiple instructions
tamil_instructions = [
    "கோடையில் குளிர்ச்சியாக இருக்க என்ன செய்யலாம்?",
    "தமிழ் இலக்கியத்தில் திருக்குறளின் முக்கியத்துவம் என்ன?",
    "சுற்றுச்சூழல் பாதுகாப்பு ஏன் முக்கியம்?"
]

responses = generate_batch_responses(tamil_instructions)
for i, response in enumerate(responses):
    print(f"Q{i+1}: {tamil_instructions[i]}")
    print(f"A{i+1}: {response}\n")

Model Performance

Benchmark Comparison

Tamil Language Understanding: Comparable to specialized Tamil models
Code Generation: Strong performance in Python with Tamil explanations
Mathematical Reasoning: Accurate solutions with clear Tamil explanations
Cultural Knowledge: Deep understanding of Tamil culture and literature

Sample Outputs

Example 1: Mathematical Problem

Input: ஒரு பாதி மற்றும் மூன்றில் ஒரு பங்கின் கூட்டுத்தொகையைக் கணக்கிடுங்கள்

Output:

1/2 + 1/3 இன் கூட்டுத்தொகையைக் கணக்கிட வேண்டும்.

முதலில் இரு பின்னங்களுக்கும் பொதுவான கணக்காளர் கண்டுபிடிக்க வேண்டும்:
- 2 மற்றும் 3 இன் மீச்சிறு பொதுமடங்கு = 6

எனவே:
1/2 = 3/6
1/3 = 2/6

கூட்டுத்தொகை: 3/6 + 2/6 = 5/6

விடை: 5/6

Example 2: Code Generation

Input: பைத்தானில் ஒரு எளிய பைபோனச்சி வரிசை நிரல் எழுதவும்

Output:

def fibonacci(n):
    """பைபோனச்சி வரிசையை உருவாக்கும் செயல்பாடு"""
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# முதல் 10 பைபோனச்சி எண்களை அச்சிடுதல்
print("பைபோனச்சி வரிசை:")
for i in range(10):
    print(f"F({i}) = {fibonacci(i)}")

Example 3: Cultural Knowledge

Input: திருக்குறள் ஏன் உலகப் பொதுமறை என்று அழைக்கப்படுகிறது?

Output:

திருக்குறள் "உலகப் பொதுமறை" என்று அழைக்கப்படுவதற்கு பல காரணங்கள் உள்ளன:

1. **சர்வமத ஏற்பு**: அனைத்து மதங்களும் ஏற்கக்கூடிய நெறிமுறைகள்
2. **காலவேறுபாடின்மை**: எந்தக் காலத்திலும் பொருந்தும் நன்னெறிகள்
3. **உலகளாவிய மதிப்புகள்**: மனித அறம், நேர்மை, அன்பு போன்ற பொதுவான கோட்பாடுகள்
4. **எளிமையான மொழி**: இரண்டு வரிகளில் ஆழமான கருத்துகள்
5. **வாழ்க்கை வழிகாட்டி**: அன்றாட வாழ்வில் பின்பற்றக்கூடிய நெறிகள்

இதனால் உலகத்தின் பொதுவான அறநூலாக திருக்குறள் போற்றப்படுகிறது.

Limitations

Context Length: Limited to 2048 tokens
Domain Specialization: May require additional fine-tuning for highly specialized domains
Resource Requirements: Requires GPU for optimal performance
Language Mixing: Occasional code-switching between Tamil and English

Technical Specifications

Architecture: Gemma2 (RMSNorm, SwiGLU, RoPE)
Attention: Multi-head attention with 16 heads
Vocabulary: 256,000 tokens (includes Tamil script)
Training Precision: Mixed precision (FP16)
Inference: Optimized for GPU inference

Hardware Requirements

Minimum Requirements

GPU: 4GB VRAM (with 4-bit quantization)
RAM: 8GB system RAM
Storage: 5GB free space

Recommended Requirements

GPU: 8GB+ VRAM (RTX 3070/4060 or better)
RAM: 16GB system RAM
Storage: 10GB free space

Installation

# Install required packages
pip install transformers torch accelerate

# For faster inference
pip install bitsandbytes

# For 4-bit quantization
pip install transformers[quantization]

Citation

If you use this model in your research or applications, please cite:

@misc{gemma2-tamil-instruct-2024,
  title={Gemma2-2B Tamil 16-bit Instruct: A Fine-tuned Tamil Instruction-Following Model},
  author={Sabarids N Fuji},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/sabaridsnfuji/gemma2-2b-tamil-16bit-instruct}
}

Base Model Citation

@article{gemma_2024,
  title={Gemma: Open Models Based on Gemini Research and Technology},
  author={Gemma Team},
  year={2024},
  journal={arXiv preprint arXiv:2403.08295}
}

Dataset Citations

@misc{balachandran2023tamilllama,
  title={Tamil-Llama: A New Tamil Language Model Based on Llama 2},
  author={Abhinand Balachandran},
  year={2023},
  eprint={2311.05845},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

License

This model is released under the Apache 2.0 License. Please note that this model is based on Gemma2, so it also inherits the Gemma Terms of Use.

Disclaimer

This model is for research and educational purposes. Please ensure responsible use and consider potential biases in the training data. The model may occasionally generate incorrect or biased content.

Acknowledgments

Google: For the base Gemma2-2B model
Unsloth Team: For the efficient fine-tuning framework
Abhinand Balachandran: For the Tamil datasets and evaluation framework
Tamil NLP Community: For ongoing support and contributions

Support

For issues, questions, or contributions:

🐛 Report Issues
💬 Community Discussions
📧 Contact:

Made with ❤️ for the Tamil NLP community

sabaridsnfuji
/

gemma2-2b-tamil-16bit-insturct

Gemma2-2B Tamil 16-bit Instruct

Model Details

Training Details

Datasets Used

Training Configuration

Capabilities

Usage

Quick Start

Streaming Generation

Batch Processing

Model Performance

Benchmark Comparison

Sample Outputs

Example 1: Mathematical Problem

Example 2: Code Generation

Example 3: Cultural Knowledge

Limitations

Technical Specifications

Hardware Requirements

Minimum Requirements

Recommended Requirements

Installation

Citation

Base Model Citation

Dataset Citations

License

Disclaimer

Acknowledgments

Support

Model tree for sabaridsnfuji/gemma2-2b-tamil-16bit-insturct

Datasets used to train sabaridsnfuji/gemma2-2b-tamil-16bit-insturct