dbhaskarganesh
/

TinyLlama1BTeluguFineTuneModel

Text Classification

Model card Files Files and versions

TinyLlama1BTeluguFineTuneModel / README.md

dbhaskarganesh's picture

Update README.md

b1e2f83 verified 7 days ago

|

history blame contribute delete

2.95 kB

	---
	license: apache-2.0
	datasets:
	- dbhaskarganesh/TeluguTinnystories
	language:
	- te
	metrics:
	- accuracy
	- perplexity
	base_model:
	- TinyLlama/TinyLlama-1.1B-Chat-v1.0
	new_version: TinyLlama/TinyLlama-1.1B-Chat-v1.0
	pipeline_tag: text-classification
	tags:
	- TinnyStories
	- NLP
	---

	# 🪔 Adapting TinyLlama-1B for Telugu

	## Model Description
	This model is a fine-tuned version of TinyLlama-1.1B-Chat trained on a custom Telugu TinyStories dataset.
	It was developed as part of CISC7021 – Applied Natural Language Processing, University of Macau to explore low-resource language adaptation of lightweight LLMs.

	- Base model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
	- Language: Telugu (te)
	- Model type: Decoder-only transformer (LLaMA-style)
	- Training objective: Continual pre-training on Telugu corpus for better language modeling and text generation

	---

	## Intended Uses
	- Text generation in Telugu (stories, descriptions, prompts).
	- Research on low-resource language adaptation.
	- Educational purposes for understanding continual pre-training with Hugging Face & PyTorch.

	⚠️ Not recommended for production or sensitive applications (e.g., medical, financial, or legal use).

	---

	## Training Data
	- Dataset: [`dbhaskarganesh/TeluguTinnystories`](https://huggingface.co/datasets/dbhaskarganesh/TeluguTinnystories)
	- Approx. size: 11,500 tokens
	- Derived from TinyStories-style narratives adapted into Telugu.

	---

	## Training Procedure
	- Base model: TinyLlama-1.1B-Chat
	- Framework: PyTorch + Hugging Face Transformers
	- GPU: Google Colab (free tier) and NVIDIA RTX 4090 24GB
	- Settings:
	- Batch size = 3
	- Max sequence length = 512
	- Learning rate = 2e-5
	- Optimizer = AdamW
	- Decoding examples: temperature = 0.6, max\_new\_tokens = 850

	---

	## Evaluation
	- Metrics: accuracy, perplexity
	- Perplexity results:
	- English test set: ~4.92
	- Telugu test set: ~2.42

	- Qualitative evaluation:
	Model generates coherent Telugu sentences, though with occasional repetition or off-topic responses.

	---

	## Limitations
	- Small model (1B parameters) → not competitive with large LLMs.
	- Limited dataset coverage → may not generalize well.
	- Possible biases and hallucinations due to training data.

	---

	## How to Use

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "dbhaskarganesh/tinyllama-telugu" # replace with your repo path
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	prompt = "ఒక చిన్న కథను వ్రాయండి."
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))