dbhaskarganesh commited on
Commit
b1e2f83
·
verified ·
1 Parent(s): 03802de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -14,4 +14,78 @@ pipeline_tag: text-classification
14
  tags:
15
  - TinnyStories
16
  - NLP
17
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  tags:
15
  - TinnyStories
16
  - NLP
17
+ ---
18
+
19
+ # 🪔 Adapting TinyLlama-1B for Telugu
20
+
21
+ ## Model Description
22
+ This model is a **fine-tuned version of TinyLlama-1.1B-Chat** trained on a custom **Telugu TinyStories dataset**.
23
+ It was developed as part of **CISC7021 – Applied Natural Language Processing, University of Macau** to explore **low-resource language adaptation** of lightweight LLMs.
24
+
25
+ - **Base model:** [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
26
+ - **Language:** Telugu (te)
27
+ - **Model type:** Decoder-only transformer (LLaMA-style)
28
+ - **Training objective:** Continual pre-training on Telugu corpus for better language modeling and text generation
29
+
30
+ ---
31
+
32
+ ## Intended Uses
33
+ - **Text generation** in Telugu (stories, descriptions, prompts).
34
+ - **Research** on low-resource language adaptation.
35
+ - **Educational purposes** for understanding continual pre-training with Hugging Face & PyTorch.
36
+
37
+ ⚠️ **Not recommended** for production or sensitive applications (e.g., medical, financial, or legal use).
38
+
39
+ ---
40
+
41
+ ## Training Data
42
+ - Dataset: [`dbhaskarganesh/TeluguTinnystories`](https://huggingface.co/datasets/dbhaskarganesh/TeluguTinnystories)
43
+ - Approx. size: 11,500 tokens
44
+ - Derived from TinyStories-style narratives adapted into Telugu.
45
+
46
+ ---
47
+
48
+ ## Training Procedure
49
+ - **Base model:** TinyLlama-1.1B-Chat
50
+ - **Framework:** PyTorch + Hugging Face Transformers
51
+ - **GPU:** Google Colab (free tier) and NVIDIA RTX 4090 24GB
52
+ - **Settings:**
53
+ - Batch size = 3
54
+ - Max sequence length = 512
55
+ - Learning rate = 2e-5
56
+ - Optimizer = AdamW
57
+ - Decoding examples: temperature = 0.6, max\_new\_tokens = 850
58
+
59
+ ---
60
+
61
+ ## Evaluation
62
+ - **Metrics:** accuracy, perplexity
63
+ - **Perplexity results:**
64
+ - English test set: ~4.92
65
+ - Telugu test set: ~2.42
66
+
67
+ - **Qualitative evaluation:**
68
+ Model generates coherent Telugu sentences, though with occasional repetition or off-topic responses.
69
+
70
+ ---
71
+
72
+ ## Limitations
73
+ - Small model (1B parameters) → not competitive with large LLMs.
74
+ - Limited dataset coverage → may not generalize well.
75
+ - Possible biases and hallucinations due to training data.
76
+
77
+ ---
78
+
79
+ ## How to Use
80
+
81
+ ```python
82
+ from transformers import AutoModelForCausalLM, AutoTokenizer
83
+
84
+ model_name = "dbhaskarganesh/tinyllama-telugu" # replace with your repo path
85
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
86
+ model = AutoModelForCausalLM.from_pretrained(model_name)
87
+
88
+ prompt = "ఒక చిన్న కథను వ్రాయండి."
89
+ inputs = tokenizer(prompt, return_tensors="pt")
90
+ outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
91
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))