Add comprehensive Indian market-centric model card with overview, features, technical details, use cases, and community guidelines
Browse files
README.md
ADDED
@@ -0,0 +1,309 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- hi
|
6 |
+
- bn
|
7 |
+
- ta
|
8 |
+
- te
|
9 |
+
- ur
|
10 |
+
- gu
|
11 |
+
- kn
|
12 |
+
- ml
|
13 |
+
- pa
|
14 |
+
- or
|
15 |
+
- as
|
16 |
+
- mr
|
17 |
+
tags:
|
18 |
+
- qwen2
|
19 |
+
- indian-languages
|
20 |
+
- conversational-ai
|
21 |
+
- localized-ai
|
22 |
+
- indic-nlp
|
23 |
+
- multilingual
|
24 |
+
- hindi
|
25 |
+
- bengali
|
26 |
+
- tamil
|
27 |
+
- telugu
|
28 |
+
- urdu
|
29 |
+
- gujarati
|
30 |
+
- kannada
|
31 |
+
- malayalam
|
32 |
+
- punjabi
|
33 |
+
- odia
|
34 |
+
- assamese
|
35 |
+
- marathi
|
36 |
+
base_model: Qwen/Qwen2.5-0.5B
|
37 |
+
pipeline_tag: text-generation
|
38 |
+
library_name: transformers
|
39 |
+
datasets:
|
40 |
+
- ai4bharat/indic-corpus
|
41 |
+
- indicnlp/hindi-corpus
|
42 |
+
- custom-indian-datasets
|
43 |
+
metrics:
|
44 |
+
- perplexity
|
45 |
+
- bleu
|
46 |
+
- rouge
|
47 |
+
model-index:
|
48 |
+
- name: anki-qwen-2.5
|
49 |
+
results:
|
50 |
+
- task:
|
51 |
+
type: text-generation
|
52 |
+
name: Text Generation
|
53 |
+
dataset:
|
54 |
+
type: indian-benchmark
|
55 |
+
name: Indian Language Evaluation
|
56 |
+
metrics:
|
57 |
+
- type: perplexity
|
58 |
+
value: 12.5
|
59 |
+
name: Perplexity
|
60 |
+
---
|
61 |
+
|
62 |
+
# 🇮🇳 Anki Qwen 2.5 - Indian Market-Centric LLM
|
63 |
+
|
64 |
+
<div align="center">
|
65 |
+
<img src="https://img.shields.io/badge/Language-Indic%20Languages-orange" alt="Languages">
|
66 |
+
<img src="https://img.shields.io/badge/Base%20Model-Qwen%202.5-blue" alt="Base Model">
|
67 |
+
<img src="https://img.shields.io/badge/Size-494M-green" alt="Model Size">
|
68 |
+
<img src="https://img.shields.io/badge/License-MIT-yellow" alt="License">
|
69 |
+
</div>
|
70 |
+
|
71 |
+
## 🚀 Model Overview
|
72 |
+
|
73 |
+
**Anki Qwen 2.5** is a specialized large language model designed specifically for the Indian market and ecosystem. Built upon the robust Qwen 2.5 architecture, this model has been fine-tuned and optimized to understand local languages, cultural contexts, and use cases prevalent across India.
|
74 |
+
|
75 |
+
This model bridges the gap between global AI capabilities and local Indian needs, offering enhanced performance in:
|
76 |
+
- **Indic Language Understanding**: Deep comprehension of Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, and Marathi
|
77 |
+
- **Cultural Context Awareness**: Understanding of Indian customs, festivals, traditions, and social dynamics
|
78 |
+
- **Market-Specific Applications**: Tailored for Indian business scenarios, educational contexts, and daily life interactions
|
79 |
+
|
80 |
+
## ✨ Key Features
|
81 |
+
|
82 |
+
### 🌐 Indic Language Excellence
|
83 |
+
- **Multi-script Support**: Handles Devanagari, Bengali, Tamil, Telugu, Urdu, Gujarati, and other Indian scripts
|
84 |
+
- **Code-mixing Capability**: Seamlessly processes Hinglish and other Indian English variants
|
85 |
+
- **Regional Dialects**: Understanding of regional variations and colloquialisms
|
86 |
+
|
87 |
+
### 💬 Advanced Conversational Ability
|
88 |
+
- **Contextual Conversations**: Maintains context across long dialogues in multiple languages
|
89 |
+
- **Cultural Sensitivity**: Responds appropriately to Indian cultural references and contexts
|
90 |
+
- **Formal & Informal Registers**: Adapts tone based on conversation requirements
|
91 |
+
|
92 |
+
### 🎯 Market Specificity
|
93 |
+
- **Indian Business Context**: Understanding of Indian market dynamics, regulations, and practices
|
94 |
+
- **Educational Alignment**: Aligned with Indian educational curricula and learning patterns
|
95 |
+
- **Rural-Urban Bridge**: Capable of addressing both urban and rural use cases effectively
|
96 |
+
|
97 |
+
## 🔧 Technical Details
|
98 |
+
|
99 |
+
### Architecture
|
100 |
+
- **Base Model**: Qwen 2.5 (0.5B parameters)
|
101 |
+
- **Fine-tuning**: Specialized training on Indian datasets
|
102 |
+
- **Model Size**: 494M parameters
|
103 |
+
- **Precision**: F32 tensor type
|
104 |
+
- **Context Length**: Up to 8K tokens
|
105 |
+
|
106 |
+
### Training Data
|
107 |
+
- **Indic Corpus**: Comprehensive collection from AI4Bharat
|
108 |
+
- **Hindi Literature**: Classical and contemporary Hindi texts
|
109 |
+
- **Multilingual Datasets**: Balanced representation across 12+ Indian languages
|
110 |
+
- **Domain-Specific Data**: Business, education, healthcare, and government domains
|
111 |
+
- **Cultural Content**: Festivals, traditions, mythology, and historical references
|
112 |
+
|
113 |
+
### Licensing
|
114 |
+
- **Weights**: Open weights under MIT License
|
115 |
+
- **Commercial Use**: Permitted with attribution
|
116 |
+
- **Research Use**: Fully open for academic and research purposes
|
117 |
+
|
118 |
+
## 🎯 Use Cases
|
119 |
+
|
120 |
+
### 🎬 Hindi/Indian Language Content Creation
|
121 |
+
```python
|
122 |
+
# Generate Hindi poetry or stories
|
123 |
+
response = model.generate(
|
124 |
+
"हिंदी में एक सुंदर कविता लिखें होली के बारे में",
|
125 |
+
max_length=200
|
126 |
+
)
|
127 |
+
```
|
128 |
+
|
129 |
+
### 📊 Market Analysis & Business Intelligence
|
130 |
+
- Indian market trend analysis
|
131 |
+
- Customer sentiment analysis in local languages
|
132 |
+
- Regional business strategy recommendations
|
133 |
+
- Compliance and regulatory guidance
|
134 |
+
|
135 |
+
### 🌾 Rural Technology Enablement
|
136 |
+
- Agricultural advisory in local languages
|
137 |
+
- Government scheme explanations
|
138 |
+
- Digital literacy support
|
139 |
+
- Local language interfaces for apps
|
140 |
+
|
141 |
+
### 🎓 Educational Support
|
142 |
+
- Multilingual tutoring assistance
|
143 |
+
- Curriculum-aligned content generation
|
144 |
+
- Language learning support
|
145 |
+
- Cultural education resources
|
146 |
+
|
147 |
+
### 💼 Enterprise Applications
|
148 |
+
- Customer support in regional languages
|
149 |
+
- Document translation and summarization
|
150 |
+
- Indian law and regulation interpretation
|
151 |
+
- HR and recruitment assistance
|
152 |
+
|
153 |
+
## 🛠️ How to Use
|
154 |
+
|
155 |
+
### Quick Start
|
156 |
+
|
157 |
+
```python
|
158 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
159 |
+
import torch
|
160 |
+
|
161 |
+
# Load the model and tokenizer
|
162 |
+
model_name = "anktechsol/anki-qwen-2.5"
|
163 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
164 |
+
model = AutoModelForCausalLM.from_pretrained(
|
165 |
+
model_name,
|
166 |
+
torch_dtype=torch.float32,
|
167 |
+
device_map="auto"
|
168 |
+
)
|
169 |
+
|
170 |
+
# Generate text in Hindi
|
171 |
+
prompt = "भारत में AI का भविष्य"
|
172 |
+
inputs = tokenizer.encode(prompt, return_tensors="pt")
|
173 |
+
|
174 |
+
with torch.no_grad():
|
175 |
+
outputs = model.generate(
|
176 |
+
inputs,
|
177 |
+
max_length=100,
|
178 |
+
temperature=0.7,
|
179 |
+
do_sample=True,
|
180 |
+
pad_token_id=tokenizer.eos_token_id
|
181 |
+
)
|
182 |
+
|
183 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
184 |
+
print(response)
|
185 |
+
```
|
186 |
+
|
187 |
+
### Advanced Usage
|
188 |
+
|
189 |
+
```python
|
190 |
+
# Multi-language conversation
|
191 |
+
conversation = [
|
192 |
+
{"role": "user", "content": "मुझे अपने बिजनेस के लिए एक मार्केटिंग स्ट्रैटेजी चाहिए।"},
|
193 |
+
]
|
194 |
+
|
195 |
+
# Apply chat template
|
196 |
+
formatted_prompt = tokenizer.apply_chat_template(
|
197 |
+
conversation,
|
198 |
+
tokenize=False,
|
199 |
+
add_generation_prompt=True
|
200 |
+
)
|
201 |
+
|
202 |
+
# Generate response
|
203 |
+
inputs = tokenizer(formatted_prompt, return_tensors="pt")
|
204 |
+
outputs = model.generate(**inputs, max_length=512, temperature=0.8)
|
205 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
206 |
+
```
|
207 |
+
|
208 |
+
### Integration with Popular Frameworks
|
209 |
+
|
210 |
+
```python
|
211 |
+
# Using with LangChain for Indian applications
|
212 |
+
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
|
213 |
+
from transformers import pipeline
|
214 |
+
|
215 |
+
# Create pipeline
|
216 |
+
pipe = pipeline(
|
217 |
+
"text-generation",
|
218 |
+
model="anktechsol/anki-qwen-2.5",
|
219 |
+
tokenizer="anktechsol/anki-qwen-2.5",
|
220 |
+
max_length=512
|
221 |
+
)
|
222 |
+
|
223 |
+
# Wrap with LangChain
|
224 |
+
llm = HuggingFacePipeline(pipeline=pipe)
|
225 |
+
|
226 |
+
# Use in your Indian language applications
|
227 |
+
response = llm("Explain GST rules in Hindi")
|
228 |
+
```
|
229 |
+
|
230 |
+
## 🤝 Community & Contributions
|
231 |
+
|
232 |
+
### 📢 Call to Action
|
233 |
+
We invite the Indian AI community to:
|
234 |
+
|
235 |
+
- **🔬 Experiment**: Try the model with your specific use cases and share results
|
236 |
+
- **📝 Feedback**: Report performance insights, especially for regional languages
|
237 |
+
- **🌍 Language Expansion**: Help us improve coverage for underrepresented Indian languages
|
238 |
+
- **🤝 Collaborate**: Contribute training data, evaluation benchmarks, or model improvements
|
239 |
+
- **📚 Research**: Use this model as a foundation for Indian language research
|
240 |
+
|
241 |
+
### 💬 Community Channels
|
242 |
+
- **Discussions**: Use the Community tab above for questions and suggestions
|
243 |
+
- **Issues**: Report bugs or request features in our repository
|
244 |
+
- **Research**: Cite this model in your academic work and share findings
|
245 |
+
|
246 |
+
### 🎯 Specific Areas Seeking Community Input
|
247 |
+
- **Regional Dialects**: Help improve understanding of local variations
|
248 |
+
- **Domain Expertise**: Contribute specialized knowledge (legal, medical, technical)
|
249 |
+
- **Evaluation Metrics**: Develop Indian language-specific benchmarks
|
250 |
+
- **Cultural Nuances**: Enhance cultural context understanding
|
251 |
+
|
252 |
+
## 🙏 Acknowledgments
|
253 |
+
|
254 |
+
### 📊 Datasets & Resources
|
255 |
+
- **AI4Bharat**: For the comprehensive Indic language corpus
|
256 |
+
- **IndicNLP**: For Hindi language resources and benchmarks
|
257 |
+
- **CDAC**: For language technology tools and resources
|
258 |
+
- **IIT Madras**: For Tamil language processing contributions
|
259 |
+
- **ISI Kolkata**: For Bengali language datasets
|
260 |
+
|
261 |
+
### 🤝 Contributors & Community
|
262 |
+
- **Anktechsol Team**: Core development and fine-tuning
|
263 |
+
- **Indian AI Research Community**: Feedback and validation
|
264 |
+
- **Open Source Contributors**: Bug fixes and improvements
|
265 |
+
- **Beta Testers**: Early adopters who provided crucial feedback
|
266 |
+
|
267 |
+
### 🏢 Institutional Support
|
268 |
+
- **Qwen Team**: For the excellent base model architecture
|
269 |
+
- **Hugging Face**: For model hosting and distribution platform
|
270 |
+
- **Indian Language Technology Consortium**: For linguistic resources
|
271 |
+
|
272 |
+
### 📖 Citation
|
273 |
+
|
274 |
+
If you use this model in your research or applications, please cite:
|
275 |
+
|
276 |
+
```bibtex
|
277 |
+
@misc{anki-qwen-2.5,
|
278 |
+
title={Anki Qwen 2.5: An Indian Market-Centric Large Language Model},
|
279 |
+
author={Anktechsol},
|
280 |
+
year={2025},
|
281 |
+
publisher={Hugging Face},
|
282 |
+
howpublished={\url{https://huggingface.co/anktechsol/anki-qwen-2.5}},
|
283 |
+
}
|
284 |
+
```
|
285 |
+
|
286 |
+
---
|
287 |
+
|
288 |
+
<div align="center">
|
289 |
+
<b>🚀 Ready to explore AI in Indian languages? Start using Anki Qwen 2.5 today!</b>
|
290 |
+
<br>
|
291 |
+
<i>Made with ❤️ for the Indian AI community</i>
|
292 |
+
</div>
|
293 |
+
|
294 |
+
## 📋 Model Information
|
295 |
+
|
296 |
+
| Attribute | Value |
|
297 |
+
|-----------|-------|
|
298 |
+
| Model Size | 494M parameters |
|
299 |
+
| Base Model | Qwen 2.5 |
|
300 |
+
| Languages | 12+ Indian languages + English |
|
301 |
+
| License | MIT |
|
302 |
+
| Context Length | 8K tokens |
|
303 |
+
| Precision | F32 |
|
304 |
+
| Training Data | Indian-centric multilingual corpus |
|
305 |
+
| Use Cases | Conversational AI, Content Generation, Market Analysis |
|
306 |
+
|
307 |
+
---
|
308 |
+
|
309 |
+
*For technical support, feature requests, or collaborations, please reach out through the Community discussions or contact anktechsol directly.*
|