|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- bn |
|
|
- ta |
|
|
- te |
|
|
- ur |
|
|
- gu |
|
|
- kn |
|
|
- ml |
|
|
- pa |
|
|
- or |
|
|
- as |
|
|
- mr |
|
|
tags: |
|
|
- indian-languages |
|
|
- conversational-ai |
|
|
- localized-ai |
|
|
- indic-nlp |
|
|
- multilingual |
|
|
- hindi |
|
|
- bengali |
|
|
- tamil |
|
|
- telugu |
|
|
- urdu |
|
|
- gujarati |
|
|
- kannada |
|
|
- malayalam |
|
|
- punjabi |
|
|
- odia |
|
|
- assamese |
|
|
- marathi |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- ai4bharat/indic-corpus |
|
|
- indicnlp/hindi-corpus |
|
|
- custom-indian-datasets |
|
|
metrics: |
|
|
- perplexity |
|
|
- bleu |
|
|
- rouge |
|
|
model-index: |
|
|
- name: anki-2.5 |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
type: indian-benchmark |
|
|
name: Indian Language Evaluation |
|
|
metrics: |
|
|
- type: perplexity |
|
|
value: 12.5 |
|
|
name: Perplexity |
|
|
--- |
|
|
|
|
|
# 🇮🇳 Anki 2.5 - Indian Market-Centric LLM |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://img.shields.io/badge/Language-Indic%20Languages-orange" alt="Languages"> |
|
|
<img src="https://img.shields.io/badge/Base%20Model-Transformer-blue" alt="Base Model"> |
|
|
<img src="https://img.shields.io/badge/Size-494M-green" alt="Model Size"> |
|
|
<img src="https://img.shields.io/badge/License-MIT-yellow" alt="License"> |
|
|
</div> |
|
|
|
|
|
## 🚀 Model Overview |
|
|
|
|
|
**Anki 2.5** is a specialized large language model designed specifically for the Indian market and ecosystem. Built upon a robust transformer architecture, this model has been fine-tuned and optimized to understand local languages, cultural contexts, and use cases prevalent across India. |
|
|
|
|
|
This model bridges the gap between global AI capabilities and local Indian needs, offering enhanced performance in: |
|
|
|
|
|
- **Indic Language Understanding**: Deep comprehension of Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, and Marathi |
|
|
- **Cultural Context Awareness**: Understanding of Indian customs, festivals, traditions, and social dynamics |
|
|
- **Market-Specific Applications**: Tailored for Indian business scenarios, educational contexts, and daily life interactions |
|
|
|
|
|
## ✨ Key Features |
|
|
|
|
|
### 🌐 Indic Language Excellence |
|
|
- **Multi-script Support**: Handles Devanagari, Bengali, Tamil, Telugu, Urdu, Gujarati, and other Indian scripts |
|
|
- **Code-mixing Capability**: Seamlessly processes Hinglish and other Indian English variants |
|
|
- **Regional Dialects**: Understanding of regional variations and colloquialisms |
|
|
|
|
|
### 💬 Advanced Conversational Ability |
|
|
- **Contextual Conversations**: Maintains context across long dialogues in multiple languages |
|
|
- **Cultural Sensitivity**: Responds appropriately to Indian cultural references and contexts |
|
|
- **Formal & Informal Registers**: Adapts tone based on conversation requirements |
|
|
|
|
|
### 🎯 Market Specificity |
|
|
- **Indian Business Context**: Understanding of Indian market dynamics, regulations, and practices |
|
|
- **Educational Alignment**: Aligned with Indian educational curricula and learning patterns |
|
|
- **Rural-Urban Bridge**: Capable of addressing both urban and rural use cases effectively |
|
|
|
|
|
## 🔧 Technical Details |
|
|
|
|
|
### Architecture |
|
|
- **Base Model**: Transformer (0.5B parameters) |
|
|
- **Fine-tuning**: Specialized training on Indian datasets |
|
|
- **Model Size**: 494M parameters |
|
|
- **Precision**: F32 tensor type |
|
|
- **Context Length**: Up to 8K tokens |
|
|
|
|
|
### Training Data |
|
|
- **Indic Corpus**: Comprehensive collection from AI4Bharat |
|
|
- **Hindi Literature**: Classical and contemporary Hindi texts |
|
|
- **Multilingual Datasets**: Balanced representation across 12+ Indian languages |
|
|
- **Domain-Specific Data**: Business, education, healthcare, and government domains |
|
|
- **Cultural Content**: Festivals, traditions, mythology, and historical references |
|
|
|
|
|
### Licensing |
|
|
- **Weights**: Open weights under MIT License |
|
|
- **Commercial Use**: Permitted with attribution |
|
|
- **Research Use**: Fully open for academic and research purposes |
|
|
|
|
|
## 🎯 Use Cases |
|
|
|
|
|
### 🎬 Hindi/Indian Language Content Creation |
|
|
```python |
|
|
# Generate Hindi poetry or stories |
|
|
response = model.generate( |
|
|
"हिंदी में एक सुंदर कविता लिखें होली के बारे में", |
|
|
max_length=200 |
|
|
) |
|
|
``` |
|
|
|
|
|
### 📊 Market Analysis & Business Intelligence |
|
|
- Indian market trend analysis |
|
|
- Customer sentiment analysis in local languages |
|
|
- Regional business strategy recommendations |
|
|
- Compliance and regulatory guidance |
|
|
|
|
|
### 🌾 Rural Technology Enablement |
|
|
- Agricultural advisory in local languages |
|
|
- Government scheme explanations |
|
|
- Digital literacy support |
|
|
- Local language interfaces for apps |
|
|
|
|
|
### 🎓 Educational Support |
|
|
- Multilingual tutoring assistance |
|
|
- Curriculum-aligned content generation |
|
|
- Language learning support |
|
|
- Cultural education resources |
|
|
|
|
|
### 💼 Enterprise Applications |
|
|
- Customer support in regional languages |
|
|
- Document translation and summarization |
|
|
- Indian law and regulation interpretation |
|
|
- HR and recruitment assistance |
|
|
|
|
|
## 🛠️ How to Use |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "anktechsol/anki-2.5" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.float32, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Generate text in Hindi |
|
|
prompt = "भारत में AI का भविष्य" |
|
|
inputs = tokenizer.encode(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
inputs, |
|
|
max_length=100, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Advanced Usage |
|
|
```python |
|
|
# Multi-language conversation |
|
|
conversation = [ |
|
|
{"role": "user", "content": "मुझे अपने बिजनेस के लिए एक मार्केटिंग स्ट्रैटेजी चाहिए।"}, |
|
|
] |
|
|
|
|
|
# Apply chat template |
|
|
formatted_prompt = tokenizer.apply_chat_template( |
|
|
conversation, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
# Generate response |
|
|
inputs = tokenizer(formatted_prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=512, temperature=0.8) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
### Integration with Popular Frameworks |
|
|
```python |
|
|
# Using with LangChain for Indian applications |
|
|
from langchain.llms.huggingface_pipeline import HuggingFacePipeline |
|
|
from transformers import pipeline |
|
|
|
|
|
# Create pipeline |
|
|
pipe = pipeline( |
|
|
"text-generation", |
|
|
model="anktechsol/anki-2.5", |
|
|
tokenizer="anktechsol/anki-2.5", |
|
|
max_length=512 |
|
|
) |
|
|
|
|
|
# Wrap with LangChain |
|
|
llm = HuggingFacePipeline(pipeline=pipe) |
|
|
|
|
|
# Use in your Indian language applications |
|
|
response = llm("Explain GST rules in Hindi") |
|
|
``` |
|
|
|
|
|
## 🤝 Community & Contributions |
|
|
|
|
|
### 📢 Call to Action |
|
|
We invite the Indian AI community to: |
|
|
- **🔬 Experiment**: Try the model with your specific use cases and share results |
|
|
- **📝 Feedback**: Report performance insights, especially for regional languages |
|
|
- **🌍 Language Expansion**: Help us improve coverage for underrepresented Indian languages |
|
|
- **🤝 Collaborate**: Contribute training data, evaluation benchmarks, or model improvements |
|
|
- **📚 Research**: Use this model as a foundation for Indian language research |
|
|
|
|
|
### 💬 Community Channels |
|
|
- **Discussions**: Use the Community tab above for questions and suggestions |
|
|
- **Issues**: Report bugs or request features in our repository |
|
|
- **Research**: Cite this model in your academic work and share findings |
|
|
|
|
|
### 🎯 Specific Areas Seeking Community Input |
|
|
- **Regional Dialects**: Help improve understanding of local variations |
|
|
- **Domain Expertise**: Contribute specialized knowledge (legal, medical, technical) |
|
|
- **Evaluation Metrics**: Develop Indian language-specific benchmarks |
|
|
- **Cultural Nuances**: Enhance cultural context understanding |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
### 📊 Datasets & Resources |
|
|
- **AI4Bharat**: For the comprehensive Indic language corpus |
|
|
- **IndicNLP**: For Hindi language resources and benchmarks |
|
|
- **CDAC**: For language technology tools and resources |
|
|
- **IIT Madras**: For Tamil language processing contributions |
|
|
- **ISI Kolkata**: For Bengali language datasets |
|
|
|
|
|
### 🤝 Contributors & Community |
|
|
- **Anktechsol Team**: Core development and fine-tuning |
|
|
- **Indian AI Research Community**: Feedback and validation |
|
|
- **Open Source Contributors**: Bug fixes and improvements |
|
|
- **Beta Testers**: Early adopters who provided crucial feedback |
|
|
|
|
|
### 🏢 Institutional Support |
|
|
- **Transformer Architecture Community**: For the excellent base model architecture |
|
|
- **Hugging Face**: For model hosting and distribution platform |
|
|
- **Indian Language Technology Consortium**: For linguistic resources |
|
|
|
|
|
### 📖 Citation |
|
|
If you use this model in your research or applications, please cite: |
|
|
```bibtex |
|
|
@misc{anki-2.5, |
|
|
title={Anki 2.5: An Indian Market-Centric Large Language Model}, |
|
|
author={Anktechsol}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/anktechsol/anki-2.5}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<b>🚀 Ready to explore AI in Indian languages? Start using Anki 2.5 today!</b> |
|
|
<br> |
|
|
<i>Made with ❤️ for the Indian AI community</i> |
|
|
</div> |
|
|
|
|
|
## 📋 Model Information |
|
|
|
|
|
| Attribute | Value | |
|
|
|-----------|-------| |
|
|
| Model Size | 494M parameters | |
|
|
| Base Model | Transformer | |
|
|
| Languages | 12+ Indian languages + English | |
|
|
| License | MIT | |
|
|
| Context Length | 8K tokens | |
|
|
| Precision | F32 | |
|
|
| Training Data | Indian-centric multilingual corpus | |
|
|
| Use Cases | Conversational AI, Content Generation, Market Analysis | |
|
|
|
|
|
--- |
|
|
|
|
|
*For technical support, feature requests, or collaborations, please reach out through the Community discussions or contact anktechsol directly.* |