anki-2.5 / README.md

Remove qwen references and update to Anki 2.5

ba2159e verified about 1 month ago

9.75 kB

	---
	license: mit
	language:
	- en
	- hi
	- bn
	- ta
	- te
	- ur
	- gu
	- kn
	- ml
	- pa
	- or
	- as
	- mr
	tags:
	- indian-languages
	- conversational-ai
	- localized-ai
	- indic-nlp
	- multilingual
	- hindi
	- bengali
	- tamil
	- telugu
	- urdu
	- gujarati
	- kannada
	- malayalam
	- punjabi
	- odia
	- assamese
	- marathi
	pipeline_tag: text-generation
	library_name: transformers
	datasets:
	- ai4bharat/indic-corpus
	- indicnlp/hindi-corpus
	- custom-indian-datasets
	metrics:
	- perplexity
	- bleu
	- rouge
	model-index:
	- name: anki-2.5
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: indian-benchmark
	name: Indian Language Evaluation
	metrics:
	- type: perplexity
	value: 12.5
	name: Perplexity
	---

	# 🇮🇳 Anki 2.5 - Indian Market-Centric LLM

	<div align="center">
	<img src="https://img.shields.io/badge/Language-Indic%20Languages-orange" alt="Languages">
	<img src="https://img.shields.io/badge/Base%20Model-Transformer-blue" alt="Base Model">
	<img src="https://img.shields.io/badge/Size-494M-green" alt="Model Size">
	<img src="https://img.shields.io/badge/License-MIT-yellow" alt="License">
	</div>

	## 🚀 Model Overview

	Anki 2.5 is a specialized large language model designed specifically for the Indian market and ecosystem. Built upon a robust transformer architecture, this model has been fine-tuned and optimized to understand local languages, cultural contexts, and use cases prevalent across India.

	This model bridges the gap between global AI capabilities and local Indian needs, offering enhanced performance in:

	- Indic Language Understanding: Deep comprehension of Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, and Marathi
	- Cultural Context Awareness: Understanding of Indian customs, festivals, traditions, and social dynamics
	- Market-Specific Applications: Tailored for Indian business scenarios, educational contexts, and daily life interactions

	## ✨ Key Features

	### 🌐 Indic Language Excellence
	- Multi-script Support: Handles Devanagari, Bengali, Tamil, Telugu, Urdu, Gujarati, and other Indian scripts
	- Code-mixing Capability: Seamlessly processes Hinglish and other Indian English variants
	- Regional Dialects: Understanding of regional variations and colloquialisms

	### 💬 Advanced Conversational Ability
	- Contextual Conversations: Maintains context across long dialogues in multiple languages
	- Cultural Sensitivity: Responds appropriately to Indian cultural references and contexts
	- Formal & Informal Registers: Adapts tone based on conversation requirements

	### 🎯 Market Specificity
	- Indian Business Context: Understanding of Indian market dynamics, regulations, and practices
	- Educational Alignment: Aligned with Indian educational curricula and learning patterns
	- Rural-Urban Bridge: Capable of addressing both urban and rural use cases effectively

	## 🔧 Technical Details

	### Architecture
	- Base Model: Transformer (0.5B parameters)
	- Fine-tuning: Specialized training on Indian datasets
	- Model Size: 494M parameters
	- Precision: F32 tensor type
	- Context Length: Up to 8K tokens

	### Training Data
	- Indic Corpus: Comprehensive collection from AI4Bharat
	- Hindi Literature: Classical and contemporary Hindi texts
	- Multilingual Datasets: Balanced representation across 12+ Indian languages
	- Domain-Specific Data: Business, education, healthcare, and government domains
	- Cultural Content: Festivals, traditions, mythology, and historical references

	### Licensing
	- Weights: Open weights under MIT License
	- Commercial Use: Permitted with attribution
	- Research Use: Fully open for academic and research purposes

	## 🎯 Use Cases

	### 🎬 Hindi/Indian Language Content Creation
	```python
	# Generate Hindi poetry or stories
	response = model.generate(
	"हिंदी में एक सुंदर कविता लिखें होली के बारे में",
	max_length=200
	)
	```

	### 📊 Market Analysis & Business Intelligence
	- Indian market trend analysis
	- Customer sentiment analysis in local languages
	- Regional business strategy recommendations
	- Compliance and regulatory guidance

	### 🌾 Rural Technology Enablement
	- Agricultural advisory in local languages
	- Government scheme explanations
	- Digital literacy support
	- Local language interfaces for apps

	### 🎓 Educational Support
	- Multilingual tutoring assistance
	- Curriculum-aligned content generation
	- Language learning support
	- Cultural education resources

	### 💼 Enterprise Applications
	- Customer support in regional languages
	- Document translation and summarization
	- Indian law and regulation interpretation
	- HR and recruitment assistance

	## 🛠️ How to Use

	### Quick Start
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load the model and tokenizer
	model_name = "anktechsol/anki-2.5"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float32,
	device_map="auto"
	)

	# Generate text in Hindi
	prompt = "भारत में AI का भविष्य"
	inputs = tokenizer.encode(prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_length=100,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Advanced Usage
	```python
	# Multi-language conversation
	conversation = [
	{"role": "user", "content": "मुझे अपने बिजनेस के लिए एक मार्केटिंग स्ट्रैटेजी चाहिए।"},
	]

	# Apply chat template
	formatted_prompt = tokenizer.apply_chat_template(
	conversation,
	tokenize=False,
	add_generation_prompt=True
	)

	# Generate response
	inputs = tokenizer(formatted_prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=512, temperature=0.8)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	### Integration with Popular Frameworks
	```python
	# Using with LangChain for Indian applications
	from langchain.llms.huggingface_pipeline import HuggingFacePipeline
	from transformers import pipeline

	# Create pipeline
	pipe = pipeline(
	"text-generation",
	model="anktechsol/anki-2.5",
	tokenizer="anktechsol/anki-2.5",
	max_length=512
	)

	# Wrap with LangChain
	llm = HuggingFacePipeline(pipeline=pipe)

	# Use in your Indian language applications
	response = llm("Explain GST rules in Hindi")
	```

	## 🤝 Community & Contributions

	### 📢 Call to Action
	We invite the Indian AI community to:
	- 🔬 Experiment: Try the model with your specific use cases and share results
	- 📝 Feedback: Report performance insights, especially for regional languages
	- 🌍 Language Expansion: Help us improve coverage for underrepresented Indian languages
	- 🤝 Collaborate: Contribute training data, evaluation benchmarks, or model improvements
	- 📚 Research: Use this model as a foundation for Indian language research

	### 💬 Community Channels
	- Discussions: Use the Community tab above for questions and suggestions
	- Issues: Report bugs or request features in our repository
	- Research: Cite this model in your academic work and share findings

	### 🎯 Specific Areas Seeking Community Input
	- Regional Dialects: Help improve understanding of local variations
	- Domain Expertise: Contribute specialized knowledge (legal, medical, technical)
	- Evaluation Metrics: Develop Indian language-specific benchmarks
	- Cultural Nuances: Enhance cultural context understanding

	## 🙏 Acknowledgments

	### 📊 Datasets & Resources
	- AI4Bharat: For the comprehensive Indic language corpus
	- IndicNLP: For Hindi language resources and benchmarks
	- CDAC: For language technology tools and resources
	- IIT Madras: For Tamil language processing contributions
	- ISI Kolkata: For Bengali language datasets

	### 🤝 Contributors & Community
	- Anktechsol Team: Core development and fine-tuning
	- Indian AI Research Community: Feedback and validation
	- Open Source Contributors: Bug fixes and improvements
	- Beta Testers: Early adopters who provided crucial feedback

	### 🏢 Institutional Support
	- Transformer Architecture Community: For the excellent base model architecture
	- Hugging Face: For model hosting and distribution platform
	- Indian Language Technology Consortium: For linguistic resources

	### 📖 Citation
	If you use this model in your research or applications, please cite:
	```bibtex
	@misc{anki-2.5,
	title={Anki 2.5: An Indian Market-Centric Large Language Model},
	author={Anktechsol},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/anktechsol/anki-2.5}},
	}
	```

	---

	<div align="center">
	<b>🚀 Ready to explore AI in Indian languages? Start using Anki 2.5 today!</b>
	<br>
	<i>Made with ❤️ for the Indian AI community</i>
	</div>

	## 📋 Model Information

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Model Size \| 494M parameters \|
	\| Base Model \| Transformer \|
	\| Languages \| 12+ Indian languages + English \|
	\| License \| MIT \|
	\| Context Length \| 8K tokens \|
	\| Precision \| F32 \|
	\| Training Data \| Indian-centric multilingual corpus \|
	\| Use Cases \| Conversational AI, Content Generation, Market Analysis \|

	---

	For technical support, feature requests, or collaborations, please reach out through the Community discussions or contact anktechsol directly.