--- library_name: transformers license: apache-2.0 datasets: - ai4bharat/naamapadam language: - bn base_model: - openai-community/gpt2 --- # Model Card for Model ID AddaGPT 2.0 is a Bengali language model based on GPT-2, fine-tuned using LoRA adapters for academic and low-resource applications. While GPT-2 was originally trained only on English data, this model has been adapted to Bengali using the AI4Bharat NaamaPadam dataset — a corpus focused on Named Entity Recognition (NER). This project is intended as a proof of concept to explore how small, pretrained models like GPT-2 can be extended to Indic languages using low-rank adaptation (LoRA) techniques, even under limited compute settings (e.g., free Kaggle GPUs). It lays the foundation for future work in adapting language models for low-bandwidth, regional, and offline-first use cases — to support local communities. ## Model Details | **Attribute** | **Description** | | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | **Base Model** | GPT-2 (117M parameters) | | **Fine-tuned Using** | [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) | | **Language** | Bengali (`bn`) | | **Training Dataset** | [`ai4bharat/naamapadam`](https://huggingface.co/datasets/ai4bharat/naamapadam) – Bengali NER corpus (train split only) | | **Sentences Seen During Training** | \~9.6 million Bengali sentences | | **Training Platform** | Kaggle (Free T4 GPUs) | | **Frameworks** | 🤗 Transformers + PEFT (Parameter-Efficient Fine-Tuning) + Safetensors | | **Trainable Parameters** | 294,912 | | **Total Parameters** | 124,734,720 | | **Percentage Fine-Tuned** | 0.2364% | ### Model Description - **Developed by:** Swastik Guha Roy - **Funded by :** Self Funded ### Uses AddaGPT 2.0 is an academic proof-of-concept project designed to explore how low-resource, low-compute setups (like Kaggle T4 GPUs) can be used to adapt large language models like GPT-2 for Indic languages, specifically Bengali. ### Intended Use Cases: Academic research on low-rank adaptation (LoRA) for regional languages Language modeling experimentation in Bengali Demonstration of fine-tuning techniques in resource-constrained environments Baseline comparison for future Bengali language model development Educational purposes for students and ML enthusiasts working on low-resource NLP ### Intended Users: ML/NLP researchers exploring parameter-efficient tuning Students building regional language models Developers prototyping Bengali language tools (with limitations) Community contributors interested in advancing open-source Bengali AI ## Limitations This model is not capable of generating grammatically or syntactically correct Bengali sentences. Instead, it outputs individual Bengali words or word-like tokens that are often meaningful on their own — a direct result of training on a NER-style dataset rather than full natural language text. ->This version does not produce grammatically coherent Bengali sentences ->It's trained on a NER dataset, so it mostly outputs individual Bengali words ->It is not suitable for downstream tasks like summarization, translation, or question-answering — yet ### How to Get Started with the Model # Load Nessecary Libraries ```python from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline ``` # Load the model and tokenizer ```python model = AutoModelForCausalLM.from_pretrained("SwastikGuhaRoy/AddaGPT2.0") tokenizer = AutoTokenizer.from_pretrained("SwastikGuhaRoy/AddaGPT2.0_tokenizer") ``` # Initialize generation pipeline ```python text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer) ``` # Run inference ``` python prompt = "রবীন্দ্রনাথ ঠাকুর একজন" output = text_generator( prompt, max_new_tokens=30, temperature=0.7, top_p=0.95, do_sample=True ) print(output[0]["generated_text"]) ``` ## Evaluation ### Results The model was evaluated on the validation split of the ai4bharat/naamapadam dataset to measure how well it models Bengali text. ## Metric: Perplexity (Lower is Better) | Model | Validation Perplexity | | ----------------------- | --------------------- | | **AddaGPT 2.0** | **25.61** | | Vanilla GPT-2 (English) | 144.53 | ## AddaGPT 2.0 shows a significantly lower perplexity, indicating a better fit to Bengali text. ## GPT-2 struggles with Bengali due to the lack of Bengali data during pretraining. ## Summary Despite lower perplexity, the model still generates mostly isolated Bengali words, not grammatically complete sentences (due to the nature of the training dataset — a NER corpus). ### Citation If you use this model, please cite: ```bibtex @misc{addagpt2.0, author = {Swastik Guha Roy}, title = {AddaGPT 2.0: Bengali Finetuned GPT-2 with LoRA}, year = 2025, howpublished = {\url{https://huggingface.co/SwastikGuhaRoy/AddaGPT2.0}}, }