--- license: mit datasets: - mikasenghaas/wikitext-2 language: - en metrics: - bleu - rouge - perplexity - accuracy base_model: - openai-community/gpt2 tags: - Quantized - Pruned - Small - Nano - SBC pipeline_tag: text-generation --- # Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2 ## Model Summary This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power. ## Model Details ### Developed by - **Developer:** [SynSci] - **Contact:** [swayam.singal@gmail.com] ### Model Description - **Architecture:** GPT-2 (Generative Pre-trained Transformer 2) - **Model Type:** Transformer-based language model - **Base Model:** [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) - **Language:** English - **License:** MIT - **Fine-tuned on:** [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2) - **Modifications:** - **Pruning:** Redundant weights removed to decrease model size and inference time. - **Quantization:** Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency. ### Direct Use - Text generation - Language modeling - Autocomplete suggestions - Educational purposes in NLP and model optimization techniques ### Downstream Use - Integration into applications requiring efficient language models - Deployment on devices with limited computational resources ### Out-of-Scope Use - Generation of misleading or harmful content - Applications requiring understanding of languages other than English - Tasks demanding high-precision language understanding beyond the model's capabilities ## Bias, Risks, and Limitations ### Biases The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives. ### Risks - Potential generation of biased or inappropriate content - Misinterpretation of generated text as factual information ### Limitations - Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization - Limited to English language understanding and generation - Not suitable for tasks requiring real-time processing of large-scale data ### Recommendations Users should: - Implement content filtering mechanisms to prevent the generation of inappropriate content. - Avoid using the model for critical applications without thorough evaluation. - Be aware of the model's limitations in understanding nuanced language and context. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant") model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant") input_text = "Once upon a time" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data - **Dataset:** [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2) - **Description:** A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. ### Training Procedure - **Preprocessing:** Standard tokenization and formatting compatible with GPT-2 requirements. - **Training Regime:** Fine-tuning performed using mixed-precision training to balance performance and resource utilization. - **Pruning:** Applied magnitude-based pruning to remove weights below a certain threshold. - **Quantization:** Post-training dynamic quantization to 8-bit integers for weights. ### Hyperparameters - **Learning Rate:** 5e-5 - **Batch Size:** 32 - **Epochs:** 3 - **Optimizer:** AdamW - **Weight Decay:** 0.01 ### Speeds, Sizes, Times - **Original Model Size:** ~500 MB - **Pruned & Quantized Model Size:** ~6 MB - **Training Time:** Approximately 2 hours on a single MPS chip ## Evaluation ### Testing Data - **Dataset:** [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2) - **Split:** Validation set used for evaluation ### Metrics - **Perplexity:** 155.43 - **BLEU Score:** 0.0498 - **ROUGE-1 Score:** 0.1836 - **Accuracy:** 93.2% ### Results Summary The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model. ## Model Examination While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied. ## Environmental Impact - **Hardware Type:** Macbook MPS [🙂‍↕️can't afford a good cuda gpu] - **Training Duration:** 2 hours - **Energy Consumption:** Approximately 0.5 kWh - **Carbon Emitted:** Estimated 0.2 kg CO₂ ## Technical Specifications ### Model Architecture and Objective - **Architecture:** Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768. - **Objective:** Causal language modeling (predicting the next token in a sequence). ### Compute Infrastructure - **Hardware:** Single NVIDIA V100 GPU - **Software:** PyTorch, Transformers library by Hugging Face ## Citation If you use this model, please cite: ```bibtex @misc{NanoQuant, title={NanoQuant}, author={swayamsingal}, year={2025}, howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}}, } ``` ## Glossary - **Pruning:** The process of removing weights from a neural network to reduce its size and computational requirements. - **Quantization:** The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.