Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2
Model Summary
This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power.
Model Details
Developed by
- Developer: [SynSci]
- Contact: [[email protected]]
Model Description
- Architecture: GPT-2 (Generative Pre-trained Transformer 2)
- Model Type: Transformer-based language model
- Base Model: openai-community/gpt2
- Language: English
- License: MIT
- Fine-tuned on: mikasenghaas/wikitext-2
- Modifications:
- Pruning: Redundant weights removed to decrease model size and inference time.
- Quantization: Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency.
Direct Use
- Text generation
- Language modeling
- Autocomplete suggestions
- Educational purposes in NLP and model optimization techniques
Downstream Use
- Integration into applications requiring efficient language models
- Deployment on devices with limited computational resources
Out-of-Scope Use
- Generation of misleading or harmful content
- Applications requiring understanding of languages other than English
- Tasks demanding high-precision language understanding beyond the model's capabilities
Bias, Risks, and Limitations
Biases
The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives.
Risks
- Potential generation of biased or inappropriate content
- Misinterpretation of generated text as factual information
Limitations
- Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization
- Limited to English language understanding and generation
- Not suitable for tasks requiring real-time processing of large-scale data
Recommendations
Users should:
- Implement content filtering mechanisms to prevent the generation of inappropriate content.
- Avoid using the model for critical applications without thorough evaluation.
- Be aware of the model's limitations in understanding nuanced language and context.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant")
model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant")
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
- Dataset: mikasenghaas/wikitext-2
- Description: A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Training Procedure
- Preprocessing: Standard tokenization and formatting compatible with GPT-2 requirements.
- Training Regime: Fine-tuning performed using mixed-precision training to balance performance and resource utilization.
- Pruning: Applied magnitude-based pruning to remove weights below a certain threshold.
- Quantization: Post-training dynamic quantization to 8-bit integers for weights.
Hyperparameters
- Learning Rate: 5e-5
- Batch Size: 32
- Epochs: 3
- Optimizer: AdamW
- Weight Decay: 0.01
Speeds, Sizes, Times
- Original Model Size: ~500 MB
- Pruned & Quantized Model Size: ~6 MB
- Training Time: Approximately 2 hours on a single MPS chip
Evaluation
Testing Data
- Dataset: mikasenghaas/wikitext-2
- Split: Validation set used for evaluation
Metrics
- Perplexity: 155.43
- BLEU Score: 0.0498
- ROUGE-1 Score: 0.1836
- Accuracy: 93.2%
Results Summary
The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model.
Model Examination
While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied.
Environmental Impact
- Hardware Type: Macbook MPS [🙂↕️can't afford a good cuda gpu]
- Training Duration: 2 hours
- Energy Consumption: Approximately 0.5 kWh
- Carbon Emitted: Estimated 0.2 kg CO₂
Technical Specifications
Model Architecture and Objective
- Architecture: Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768.
- Objective: Causal language modeling (predicting the next token in a sequence).
Compute Infrastructure
- Hardware: Single NVIDIA V100 GPU
- Software: PyTorch, Transformers library by Hugging Face
Citation
If you use this model, please cite:
@misc{NanoQuant,
title={NanoQuant},
author={swayamsingal},
year={2025},
howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}},
}
Glossary
- Pruning: The process of removing weights from a neural network to reduce its size and computational requirements.
- Quantization: The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.
- Downloads last month
- 13
Model tree for swayamsingal/NanoQuant
Base model
openai-community/gpt2