Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2

Model Summary

This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power.

Model Details

Developed by

Developer: [SynSci]
Contact: [[email protected]]

Model Description

Architecture: GPT-2 (Generative Pre-trained Transformer 2)
Model Type: Transformer-based language model
Base Model: openai-community/gpt2
Language: English
License: MIT
Fine-tuned on: mikasenghaas/wikitext-2
Modifications:
- Pruning: Redundant weights removed to decrease model size and inference time.
- Quantization: Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency.

Direct Use

Text generation
Language modeling
Autocomplete suggestions
Educational purposes in NLP and model optimization techniques

Downstream Use

Integration into applications requiring efficient language models
Deployment on devices with limited computational resources

Out-of-Scope Use

Generation of misleading or harmful content
Applications requiring understanding of languages other than English
Tasks demanding high-precision language understanding beyond the model's capabilities

Bias, Risks, and Limitations

Biases

The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives.

Risks

Potential generation of biased or inappropriate content
Misinterpretation of generated text as factual information

Limitations

Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization
Limited to English language understanding and generation
Not suitable for tasks requiring real-time processing of large-scale data

Recommendations

Users should:

Implement content filtering mechanisms to prevent the generation of inappropriate content.
Avoid using the model for critical applications without thorough evaluation.
Be aware of the model's limitations in understanding nuanced language and context.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant")
model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant")

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Dataset: mikasenghaas/wikitext-2
Description: A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Training Procedure

Preprocessing: Standard tokenization and formatting compatible with GPT-2 requirements.
Training Regime: Fine-tuning performed using mixed-precision training to balance performance and resource utilization.
Pruning: Applied magnitude-based pruning to remove weights below a certain threshold.
Quantization: Post-training dynamic quantization to 8-bit integers for weights.

Hyperparameters

Learning Rate: 5e-5
Batch Size: 32
Epochs: 3
Optimizer: AdamW
Weight Decay: 0.01

Speeds, Sizes, Times

Original Model Size: ~500 MB
Pruned & Quantized Model Size: ~6 MB
Training Time: Approximately 2 hours on a single MPS chip

Evaluation

Testing Data

Dataset: mikasenghaas/wikitext-2
Split: Validation set used for evaluation

Metrics

Perplexity: 155.43
BLEU Score: 0.0498
ROUGE-1 Score: 0.1836
Accuracy: 93.2%

Results Summary

The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model.

Model Examination

While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied.

Environmental Impact

Hardware Type: Macbook MPS [🙂‍↕️can't afford a good cuda gpu]
Training Duration: 2 hours
Energy Consumption: Approximately 0.5 kWh
Carbon Emitted: Estimated 0.2 kg CO₂

Technical Specifications

Model Architecture and Objective

Architecture: Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768.
Objective: Causal language modeling (predicting the next token in a sequence).

Compute Infrastructure

Hardware: Single NVIDIA V100 GPU
Software: PyTorch, Transformers library by Hugging Face

Citation

If you use this model, please cite:

@misc{NanoQuant,
  title={NanoQuant},
  author={swayamsingal},
  year={2025},
  howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}},
}

Glossary

Pruning: The process of removing weights from a neural network to reduce its size and computational requirements.
Quantization: The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.

swayamsingal
/

NanoQuant

You need to agree to share your contact information to access this model