You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2

Model Summary

This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power.

Model Details

Developed by

Model Description

  • Architecture: GPT-2 (Generative Pre-trained Transformer 2)
  • Model Type: Transformer-based language model
  • Base Model: openai-community/gpt2
  • Language: English
  • License: MIT
  • Fine-tuned on: mikasenghaas/wikitext-2
  • Modifications:
    • Pruning: Redundant weights removed to decrease model size and inference time.
    • Quantization: Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency.

Direct Use

  • Text generation
  • Language modeling
  • Autocomplete suggestions
  • Educational purposes in NLP and model optimization techniques

Downstream Use

  • Integration into applications requiring efficient language models
  • Deployment on devices with limited computational resources

Out-of-Scope Use

  • Generation of misleading or harmful content
  • Applications requiring understanding of languages other than English
  • Tasks demanding high-precision language understanding beyond the model's capabilities

Bias, Risks, and Limitations

Biases

The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives.

Risks

  • Potential generation of biased or inappropriate content
  • Misinterpretation of generated text as factual information

Limitations

  • Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization
  • Limited to English language understanding and generation
  • Not suitable for tasks requiring real-time processing of large-scale data

Recommendations

Users should:

  • Implement content filtering mechanisms to prevent the generation of inappropriate content.
  • Avoid using the model for critical applications without thorough evaluation.
  • Be aware of the model's limitations in understanding nuanced language and context.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant")
model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant")

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

  • Dataset: mikasenghaas/wikitext-2
  • Description: A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Training Procedure

  • Preprocessing: Standard tokenization and formatting compatible with GPT-2 requirements.
  • Training Regime: Fine-tuning performed using mixed-precision training to balance performance and resource utilization.
  • Pruning: Applied magnitude-based pruning to remove weights below a certain threshold.
  • Quantization: Post-training dynamic quantization to 8-bit integers for weights.

Hyperparameters

  • Learning Rate: 5e-5
  • Batch Size: 32
  • Epochs: 3
  • Optimizer: AdamW
  • Weight Decay: 0.01

Speeds, Sizes, Times

  • Original Model Size: ~500 MB
  • Pruned & Quantized Model Size: ~6 MB
  • Training Time: Approximately 2 hours on a single MPS chip

Evaluation

Testing Data

Metrics

  • Perplexity: 155.43
  • BLEU Score: 0.0498
  • ROUGE-1 Score: 0.1836
  • Accuracy: 93.2%

Results Summary

The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model.

Model Examination

While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied.

Environmental Impact

  • Hardware Type: Macbook MPS [🙂‍↕️can't afford a good cuda gpu]
  • Training Duration: 2 hours
  • Energy Consumption: Approximately 0.5 kWh
  • Carbon Emitted: Estimated 0.2 kg CO₂

Technical Specifications

Model Architecture and Objective

  • Architecture: Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768.
  • Objective: Causal language modeling (predicting the next token in a sequence).

Compute Infrastructure

  • Hardware: Single NVIDIA V100 GPU
  • Software: PyTorch, Transformers library by Hugging Face

Citation

If you use this model, please cite:

@misc{NanoQuant,
  title={NanoQuant},
  author={swayamsingal},
  year={2025},
  howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}},
}

Glossary

  • Pruning: The process of removing weights from a neural network to reduce its size and computational requirements.
  • Quantization: The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.
Downloads last month
13
GGUF
Model size
0 params
Architecture
gpt2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for swayamsingal/NanoQuant

Quantized
(70)
this model

Dataset used to train swayamsingal/NanoQuant