pashto-base-bloom / ModelCard.md
tasal9's picture
Push retrained pashto-base-bloom (on base_pashto_clean) and update ModelCard
0d65f9c verified

ZamAI Bloom Pashto - checkpoint5207 (and Final Model)

This model card is for checkpoint5207 and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.

Model Description

This model is a fine-tuned version of bigscience/bloom-560m on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.

Base Model: bigscience/bloom-560m Fine-tuning Checkpoint: checkpoint5207 Final Model: [tasal9/zamai-bloom-ps-final]

Intended Uses & Limitations

Intended Uses

This model is intended for:

  • Generating Pashto text.
  • Assisting with Pashto language content creation.
  • Research in Pashto NLP.
  • Educational purposes for Pashto language learning.

Limitations and Bias

  • The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
  • It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
  • The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
  • Performance on specific Pashto dialects might vary depending on their representation in the training data.

How to use

You can use this model with the Hugging Face transformers library for text generation.

First, install the library:

pip install transformers torch

Then, you can use the model in Python:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

Training Data

Describe the dataset(s) used for fine-tuning.

  • Source: [e.g., Web scraped data, specific Pashto corpora, data from datasets/base_pashto/]
  • Size: [e.g., Number of documents, tokens, GBs]
  • Preprocessing: [e.g., Cleaning steps, tokenization details]
  • Language Variety: [e.g., Predominant dialects, formal/informal text]

If your dataset is on the Hugging Face Hub, link to it.

Training Procedure

Preprocessing

The texts were tokenized using the AutoTokenizer associated with the base Bloom model. [Add any other specific preprocessing steps you took.]

Fine-tuning

The model was fine-tuned using the Hugging Face transformers library with PyTorch.

  • Training script: [Link to your train_base_model.py if applicable]
  • Hyperparameters:
    • Learning rate: 2e-5
    • Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
    • Number of epochs: 3 # Adjust based on convergence and overfitting
    • Optimizer: AdamW
    • Weight decay: 0.01
    • Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
    • Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
    • Seed: 42 # For reproducibility
  • Infrastructure:
    • Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
    • Training time: [e.g., X hours]

This specific model card refers to checkpoint5207, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.

Evaluation Results

Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).

  • Test set: [Describe your test set]
  • Metrics: [e.g., Perplexity, BLEU, ROUGE]
  • Results for checkpoint5207:
  • Results for final model:

Qualitative observations can also be included.

Model Card Contact

Author: Yaqoob Tasal
Username: tasal9
Organization: ZamAI
GitHub: https://github.com/tasal9

Citation

If you use this model or its checkpoints, please consider citing:

@misc{zamai_bloom_pashto_2025,
  author    = {Yaqoob Tasal},
  title     = {ZamAI Bloom Pashto - Fine-tuned Language Model},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
}

And the original Bloom model:

@article{scao2022bloom,
  title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
  author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
  journal={arXiv preprint arXiv:2211.05100},
  year={2022}
}

Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a README.md file in your model repository on the Hugging Face Hub.

Training Details (Cleaned Base Model - June 2025)

This model version was trained from bigscience/bloom-560m using the train_base_model.py script.

  • Training Data: The model was trained on a locally prepared dataset located at datasets/base_pashto_clean. This dataset was created using prepare_base_dataset.py and is derived from pashto_data/base_model/cleaned_base_data.txt, which primarily contains Pashto text from a bilingual Pashto-English glossary.
  • Training Objective: To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
  • Output Directory (during training): models/pashto-bloom-base-clean-colab
  • Key Training Hyperparameters:
    • Epochs: 3
    • Per Device Batch Size: 2
    • Gradient Accumulation Steps: 4
    • Learning Rate: 5e-5
    • FP16 (Mixed Precision): True
    • Optimizer: AdamW