ZamAI Bloom Pashto - checkpoint5207 (and Final Model)
This model card is for checkpoint5207
and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.
Model Description
This model is a fine-tuned version of bigscience/bloom-560m on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.
Base Model: bigscience/bloom-560m
Fine-tuning Checkpoint: checkpoint5207
Final Model: [tasal9/zamai-bloom-ps-final]
Intended Uses & Limitations
Intended Uses
This model is intended for:
- Generating Pashto text.
- Assisting with Pashto language content creation.
- Research in Pashto NLP.
- Educational purposes for Pashto language learning.
Limitations and Bias
- The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
- It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
- The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
- Performance on specific Pashto dialects might vary depending on their representation in the training data.
How to use
You can use this model with the Hugging Face transformers
library for text generation.
First, install the library:
pip install transformers torch
Then, you can use the model in Python:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Training Data
Describe the dataset(s) used for fine-tuning.
- Source: [e.g., Web scraped data, specific Pashto corpora, data from
datasets/base_pashto/
] - Size: [e.g., Number of documents, tokens, GBs]
- Preprocessing: [e.g., Cleaning steps, tokenization details]
- Language Variety: [e.g., Predominant dialects, formal/informal text]
If your dataset is on the Hugging Face Hub, link to it.
Training Procedure
Preprocessing
The texts were tokenized using the AutoTokenizer
associated with the base Bloom model.
[Add any other specific preprocessing steps you took.]
Fine-tuning
The model was fine-tuned using the Hugging Face transformers
library with PyTorch.
- Training script: [Link to your
train_base_model.py
if applicable] - Hyperparameters:
- Learning rate: 2e-5
- Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
- Number of epochs: 3 # Adjust based on convergence and overfitting
- Optimizer: AdamW
- Weight decay: 0.01
- Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
- Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
- Seed: 42 # For reproducibility
- Infrastructure:
- Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
- Training time: [e.g., X hours]
This specific model card refers to checkpoint5207
, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.
Evaluation Results
Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).
- Test set: [Describe your test set]
- Metrics: [e.g., Perplexity, BLEU, ROUGE]
- Results for checkpoint5207:
- Results for final model:
Qualitative observations can also be included.
Model Card Contact
Author: Yaqoob Tasal
Username: tasal9
Organization: ZamAI
GitHub: https://github.com/tasal9
Citation
If you use this model or its checkpoints, please consider citing:
@misc{zamai_bloom_pashto_2025,
author = {Yaqoob Tasal},
title = {ZamAI Bloom Pashto - Fine-tuned Language Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
}
And the original Bloom model:
@article{scao2022bloom,
title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
journal={arXiv preprint arXiv:2211.05100},
year={2022}
}
Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a README.md
file in your model repository on the Hugging Face Hub.
Training Details (Cleaned Base Model - June 2025)
This model version was trained from bigscience/bloom-560m
using the train_base_model.py
script.
- Training Data: The model was trained on a locally prepared dataset located at
datasets/base_pashto_clean
. This dataset was created usingprepare_base_dataset.py
and is derived frompashto_data/base_model/cleaned_base_data.txt
, which primarily contains Pashto text from a bilingual Pashto-English glossary. - Training Objective: To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
- Output Directory (during training):
models/pashto-bloom-base-clean-colab
- Key Training Hyperparameters:
- Epochs: 3
- Per Device Batch Size: 2
- Gradient Accumulation Steps: 4
- Learning Rate: 5e-5
- FP16 (Mixed Precision): True
- Optimizer: AdamW