Model Card
GPT-2 Fine-tuned for Karakalpak Language
Model Overview
This is a GPT-2 (Generative Pre-trained Transformer 2) model, fine-tuned specifically for the Karakalpak language. The model is designed to generate coherent and contextually relevant text in Karakalpak, and capable of basic conversational responses based on the training data.
Training Data
The model was fine-tuned on a custom dataset combining:
- Approximately 300,000 lines of parallel text data
- Approximately 150 question-answer pairs of conversational dialogues in Karakalpak. These dialogues focus on simple greetings and basic exchanges.
Data Preparation:
- The training data was consolidated into a single text file.
- Special tokens (
<|startoftext|>
,<|endoftext|>
) were used to delineate conversational turns in the dialogue data, allowing the model to learn conversational structure. - Karakalpak specific characters (Á, á, Ǵ, ǵ, Ń, ń, Ó, ó, Ú, ú, Í, ı) were added as special tokens to the tokenizer to ensure proper handling of the language's unique alphabet.
Training Details
- Base Model:
gpt2
(from Hugging Face Transformers) - Framework: PyTorch with Hugging Face Transformers
Trainer
API - GPU: NVIDIA RTX 6000 Ada (48 GB VRAM)
- Batch Size: 8 (or higher, depending on VRAM)
- Number of Epochs: 5
- Optimizer: AdamW
- Learning Rate: 5e-5
Performance Metrics
After fine-tuning, the model achieved the following performance on the training dataset:
- Final Training Loss: 6.68
- Perplexity (PPL): PPL = e^2.5 ≈ 14,58
How to Use
You can use this model with the transformers
library in Python for text generation:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "{repo_id}"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = 0 if torch.cuda.is_available() else -1
karakalpak_generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=device)
prompt_text = "Nókis qalası"
generated_text = karakalpak_generator(prompt_text, max_new_tokens=50, num_return_sequences=1)[0]['generated_text']
print(f"Generated text: {generated_text}")
dialog_prompt = "<|startoftext|>Ayta alasiz ba, sizdi kim jaratqan?<|sep|>"
generated_dialog = karakalpak_generator(dialog_prompt, max_new_tokens=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)[0]['generated_text']
if '<|endoftext|>' in generated_dialog:
generated_dialog = generated_dialog.split('<|endoftext|>')[0]
print(f"Generated dialogue: {generated_dialog}")
- Downloads last month
- 27
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for nickoo004/gpt2_karakalpak
Base model
openai-community/gpt2