|
--- |
|
license: apache-2.0 |
|
language: |
|
- ku |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
library_name: adapter-transformers |
|
--- |
|
|
|
# Kurdish Language Detector Model |
|
|
|
This is a fine-tuned version of `abdulhade/RoBERTa-large-SizeCorpus_1B`, designed for detecting and classifying Kurdish and English text. Leveraging a custom bilingual corpus, this model is effective in distinguishing between these languages and accurately identifying text segments. |
|
|
|
## Model Overview |
|
|
|
- **Model Type**: Text classification (language detection) |
|
- **Base Model**: `abdulhade/RoBERTa-large-SizeCorpus_1B` |
|
- **Languages Supported**: English, Kurdish |
|
- **Training Data**: Custom bilingual corpus of English and Kurdish text |
|
- **Primary Use Case**: Identifying whether input text is in English or Kurdish |
|
|
|
## Model Performance |
|
|
|
The model was evaluated using various metrics and achieved outstanding results: |
|
|
|
- **Evaluation Loss**: 0.0012 |
|
- **Evaluation Accuracy**: 99.99% |
|
- **Evaluation F1 Score**: 0.9999 |
|
- **Evaluation Precision**: 0.99999 |
|
- **Evaluation Recall**: 0.99983 |
|
|
|
### Training Details |
|
|
|
- **Training Loss**: 0.027 |
|
- **Training Runtime**: 40,500.85 seconds |
|
- **Samples per Second (Training)**: 72.35 |
|
- **Steps per Second (Training)**: 4.52 |
|
- **Epochs**: 3 |
|
|
|
### Evaluation Details |
|
|
|
- **Evaluation Runtime**: 4,111.17 seconds |
|
- **Samples per Second (Evaluation)**: 237.58 |
|
- **Steps per Second (Evaluation)**: 14.85 |
|
|
|
### Hardware and Environment |
|
|
|
- **Environment**: Accelerated hardware (e.g., GPU) |
|
- **Default Inference Device**: CPU (specify `device=0` for GPU usage) |
|
|
|
## Quickstart Guide |
|
|
|
### Installation |
|
|
|
Ensure you have the `transformers` library and `torch` installed: |
|
|
|
```bash |
|
pip install transformers torch |
|
|
|
from transformers import pipeline |
|
|
|
# Load the Kurdish Language Detector |
|
kurdish_detector = pipeline('text-classification', |
|
model='abdulhade/kurdishRoBERTa-language-detector-1B', |
|
tokenizer='abdulhade/kurdishRoBERTa-language-detector-1B') |
|
|
|
# Perform a prediction |
|
result = kurdish_detector("Insert your text here") |
|
print(result) # Outputs: [{'label': 'LABEL_1', 'score': <probability>}] |
|
|
|
# Custom function to map the labels |
|
def map_labels(prediction): |
|
label_mapping = { |
|
'LABEL_0': 'English', |
|
'LABEL_1': 'Kurdish' |
|
} |
|
# Map the label and keep the score as is |
|
return {'label': label_mapping[prediction['label']], 'score': prediction['score']} |
|
|
|
# Test the model with new input and map the labels |
|
input_text_1 = "Hello World" |
|
input_text_2 = "Hi dear برام دەنگ و باست" |
|
|
|
# Get predictions |
|
predictions_1 = kurdish_detector(input_text_1) |
|
predictions_2 = kurdish_detector(input_text_2) |
|
|
|
# Map and print results |
|
mapped_predictions_1 = [map_labels(pred) for pred in predictions_1] |
|
mapped_predictions_2 = [map_labels(pred) for pred in predictions_2] |
|
print(input_text_1) |
|
print(mapped_predictions_1) # Expected output: [{'label': 'English', 'score': <score>}] |
|
print(input_text_2) |
|
print(mapped_predictions_2) # Expected output: [{'label': 'Kurdish', 'score': <score>}] |