|
--- |
|
language: tr |
|
--- |
|
|
|
# For Turkish language, here is an easy-to-use NER application. |
|
** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli... |
|
|
|
|
|
|
|
# Citation |
|
|
|
Please cite if you use it in your study |
|
|
|
|
|
``` |
|
|
|
@misc{yildirim2024finetuning, |
|
title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, |
|
author={Savas Yildirim}, |
|
year={2024}, |
|
eprint={2401.17396}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
|
|
|
|
|
|
@book{yildirim2021mastering, |
|
title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques}, |
|
author={Yildirim, Savas and Asgari-Chenaghlu, Meysam}, |
|
year={2021}, |
|
publisher={Packt Publishing Ltd} |
|
} |
|
``` |
|
|
|
|
|
# other detail |
|
|
|
|
|
Thanks to @stefan-it, I applied the followings for training |
|
|
|
|
|
cd tr-data |
|
|
|
for file in train.txt dev.txt test.txt labels.txt |
|
do |
|
wget https://schweter.eu/storage/turkish-bert-wikiann/$file |
|
done |
|
|
|
cd .. |
|
It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder. |
|
|
|
Run pre-training |
|
After downloading the dataset, pre-training can be started. Just set the following environment variables: |
|
``` |
|
export MAX_LENGTH=128 |
|
export BERT_MODEL=dbmdz/bert-base-turkish-cased |
|
export OUTPUT_DIR=tr-new-model |
|
export BATCH_SIZE=32 |
|
export NUM_EPOCHS=3 |
|
export SAVE_STEPS=625 |
|
export SEED=1 |
|
``` |
|
Then run pre-training: |
|
``` |
|
python3 run_ner_old.py --data_dir ./tr-data3 \ |
|
--model_type bert \ |
|
--labels ./tr-data/labels.txt \ |
|
--model_name_or_path $BERT_MODEL \ |
|
--output_dir $OUTPUT_DIR-$SEED \ |
|
--max_seq_length $MAX_LENGTH \ |
|
--num_train_epochs $NUM_EPOCHS \ |
|
--per_gpu_train_batch_size $BATCH_SIZE \ |
|
--save_steps $SAVE_STEPS \ |
|
--seed $SEED \ |
|
--do_train \ |
|
--do_eval \ |
|
--do_predict \ |
|
--fp16 |
|
``` |
|
|
|
|
|
# Usage |
|
|
|
``` |
|
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased") |
|
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased") |
|
ner=pipeline('ner', model=model, tokenizer=tokenizer) |
|
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.") |
|
``` |
|
# Some results |
|
Data1: For the data above |
|
Eval Results: |
|
|
|
* precision = 0.916400580551524 |
|
* recall = 0.9342309684101502 |
|
* f1 = 0.9252298787412536 |
|
* loss = 0.11335893666411284 |
|
|
|
Test Results: |
|
* precision = 0.9192058759362955 |
|
* recall = 0.9303010230367262 |
|
* f1 = 0.9247201697271198 |
|
* loss = 0.11182546521618497 |
|
|
|
|
|
|
|
Data2: |
|
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt |
|
The performance for the data given by @kemalaraz is as follows |
|
|
|
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt |
|
* precision = 0.9461980692049029 |
|
* recall = 0.959309358847465 |
|
* f1 = 0.9527086063783312 |
|
* loss = 0.037054269206847804 |
|
|
|
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt |
|
* precision = 0.9458370635631155 |
|
* recall = 0.9588201928530913 |
|
* f1 = 0.952284378344882 |
|
* loss = 0.035431676572445225 |
|
|
|
|