--- language: - tr arXiv: 2403.01308 library_name: transformers pipeline_tag: text2text-generation widget: - text: >- Soru yarat: cevap: Alan Mathison Turing İngiliz matematikçi, bilgisayar bilimcisi ve kriptolog. II. Dünya Savaşı sırasında Alman şifrelerinin kırılmasında çok önemli bir rol oynadığı için savaş kahramanı sayılmıştır. Ayrıca Manchester Üniversitesi'nde çalıştığı yıllarda, Turing makinesi denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini atmıştır. example_title: Question generation - text: >- Soru cevapla: Turing makinesi denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini atan bilim insanı kimdir? kaynak: Alan Mathison Turing İngiliz matematikçi, bilgisayar bilimcisi ve kriptolog. II. Dünya Savaşı sırasında Alman şifrelerinin kırılmasında çok önemli bir rol oynadığı için savaş kahramanı sayılmıştır. Ayrıca Manchester Üniversitesi'nde çalıştığı yıllarda, Turing makinesi denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini atmıştır. example_title: Question answering - text: >- yanıtları çıkar: Alan Mathison Turing İngiliz matematikçi, bilgisayar bilimcisi ve kriptolog. II. Dünya Savaşı sırasında Alman şifrelerinin kırılmasında çok önemli bir rol oynadığı için savaş kahramanı sayılmıştır. Ayrıca Manchester Üniversitesi'nde çalıştığı yıllarda, Turing makinesi denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini atmıştır . example_title: Answer Extraction license: cc-by-nc-sa-4.0 --- # VBART Model Card ## Model Description This repo contains pretrained tensorflow and safetensors weights of VBART the first sequence-to-sequence model trained in Turkish corpora from scratch. VBART was trained by VNGRS in February 2023. The model is capable of text transformation tasks such as summarization, paraphrasing, and title generation with fine-tuning. This model overperforms its multilingual counterparts, albeit being much smaller than other implementations. This repository contains fine-tuned weights of VBART for question-answering and generation tasks described in the [paper](https://doi.org/10.55730/1300-0632.3914). - **Developed by:** [VNGRS-AI](https://vngrs.com/ai/) - **Model type:** Transformer encoder-decoder based on mBART architecture - **Language(s) (NLP):** Turkish - **License:** CC BY-NC-SA 4.0 - **Finetuned from:** VBART-Large - **Paper:** [arXiv](https://arxiv.org/abs/2403.01308) ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("vngrs-ai/VBART-Large-QAQG", model_input_names=['input_ids', 'attention_mask']) # Uncomment the device_map kwarg and delete the closing bracket to infer model in gpu model = AutoModelForSeq2SeqLM.from_pretrained("vngrs-ai/VBART-Large-QAQG")#, device_map="auto") context="..." question="..." highlighted_context="..." # Prompt for question generation qg_prompt = f'Soru yarat: cevap: {context}' # Prompt for question answering qa_prompt = f'Soru cevapla: {question} kaynak: {context}' # Prompt for answer extraction ae_prompt = f'yanıtları çıkar: {highlighted_context}' # text_input = f"{qg_prompt} {context} " token_input = tokenizer(ae_prompt, return_tensors="pt")#.to('cuda') # token_input outputs = model.generate(**token_input) print(tokenizer.decode(outputs[0])) ``` ## Training Details ### Fine-tuning prompt This model is trained on three tasks: - question answering: Answer a question with given context. Prompted with ```Soru cevapla: kaynak: ``` - question generation: Generate a question from a given context. Will accept a highlight token (``, without spaces) to specify the answer to the question generated. Prompted with ```Soru yarat: ``` - answer extraction: Will extract possible answers from a highlighted range (using the same highlight token). Prompted with ``` yanıtları çıkar: ``` ### Training Data The base model is pre-trained on cleaned and filtered versions of a mixed corpus made of Turkish parts of [OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) and [mC4](https://huggingface.co/datasets/mc4) datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our [paper](https://arxiv.org/abs/2403.01308). The fine-tuning dataset is [TQuAD](https://github.com/obss/turkish-question-generation), which has two versions. We have concatenated them and dropped duplicate samples. More information about this process can be found in Appendix B of our [paper](https://arxiv.org/abs/2403.01308). ### Limitations This model is fine-tuned for question-answering and question-generation tasks with specific prompts. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts. ### Training Procedure Pretrained for 30 days and for a total of 708B tokens. Finetuned for 5 epoch. #### Hardware - **GPUs**: 8 x Nvidia A100-80 GB #### Software - Tensorflow #### Hyperparameters ##### Pretraining - **Training regime:** fp16 mixed precision - **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens) - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6) - **Scheduler**: Linear decay scheduler (20,000 warm-up steps) - **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 165k and 205 steps, respectively) - **Initial Learning rate**: 5e-6 - **Training tokens**: 708B ##### Fine-tuning - **Training regime:** fp16 mixed precision - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6) - **Scheduler**: Linear decay scheduler - **Dropout**: 0.1 - **Learning rate**: 5e-5 - **Fine-tune epochs**: 5 #### Metrics ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/D-Epasj5C4icAu0ykqt10.png) ## Citation ``` @article{turker2024vbart, title={VBART: The Turkish LLM}, author={Turker, Meliksah and Ari, Erdi and Han, Aydin}, journal={arXiv preprint arXiv:2403.01308}, year={2024} } ```