File size: 2,849 Bytes
13424be cc4e993 df7c07d d3b7488 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 3927657 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 cc4e993 013ba46 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: mit
tags:
- unsloth
- legal
language:
- tk
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: table-question-answering
datasets:
- maximilianshwarzmullers/hukukchy
metrics:
- bleu
new_version: maximilianshwarzmullers/tm-hukuk
library_name: allennlp
---
# Model Card for Model ID
maximilianshwarzmullers/tm-hukuk
### Model Description
This is Turkmen LLM Finetuned in Turkmen law data
- **Developed by:** [Annamyrat Saparow]
- **Funded date:** [30.04.2025]
- **Shared date:** [07.05.2025]
- **Model type:** [Instruction model]
- **Language(s) (NLP):** [Turkmen, English]
- **License:** [MIT]
- **Finetuned from model [optional]:** [llama 3.1 8b instruct ]
### Model Sources [optional]
- **Repository:** [maximilianshwarzmullers/tm-hukuk]
### Training Data
Training data is maximilianshwarzmullers/hukukchy for law
It is my own dataset
### Training Procedure
QLora
#### Preprocessing [optional]
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["question"]
outputs = examples["answer"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("maximilianshwarzmullers/hukukchy", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
#### Training Hyperparameters
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 750,
learning_rate = 3e-5,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)
|