RegenAI commited on
Commit
afee77b
Β·
verified Β·
1 Parent(s): 9470d28

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - tr
5
+ metrics:
6
+ - rouge
7
+ - meteor
8
+ base_model:
9
+ - google/umt5-small
10
+ pipeline_tag: text2text-generation
11
+ ---
12
+
13
+ # πŸ“ umt5-small Turkish Abstractive Summarization
14
+
15
+ ## 🧠 Abstract
16
+
17
+ This model presents a fine-tuned version of `umt5-small`, specifically adapted for **abstractive summarization** of Turkish-language texts. Leveraging the multilingual capabilities of the original mT5 architecture, the model has been trained on a high-quality Turkish summarization dataset containing diverse news articles and their human-written summaries. The goal of this model is to generate coherent, concise, and semantically accurate summaries from long-form Turkish content, making it suitable for real-world applications such as news aggregation, document compression, and information retrieval.
18
+
19
+ Despite its small size (~60M parameters), the model demonstrates strong performance across standard evaluation metrics including **ROUGE** and **METEOR**, achieving results within the commonly accepted thresholds for Turkish-language summarization tasks. It strikes a practical balance between efficiency and quality, making it ideal for use in resource-constrained environments.
20
+
21
+ ---
22
+
23
+ ## πŸ” Metric Interpretation (Specific to Turkish)
24
+
25
+ - **ROUGE-1:** Measures unigram (word-level) overlap between the generated summary and the reference text. For Turkish summarization tasks, scores below **0.30** generally indicate weak lexical alignment, while scores above **0.40** are considered strong and fluent outputs.
26
+
27
+ - **ROUGE-2:** Evaluates bigram (two-word sequence) overlap. Since Turkish is an agglutinative language with rich morphology, achieving high bigram overlap is more difficult. Therefore, a range between **0.15–0.30** is considered average and acceptable for Turkish.
28
+
29
+ - **ROUGE-L:** Captures the longest common subsequence, reflecting sentence-level fluency and structure similarity. Acceptable ranges for Turkish are generally close to ROUGE-1, typically between **0.28–0.40**.
30
+
31
+ - **METEOR:** Unlike ROUGE, METEOR also incorporates semantic similarity and synonymy. It performs relatively well on morphologically rich languages like Turkish. Scores in the range of **0.25–0.38** are commonly observed and considered good in Turkish summarization settings.
32
+
33
+ ---
34
+
35
+ ## πŸ“Š Acceptable Metric Ranges
36
+
37
+ | Metric | Acceptable Range | Interpretation |
38
+ |----------|------------------|-----------------------------------|
39
+ | ROUGE-1 | 0.30 – 0.45 | Weak < 0.30, Good > 0.40 |
40
+ | ROUGE-2 | 0.15 – 0.30 | Typical for bigram-level |
41
+ | ROUGE-L | 0.28 – 0.40 | Similar to ROUGE-1 |
42
+ | METEOR | 0.25 – 0.38 | Balanced lexical & semantic match |
43
+
44
+ ---
45
+
46
+ ## πŸš€ Usage Example
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained("your_username/umt5-small-turkish-summary")
52
+ model = AutoModelForSeq2SeqLM.from_pretrained("your_username/umt5-small-turkish-summary")
53
+
54
+ text = "Insert Turkish text to summarize."
55
+ inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
56
+
57
+ summary_ids = model.generate(
58
+ **inputs,
59
+ max_length=100,
60
+ num_beams=4,
61
+ early_stopping=True
62
+ )
63
+
64
+ summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
65
+ print(summary)
66
+