Sengil commited on
Commit
fd68a15
·
verified ·
1 Parent(s): f93df2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -141
README.md CHANGED
@@ -1,171 +1,111 @@
1
  ---
2
  library_name: transformers
3
  tags:
4
- - Aspect Term Extraction
5
  - transformers
6
- - t5
7
  language:
8
  - tr
9
  metrics:
10
- - micro-f1
11
  base_model:
12
- - Turkish-NLP/t5-efficient-base-turkish
13
- pipeline_tag: text2text-generation
14
  ---
 
15
 
16
- # **Sengil/t5-turkish-aspect-term-extractor** 🇹🇷
 
17
 
18
- A Turkish sequence-to-sequence model based on `Turkish-NLP/t5-efficient-base-turkish`, fine-tuned for **Aspect Term Extraction (ATE)** from customer reviews and sentences.
 
 
 
 
 
19
 
20
- Given a Turkish sentence, the model generates a list of **aspect terms** (e.g., *kahve*, *servis*, *fiyatlar*) that reflect the primary discussed entities or features.
21
 
22
- ---
 
 
 
 
 
 
 
 
 
 
23
 
24
- ## ✨ Example
25
 
26
  ```python
27
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
28
  import torch
29
- import re
30
- from collections import Counter
31
-
32
- #LOAD MODEL
33
- MODEL_ID = "Sengil/t5-turkish-aspect-term-extractor"
34
- DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
35
-
36
- tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
37
- model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID).to(DEVICE)
38
- model.eval()
39
-
40
- TURKISH_STOPWORDS = {
41
- "ve", "çok", "ama", "bir", "bu", "daha", "gibi", "ile", "için",
42
- "de", "da", "ki", "o", "şu", "bu", "sen", "biz", "siz", "onlar"
43
- }
44
-
45
- def is_valid_aspect(word):
46
- word = word.strip().lower()
47
- return (
48
- len(word) > 1 and
49
- word not in TURKISH_STOPWORDS and
50
- word.isalpha()
51
- )
52
-
53
- def extract_and_rank_aspects(text, max_tokens=64, beams=5):
54
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(DEVICE)
55
-
56
- with torch.no_grad():
57
- outputs = model.generate(
58
- input_ids=inputs["input_ids"],
59
- attention_mask=inputs["attention_mask"],
60
- max_new_tokens=max_tokens,
61
- num_beams=beams,
62
- num_return_sequences=beams,
63
- early_stopping=True
64
- )
65
-
66
- all_predictions = [
67
- tokenizer.decode(output, skip_special_tokens=True)
68
- for output in outputs
69
- ]
70
 
 
 
 
71
 
72
- all_terms = []
73
- for pred in all_predictions:
74
- candidates = re.split(r"[;,–—\-]|(?:\s*,\s*)", pred)
75
- all_terms.extend([w.strip().lower() for w in candidates if is_valid_aspect(w)])
 
76
 
77
- ranked = Counter(all_terms).most_common()
78
- return ranked
79
-
80
-
81
- #INFERENCE
82
- text = "Artılar: Göl manzarasıyla harika bir atmosfer, Ipoh'un her zaman sıcak olan havası nedeniyle iyi bir klima olan restoran, iyi ve hızlı hizmet sunan garsonlar, temassız ödeme kabul eden e-cüzdan, ücretsiz otopark ama sıcak güneş altında açık, yemeklerin tadı güzel."
83
- ranked_aspects = extract_and_rank_aspects(text)
84
-
85
- print("Sorted Aspect Terms:")
86
- for term, score in ranked_aspects:
87
- print(f"{term:<15} skor: {score}")
88
- ````
89
-
90
- **Output:**
91
 
 
92
  ```
93
- Sorted Aspect Terms:
94
- atmosfer skor: 1
95
- servis skor: 1
96
- restoran skor: 1
97
- hizmet skor: 1
98
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- ---
101
-
102
- ## 📌 Model Details
103
-
104
- | Detail | Value |
105
- | -------------------- | -------------------------------------------- |
106
- | **Model Type** | `AutoModelForSeq2SeqLM` (T5-style) |
107
- | **Base Model** | `Turkish-NLP/t5-efficient-base-turkish` |
108
- | **Languages** | `tr` (Turkish) |
109
- | **Fine-tuning Task** | Aspect Term Extraction (sequence generation) |
110
- | **Framework** | 🤗 Transformers |
111
- | **License** | Apache-2.0 |
112
- | **Tokenizer** | SentencePiece (T5-style) |
113
-
114
- ---
115
-
116
- ## 📊 Dataset & Training
117
-
118
- * Total samples: 37,000+ Turkish review sentences
119
- * Input: Raw sentence (e.g., `"Pilav çok lezzetliydi ama servis yavaştı."`)
120
- * Target: Comma-separated aspect terms (e.g., `"pilav, servis"`)
121
-
122
- ### Training Configuration
123
-
124
- | Setting | Value |
125
- | --------------------- | ------------------ |
126
- | **Epochs** | 3 |
127
- | **Batch size** | 8 |
128
- | **Max input length** | 128 tokens |
129
- | **Max output length** | 64 tokens |
130
- | **Optimizer** | AdamW |
131
- | **Learning rate** | 3e-5 |
132
- | **Scheduler** | Linear |
133
- | **Precision** | FP32 |
134
- | **Hardware** | 1× Tesla T4 / P100 |
135
-
136
- ---
137
-
138
- ### 🔍 Evaluation
139
-
140
- The model was evaluated using exact-match micro-F1 score on a held-out test set.
141
-
142
- | Metric | Score |
143
- | --------------- | ----: |
144
- | **Micro-F1** | 0.84+ |
145
- | **Exact Match** | \~78% |
146
-
147
- ---
148
-
149
- ## 💡 Use Cases
150
-
151
- * 💬 Opinion mining in Turkish product or service reviews
152
- * 🧾 Aspect-level sentiment analysis preprocessing
153
- * 📊 Feature-based review summarization in NLP pipelines
154
-
155
- ---
156
-
157
- ## 📦 Model Card / Citation
158
-
159
- ```bibtex
160
- @misc{Sengil2025T5AspectTR,
161
- title = {Sengil/t5-turkish-aspect-term-extractor: Turkish Aspect Term Extraction with T5},
162
  author = {Şengil, Mert},
163
  year = {2025},
164
- url = {https://huggingface.co/Sengil/t5-turkish-aspect-term-extractor}
165
  }
166
  ```
167
 
168
  ---
169
-
170
- For contributions, improvements, or issue reporting, feel free to open a GitHub/Hugging Face issue or contact **[Mert Şengil](https://www.linkedin.com/in/mertsengil/)**.
171
-
 
1
  ---
2
  library_name: transformers
3
  tags:
4
+ - Dissonant Detection
5
  - transformers
6
+ - bert
7
  language:
8
  - tr
9
  metrics:
10
+ - accuracy
11
  base_model:
12
+ - ytu-ce-cosmos/turkish-base-bert-uncased
13
+ pipeline_tag: text-classification
14
  ---
15
+ # **Sengil/ytu-bert-base-dissonance-tr** 🇹🇷
16
 
17
+ A Turkish BERT-based model fine-tuned for three-way sentiment classification on single-sentence discourse.
18
+ This model categorizes input sentences into one of the following classes:
19
 
20
+ **Dissonance:** The sentence contains conflicting or contradictory sentiments
21
+ &nbsp;&nbsp;&nbsp;&nbsp;_e.g.,_ "Telefon çok kaliteli ve hızlı bitiyor şarjı"
22
+ **Consonance:** The sentence expresses harmonizing or mutually reinforcing sentiments
23
+ &nbsp;&nbsp;&nbsp;&nbsp;_e.g.,_ "Yemeklerde çok güzel manzarada mükemmel"
24
+ **Neither:** The sentence is neutral or does not clearly reflect either dissonance or consonance
25
+ &nbsp;&nbsp;&nbsp;&nbsp;_e.g.,_ "Bu gün hava çok güzel"
26
 
 
27
 
28
+ The model was trained on 37,368 Turkish samples and evaluated on two separate sets of 4,671 samples each.
29
+ It achieved 97.5% accuracy and 97.5% macro-F1 score on the test set, demonstrating strong performance in distinguishing subtle semantic contrasts in Turkish sentences.
30
+
31
+ |**Model Details** | |
32
+ | -------------------- | ----------------------------------------------------- |
33
+ | **Developed by** | Mert Şengil |
34
+ | **Model type** | `BertForSequenceClassification` |
35
+ | **Base model** | `ytu-ce-cosmos/turkish-base-bert-uncased` |
36
+ | **Languages** | `tr` (Turkish) |
37
+ | **License** | Apache-2.0 |
38
+ | **Fine-tuning task** | 3-class sentiment (dissonance / consonance / neither) |
39
 
40
+ ## Uses
41
 
42
  ```python
43
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
44
  import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ model_id = "Sengil/ytu-bert-base-dissonance-tr"
47
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
48
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
49
 
50
+ text = "onu çok seviyorum ve güvenmiyorum."
51
+ text = text.replace("I", "ı").lower()
52
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
53
+ with torch.no_grad():
54
+ logits = model(**inputs).logits
55
 
56
+ label_id = int(logits.argmax())
57
+ id2label = {0: "Dissonance", 1: "Consonance", 2: "Neither"}
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
+ print(f"{{'label': '{id2label[label_id]}','score':{logits.argmax()}}}")
60
  ```
61
+ output:
 
 
 
 
62
  ```
63
+ {'label': 'Dissonance','score':0}
64
+ ```
65
+ |**Training Details** | |
66
+ | ---------------------- | ---------------------------------------------- |
67
+ | **Training samples** | 37 368 |
68
+ | **Validation samples** | 4 671 |
69
+ | **Test samples** | 4 671 |
70
+ | **Epochs** | 4 |
71
+ | **Batch size** | 32 (train) / 16 (eval) |
72
+ | **Optimizer** | `AdamW` (lr = 2 × 10⁻⁵, weight\_decay = 0.005) |
73
+ | **Scheduler** | Linear with 10 % warm-up |
74
+ | **Precision** | FP32 |
75
+ | **Hardware** | 1× GPU P100 |
76
+
77
+ ### Training Loss Progression
78
+ | Epoch | Train Loss | Val Loss |
79
+ | ----: | ---------: | ---------: |
80
+ | 1 | 0.2661 | 0.0912 |
81
+ | 2 | 0.0784 | 0.0812 |
82
+ | 3 | 0.0520 | 0.0859 |
83
+ | 4 | **0.0419** | **0.0859** |
84
+
85
+ ## Evaluation
86
+
87
+ | Metric | Value |
88
+ | ------------------- | ---------: |
89
+ | **Accuracy (test)** | **0.9750** |
90
+ | **Macro-F1 (test)** | **0.9749** |
91
+
92
+ |**Environmental Impact** | |
93
+ | ----------------------- | -------------------- |
94
+ | **Hardware** | 1× A100-40 GB |
95
+ | **Training time** | ≈ 4 × 7 min ≈ 0.47 h |
96
+
97
+
98
+ ## Citation
99
 
100
+ ```
101
+ @misc{Sengil2025DisConBERT,
102
+ title = {Sengil/ytu-bert-base-dissonance-tr: A Three-way Dissonance/Consonance Classifier},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  author = {Şengil, Mert},
104
  year = {2025},
105
+ url = {https://huggingface.co/Sengil/ytu-bert-base-dissonance-tr}
106
  }
107
  ```
108
 
109
  ---
110
+ I would like to thank YTU for their open-source contributions that supported the development of this model.
111
+ For issues or questions, please open an issue on the Hub repo or contact **[mert sengil](https://www.linkedin.com/in/mertsengil/)**.