Sengil/t5-turkish-aspect-term-extractor 🇹🇷

A Turkish sequence-to-sequence model based on Turkish-NLP/t5-efficient-base-turkish, fine-tuned for Aspect Term Extraction (ATE) from customer reviews and sentences.

Given a Turkish sentence, the model generates a list of aspect terms (e.g., kahve, servis, fiyatlar) that reflect the primary discussed entities or features.

✨ Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import re
from collections import Counter

#LOAD MODEL
MODEL_ID = "Sengil/t5-turkish-aspect-term-extractor"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID).to(DEVICE)
model.eval()

TURKISH_STOPWORDS = {
    "ve", "çok", "ama", "bir", "bu", "daha", "gibi", "ile", "için",
    "de", "da", "ki", "o", "şu", "bu", "sen", "biz", "siz", "onlar"
}

def is_valid_aspect(word):
    word = word.strip().lower()
    return (
        len(word) > 1 and
        word not in TURKISH_STOPWORDS and
        word.isalpha()
    )

def extract_and_rank_aspects(text, max_tokens=64, beams=5):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(DEVICE)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_tokens,
            num_beams=beams,
            num_return_sequences=beams,
            early_stopping=True
        )

    all_predictions = [
        tokenizer.decode(output, skip_special_tokens=True)
        for output in outputs
    ]


    all_terms = []
    for pred in all_predictions:
        candidates = re.split(r"[;,–—\-]|(?:\s*,\s*)", pred)
        all_terms.extend([w.strip().lower() for w in candidates if is_valid_aspect(w)])

    ranked = Counter(all_terms).most_common()
    return ranked


#INFERENCE
text = "Artılar: Göl manzarasıyla harika bir atmosfer, Ipoh'un her zaman sıcak olan havası nedeniyle iyi bir klima olan restoran, iyi ve hızlı hizmet sunan garsonlar, temassız ödeme kabul eden e-cüzdan, ücretsiz otopark ama sıcak güneş altında açık, yemeklerin tadı güzel."
ranked_aspects = extract_and_rank_aspects(text)

print("Sorted Aspect Terms:")
for term, score in ranked_aspects:
    print(f"{term:<15}  skor: {score}")

Output:

Sorted Aspect Terms:
atmosfer         skor: 1
servis           skor: 1
restoran         skor: 1
hizmet           skor: 1

📌 Model Details

Detail	Value
Model Type	`AutoModelForSeq2SeqLM` (T5-style)
Base Model	`Turkish-NLP/t5-efficient-base-turkish`
Languages	`tr` (Turkish)
Fine-tuning Task	Aspect Term Extraction (sequence generation)
Framework	🤗 Transformers
License	Apache-2.0
Tokenizer	SentencePiece (T5-style)

📊 Dataset & Training

Total samples: 37,000+ Turkish review sentences
Input: Raw sentence (e.g., "Pilav çok lezzetliydi ama servis yavaştı.")
Target: Comma-separated aspect terms (e.g., "pilav, servis")

Training Configuration

Setting	Value
Epochs	3
Batch size	8
Max input length	128 tokens
Max output length	64 tokens
Optimizer	AdamW
Learning rate	3e-5
Scheduler	Linear
Precision	FP32
Hardware	1× Tesla T4 / P100

🔍 Evaluation

The model was evaluated using exact-match micro-F1 score on a held-out test set.

Metric	Score
Micro-F1	0.84+
Exact Match	~78%

💡 Use Cases

💬 Opinion mining in Turkish product or service reviews
🧾 Aspect-level sentiment analysis preprocessing
📊 Feature-based review summarization in NLP pipelines

📦 Model Card / Citation

@misc{Sengil2025T5AspectTR,
  title   = {Sengil/t5-turkish-aspect-term-extractor: Turkish Aspect Term Extraction with T5},
  author  = {Şengil, Mert},
  year    = {2025},
  url     = {https://huggingface.co/Sengil/t5-turkish-aspect-term-extractor}
}

For contributions, improvements, or issue reporting, feel free to open a GitHub/Hugging Face issue or contact Mert Şengil.

Sengil
/

t5-turkish-aspect-term-extractor