Thai-sentiment-e5

A Thai sentiment analysis model fine-tuned from multilingual-e5-large for classifying sentiment in Thai text into positive, negative, and neutral categories.

Model Details

Model Description

This model is a fine-tuned version of intfloat/multilingual-e5-large specifically trained for Thai sentiment analysis. It can classify Thai text into three sentiment categories: positive, negative, and neutral. The model demonstrates strong performance on Thai language sentiment classification tasks with high accuracy and good understanding of Thai linguistic nuances including sarcasm and implicit sentiment.

  • Developed by: ZombitX64, Krittanut Janutsaha, Chanyut Saengwichain
  • Model type: Sequence Classification (Sentiment Analysis)
  • Language(s) (NLP): Thai (th)
  • License: Creative Commons
  • Finetuned from model: intfloat/multilingual-e5-large

Model Sources

Uses

Direct Use

This model can be directly used for sentiment analysis of Thai text. It's particularly useful for:

  • Social media sentiment monitoring
  • Customer feedback analysis
  • Product review classification
  • Opinion mining from Thai text content

Downstream Use

The model can be integrated into larger applications such as:

  • Customer service chatbots
  • Social media analytics platforms
  • E-commerce review analysis systems
  • Content moderation systems

Out-of-Scope Use

This model should not be used for:

  • Languages other than Thai (though it may have some capability due to the multilingual base model)
  • Fine-grained emotion detection beyond the three sentiment categories
  • Clinical or medical sentiment analysis without proper validation

Bias, Risks, and Limitations

The model may have biases related to:

  • Text domains represented in the training data
  • Demographic or cultural biases present in the training dataset
  • Potential difficulty with highly domain-specific terminology or slang not present in training data

Recommendations

Users should be aware that:

  • The model performs best on text similar to its training data
  • Results should be validated for specific use cases
  • Consider the three-class limitation when interpreting results
  • Be cautious when applying to sensitive domains without additional validation

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "ZombitX64/Thai-sentiment-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# List of texts to analyze
texts = [
    "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย",
    "บริการแย่มาก ไม่ประทับใจเลย",
    "สินค้าคุณภาพพอใช้ได้",
    "มีคำถามเกี่ยวกับวิธีการใช้งานครับ",
    "สุดยอดไปเลย!",
    "ผิดหวังเล็กน้อย"
]

# Label mapping: 0=Question, 1=Negative, 2=Neutral, 3=Positive
labels = ["Question", "Negative", "Neutral", "Positive"]

# Process each text and predict sentiment
print("Predicting sentiment for multiple texts:")
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1)

    # Get predicted label and confidence
    predicted_label = labels[predicted_class.item()]
    confidence = predictions[0][predicted_class.item()].item()

    print(f"\nText: \"{text}\"")
    print(f"Predicted sentiment: {predicted_label} ({confidence:.2%})")

Training Details

Training Data

The model was trained on a Thai sentiment dataset containing 2,730 samples with the following distribution:

  • Total samples: 2,730 (2,729 after filtering)
  • Positive samples: 2,310 (label 1)
  • Negative samples: 102 (label 2)
  • Neutral samples: 317 (label 3)

Data Split:

  • Training set: 2,456 samples
  • Validation set: 273 samples

Training Procedure

The model was fine-tuned using the Hugging Face Transformers library with the following setup:

Training Hyperparameters

  • Base Model: intfloat/multilingual-e5-large
  • Model Architecture: XLMRobertaForSequenceClassification
  • Training Epochs: 5
  • Training Steps: 770
  • Training Runtime: 1,633.3 seconds
  • Training samples per second: 7.519
  • Training steps per second: 0.471

Speeds, Sizes, Times

  • Training Time: ~27 minutes (1,633 seconds)
  • Total Training Steps: 770
  • Final Training Loss: 0.043

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a validation set of 273 samples split from the original dataset.

Metrics

The model was evaluated using:

  • Accuracy: Primary metric for classification performance
  • Training Loss: Cross-entropy loss during training
  • Validation Loss: Cross-entropy loss on validation set

Results

Training Progress:

Epoch Training Loss Validation Loss Accuracy
1 0.0812 0.0699 98.53%
2 0.0053 0.0527 99.27%
3 0.0041 0.0350 99.63%
4 0.0002 0.0384 99.63%
5 0.0002 0.0410 99.63%

Final Test Results:

Metric Negative Neutral Positive Overall
Precision 1.00 1.00 1.00 1.00
Recall 1.00 0.90 1.00 1.00
F1-Score 1.00 0.95 1.00 1.00
Support 231 10 32 273

Performance Metrics:

  • Overall Accuracy: 100% (273/273)
  • Macro Average F1: 0.98
  • Weighted Average F1: 1.00
  • AUC Scores:
    • Negative: 0.00 (perfect separation)
    • Neutral: 0.12
    • Positive: 0.91

Confusion Matrix Results:

  • Negative: 231/231 correctly classified (100%)
  • Neutral: 9/10 correctly classified (90%)
  • Positive: 32/32 correctly classified (100%)

Model Performance Visualizations

Confusion Matrix

Confusion Matrix

ROC Curve

ROC Curve

Precision-Recall Curve

Precision-Recall Curve

additional_texts = [
    "ช่วยแนะนำเมนูขายดีหน่อยค่ะ",              # question
    "เซ็งมาก ทำของหายที่ร้าน",                 # negative
    "โอเคนะ ไม่แย่แต่ก็ไม่ได้ดี",               # neutral
    "พนักงานน่ารักมาก บริการดีเยี่ยม",          # positive
    "สาขานี้อยู่ตรงไหนเหรอ?",                   # question
    "อาหารเย็นชืด ไม่อร่อยเลย",                # negative
    "ไม่ได้รู้สึกอะไรเป็นพิเศษ",                # neutral
    "ชอบมากเลยค่ะ จะมาอีกแน่นอน",              # positive
    "โทรไปไม่มีคนรับสายเลย",                   # negative
    "รบกวนขอเลขพัสดุด้วยค่ะ",                  # question
    "เฉยๆ กับรสชาตินี้นะ",                    # neutral
    "ขนมปังนุ่มมาก ชอบๆ",                      # positive
    "ที่จอดรถอยู่ตรงไหน?",                     # question
    "ผิดหวังมากกับคุณภาพ",                     # negative
    "ไม่รู้จะพูดว่ายังไง มันก็แค่ธรรมดา",       # neutral
    "ครั้งแรกที่มาแล้วรู้สึกดีมาก",             # positive
    "อยากทราบว่าสั่งล่วงหน้าได้ไหม?",         # question
    "ของหมด ไม่แจ้งล่วงหน้าเลย",              # negative
    "ไม่มีความเห็นเป็นพิเศษ",                 # neutral
    "ทุกอย่างดีหมดเลยครับ",                    # positive
    "ตอนนี้เปิดให้บริการอยู่หรือเปล่าครับ?",     # question
    "พนักงานเสียงดัง พูดไม่เพราะ",             # negative
    "ก็โอเคในระดับนึงนะ",                      # neutral
    "บริการรวดเร็วทันใจ ชอบมาก",               # positive
    "ยังไม่เห็นเลขพัสดุเลยค่ะ",                # negative
    "เมนูแนะนำวันนี้คืออะไรคะ?",               # question
    "อาหารไม่ตรงปกเลยครับ",                   # negative
    "สั่งไว้ตอนเที่ยง กว่าจะได้ตอนบ่าย",        # negative
    "ไม่มีอะไรพิเศษ แต่ก็ไม่แย่",              # neutral
    "ของอร่อย บริการดี ราคาเหมาะสม",          # positive
]

additional_labels = [
    0, 1, 2, 3, 0, 1, 2, 3, 1, 0,
    2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
    0, 1, 2, 3, 1, 0, 1, 1, 2, 3
]

#  รวมทั้งหมด 40 ตัวอย่าง
texts = [
    "อร่อยมาก แนะนำเลย",              # Positive
    "ผิดหวังกับบริการมาก",             # Negative
    "ก็เฉยๆ แหละ",                    # Neutral
    "บริการช้า แต่พนักงานดี",         # Neutral
    "ไอเดียสร้างสรรค์มาก!",            # Positive
    "สาขานี้เปิดกี่โมง?",              # Question
    "รสชาติห่วยแตก ไม่ประทับใจเลย",     # Negative
    "ดีแต่ยังไม่ว้าว",                 # Neutral
    "บรรยากาศดี อาหารอร่อยมาก",        # Positive
    "ทำไมราคาแพงจัง?",                # Question
] + additional_texts

Predicted Sentiment Distribution

Predicted Sentiment Distribution

Confusion Matrix

Confusion Matrix

Summary

The model achieves exceptional performance with:

  • Perfect Overall Accuracy: 100% on test set
  • Excellent Class-wise Performance: Near-perfect precision and recall across all classes
  • Strong Generalization: Maintains high performance on diverse text types
  • Robust Sentiment Detection: Handles complex cases including sarcasm, implicit sentiment, and neutral expressions
  • Minimal Confusion: Only 1 misclassification (Neutral → Negative)

Model Examination

The model demonstrates strong capability in understanding Thai sentiment nuances:

Strengths:

  • High accuracy on straightforward sentiment classification
  • Good detection of sarcastic and ironic statements
  • Ability to handle implicit sentiment (e.g., "เยี่ยมเลยที่ลืมส่งงานอีกครั้ง" → Negative)
  • Understanding of Thai cultural context in sentiment expression

Examples of Model Performance:

The model demonstrates excellent capability in understanding Thai sentiment nuances:

Straightforward Cases:

  • "วันนี้อากาศดีจังเลย" → Positive (99.96%)
  • "อร่อยมาก แนะนำเลย" → Positive (99.98%)
  • "ดีใจด้วยนะ!" → Positive (99.94%)
  • "แย่ที่สุดเท่าที่เคยเจอมา" → Negative (99.99%)
  • "ผิดหวังกับบริการมาก" → Negative (99.99%)
  • "ก็งั้นๆ แหละ ไม่มีอะไรพิเศษ" → Neutral (99.70%)

Complex Sentiment Detection:

  • Sarcasm: "เก่งจังเลยนะ ทำผิดซ้ำได้เหมือนเดิมเป๊ะเลย" → Negative (99.99%)
  • Implicit criticism: "ไอเดียสร้างสรรค์มาก! ไม่มีใครคิดจะเสนออะไรที่ไม่มีทางเป็นไปได้แบบนี้หรอก" → Negative (99.43%)
  • Backhanded compliments: "เธออาจจะไม่เหมือนใคร…แต่ก็มีเสน่ห์ในแบบของเธอเอง" → Negative (90.58%)

Challenging Cases:

  • Questions with negative context: "ประชุมพรุ่งนี้กี่โมงครับ" → Negative (99.97%)
  • Subtle disappointment: "น่าจะดีขึ้นในครั้งต่อไป" → Negative (99.83%)
  • Mixed sentiment: "ลำไยอร่อยดีสดมากและลูกใหญ่ด้วยแต่เน่าไปครึ่งนึงก็ไม่ได้แย่แต่ก็ไม่ได้ดี" → Negative (59.86%)

Environmental Impact

Carbon emissions information not available. The model was fine-tuned from a pre-trained multilingual model, reducing the overall computational cost compared to training from scratch.

Technical Specifications

Model Architecture and Objective

  • Architecture: XLMRobertaForSequenceClassification
  • Base Model: intfloat/multilingual-e5-large
  • Task: Multi-class text classification (3 classes)
  • Objective: Cross-entropy loss minimization

Compute Infrastructure

Hardware

Training hardware specifications not specified.

Software

  • Framework: Hugging Face Transformers
  • Language: Python
  • Base Model: XLM-RoBERTa architecture

Citation

BibTeX:

@misc{thai-sentiment-e5,
  title={Thai-sentiment-e5: A Thai Sentiment Analysis Model},
  author={ZombitX64, Krittanut Janutsaha, Chanyut Saengwichain},
  year={2024},
  url={https://huggingface.co/ZombitX64/Thai-sentiment-e5}
}

Model Card Authors

ZombitX64, Krittanut Janutsaha, Chanyut Saengwichain

Model Card Contact

For questions or issues regarding this model, please contact through the Hugging Face model repository.

Downloads last month
95
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZombitX64/Thai-sentiment-e5

Finetuned
(110)
this model

Space using ZombitX64/Thai-sentiment-e5 1