|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: scikit-learn |
|
|
tags: |
|
|
- scikit-learn |
|
|
- sklearn |
|
|
- text-classification |
|
|
- vietnamese |
|
|
- nlp |
|
|
- pulse |
|
|
- tf-idf |
|
|
- logistic-regression |
|
|
- svc |
|
|
- support-vector-classification |
|
|
- aspect-sentiment-analysis |
|
|
- banking |
|
|
- financial-nlp |
|
|
datasets: |
|
|
- undertheseanlp/UTS2017_Bank |
|
|
- ura-hcmut/vlsp2016 |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1-score |
|
|
model-index: |
|
|
- name: pulse-core-1 |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Vietnamese General Sentiment Analysis |
|
|
dataset: |
|
|
name: VLSP2016 |
|
|
type: ura-hcmut/vlsp2016 |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.7114 |
|
|
name: Test Accuracy (SVC Linear) |
|
|
- type: accuracy |
|
|
value: 0.7019 |
|
|
name: Test Accuracy (Logistic Regression) |
|
|
- type: f1-score |
|
|
value: 0.713 |
|
|
name: Weighted F1-Score (SVC) |
|
|
- type: f1-score |
|
|
value: 0.703 |
|
|
name: Weighted F1-Score (Logistic Regression) |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Vietnamese Banking Aspect Sentiment Analysis |
|
|
dataset: |
|
|
name: UTS2017_Bank |
|
|
type: undertheseanlp/UTS2017_Bank |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.7172 |
|
|
name: Test Accuracy (SVC) |
|
|
- type: accuracy |
|
|
value: 0.6818 |
|
|
name: Test Accuracy (Logistic Regression) |
|
|
- type: precision |
|
|
value: 0.65 |
|
|
name: Weighted Precision (SVC) |
|
|
- type: recall |
|
|
value: 0.72 |
|
|
name: Weighted Recall (SVC) |
|
|
- type: f1-score |
|
|
value: 0.66 |
|
|
name: Weighted F1-Score (SVC) |
|
|
- type: f1-score |
|
|
value: 0.66 |
|
|
name: Weighted F1-Score (Logistic Regression) |
|
|
language: |
|
|
- vi |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Pulse Core 1 - Vietnamese Sentiment Analysis System |
|
|
|
|
|
A comprehensive machine learning-based sentiment analysis system for Vietnamese text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **71.14% accuracy** on VLSP2016 general sentiment dataset and **71.72% accuracy** on UTS2017_Bank banking aspect sentiment dataset with Support Vector Classification (SVC). |
|
|
|
|
|
📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/paper/pulse_core_1_technical_report.tex)** for comprehensive model documentation, performance analysis, and limitations. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Pulse Core 1** is a versatile Vietnamese sentiment analysis system that supports both general sentiment classification and specialized banking aspect sentiment analysis. The system can analyze general Vietnamese text sentiment (positive/negative/neutral) and banking-specific aspect sentiment (combining banking aspects with sentiment polarities). It's designed for Vietnamese text analysis across multiple domains, with specialized capabilities for banking customer feedback analysis and financial service categorization. |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline |
|
|
- **Feature Extraction**: CountVectorizer with 20,000 max features |
|
|
- **N-gram Support**: Unigram and bigram (1-2) |
|
|
- **TF-IDF**: Transformation with IDF weighting |
|
|
- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters |
|
|
- **Framework**: scikit-learn ≥1.6 |
|
|
- **Caching System**: Hash-based caching for efficient processing |
|
|
|
|
|
## Supported Datasets & Categories |
|
|
|
|
|
### VLSP2016 Dataset - General Sentiment Analysis (3 classes) |
|
|
|
|
|
**Sentiment Categories:** |
|
|
- **positive** - Positive sentiment towards products/services |
|
|
- **negative** - Negative sentiment towards products/services |
|
|
- **neutral** - Neutral or mixed sentiment |
|
|
|
|
|
**Dataset Statistics:** |
|
|
- Training samples: 5,100 (1,700 per class) |
|
|
- Test samples: 1,050 (350 per class) |
|
|
- Balanced distribution across all sentiment classes |
|
|
- Domain: General product and service reviews |
|
|
|
|
|
### UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes) |
|
|
|
|
|
**Banking Aspects:** |
|
|
1. **ACCOUNT** - Account services |
|
|
2. **CARD** - Card services |
|
|
3. **CUSTOMER_SUPPORT** - Customer support |
|
|
4. **DISCOUNT** - Discount offers |
|
|
5. **INTEREST_RATE** - Interest rate information |
|
|
6. **INTERNET_BANKING** - Internet banking services |
|
|
7. **LOAN** - Loan services |
|
|
8. **MONEY_TRANSFER** - Money transfer services |
|
|
9. **OTHER** - Other services |
|
|
10. **PAYMENT** - Payment services |
|
|
11. **PROMOTION** - Promotional offers |
|
|
12. **SAVING** - Savings accounts |
|
|
13. **SECURITY** - Security features |
|
|
14. **TRADEMARK** - Trademark/branding |
|
|
|
|
|
**Sentiments:** |
|
|
- **positive** - Positive sentiment |
|
|
- **negative** - Negative sentiment |
|
|
- **neutral** - Neutral sentiment |
|
|
|
|
|
**Combined Labels:** The model predicts combined aspect-sentiment labels in the format `<aspect>#<sentiment>`, such as: |
|
|
- `CUSTOMER_SUPPORT#negative` - Negative feedback about customer support |
|
|
- `LOAN#positive` - Positive opinion about loan services |
|
|
- `TRADEMARK#positive` - Positive brand perception |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install scikit-learn>=1.6 joblib |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Training the Model |
|
|
|
|
|
#### Dataset Selection and Training |
|
|
|
|
|
**VLSP2016 Dataset (General Sentiment Analysis):** |
|
|
```bash |
|
|
# Train on VLSP2016 with Logistic Regression |
|
|
python train.py --dataset vlsp2016 --model logistic |
|
|
|
|
|
# Train with SVC for better performance |
|
|
python train.py --dataset vlsp2016 --model svc_linear |
|
|
|
|
|
# Compare n-gram ranges |
|
|
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 2 |
|
|
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 3 |
|
|
|
|
|
# Export model for deployment |
|
|
python train.py --dataset vlsp2016 --model svc_linear --export-model |
|
|
``` |
|
|
|
|
|
**UTS2017_Bank Dataset (Banking Aspect Sentiment Analysis):** |
|
|
```bash |
|
|
# Train on UTS2017_Bank (default dataset) |
|
|
python train.py --dataset uts2017 --model logistic |
|
|
|
|
|
# Train with SVC for better performance |
|
|
python train.py --dataset uts2017 --model svc_linear |
|
|
|
|
|
# With specific parameters |
|
|
python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2 |
|
|
|
|
|
# Export model for deployment |
|
|
python train.py --dataset uts2017 --model logistic --export-model |
|
|
|
|
|
# Compare multiple models on specific dataset |
|
|
python train.py --dataset vlsp2016 --compare-models logistic svc_linear |
|
|
``` |
|
|
|
|
|
### Training from Scratch |
|
|
|
|
|
```python |
|
|
from train import train_notebook |
|
|
|
|
|
# Train VLSP2016 general sentiment model |
|
|
results = train_notebook( |
|
|
dataset="vlsp2016", |
|
|
model_name="svc_linear", |
|
|
max_features=20000, |
|
|
ngram_min=1, |
|
|
ngram_max=2, |
|
|
export_model=True |
|
|
) |
|
|
|
|
|
# Train UTS2017_Bank aspect sentiment model |
|
|
results = train_notebook( |
|
|
dataset="uts2017", |
|
|
model_name="logistic", |
|
|
max_features=20000, |
|
|
ngram_min=1, |
|
|
ngram_max=2, |
|
|
export_model=True |
|
|
) |
|
|
|
|
|
# Compare multiple models on VLSP2016 |
|
|
comparison_results = train_notebook( |
|
|
dataset="vlsp2016", |
|
|
compare=True |
|
|
) |
|
|
``` |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
### VLSP2016 General Sentiment Analysis Performance |
|
|
- **Training Accuracy**: 94.57% (SVC Linear) |
|
|
- **Test Accuracy**: 71.14% (SVC Linear, 1-2 ngram) / 70.67% (SVC Linear, 1-3 ngram) / 70.19% (Logistic Regression) |
|
|
- **Training Samples**: 5,100 (balanced: 1,700 per class) |
|
|
- **Test Samples**: 1,050 (balanced: 350 per class) |
|
|
- **Number of Classes**: 3 sentiment polarities |
|
|
- **Training Time**: ~24.95 seconds (SVC) / 0.75 seconds (LR) |
|
|
- **Per-Class Performance (SVC Linear)**: |
|
|
- **Positive**: 80% precision, 72% recall, 76% F1-score |
|
|
- **Negative**: 70% precision, 72% recall, 71% F1-score |
|
|
- **Neutral**: 65% precision, 69% recall, 67% F1-score |
|
|
- **Key Insights**: Consistent performance across all sentiment classes due to balanced dataset |
|
|
- **Optimal N-gram**: Bigrams (1-2) outperform trigrams (1-3) by 0.47 percentage points |
|
|
|
|
|
### UTS2017_Bank Aspect Sentiment Analysis Performance |
|
|
- **Training Accuracy**: 94.57% (SVC) |
|
|
- **Test Accuracy**: 71.72% (SVC) / 68.18% (Logistic Regression) |
|
|
- **Training Samples**: 1,581 |
|
|
- **Test Samples**: 396 |
|
|
- **Number of Classes**: 35 aspect-sentiment combinations |
|
|
- **Training Time**: ~5.3 seconds (SVC) / 2.13 seconds (LR) |
|
|
- **Best Performing Classes**: |
|
|
- `TRADEMARK#positive`: 90% F1-score |
|
|
- `CUSTOMER_SUPPORT#positive`: 88% F1-score |
|
|
- `LOAN#negative`: 67% F1-score (SVC improvement over LR) |
|
|
- `CUSTOMER_SUPPORT#negative`: 65% F1-score |
|
|
- **Challenges**: Class imbalance affects minority aspect-sentiment combinations |
|
|
- **Key Finding**: SVC shows superior category diversity compared to Logistic Regression |
|
|
|
|
|
### Cross-Dataset Performance Analysis |
|
|
- **Consistent SVC Performance**: ~71% accuracy on both 3-class (VLSP2016) and 35-class (UTS2017_Bank) tasks |
|
|
- **Balance Impact**: Balanced datasets (VLSP2016) yield consistent per-class results while imbalanced datasets create performance variations |
|
|
- **Training Efficiency**: Larger balanced datasets require more training time but provide stable results |
|
|
|
|
|
## Using the Pre-trained Models |
|
|
|
|
|
### Local Model (Vietnamese Banking Aspect Sentiment Analysis) |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
|
|
|
# Load VLSP2016 general sentiment model |
|
|
general_model = joblib.load("vlsp2016_sentiment_20250929_075529.joblib") |
|
|
|
|
|
# Load UTS2017_Bank aspect sentiment model |
|
|
banking_model = joblib.load("uts2017_sentiment_20250928_131716.joblib") |
|
|
|
|
|
# Or use inference script directly |
|
|
from inference import predict_text |
|
|
|
|
|
# General sentiment analysis |
|
|
general_text = "Sản phẩm này rất tốt, tôi rất hài lòng" |
|
|
prediction, confidence, top_predictions = predict_text(general_model, general_text) |
|
|
print(f"General Sentiment: {prediction}") # Expected: positive |
|
|
|
|
|
# Banking aspect sentiment analysis |
|
|
bank_text = "Lãi suất vay mua nhà hiện tại quá cao" |
|
|
prediction, confidence, top_predictions = predict_text(banking_model, bank_text) |
|
|
print(f"Banking Aspect-Sentiment: {prediction}") # Expected: INTEREST_RATE#negative |
|
|
|
|
|
print(f"Confidence: {confidence:.3f}") |
|
|
print("Top 3 predictions:") |
|
|
for i, (category, prob) in enumerate(top_predictions, 1): |
|
|
print(f" {i}. {category}: {prob:.3f}") |
|
|
|
|
|
# Example output for banking text: |
|
|
# Banking Aspect-Sentiment: INTEREST_RATE#negative |
|
|
# Confidence: 0.509 |
|
|
# Top 3 predictions: |
|
|
# 1. INTEREST_RATE#negative: 0.509 |
|
|
# 2. LOAN#negative: 0.218 |
|
|
# 3. CUSTOMER_SUPPORT#negative: 0.095 |
|
|
``` |
|
|
|
|
|
### Using the Inference Script |
|
|
|
|
|
```bash |
|
|
# Interactive mode |
|
|
python inference.py |
|
|
|
|
|
# Single prediction |
|
|
python inference.py --text "Lãi suất vay mua nhà hiện tại quá cao" |
|
|
|
|
|
# Test with examples |
|
|
python inference.py --test-examples |
|
|
|
|
|
# List available models |
|
|
python inference.py --list-models |
|
|
``` |
|
|
|
|
|
|
|
|
## Model Parameters |
|
|
|
|
|
- `dataset`: Dataset selection ("vlsp2016" for general sentiment, "uts2017" for banking aspect sentiment) |
|
|
- `model`: Model type ("logistic", "svc_linear", "svc_rbf", "naive_bayes", "decision_tree", "random_forest", etc.) |
|
|
- `max_features`: Maximum number of TF-IDF features (default: 20000) |
|
|
- `ngram_min/max`: N-gram range (default: 1-2, optimal for Vietnamese) |
|
|
- `split_ratio`: Train/test split ratio (default: 0.2, only used for uts2017) |
|
|
- `n_samples`: Optional sample limit for quick testing |
|
|
- `export_model`: Export model for deployment (creates `<dataset>_sentiment_<timestamp>.joblib`) |
|
|
- `compare`: Compare multiple model configurations |
|
|
- `compare_models`: Specify models to compare |
|
|
|
|
|
## Project Management |
|
|
|
|
|
### Cleanup Utility |
|
|
|
|
|
The project includes a cleanup script to manage training runs: |
|
|
|
|
|
```bash |
|
|
# Preview runs that will be deleted (without exported models) |
|
|
uv run python clean.py --dry-run --verbose |
|
|
|
|
|
# Clean up runs without exported models |
|
|
uv run python clean.py --yes |
|
|
|
|
|
# Interactive cleanup with confirmation |
|
|
uv run python clean.py |
|
|
``` |
|
|
|
|
|
**Features:** |
|
|
- Automatically identifies runs without exported model files |
|
|
- Shows space that will be freed |
|
|
- Dry-run mode for safe previewing |
|
|
- Detailed information about each run |
|
|
- Preserves runs with exported models |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Language Specificity**: Only works with Vietnamese text |
|
|
2. **Domain Coverage**: Two specialized domains (general sentiment + banking aspect sentiment) |
|
|
3. **Feature Limitations**: Limited to 20,000 most frequent features |
|
|
4. **Class Imbalance Sensitivity**: Performance degrades significantly with imbalanced datasets (evident in UTS2017_Bank) |
|
|
5. **Specific Weaknesses**: |
|
|
- **VLSP2016**: Minor performance variation between sentiment classes |
|
|
- **UTS2017_Bank**: Poor performance on minority aspect-sentiment classes due to insufficient training data |
|
|
- **N-gram Limitation**: Trigrams provide minimal improvement over bigrams while increasing computational cost |
|
|
- Banking domain aspects limited to predefined categories (account, loan, card, etc.) |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- **Dataset Bias**: Models reflect biases present in training datasets (VLSP2016 general reviews, UTS2017_Bank banking feedback) |
|
|
- **Performance Variation**: Significant performance differences between balanced (VLSP2016) and imbalanced (UTS2017_Bank) datasets |
|
|
- **Domain Validation**: Should be validated on target domain before deployment |
|
|
- **Class Imbalance**: Consider dataset balance when interpreting results, especially for banking aspect sentiment |
|
|
- **Representation**: VLSP2016 provides more equitable performance across sentiment classes due to balanced training data |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{undertheseanlp_2025, |
|
|
author = { Vu Anh }, |
|
|
organization = { UnderTheSea NLP }, |
|
|
title = { Pulse Core 1 - Vietnamese Sentiment Analysis System }, |
|
|
year = 2025, |
|
|
url = { https://huggingface.co/undertheseanlp/pulse_core_1 }, |
|
|
doi = { 10.57967/hf/6605 }, |
|
|
publisher = { Hugging Face } |
|
|
} |
|
|
``` |
|
|
|