pulse_core_1 / README.md
Vu Anh
Update README.md with comprehensive dual-dataset evaluation
25aa0d2
---
license: apache-2.0
library_name: scikit-learn
tags:
- scikit-learn
- sklearn
- text-classification
- vietnamese
- nlp
- pulse
- tf-idf
- logistic-regression
- svc
- support-vector-classification
- aspect-sentiment-analysis
- banking
- financial-nlp
datasets:
- undertheseanlp/UTS2017_Bank
- ura-hcmut/vlsp2016
metrics:
- accuracy
- precision
- recall
- f1-score
model-index:
- name: pulse-core-1
results:
- task:
type: text-classification
name: Vietnamese General Sentiment Analysis
dataset:
name: VLSP2016
type: ura-hcmut/vlsp2016
metrics:
- type: accuracy
value: 0.7114
name: Test Accuracy (SVC Linear)
- type: accuracy
value: 0.7019
name: Test Accuracy (Logistic Regression)
- type: f1-score
value: 0.713
name: Weighted F1-Score (SVC)
- type: f1-score
value: 0.703
name: Weighted F1-Score (Logistic Regression)
- task:
type: text-classification
name: Vietnamese Banking Aspect Sentiment Analysis
dataset:
name: UTS2017_Bank
type: undertheseanlp/UTS2017_Bank
metrics:
- type: accuracy
value: 0.7172
name: Test Accuracy (SVC)
- type: accuracy
value: 0.6818
name: Test Accuracy (Logistic Regression)
- type: precision
value: 0.65
name: Weighted Precision (SVC)
- type: recall
value: 0.72
name: Weighted Recall (SVC)
- type: f1-score
value: 0.66
name: Weighted F1-Score (SVC)
- type: f1-score
value: 0.66
name: Weighted F1-Score (Logistic Regression)
language:
- vi
pipeline_tag: text-classification
---
# Pulse Core 1 - Vietnamese Sentiment Analysis System
A comprehensive machine learning-based sentiment analysis system for Vietnamese text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **71.14% accuracy** on VLSP2016 general sentiment dataset and **71.72% accuracy** on UTS2017_Bank banking aspect sentiment dataset with Support Vector Classification (SVC).
📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/paper/pulse_core_1_technical_report.tex)** for comprehensive model documentation, performance analysis, and limitations.
## Model Description
**Pulse Core 1** is a versatile Vietnamese sentiment analysis system that supports both general sentiment classification and specialized banking aspect sentiment analysis. The system can analyze general Vietnamese text sentiment (positive/negative/neutral) and banking-specific aspect sentiment (combining banking aspects with sentiment polarities). It's designed for Vietnamese text analysis across multiple domains, with specialized capabilities for banking customer feedback analysis and financial service categorization.
### Model Architecture
- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
- **Feature Extraction**: CountVectorizer with 20,000 max features
- **N-gram Support**: Unigram and bigram (1-2)
- **TF-IDF**: Transformation with IDF weighting
- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
- **Framework**: scikit-learn ≥1.6
- **Caching System**: Hash-based caching for efficient processing
## Supported Datasets & Categories
### VLSP2016 Dataset - General Sentiment Analysis (3 classes)
**Sentiment Categories:**
- **positive** - Positive sentiment towards products/services
- **negative** - Negative sentiment towards products/services
- **neutral** - Neutral or mixed sentiment
**Dataset Statistics:**
- Training samples: 5,100 (1,700 per class)
- Test samples: 1,050 (350 per class)
- Balanced distribution across all sentiment classes
- Domain: General product and service reviews
### UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes)
**Banking Aspects:**
1. **ACCOUNT** - Account services
2. **CARD** - Card services
3. **CUSTOMER_SUPPORT** - Customer support
4. **DISCOUNT** - Discount offers
5. **INTEREST_RATE** - Interest rate information
6. **INTERNET_BANKING** - Internet banking services
7. **LOAN** - Loan services
8. **MONEY_TRANSFER** - Money transfer services
9. **OTHER** - Other services
10. **PAYMENT** - Payment services
11. **PROMOTION** - Promotional offers
12. **SAVING** - Savings accounts
13. **SECURITY** - Security features
14. **TRADEMARK** - Trademark/branding
**Sentiments:**
- **positive** - Positive sentiment
- **negative** - Negative sentiment
- **neutral** - Neutral sentiment
**Combined Labels:** The model predicts combined aspect-sentiment labels in the format `<aspect>#<sentiment>`, such as:
- `CUSTOMER_SUPPORT#negative` - Negative feedback about customer support
- `LOAN#positive` - Positive opinion about loan services
- `TRADEMARK#positive` - Positive brand perception
## Installation
```bash
pip install scikit-learn>=1.6 joblib
```
## Usage
### Training the Model
#### Dataset Selection and Training
**VLSP2016 Dataset (General Sentiment Analysis):**
```bash
# Train on VLSP2016 with Logistic Regression
python train.py --dataset vlsp2016 --model logistic
# Train with SVC for better performance
python train.py --dataset vlsp2016 --model svc_linear
# Compare n-gram ranges
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 2
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 3
# Export model for deployment
python train.py --dataset vlsp2016 --model svc_linear --export-model
```
**UTS2017_Bank Dataset (Banking Aspect Sentiment Analysis):**
```bash
# Train on UTS2017_Bank (default dataset)
python train.py --dataset uts2017 --model logistic
# Train with SVC for better performance
python train.py --dataset uts2017 --model svc_linear
# With specific parameters
python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
# Export model for deployment
python train.py --dataset uts2017 --model logistic --export-model
# Compare multiple models on specific dataset
python train.py --dataset vlsp2016 --compare-models logistic svc_linear
```
### Training from Scratch
```python
from train import train_notebook
# Train VLSP2016 general sentiment model
results = train_notebook(
dataset="vlsp2016",
model_name="svc_linear",
max_features=20000,
ngram_min=1,
ngram_max=2,
export_model=True
)
# Train UTS2017_Bank aspect sentiment model
results = train_notebook(
dataset="uts2017",
model_name="logistic",
max_features=20000,
ngram_min=1,
ngram_max=2,
export_model=True
)
# Compare multiple models on VLSP2016
comparison_results = train_notebook(
dataset="vlsp2016",
compare=True
)
```
## Performance Metrics
### VLSP2016 General Sentiment Analysis Performance
- **Training Accuracy**: 94.57% (SVC Linear)
- **Test Accuracy**: 71.14% (SVC Linear, 1-2 ngram) / 70.67% (SVC Linear, 1-3 ngram) / 70.19% (Logistic Regression)
- **Training Samples**: 5,100 (balanced: 1,700 per class)
- **Test Samples**: 1,050 (balanced: 350 per class)
- **Number of Classes**: 3 sentiment polarities
- **Training Time**: ~24.95 seconds (SVC) / 0.75 seconds (LR)
- **Per-Class Performance (SVC Linear)**:
- **Positive**: 80% precision, 72% recall, 76% F1-score
- **Negative**: 70% precision, 72% recall, 71% F1-score
- **Neutral**: 65% precision, 69% recall, 67% F1-score
- **Key Insights**: Consistent performance across all sentiment classes due to balanced dataset
- **Optimal N-gram**: Bigrams (1-2) outperform trigrams (1-3) by 0.47 percentage points
### UTS2017_Bank Aspect Sentiment Analysis Performance
- **Training Accuracy**: 94.57% (SVC)
- **Test Accuracy**: 71.72% (SVC) / 68.18% (Logistic Regression)
- **Training Samples**: 1,581
- **Test Samples**: 396
- **Number of Classes**: 35 aspect-sentiment combinations
- **Training Time**: ~5.3 seconds (SVC) / 2.13 seconds (LR)
- **Best Performing Classes**:
- `TRADEMARK#positive`: 90% F1-score
- `CUSTOMER_SUPPORT#positive`: 88% F1-score
- `LOAN#negative`: 67% F1-score (SVC improvement over LR)
- `CUSTOMER_SUPPORT#negative`: 65% F1-score
- **Challenges**: Class imbalance affects minority aspect-sentiment combinations
- **Key Finding**: SVC shows superior category diversity compared to Logistic Regression
### Cross-Dataset Performance Analysis
- **Consistent SVC Performance**: ~71% accuracy on both 3-class (VLSP2016) and 35-class (UTS2017_Bank) tasks
- **Balance Impact**: Balanced datasets (VLSP2016) yield consistent per-class results while imbalanced datasets create performance variations
- **Training Efficiency**: Larger balanced datasets require more training time but provide stable results
## Using the Pre-trained Models
### Local Model (Vietnamese Banking Aspect Sentiment Analysis)
```python
import joblib
# Load VLSP2016 general sentiment model
general_model = joblib.load("vlsp2016_sentiment_20250929_075529.joblib")
# Load UTS2017_Bank aspect sentiment model
banking_model = joblib.load("uts2017_sentiment_20250928_131716.joblib")
# Or use inference script directly
from inference import predict_text
# General sentiment analysis
general_text = "Sản phẩm này rất tốt, tôi rất hài lòng"
prediction, confidence, top_predictions = predict_text(general_model, general_text)
print(f"General Sentiment: {prediction}") # Expected: positive
# Banking aspect sentiment analysis
bank_text = "Lãi suất vay mua nhà hiện tại quá cao"
prediction, confidence, top_predictions = predict_text(banking_model, bank_text)
print(f"Banking Aspect-Sentiment: {prediction}") # Expected: INTEREST_RATE#negative
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
print(f" {i}. {category}: {prob:.3f}")
# Example output for banking text:
# Banking Aspect-Sentiment: INTEREST_RATE#negative
# Confidence: 0.509
# Top 3 predictions:
# 1. INTEREST_RATE#negative: 0.509
# 2. LOAN#negative: 0.218
# 3. CUSTOMER_SUPPORT#negative: 0.095
```
### Using the Inference Script
```bash
# Interactive mode
python inference.py
# Single prediction
python inference.py --text "Lãi suất vay mua nhà hiện tại quá cao"
# Test with examples
python inference.py --test-examples
# List available models
python inference.py --list-models
```
## Model Parameters
- `dataset`: Dataset selection ("vlsp2016" for general sentiment, "uts2017" for banking aspect sentiment)
- `model`: Model type ("logistic", "svc_linear", "svc_rbf", "naive_bayes", "decision_tree", "random_forest", etc.)
- `max_features`: Maximum number of TF-IDF features (default: 20000)
- `ngram_min/max`: N-gram range (default: 1-2, optimal for Vietnamese)
- `split_ratio`: Train/test split ratio (default: 0.2, only used for uts2017)
- `n_samples`: Optional sample limit for quick testing
- `export_model`: Export model for deployment (creates `<dataset>_sentiment_<timestamp>.joblib`)
- `compare`: Compare multiple model configurations
- `compare_models`: Specify models to compare
## Project Management
### Cleanup Utility
The project includes a cleanup script to manage training runs:
```bash
# Preview runs that will be deleted (without exported models)
uv run python clean.py --dry-run --verbose
# Clean up runs without exported models
uv run python clean.py --yes
# Interactive cleanup with confirmation
uv run python clean.py
```
**Features:**
- Automatically identifies runs without exported model files
- Shows space that will be freed
- Dry-run mode for safe previewing
- Detailed information about each run
- Preserves runs with exported models
## Limitations
1. **Language Specificity**: Only works with Vietnamese text
2. **Domain Coverage**: Two specialized domains (general sentiment + banking aspect sentiment)
3. **Feature Limitations**: Limited to 20,000 most frequent features
4. **Class Imbalance Sensitivity**: Performance degrades significantly with imbalanced datasets (evident in UTS2017_Bank)
5. **Specific Weaknesses**:
- **VLSP2016**: Minor performance variation between sentiment classes
- **UTS2017_Bank**: Poor performance on minority aspect-sentiment classes due to insufficient training data
- **N-gram Limitation**: Trigrams provide minimal improvement over bigrams while increasing computational cost
- Banking domain aspects limited to predefined categories (account, loan, card, etc.)
## Ethical Considerations
- **Dataset Bias**: Models reflect biases present in training datasets (VLSP2016 general reviews, UTS2017_Bank banking feedback)
- **Performance Variation**: Significant performance differences between balanced (VLSP2016) and imbalanced (UTS2017_Bank) datasets
- **Domain Validation**: Should be validated on target domain before deployment
- **Class Imbalance**: Consider dataset balance when interpreting results, especially for banking aspect sentiment
- **Representation**: VLSP2016 provides more equitable performance across sentiment classes due to balanced training data
## Citation
If you use this model, please cite:
```bibtex
@misc{undertheseanlp_2025,
author = { Vu Anh },
organization = { UnderTheSea NLP },
title = { Pulse Core 1 - Vietnamese Sentiment Analysis System },
year = 2025,
url = { https://huggingface.co/undertheseanlp/pulse_core_1 },
doi = { 10.57967/hf/6605 },
publisher = { Hugging Face }
}
```