pulse_core_1 / README.md

Vu Anh

Update README.md with comprehensive dual-dataset evaluation

25aa0d2 2 months ago

13.7 kB

	---
	license: apache-2.0
	library_name: scikit-learn
	tags:
	- scikit-learn
	- sklearn
	- text-classification
	- vietnamese
	- nlp
	- pulse
	- tf-idf
	- logistic-regression
	- svc
	- support-vector-classification
	- aspect-sentiment-analysis
	- banking
	- financial-nlp
	datasets:
	- undertheseanlp/UTS2017_Bank
	- ura-hcmut/vlsp2016
	metrics:
	- accuracy
	- precision
	- recall
	- f1-score
	model-index:
	- name: pulse-core-1
	results:
	- task:
	type: text-classification
	name: Vietnamese General Sentiment Analysis
	dataset:
	name: VLSP2016
	type: ura-hcmut/vlsp2016
	metrics:
	- type: accuracy
	value: 0.7114
	name: Test Accuracy (SVC Linear)
	- type: accuracy
	value: 0.7019
	name: Test Accuracy (Logistic Regression)
	- type: f1-score
	value: 0.713
	name: Weighted F1-Score (SVC)
	- type: f1-score
	value: 0.703
	name: Weighted F1-Score (Logistic Regression)
	- task:
	type: text-classification
	name: Vietnamese Banking Aspect Sentiment Analysis
	dataset:
	name: UTS2017_Bank
	type: undertheseanlp/UTS2017_Bank
	metrics:
	- type: accuracy
	value: 0.7172
	name: Test Accuracy (SVC)
	- type: accuracy
	value: 0.6818
	name: Test Accuracy (Logistic Regression)
	- type: precision
	value: 0.65
	name: Weighted Precision (SVC)
	- type: recall
	value: 0.72
	name: Weighted Recall (SVC)
	- type: f1-score
	value: 0.66
	name: Weighted F1-Score (SVC)
	- type: f1-score
	value: 0.66
	name: Weighted F1-Score (Logistic Regression)
	language:
	- vi
	pipeline_tag: text-classification
	---

	# Pulse Core 1 - Vietnamese Sentiment Analysis System

	A comprehensive machine learning-based sentiment analysis system for Vietnamese text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving 71.14% accuracy on VLSP2016 general sentiment dataset and 71.72% accuracy on UTS2017_Bank banking aspect sentiment dataset with Support Vector Classification (SVC).

	📋 [View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/paper/pulse_core_1_technical_report.tex) for comprehensive model documentation, performance analysis, and limitations.

	## Model Description

	Pulse Core 1 is a versatile Vietnamese sentiment analysis system that supports both general sentiment classification and specialized banking aspect sentiment analysis. The system can analyze general Vietnamese text sentiment (positive/negative/neutral) and banking-specific aspect sentiment (combining banking aspects with sentiment polarities). It's designed for Vietnamese text analysis across multiple domains, with specialized capabilities for banking customer feedback analysis and financial service categorization.

	### Model Architecture

	- Algorithm: TF-IDF + SVC/Logistic Regression Pipeline
	- Feature Extraction: CountVectorizer with 20,000 max features
	- N-gram Support: Unigram and bigram (1-2)
	- TF-IDF: Transformation with IDF weighting
	- Classifier: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
	- Framework: scikit-learn ≥1.6
	- Caching System: Hash-based caching for efficient processing

	## Supported Datasets & Categories

	### VLSP2016 Dataset - General Sentiment Analysis (3 classes)

	Sentiment Categories:
	- positive - Positive sentiment towards products/services
	- negative - Negative sentiment towards products/services
	- neutral - Neutral or mixed sentiment

	Dataset Statistics:
	- Training samples: 5,100 (1,700 per class)
	- Test samples: 1,050 (350 per class)
	- Balanced distribution across all sentiment classes
	- Domain: General product and service reviews

	### UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes)

	Banking Aspects:
	1. ACCOUNT - Account services
	2. CARD - Card services
	3. CUSTOMER_SUPPORT - Customer support
	4. DISCOUNT - Discount offers
	5. INTEREST_RATE - Interest rate information
	6. INTERNET_BANKING - Internet banking services
	7. LOAN - Loan services
	8. MONEY_TRANSFER - Money transfer services
	9. OTHER - Other services
	10. PAYMENT - Payment services
	11. PROMOTION - Promotional offers
	12. SAVING - Savings accounts
	13. SECURITY - Security features
	14. TRADEMARK - Trademark/branding

	Sentiments:
	- positive - Positive sentiment
	- negative - Negative sentiment
	- neutral - Neutral sentiment

	Combined Labels: The model predicts combined aspect-sentiment labels in the format `<aspect>#<sentiment>`, such as:
	- `CUSTOMER_SUPPORT#negative` - Negative feedback about customer support
	- `LOAN#positive` - Positive opinion about loan services
	- `TRADEMARK#positive` - Positive brand perception

	## Installation

	```bash
	pip install scikit-learn>=1.6 joblib
	```

	## Usage

	### Training the Model

	#### Dataset Selection and Training

	VLSP2016 Dataset (General Sentiment Analysis):
	```bash
	# Train on VLSP2016 with Logistic Regression
	python train.py --dataset vlsp2016 --model logistic

	# Train with SVC for better performance
	python train.py --dataset vlsp2016 --model svc_linear

	# Compare n-gram ranges
	python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 2
	python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 3

	# Export model for deployment
	python train.py --dataset vlsp2016 --model svc_linear --export-model
	```

	UTS2017_Bank Dataset (Banking Aspect Sentiment Analysis):
	```bash
	# Train on UTS2017_Bank (default dataset)
	python train.py --dataset uts2017 --model logistic

	# Train with SVC for better performance
	python train.py --dataset uts2017 --model svc_linear

	# With specific parameters
	python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2

	# Export model for deployment
	python train.py --dataset uts2017 --model logistic --export-model

	# Compare multiple models on specific dataset
	python train.py --dataset vlsp2016 --compare-models logistic svc_linear
	```

	### Training from Scratch

	```python
	from train import train_notebook

	# Train VLSP2016 general sentiment model
	results = train_notebook(
	dataset="vlsp2016",
	model_name="svc_linear",
	max_features=20000,
	ngram_min=1,
	ngram_max=2,
	export_model=True
	)

	# Train UTS2017_Bank aspect sentiment model
	results = train_notebook(
	dataset="uts2017",
	model_name="logistic",
	max_features=20000,
	ngram_min=1,
	ngram_max=2,
	export_model=True
	)

	# Compare multiple models on VLSP2016
	comparison_results = train_notebook(
	dataset="vlsp2016",
	compare=True
	)
	```

	## Performance Metrics

	### VLSP2016 General Sentiment Analysis Performance
	- Training Accuracy: 94.57% (SVC Linear)
	- Test Accuracy: 71.14% (SVC Linear, 1-2 ngram) / 70.67% (SVC Linear, 1-3 ngram) / 70.19% (Logistic Regression)
	- Training Samples: 5,100 (balanced: 1,700 per class)
	- Test Samples: 1,050 (balanced: 350 per class)
	- Number of Classes: 3 sentiment polarities
	- Training Time: ~24.95 seconds (SVC) / 0.75 seconds (LR)
	- Per-Class Performance (SVC Linear):
	- Positive: 80% precision, 72% recall, 76% F1-score
	- Negative: 70% precision, 72% recall, 71% F1-score
	- Neutral: 65% precision, 69% recall, 67% F1-score
	- Key Insights: Consistent performance across all sentiment classes due to balanced dataset
	- Optimal N-gram: Bigrams (1-2) outperform trigrams (1-3) by 0.47 percentage points

	### UTS2017_Bank Aspect Sentiment Analysis Performance
	- Training Accuracy: 94.57% (SVC)
	- Test Accuracy: 71.72% (SVC) / 68.18% (Logistic Regression)
	- Training Samples: 1,581
	- Test Samples: 396
	- Number of Classes: 35 aspect-sentiment combinations
	- Training Time: ~5.3 seconds (SVC) / 2.13 seconds (LR)
	- Best Performing Classes:
	- `TRADEMARK#positive`: 90% F1-score
	- `CUSTOMER_SUPPORT#positive`: 88% F1-score
	- `LOAN#negative`: 67% F1-score (SVC improvement over LR)
	- `CUSTOMER_SUPPORT#negative`: 65% F1-score
	- Challenges: Class imbalance affects minority aspect-sentiment combinations
	- Key Finding: SVC shows superior category diversity compared to Logistic Regression

	### Cross-Dataset Performance Analysis
	- Consistent SVC Performance: ~71% accuracy on both 3-class (VLSP2016) and 35-class (UTS2017_Bank) tasks
	- Balance Impact: Balanced datasets (VLSP2016) yield consistent per-class results while imbalanced datasets create performance variations
	- Training Efficiency: Larger balanced datasets require more training time but provide stable results

	## Using the Pre-trained Models

	### Local Model (Vietnamese Banking Aspect Sentiment Analysis)

	```python
	import joblib

	# Load VLSP2016 general sentiment model
	general_model = joblib.load("vlsp2016_sentiment_20250929_075529.joblib")

	# Load UTS2017_Bank aspect sentiment model
	banking_model = joblib.load("uts2017_sentiment_20250928_131716.joblib")

	# Or use inference script directly
	from inference import predict_text

	# General sentiment analysis
	general_text = "Sản phẩm này rất tốt, tôi rất hài lòng"
	prediction, confidence, top_predictions = predict_text(general_model, general_text)
	print(f"General Sentiment: {prediction}") # Expected: positive

	# Banking aspect sentiment analysis
	bank_text = "Lãi suất vay mua nhà hiện tại quá cao"
	prediction, confidence, top_predictions = predict_text(banking_model, bank_text)
	print(f"Banking Aspect-Sentiment: {prediction}") # Expected: INTEREST_RATE#negative

	print(f"Confidence: {confidence:.3f}")
	print("Top 3 predictions:")
	for i, (category, prob) in enumerate(top_predictions, 1):
	print(f" {i}. {category}: {prob:.3f}")

	# Example output for banking text:
	# Banking Aspect-Sentiment: INTEREST_RATE#negative
	# Confidence: 0.509
	# Top 3 predictions:
	# 1. INTEREST_RATE#negative: 0.509
	# 2. LOAN#negative: 0.218
	# 3. CUSTOMER_SUPPORT#negative: 0.095
	```

	### Using the Inference Script

	```bash
	# Interactive mode
	python inference.py

	# Single prediction
	python inference.py --text "Lãi suất vay mua nhà hiện tại quá cao"

	# Test with examples
	python inference.py --test-examples

	# List available models
	python inference.py --list-models
	```


	## Model Parameters

	- `dataset`: Dataset selection ("vlsp2016" for general sentiment, "uts2017" for banking aspect sentiment)
	- `model`: Model type ("logistic", "svc_linear", "svc_rbf", "naive_bayes", "decision_tree", "random_forest", etc.)
	- `max_features`: Maximum number of TF-IDF features (default: 20000)
	- `ngram_min/max`: N-gram range (default: 1-2, optimal for Vietnamese)
	- `split_ratio`: Train/test split ratio (default: 0.2, only used for uts2017)
	- `n_samples`: Optional sample limit for quick testing
	- `export_model`: Export model for deployment (creates `<dataset>_sentiment_<timestamp>.joblib`)
	- `compare`: Compare multiple model configurations
	- `compare_models`: Specify models to compare

	## Project Management

	### Cleanup Utility

	The project includes a cleanup script to manage training runs:

	```bash
	# Preview runs that will be deleted (without exported models)
	uv run python clean.py --dry-run --verbose

	# Clean up runs without exported models
	uv run python clean.py --yes

	# Interactive cleanup with confirmation
	uv run python clean.py
	```

	Features:
	- Automatically identifies runs without exported model files
	- Shows space that will be freed
	- Dry-run mode for safe previewing
	- Detailed information about each run
	- Preserves runs with exported models

	## Limitations

	1. Language Specificity: Only works with Vietnamese text
	2. Domain Coverage: Two specialized domains (general sentiment + banking aspect sentiment)
	3. Feature Limitations: Limited to 20,000 most frequent features
	4. Class Imbalance Sensitivity: Performance degrades significantly with imbalanced datasets (evident in UTS2017_Bank)
	5. Specific Weaknesses:
	- VLSP2016: Minor performance variation between sentiment classes
	- UTS2017_Bank: Poor performance on minority aspect-sentiment classes due to insufficient training data
	- N-gram Limitation: Trigrams provide minimal improvement over bigrams while increasing computational cost
	- Banking domain aspects limited to predefined categories (account, loan, card, etc.)

	## Ethical Considerations

	- Dataset Bias: Models reflect biases present in training datasets (VLSP2016 general reviews, UTS2017_Bank banking feedback)
	- Performance Variation: Significant performance differences between balanced (VLSP2016) and imbalanced (UTS2017_Bank) datasets
	- Domain Validation: Should be validated on target domain before deployment
	- Class Imbalance: Consider dataset balance when interpreting results, especially for banking aspect sentiment
	- Representation: VLSP2016 provides more equitable performance across sentiment classes due to balanced training data

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{undertheseanlp_2025,
	author = { Vu Anh },
	organization = { UnderTheSea NLP },
	title = { Pulse Core 1 - Vietnamese Sentiment Analysis System },
	year = 2025,
	url = { https://huggingface.co/undertheseanlp/pulse_core_1 },
	doi = { 10.57967/hf/6605 },
	publisher = { Hugging Face }
	}
	```