QA Multi-Head DistilBERT for Helpline Quality Assessment

Model Description

This is a fine-tuned DistilBERT model designed for multi-head quality assurance (QA) classification of call center and helpline transcripts. Developed by BITZ IT Consulting as part of an AI pipeline for child helplines and crisis support services in East Africa, this model evaluates transcript quality across six key dimensions with 17 specific sub-metrics.

The model addresses a critical operational challenge in helpline services: most helpline calls between agents and callers go unmonitored due to the overwhelming manual effort required for quality assurance. Supervisors traditionally must listen to entire call recordings to evaluate performance, making comprehensive QA virtually impossible at scale. By automating this process through AI-powered QA scoring, this model significantly reduces the supervisory burden and enables systematic evaluation of call quality across all interactions, ensuring consistent service standards and targeted agent development.

Model Architecture

Base Model: DistilBERT (distilbert-base-uncased)
Architecture: Multi-head classifier with 6 specialized output heads
Input: Call center/helpline transcripts (max 512 tokens)
Output: Binary predictions for 17 quality assurance sub-metrics
Training: Fine-tuned on domain-specific helpline and call center data

QA Heads and Sub-metrics

Head	Sub-metrics	Count	Description
Opening	Use of call opening phrase	1	Evaluates proper call initiation protocols
Listening	Non-interruption, empathy, paraphrasing, politeness, confidence	5	Assesses active listening and communication skills
Proactiveness	Extra issue solving, satisfaction confirmation, follow-up	3	Measures proactive service approach
Resolution	Information accuracy, language use, consultation, process adherence, clarity	5	Evaluates problem-solving effectiveness
Hold	Hold explanation, gratitude for waiting	2	Assesses proper hold procedures
Closing	Proper closing phrase	1	Evaluates professional call conclusion

Total Sub-metrics: 17 across 6 main QA dimensions

Social Impact and Use Case

This model is specifically designed to support child helplines and crisis intervention services in East Africa. It addresses several critical challenges:

Consistent Care: Ensures uniform quality standards across different operators
Training Support: Provides objective feedback for helpline staff development
Scalable Monitoring: Enables quality assurance at scale for under-resourced services

The model is part of a broader AI pipeline that includes ASR (Automatic Speech Recognition), translation, Entity recognition, case classification and summarization components, all focused on protecting vulnerable populations.

Model Performance

Overall Performance

Overall Accuracy: ~87.5%
Average F1 Score: ~91.2%
Training Approach: Multi-task learning with BCEWithLogitsLoss per head
Evaluation: Comprehensive metrics across all QA dimensions

Per-Head Performance

Detailed Per-Head Performance

Head	Accuracy	Precision	Recall	F1 Score	Performance Level
Closing	100.0%	100.0%	100.0%	100.0%	Perfect
Resolution	90.5%	98.5%	98.5%	98.5%	Excellent
Hold	90.5%	66.7%	100.0%	80.0%	Good
Proactiveness	85.7%	91.7%	95.7%	93.6%	Good
Opening	85.7%	85.7%	85.7%	85.7%	Good
Listening	71.4%	98.5%	93.1%	95.7%	Mixed Performance

Performance Insights

Strongest Performance: Closing and Resolution heads achieve near-perfect scores
Consistent Performance: Opening, Proactiveness show balanced precision/recall
High Precision Models: Most heads demonstrate excellent precision (>85%)
Listening Head: Lower accuracy (71.4%) but exceptional F1 score (95.7%) indicates the model correctly identifies listening behaviors when present, with some false negatives
Hold Head: High accuracy but lower precision suggests conservative predictions - catches all positive cases but with some false positives

Installation and Usage

Quick Start

pip install transformers torch

Model Classes

import torch
import torch.nn as nn
from transformers import DistilBertModel, DistilBertPreTrainedModel, AutoTokenizer

class MultiHeadQAClassifier(DistilBertPreTrainedModel):
    """
    Multi-head QA classifier for call center quality assessment.
    Each head corresponds to a different QA metric with specific sub-metrics.
    """
    
    def __init__(self, config):
        super().__init__(config)
        
        # QA heads configuration
        self.heads_config = getattr(config, 'heads_config', {
            "opening": 1,
            "listening": 5,
            "proactiveness": 3,
            "resolution": 5,
            "hold": 2,
            "closing": 1
        })
        
        self.bert = DistilBertModel(config)
        classifier_dropout = getattr(config, 'classifier_dropout', 0.1)
        self.dropout = nn.Dropout(classifier_dropout)

        # Multiple classification heads
        self.classifiers = nn.ModuleDict({
            head_name: nn.Linear(config.hidden_size, num_labels)
            for head_name, num_labels in self.heads_config.items()
        })
        
        # Initialize weights
        self.post_init()

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = self.dropout(outputs.last_hidden_state[:, 0])  # [CLS] token

        logits = {}
        losses = {}
        total_loss = 0

        for head_name, classifier in self.classifiers.items():
            head_logits = classifier(pooled_output)
            logits[head_name] = torch.sigmoid(head_logits)  # Convert to probabilities

            # Calculate loss if labels provided
            if labels is not None and head_name in labels:
                loss_fn = nn.BCEWithLogitsLoss()
                loss = loss_fn(head_logits, labels[head_name])
                losses[head_name] = loss.item()
                total_loss += loss

        return {
            "logits": logits,
            "loss": total_loss if labels is not None else None,
            "losses": losses if labels is not None else None
        }

Inference Function

def predict_qa_metrics(text: str, model, tokenizer, threshold: float = 0.5, device=None):
    """
    Predict QA metrics for a helpline transcript with beautiful output formatting.
    
    Args:
        text: Input transcript text
        model: Loaded MultiHeadQAClassifier model
        tokenizer: DistilBERT tokenizer
        threshold: Classification threshold (default: 0.5)
        device: Device to use for inference
    
    Returns:
        Dictionary with predictions and probabilities for each QA metric
    """
    if device is None:
        device = next(model.parameters()).device
    
    model.eval()
    
    # Sub-metric labels for formatted output
    HEAD_SUBMETRIC_LABELS = {
        "opening": ["Use of call opening phrase"],
        "listening": [
            "Caller was not interrupted",
            "Empathizes with the caller", 
            "Paraphrases or rephrases the issue",
            "Uses 'please' and 'thank you'",
            "Does not hesitate or sound unsure"
        ],
        "proactiveness": [
            "Willing to solve extra issues",
            "Confirms satisfaction with action points",
            "Follows up on case updates"
        ],
        "resolution": [
            "Gives accurate information",
            "Correct language use",
            "Consults if unsure",
            "Follows correct steps",
            "Explains solution process clearly"
        ],
        "hold": [
            "Explains before placing on hold",
            "Thanks caller for holding"
        ],
        "closing": ["Proper call closing phrase used"]
    }

    # Tokenize input
    encoding = tokenizer(
        text,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )
    
    input_ids = encoding["input_ids"].to(device)
    attention_mask = encoding["attention_mask"].to(device)
    
    # Forward pass
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs["logits"]
    
    # Format results
    results = {}
    print(f"📞 Transcript: {text}\n")
    
    total_positive = 0
    total_metrics = 0
    
    for head_name, probs in logits.items():
        probs_np = probs.cpu().numpy()[0]
        submetrics = HEAD_SUBMETRIC_LABELS.get(head_name, [f"Submetric {i+1}" for i in range(len(probs_np))])
        
        print(f"🔹 {head_name.upper()}:")
        head_results = []
        
        for prob, submetric in zip(probs_np, submetrics):
            prediction = prob > threshold
            indicator = "✓" if prediction else "✗"
            
            if prediction:
                total_positive += 1
            total_metrics += 1
            
            result_item = {
                "submetric": submetric,
                "probability": float(prob),
                "prediction": bool(prediction),
                "indicator": indicator
            }
            head_results.append(result_item)
            
            print(f"  ➤ {submetric}: P={prob:.3f} → {indicator}")
        
        results[head_name] = head_results
    
    # Overall summary
    overall_accuracy = (total_positive / total_metrics) * 100
    print(f"\n Overall Score: {total_positive}/{total_metrics} ({overall_accuracy:.1f}%)")
    
    results["summary"] = {
        "total_positive": total_positive,
        "total_metrics": total_metrics,
        "accuracy": overall_accuracy
    }
    
    return results

Complete Usage Example

from transformers import AutoTokenizer
import torch

# Load model and tokenizer
MODEL_NAME = "openchs/qa-helpline-distilbert-v1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = MultiHeadQAClassifier.from_pretrained(MODEL_NAME)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Example helpline transcript
transcript = """
Hello, thank you for calling our child helpline. My name is Sarah, how can I help you today? 
I understand your concern completely and I want to help you through this difficult situation. 
Let me check what resources we have available for you. Please hold for just a moment while I 
look into this. Thank you for holding. I've found several support options that can help. 
Is there anything else I can assist you with today? Thank you for reaching out to us, 
and please don't hesitate to call again if you need further support.
"""

# Run prediction
results = predict_qa_metrics(transcript, model, tokenizer, threshold=0.5, device=device)

# Access specific results
opening_results = results["opening"]
listening_results = results["listening"]
overall_summary = results["summary"]

FastAPI Integration

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional

app = FastAPI(title="QA Helpline Metrics API")

class TranscriptInput(BaseModel):
    text: str
    threshold: Optional[float] = 0.5

@app.post("/predict")
async def predict_transcript_quality(input_data: TranscriptInput):
    try:
        results = predict_qa_metrics(
            text=input_data.text,
            model=model,
            tokenizer=tokenizer,
            threshold=input_data.threshold
        )
        return {"success": True, "predictions": results}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Training Details

Training Data

Domain: Child helplines and crisis support transcripts
Languages: English
Size: Custom dataset with balanced QA metric annotations with no PII
Preprocessing: no PII removal, text normalization, quality filtering

Training Configuration

Base Model: distilbert-base-uncased
Optimizer: AdamW (lr=2e-5)
Loss Function: BCEWithLogitsLoss (per head)
Batch Size: 4
Max Length: 512 tokens
Epochs: 5
Training Framework: PyTorch + Transformers

Data Preprocessing Pipeline

Text cleaning and normalization
Token length validation
Quality assurance checks

Limitations and Considerations

Technical Limitations

Context Length: Limited to 512 tokens (longer transcripts need chunking)
Language Bias: Primary training on English
Domain Specificity: Optimized for helpline/call center contexts
Binary Classification: Each sub-metric is binary (present/absent)

Ethical Considerations

Human-in-the-Loop: Designed to assist and compliment, not replace human judgment
Privacy: Was trained on custom PII-less data
Bias Monitoring: Regular evaluation for demographic and linguistic bias
Sensitive Context: Special care needed when evaluating crisis support calls

Performance Considerations

Some heads (Listening, Proactiveness, Resolution) show room for improvement
Model performance may vary with transcript quality and length
Threshold tuning recommended based on specific use case requirements

Intended Use Cases

Primary Applications

Helpline Quality Assurance: Automated initial assessment of call quality
Agent Training: Provide structured feedback for skill development
Service Monitoring: Consistent evaluation across different operators
Performance Analytics: Track quality trends and improvement areas

Social Impact Applications

Child Protection: Ensure quality standards in child helpline services
Crisis Support: Maintain high standards in mental health and crisis calls
Language Accessibility: N/A
Capacity Building: Training support for under-resourced helpline services

Out of Scope Uses

Standalone Decision Making: Should not be used without human oversight
General Text Classification: Not optimized for non-helpline contexts
Real-time Critical Decisions: Not suitable for immediate intervention decisions
Legal/Medical Advice Evaluation: Not designed for professional advice assessment

Model Developers

BITZ IT Consulting - AI Solutions for Social Impact

Team:

Data Engineering Lead: Rogendo
Data Analysis: Shemmiriam
Quality Assurance: Nelsonadagi
ML Engineering: Collaborative team effort

Mission: Developing AI solutions that protect vulnerable populations and improve access to critical support services across East Africa.

Evaluation and Monitoring

Performance Tracking

Regular evaluation on held-out test sets
Cross-validation across different helpline types
Continuous monitoring for performance degradation
A/B testing for threshold optimization

Bias and Fairness

Demographic bias assessment
Language performance parity monitoring
Cultural appropriateness evaluation
Regular stakeholder feedback incorporation

Contributing and Support

Community Contributions

Feedback on model performance in different contexts
Contributions to multilingual support (especially East African languages)
Performance improvements and optimization suggestions
Documentation and usage examples

Research Collaboration

We welcome collaboration with:

Child protection organizations
Crisis support services
Academic researchers in NLP and social good
Other organizations serving vulnerable populations

Citation

@model{qa_helpline_distilbert_2025,
  title={QA Multi-Head DistilBERT for Helpline Quality Assessment},
  author={BITZ IT Consulting Team},
  year={2025},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/openchs/qa-helpline-distilbert-v1}},
  note={AI for Social Impact: Child Helplines and Crisis Support in East Africa}
}

Model Card Contact

Organization: BITZ IT Consulting
Support: Technical questions and collaboratifzon inquiries welcome

Repository Issues: https://huggingface.co/openchs/qa-helpline-distilbert-v1/discussions

Making Technology Work for Those Who Need It Most

Downloads last month: 859

Model tree for openchs/qa-helpline-distilbert-v1

Base model

distilbert/distilbert-base-uncased

Finetuned

(10331)

this model

Dataset used to train openchs/qa-helpline-distilbert-v1

Evaluation results

Overall Accuracy
self-reported

0.850
Weighted F1 Score
self-reported

0.820

Metadata error: specify a dataset to view leaderboard