Precision vs Sensitivity: Which Should You Optimize for in an LLM Triage Classifier?
How to choose the right metric for your LLM triage system
So, you've built an LLM-based customer service triage bot that your product owners are happy with. The system is designed to automatically escalate conversations when customers show signs of frustration, anger, or dissatisfaction. Stakeholders are happy with the test results, and you're ready to deploy.
But three weeks into production, you're getting complaints from two directions. Customer service managers are drowning in false escalations. These are routine conversations being unnecessarily flagged as frustrated customers. Meanwhile, genuinely upset customers are slipping through the cracks, their complaints going unnoticed until they escalate to social media or cancel their subscriptions.
When Overall Accuracy Misleads
Most LLM triage systems in production deal with highly imbalanced datasets. Whether you're building customer service triage, email prioritization, or content moderation, you're typically looking for a small percentage of cases that need special handling among a large volume of routine interactions.
Consider a customer escalation system where only 2% of conversations actually require manager intervention. A naive model that always predicts "no escalation needed" would achieve 98% accuracy while being completely useless for its intended purpose.
This is where understanding the confusion matrix becomes critical:
| | Triage System Escalates | Doesn't Escalate | | ------------------- | ----------------------- | ------------------- | | Escalation Required | True Positive (TP) | False Negative (FN) | | Shouldn't Escalate | False Positive (FP) | True Negative (TN) |
In business terms:
- True Positives: Correctly flagged issues that needed attention
- False Positives: Unnecessary escalations that waste resources
- False Negatives: Missed issues that should have been escalated
- True Negatives: Correctly identified routine interactions
The real question isn't "How accurate is our model?" but rather "What's the business cost of each type of error?"
The Precision vs Sensitivity Trade-off
This leads us to the fundamental choice in classification system design:
Precision: Of all the cases we flag for escalation, what percentage actually need it?
- Formula: TP / (TP + FP)
- Business question: "How much can we trust a positive prediction?"
Sensitivity (Recall): Of all the cases that should be escalated, what percentage do we catch?
- Formula: TP / (TP + FN)
- Business question: "How many real issues are we missing?"
You cannot optimize for both simultaneously. Higher sensitivity inevitably means more false positives. Higher precision means missing more true cases. The question becomes: which error is more costly for your specific business context?
A Framework for Metric Selection
To make this decision systematically, evaluate three key cost factors:
1. Value of True Positives
What's the benefit when your system correctly identifies a case that needs attention? For customer service, this might mean retaining a frustrated customer. For security monitoring, it could prevent a significant breach.
2. Cost of False Positives
What happens when your system unnecessarily flags a routine case? Consider both direct costs (human review time) and indirect costs (alert fatigue, reduced trust in the system).
3. Cost of False Negatives
What's the impact when your system misses a case that should have been flagged? This often includes both immediate costs and potential downstream consequences.
Decision Patterns in Practice
Optimize for Sensitivity (High Sensitivity) When:
- Missing a true positive is expensive or dangerous
- You have robust downstream processes to handle false positives
- The cost of human review is relatively low
- Examples: Security threat detection, medical screening, fraud monitoring
Optimize for Precision (Low False Positives) When:
- False positives create significant operational burden
- Missing some true cases has limited immediate impact
- User trust and system adoption are critical
- Examples: Email prioritization, content recommendations, non-urgent notifications
Consider the Email Prioritization Example:
If your "urgent email" classifier flags every message as urgent, users will ignore the feature entirely. Here, false positives destroy the system's value. Missing one urgent email has limited cost - it will be seen eventually. Optimize for precision.
Contrast with Security Monitoring:
If your threat detection system misses a real attack, the consequences could be catastrophic. False positives cost analyst time but create manageable operational overhead. Optimize for sensitivity and build processes to efficiently handle false alarms.
Common Mistakes and How to Avoid Them
Mistake 1: Optimizing for the Wrong Metric
Teams often default to maximizing overall accuracy or F1 score without considering business context. This mathematical optimization rarely aligns with business value.
Solution: Start with business cost analysis before touching any technical metrics.
Mistake 2: Ignoring Downstream Processes
Many teams evaluate their classifier in isolation, ignoring how classification errors flow through their broader system.
Solution: Design multi-stage pipelines where initial high-sensitivity classification feeds into more precise secondary checks.
Mistake 3: Static Optimization
Setting your precision/sensitivity balance once and never revisiting it, even as business conditions change.
Solution: Monitor real-world costs continuously and be prepared to retune your system as operational priorities evolve.
Beyond Binary Choices
Remember that precision and sensitivity optimization isn't always binary. Consider these approaches:
- Confidence Thresholds: Use prediction confidence scores to create multiple escalation tiers
- Multi-Stage Pipelines: High-sensitivity initial screening followed by high-precision secondary analysis
- Human-in-the-Loop: Strategic human validation points for uncertain cases
- Active Learning: Continuously improve your system based on production feedback
Moving Forward Strategically
The most successful LLM classification systems align technical metrics with business reality from day one. Before you deploy your next triage system, ask yourself:
- What specific business outcomes are we trying to achieve?
- What's the true cost of each type of classification error?
- How will our classification decisions flow through existing operational processes?
- How will we measure and adjust our system's performance over time?
At our consulting practice, we help organizations design evaluation frameworks that align AI system performance with business outcomes. Whether you're building your first classification system or optimizing existing AI implementations, we'd be happy to discuss how systematic evaluation design could improve your system's real-world impact.
Please contact us at hello@verticalai.com.au or visit our website at https://verticalai.com.au
