Tuning Confidence Thresholds
This page covers the practical decisions: which threshold to start with, how to encode it in your application, and when to retune. For what a confidence score is and how to read the numbers, see Confidence scores in Concepts.
Starting thresholds by use case
The right threshold depends on the cost of a wrong prediction. Start conservative — you can always loosen later as you learn the model’s behavior in production.
| Use case | Cost of a wrong prediction | Suggested auto-accept threshold |
|---|---|---|
| Medical triage, legal review, fraud action | Very high | ≥ 0.95, route everything else |
| Content moderation, financial routing | High | ≥ 0.90 |
| Support ticket routing, document classification | Medium | ≥ 0.85 |
| Product tagging, recommendation filtering | Low | ≥ 0.75 |
| Social-media surfacing, exploratory analytics | Very low | ≥ 0.60 or no threshold |
Three-zone routing in code
The standard pattern uses two thresholds to create three routing zones: auto-accept, human review, and reject.
result = invoke(input_data)
if result["confidence"] >= 0.90:
take_automated_action(result["labelName"])
elif result["confidence"] >= 0.60:
queue_for_human_review(result)
else:
treat_as_uncertain(result)
The exact numbers are your call. The pattern of three zones with two thresholds is what generalizes — it gives you an auto-accept path for throughput, a review queue that generates training data, and a reject path for inputs that don’t fit any of your defined labels.
When to retune
Threshold settings are not “set and forget.” Revisit them when:
- Too many correct predictions are being flagged for review. Lower the auto-accept threshold so confident predictions go straight through.
- Too many wrong predictions are slipping through automation. Raise the auto-accept threshold so borderline cases get sent for human review.
- The model has matured. After a few weeks of feedback, confidence on correct predictions usually rises. You can often raise both thresholds without losing accuracy.
- Real-world inputs have shifted. If you see the confidence distribution flatten in the console, the model is seeing new kinds of inputs. Tighten thresholds and add training samples for the new cases.
Look at the distribution, not just averages
The Nyckel console shows the confidence distribution across recent predictions. Use it to set thresholds based on your actual traffic rather than guessing:
- A healthy function has most predictions clustered above your auto-accept threshold.
- A long flat tail in the medium range means there’s a real review workload to staff — or a real opportunity to add labeled samples in the categories where the model is uncertain.
- A bimodal distribution (lots of very high and very low predictions, little in the middle) usually means the model has learned the easy cases well and is genuinely unable to classify the rest — that’s a labeling-quality signal, not a threshold problem.