Tuning Confidence Thresholds

This page covers the practical decisions: which threshold to start with, how to encode it in your application, and when to retune. For what a confidence score is and how to read the numbers, see Confidence scores in Concepts.

Starting thresholds by use case

The right threshold depends on the cost of a wrong prediction. Start conservative — you can always loosen later as you learn the model’s behavior in production.

Use case	Cost of a wrong prediction	Suggested auto-accept threshold
Medical triage, legal review, fraud action	Very high	`≥ 0.95`, route everything else
Content moderation, financial routing	High	`≥ 0.90`
Support ticket routing, document classification	Medium	`≥ 0.85`
Product tagging, recommendation filtering	Low	`≥ 0.75`
Social-media surfacing, exploratory analytics	Very low	`≥ 0.60` or no threshold

Three-zone routing in code

The standard pattern uses two thresholds to create three routing zones: auto-accept, human review, and reject.

result = invoke(input_data)

if result["confidence"] >= 0.90:
    take_automated_action(result["labelName"])
elif result["confidence"] >= 0.60:
    queue_for_human_review(result)
else:
    treat_as_uncertain(result)

The exact numbers are your call. The pattern of three zones with two thresholds is what generalizes — it gives you an auto-accept path for throughput, a review queue that generates training data, and a reject path for inputs that don’t fit any of your defined labels.

When to retune

Threshold settings are not “set and forget.” Revisit them when:

Too many correct predictions are being flagged for review. Lower the auto-accept threshold so confident predictions go straight through.
Too many wrong predictions are slipping through automation. Raise the auto-accept threshold so borderline cases get sent for human review.
The model has matured. After a few weeks of feedback, confidence on correct predictions usually rises. You can often raise both thresholds without losing accuracy.
Real-world inputs have shifted. If you see the confidence distribution flatten in the console, the model is seeing new kinds of inputs. Tighten thresholds and add training samples for the new cases.

Look at the distribution, not just averages

The Nyckel console shows the confidence distribution across recent predictions. Use it to set thresholds based on your actual traffic rather than guessing:

A healthy function has most predictions clustered above your auto-accept threshold.
A long flat tail in the medium range means there’s a real review workload to staff — or a real opportunity to add labeled samples in the categories where the model is uncertain.
A bimodal distribution (lots of very high and very low predictions, little in the middle) usually means the model has learned the easy cases well and is genuinely unable to classify the rest — that’s a labeling-quality signal, not a threshold problem.

TipWhen reviewing predictions, prioritize the ones closest to your threshold boundary. These are the inputs where a correction has the highest impact on future model behavior.