Confidence Scores

Every prediction from a Nyckel function includes two values: a predicted label and a confidence score.

{
  "labelName": "Billing",
  "confidence": 0.91
}

The label tells you what the model predicted. The confidence score tells you how certain the model is. Your application uses both to decide what to do next.

What the score represents

Confidence is a calibrated probability — a number between 0 and 1 indicating how likely the prediction is to be correct.

A score of 0.91 means the model is highly confident the input belongs to the predicted label. A score of 0.52 means the model sees it as a close call. Calibrated means the number is meaningful: a prediction with confidence 0.90 should be correct roughly 90% of the time when measured across many predictions at that level.

This makes confidence scores useful for decision-making. Rather than treating every prediction as equally reliable, you can differentiate the ones your application can act on immediately from the ones that need a closer look.

Reading high and low scores

High confidence (e.g. ≥ 0.85)

The model has seen many similar examples and is placing the input firmly in one label. These predictions are generally safe to act on automatically. In a support ticket routing system, a ticket classified as “Billing” with confidence 0.94 can be routed to the billing team without human review.

Medium confidence (e.g. 0.60 – 0.85)

The model sees the input as a reasonable match for the predicted label but has some uncertainty. These are good candidates for human review — not because the prediction is likely wrong, but because the cost of a mistake may outweigh the benefit of automation. A reviewer can confirm the label or correct it, and that correction becomes a training sample.

Low confidence (e.g. < 0.60)

The model is uncertain. This can mean:

The input is genuinely ambiguous and could belong to more than one label
The model has not seen many examples like this one
The input does not fit any of your defined labels well

A low-confidence prediction is not a failure — it is information. Low-confidence predictions that get reviewed and correctly labeled are often the most valuable additions to your training set, because they fill in the gaps where the model is weakest.

Thresholds, briefly

A threshold is a confidence value your application uses to decide whether to act on a prediction automatically or send it for review. Picking the right numbers depends on the cost of a wrong prediction in your specific use case and the review capacity of your team — see Tuning confidence thresholds for starting values, the three-zone routing pattern, and when to retune.

Confidence across a label set

When a function has more than two labels, each prediction competes against all defined labels. The reported confidence score reflects how much more strongly the model favors the predicted label over the alternatives.

This means:

In a two-label function (e.g. Spam / Not Spam), a score of 0.70 means a moderate lean toward one label.
In a ten-label function, a score of 0.70 may represent a strong win — the model placed the input well ahead of nine other options.

When tuning thresholds for a multi-label function, it is useful to look at the confidence distribution across recent predictions in the Nyckel console. This shows you how the scores cluster for each label, which helps you identify where the model is confident and where it is uncertain.

How confidence changes as the model improves

Confidence scores are not fixed — they shift as your model learns.

Early in a function’s life, when the labeled dataset is small, the model has limited information to draw from. More predictions will fall in the medium and low confidence ranges. As you add labeled samples and correct predictions, the model retrains on a richer dataset and becomes better at placing inputs firmly in one label. The result is a distribution that shifts toward higher confidence over time.

You can observe this directly in the Nyckel console:

Confidence distribution shows the spread of scores across recent predictions
Accuracy tracks correctness against held-out labeled samples
Label distribution shows how often each label is predicted, which can reveal imbalances in training data

A healthy, well-trained function shows most predictions concentrated at the high end of the confidence range, with only genuinely ambiguous inputs landing in the middle.

A function that is still developing — or one that has encountered new types of inputs it has not seen before — will show a flatter distribution with more medium and low-confidence predictions. That is a signal to review predictions and add more labeled samples, especially in the categories where the model is uncertain.

Confidence and the feedback loop

Routing low-confidence predictions to human review, then annotating the results, is the core feedback mechanism that improves a Nyckel function over time.

Your application invokes the function and receives a prediction with a confidence score.
High-confidence predictions are acted on automatically.
Low-confidence predictions are routed to a reviewer.
The reviewer confirms or corrects the label.
That labeled sample is added to the function’s dataset.
The model retrains on the expanded dataset.
Similar inputs in the future receive higher-confidence predictions.

This loop means the volume of predictions requiring human review naturally decreases as the function matures — provided reviews are being completed and corrections are being submitted back to the function.

TipWhen reviewing predictions, prioritize the ones closest to your threshold boundary. These are the inputs where a correction has the highest impact on future model behavior.