How Functions Improve from Feedback

A Nyckel function doesn’t get better the way a typical ML model does — there’s no “retrain” button, no notebook, no model versioning to manage. It gets better because feedback flows back into the same store the model is trained on, and Nyckel retrains continuously in the background.

This page explains the mechanics so you know what to expect.

The feedback signal

The only signal that makes a function better is labeled examples — inputs paired with the correct output. The more you have, and the more accurately they reflect what your endpoint sees in production, the better the function gets.

There are three ways labeled examples are added:

You confirm or correct predictions (for classification functions). Every confirmed prediction adds a sample with the predicted label; every correction adds a sample with the corrected label.
You upload training data in the UI — from the console.
You upload training samples via the API — useful when your labels live in another system, or when you want labeling to happen inside your own application. For more on using the API, see the Developer Platform.

For classification functions, the first source is the most powerful in the long run, because it’s an automatic byproduct of using the function — the model starts learning from the inputs your endpoint is actually being asked to classify in production. The more your application invokes the endpoint and the more you (or your team) review the results, the more training data you accumulate, with no separate labeling project.

The other two sources are great too, especially for seeding a function with known-good data up front, or when labels are produced somewhere else in your stack.

What happens after you add samples

Nyckel watches your samples store and retrains in the background when there’s a meaningful change. You don’t trigger this and you don’t wait for it. While retraining runs, the existing model keeps serving requests — the endpoint never goes down.

When a candidate model is ready, Nyckel benchmarks it against the current model on held-out data. If the candidate is better, the endpoint switches over automatically. If it isn’t, the current model stays. Models that don’t improve never reach production.

The benchmark uses held-out samples so improvements are measured on examples the candidate model hasn’t seen — that’s the only way to get an honest read on whether it’s actually better.

How fast does it improve?

A few rules of thumb:

First model. For classification, a private model starts training as soon as you have 2 confirmed samples per label. The first private model usually shows clear improvement over the zero-shot baseline by 10–20 samples per label.
Steep early gains. The biggest accuracy gains come from the first few hundred samples — that’s when the model is learning the most about your labels.
Long tail. Beyond a few hundred samples per label, gains come from reviewing the hard cases — predictions the current model got wrong or had low confidence on. Adding 50 random samples and 50 hard ones is not the same: the hard ones move the needle more.

What you should do

You don’t manage retraining, but you do influence how fast and how well a function improves. The two highest-leverage actions are:

Review predictions regularly — especially low-confidence ones. Each review becomes a sample.
Correct mistakes, don’t just confirm. A correction is a much stronger training signal than a confirm, because it tells the model exactly where its decision boundary is wrong.

The deeper guide on this is Review and Improve Predictions under Best Practices.

Box Detect and Search

Box Detect functions improve the same way — by adding more annotated training images with bounding boxes — but the review workflow is different: you don’t review every invoke, you add training images with boxes when the function misses an instance or finds something that isn’t there.

Search functions don’t have a “right answer” to correct — they improve by curating the reference data. Adding, removing, or re-tagging items changes the search results immediately.

Build and Improve a Spam Classifier — see the whole loop in five minutes.
Review and improve predictions — the practical guide to building a steady review workflow.