Training & Accuracy

How Nyckel turns annotated samples into a working model, how it reports accuracy back to you, what to feed it, and the input-size limits to know about.

Accuracy

While a function is training, its accuracy is shown in the left navigation panel of the Nyckel console.

The top bar reflects the overall accuracy — the number of correctly predicted samples divided by the total number of samples.

Below it are class-level accuracy bars. Each bar shows, for one class, how many samples from that class the function predicted correctly. Per-class bars surface where the model is strong and where it needs more or better training data.

Cross-validation

To estimate function accuracy honestly, Nyckel uses cross-validation.

Cross-validation means training multiple models, each on most — but not all — of the samples. Each model then predicts the labels for the samples that were held out of its training set. Stitched together, this gives a fair prediction for every annotated sample without ever training and evaluating on the same data. The aggregate of those held-out predictions is what Nyckel reports as your function’s accuracy.

After cross-validation runs, Nyckel trains a single final model using all annotated data. That final model is what serves predictions for new invokes. Cross-validation is purely for the accuracy estimate; production traffic always hits the full-data model.

Training data

Nyckel trains your function from the data you import and annotate. To get the best performance, follow these guidelines when choosing what to import:

Provide data similar to what your function will encounter in production. If possible, draw samples from the same system the function will be deployed in. A model trained on screenshots from one tool will be weaker on screenshots from a different tool, even when the task is “the same.”
Provide balanced data. Include roughly the same number of samples per class. A function with 10,000 examples of one class and 50 of another will learn the dominant class well and the rare class poorly.
Provide more data. More annotated samples beat almost every other lever. If you have to choose between cleaning a marginal sample and adding a new one, add a new one.

Context length

Nyckel’s Text and Tabular models rely on several large language models (LLMs) to read sample text. Every LLM has a context length — the maximum amount of text it can ingest (and therefore learn from) in a single sample. Nyckel’s LLMs currently have a context length of 512 tokens.

A “token” is a word, a piece of punctuation, or other language fragment. There is no clean 1-to-1 mapping between tokens and words or characters:

Most tokenizers use one token for the word cat.
The same tokenizer typically uses two tokens for cat's.
The exact mapping depends on the tokenizer, which depends on the LLM.

In practice, across a wide range of text, the Nyckel LLMs can ingest 300–500 words before hitting the limit. If your samples are routinely longer than that, you have two options:

Use the full text anyway. The first 300–500 words often carry enough signal for Nyckel to learn a good model; the truncated tail is rarely the discriminating part.
Shorten the input in pre-processing by splitting long samples into several shorter ones, each annotated with the same label.