Get Reliable Predictions

The quality of your predictions depends heavily on how well your labels are defined and how representative your samples are.

Define clear, non-overlapping labels

Each label should represent a distinct category. If two labels could both apply to the same input, the model will struggle to learn the boundary.

Good label design:

Spam / Not spam — clear binary decision
Billing / Technical / Account — distinct support ticket categories

Problematic label design:

Negative / Somewhat negative — too similar; model will confuse them
Product issue / Complaint — overlapping intent

If you find yourself labeling the same example differently on different days, the label boundary is too ambiguous.

Provide balanced examples

Aim for a roughly similar number of examples per label, especially when starting out. A heavily imbalanced dataset (for example, 200 examples of Spam and 5 of Not spam) can skew predictions toward the majority label.

Use examples that reflect real inputs

Train on the same kinds of data your application will send at runtime. If your production inputs are short, informal messages, do not train only on long, formal text. Edge cases matter — include examples that are tricky or borderline.

Keep labels consistent over time

If your team labels data, make sure everyone applies the same rules. Inconsistency in labeling is one of the most common causes of poor accuracy.

TipWrite down a one-sentence definition of each label before you start labeling. Refer to it when edge cases come up.