Get Reliable Predictions
The quality of your predictions depends heavily on how well your labels are defined and how representative your samples are.
Define clear, non-overlapping labels
Each label should represent a distinct category. If two labels could both apply to the same input, the model will struggle to learn the boundary.
Good label design:
Spam/Not spam— clear binary decisionBilling/Technical/Account— distinct support ticket categories
Problematic label design:
Negative/Somewhat negative— too similar; model will confuse themProduct issue/Complaint— overlapping intent
If you find yourself labeling the same example differently on different days, the label boundary is too ambiguous.
Provide balanced examples
Aim for a roughly similar number of examples per label, especially when starting out. A heavily imbalanced dataset (for example, 200 examples of Spam and 5 of Not spam) can skew predictions toward the majority label.
Use examples that reflect real inputs
Train on the same kinds of data your application will send at runtime. If your production inputs are short, informal messages, do not train only on long, formal text. Edge cases matter — include examples that are tricky or borderline.
Keep labels consistent over time
If your team labels data, make sure everyone applies the same rules. Inconsistency in labeling is one of the most common causes of poor accuracy.