How MathANEX use Nyckel to grade free-form math quizzes

mugshot Oscar Beijbom
Sep 2022
MathANEX logo

MathANEX gives math assessments to K–12 students. Rather than simply determining whether students’ answers are right or wrong and providing a numerical score, MathANEX looks deeply into how students approach math problems by examining how they explain their own work.

“With our Nyckel integration, work that took 1 month now takes 2 weeks.” – Jeremy, CTO, MathANEX

Students’ explanations of how they approach math questions are a powerful lens through which teachers can examine their mathematical knowledge and skill level, gaining rich insights into how to effectively re-engage them in their learning journey. The underlying principle of this way of doing and using assessment (so-called assessment for learning, rather than assessment of learning) is well established in educational theory and practice. Simply telling students whether they are right or wrong is the form of feedback least likely to re-engage them in learning. Showing them the right method is more effective, and building out from a student’s current way of thinking is the most powerful way to engage them in further learning.

But a significant drawback of this approach historically has been that it doesn’t scale; the amount of time required for an expert assessor to read and evaluate long, discursive answers is the very reason why standardized tests so often rely on multiple choice questions.

“What we can do is parse through all these answers, understand how they are approaching it, what tools they’re using, and what it means for the teacher” – Jeremy, CTO, MathANEX

Let’s look at an example of how students answer one of MathANEX’s math questions.

Example math question

The “correct” answer is 9. You divide 33 by 4 and get 8.25 tables, but you can’t buy a quarter of a table, so you round the answer up to 9.

But there is a wide variety of approaches to this question. We used division to answer it, but students often apply a multiplication heuristic, e.g., starting from 10 × 4 = 40, and then dropping down to find the right answer. Others begin multiplying from 1 × 4 = 4 and going up.

In all these cases, a student may arrive at the “correct” answer, but the detail of their methods reveals a spread in their mathematical understanding. Analyzing students’ “incorrect” responses is illuminating for the same reason, allowing a teacher to differentiate between, e.g., students who answer 8 because they miscalculate from those who round down to 8 based on practical reasoning: buying a whole extra table for one person seems excessive after all.

Rubrik example

Simply grading students’ responses as “correct” or “incorrect” would obscure all this creative thinking and provide little insight into how to move students on in their learning.

“We’re getting hundreds of thousands of students explaining how they solved problems, so it’s a very large set of text that we hire analysts to go through” – Jeremy, CTO, MathANEX

With that much data coming in, MathANEX was looking for some way to automate the grading process. They had identified two ways Machine Learning (ML) could help. First, student explanations fall into just a handful of prototypical categories like those discussed above. ML can help to organize the whole dataset into such categories, making it easier for human markers to then provide systematic feedback for students. Second, ML can directly complement the human reviewer in assessing the students’ individual answers.

“The pricing is extremely reasonable. Not only is Nyckel cheaper than hiring human analysts, but it also saves on all the secondary costs: the hiring, the training, the managing; that, we don’t have to do.” – Jeremy, CTO, MathANEX

Great. So, how easy is it to implement a ML solution? Jeremy is a developer who had done some ML work on MathANEX’s data before coming to Nyckel. He was able to verify that an ML approach would be useful, but the amount of work involved to do it in-house was prohibitive, both to get good enough performance in principle, and to bring it to production.

“You’re way cheaper than a developer to implement this.” – Jeremy, CTO, MathANEX

Jeremy found Nyckel halfway through the spring semester of 2022 and plugged in some data to its API. Training a text classification function meant uploading a csv file with tags; after a few issues getting the formatting right, the model trained right away. The UI enabled him to see the stats, as well as individual predictions. The results, which Jeremy verified with a train / test split, were “way better” than what he had been able to implement independently. At this point, MathANEX were using Nyckel mostly as a double checker for assessment review.

“Already at this stage, using Nyckel had saved more time than had been spent on the first implementation. So that was big. And that was all for free!” – Jeremy, CTO, MathANEX

The next step was training a production model for the fall semester. This time, Jeremy used Nyckel for the grouping problem. Training data came from human review of a few answers per question, graded according to MathANEX’s assessment rubric. The next step was to train a Nyckel function and use those predictions to assign rubric criteria to each answer. Those assigned criteria then form the groups which go to manual review. This meant that, once trained, the model could group answers before a human had looked at them, with around 50% of student responses being grouped with above 95% confidence.

“This was a game-changer. In a month-long project, in the first day we’re at like 50%... it helps us move much, much faster.” – Jeremy, CTO, MathANEX

Integrating the model to MathANEX’s production system took the API a few hours. Having half of student responses automatically evaluated with this level of confidence cut MathANEX’s work in half; instead of taking one month to analyze a hundred thousand responses, it could now be done in two weeks. In a busy month, this translates to saving 10 times the amount of money invested in Nyckel’s services.

“For a small company, a 10x cost savings is huge. It’s quite meaningful.” – Jeremy, CTO, MathANEX

So, what’s next for MathANEX? In terms of assessment design, MathANEX prefers to give as many free-form answers as possible; instead of locking assessment to a number-plus-explanation, they would like to keep answers completely open. Grouping free-forms responses is more analytically challenging – at least for a human – but it is a reasonable next developmental step for MathANEX and Nyckel.

“I love what you are doing.” – Jeremy, CTO, MathANEX

Jeremy is keen to continue to build on MathANEX’s positive relationship with Nyckel, giving the AI more responsibility as it gets better – especially as MathANEX continues to grow. One reasonable extension of their use of AI would be to apply it to the assessments themselves, rather than restricting it to the initial grouping step (which speeds up subsequent human classification) and the moderation step (which checks human answers). ML is also able to compensate for some forms of cognitive bias, such as recency bias, which can skew the assessments by analysts who are grading the same question over and over again. If an ML solution can achieve higher accuracy than human assessors across some large proportion of student responses, then those response classes are good candidates for automation.

“Cost savings are great, but you all actually allow us to scale easier too. Because of the reduced labor needs.” – Jeremy, CTO, MathANEX