The research
Rubrica grows out of ongoing PhD research on rubric drift: the tendency of an LLM grader to give the same correct answer different scores. We study where that comes from inside the model, measure it, and train it away.
We change an answer in ways that keep it correct, then watch the score move. Each kind of change points to a different weakness inside the model. These are four we study closely, and our research suggests there are more still to map.
Move a key step elsewhere while keeping the logic. Tied to positional attention bias and lost in the middle effects.
Keep it correct but make it longer or shorter. Tied to length preference and the evidence getting diluted.
Change wording or formatting only. Tied to style and format shortcuts and unstable parsing.
Add unrelated but harmless text. Tied to attention leaking onto irrelevant tokens.
A way to measure the problem, and a way to train it out of the model.
A STEM grading benchmark that pairs each answer with rubric equivalent rewrites, so a grader can finally be measured on how much it drifts, not just whether it is roughly right.
A consistency objective that pushes the model toward the same score distribution for answers that mean the same thing.
A regularizer that keeps the model attending to the tokens that carry the answer, rather than length, style, or distractor text.
The research brings together researchers from five universities, led from the University of Arizona. It is active and ongoing.
University of Arizona
Stanford University
Carnegie Mellon University
Northeastern University
Boise State University
We are looking for university partners to pilot the benchmark and the model on real coursework.
Get in touch