Rubrica Request a demo

The research

Why language models grade inconsistently, and how to fix it.

Rubrica grows out of ongoing PhD research on rubric drift: the tendency of an LLM grader to give the same correct answer different scores. We study where that comes from inside the model, measure it, and train it away.

How a grade drifts

We change an answer in ways that keep it correct, then watch the score move. Each kind of change points to a different weakness inside the model. These are four we study closely, and our research suggests there are more still to map.

Position

Move a key step elsewhere while keeping the logic. Tied to positional attention bias and lost in the middle effects.

Length

Keep it correct but make it longer or shorter. Tied to length preference and the evidence getting diluted.

Style

Change wording or formatting only. Tied to style and format shortcuts and unstable parsing.

Distractor

Add unrelated but harmless text. Tied to attention leaking onto irrelevant tokens.

Our approach

A way to measure the problem, and a way to train it out of the model.

Benchmark

EduR²Bench

A STEM grading benchmark that pairs each answer with rubric equivalent rewrites, so a grader can finally be measured on how much it drifts, not just whether it is roughly right.

Training

Rubric Invariance Training

A consistency objective that pushes the model toward the same score distribution for answers that mean the same thing.

Training

Evidence Attention Regularization

A regularizer that keeps the model attending to the tokens that carry the answer, rather than length, style, or distractor text.

A five university collaboration

The research brings together researchers from five universities, led from the University of Arizona. It is active and ongoing.

University of Arizona University of Arizona
Stanford University Stanford University
Carnegie Mellon University Carnegie Mellon University
Northeastern University Northeastern University
Boise State University Boise State University

Want to follow the work?

We are looking for university partners to pilot the benchmark and the model on real coursework.

Get in touch