The research

Why language models grade inconsistently, and how to fix it.

Rubrica grows out of ongoing PhD research on rubric drift: the tendency of an LLM grader to give the same correct answer different scores. We study where that comes from inside the model, measure it, and train it away.

How a grade drifts

We change an answer in ways that keep it correct, then watch the score move. Each kind of change points to a different weakness inside the model. These are four we study closely, and our research suggests there are more still to map.

Position

Move a key step elsewhere while keeping the logic. Tied to positional attention bias and lost in the middle effects.

Length

Keep it correct but make it longer or shorter. Tied to length preference and the evidence getting diluted.

Style

Change wording or formatting only. Tied to style and format shortcuts and unstable parsing.

Distractor

Add unrelated but harmless text. Tied to attention leaking onto irrelevant tokens.

Our approach

A way to measure the problem, and a way to train it out of the model.

Benchmark

EduR²Bench

A STEM grading benchmark that pairs each answer with rubric equivalent rewrites, so a grader can finally be measured on how much it drifts, not just whether it is roughly right.

Training