Language models give the same answer different grades, and they ask you to send student work to an outside company. Rubrica is a grading model that scores the same way every time and runs inside your institution, so student data stays where it belongs.
A research collaboration across five universities
University of Arizona
Stanford University
Carnegie Mellon
Northeastern
Boise StateThe problem
AI grading already exists inside the tools schools use. Faculty still don't rely on it, because today's tools fall short in several ways at once.
Move a step around, change the wording, write more or less, or add a harmless extra sentence, and the model gives the same correct answer a different grade. A grade that changes with phrasing is hard to stand behind.
One answer, four equivalent rewrites, scored by a general model
Sending coursework to a third party API runs straight into FERPA and student privacy. After recent breaches took whole campuses offline, keeping data inside the institution stopped being optional.
Existing tools hand back a grade with no reasoning and no feedback. Students learn nothing from it, and faculty can't see how the score was reached.
Models often reward polished phrasing and penalize non native writing, even when the reasoning is the same. That becomes an equity problem the moment it touches a grade.
When a student appeals, a number from a black box is not an answer. Faculty need to see the evidence behind every score.
Per call API pricing climbs fast across thousands of students and assignments. A model you host yourself keeps cost flat as you grow.
The solution
We adapt open weight models with a training method built for scoring, then run it where your data already lives.
We train the model so that answers meaning the same thing receive the same score, no matter how they are written.
The model is guided to read the parts of an answer that actually matter, instead of being swayed by length, style, or filler.
It runs on your own GPUs. Student work never leaves campus, which keeps you on the right side of FERPA.
More than a score
Other tools return a number and stop there. Rubrica gives the grade, the reasoning behind it, and feedback the student can act on.
A consistent grade against your rubric, the same for any answer that means the same thing.
Which rubric criteria were met, and the exact part of the answer each judgment is based on. Faculty can check it and defend it.
Specific, usable notes on what to fix next, so the student actually learns from the grade instead of just receiving it.
Why it's different
Most tools try to patch inconsistent grading with cleverer prompts. We trace it to how the model attends to an answer, and change that through training.
We built a way to measure how much a grader drifts across equivalent answers, so schools can compare models on something that finally matters to them.
Universities are where we start. The same engine scores open ended answers anywhere they are graded at scale.
The vision
Every place AI touches education needs to be reliable and private. We begin with grading because that is where inconsistency and privacy risk cost the most, then bring the same trusted model to the rest of the institution.
Consistent scoring for STEM short answers and assignments.
Consistency checks, appeals support, and records you can audit.
Self hosted feedback that guides students without sending their work away.
Built on research
Rubrica is built by researchers from five universities studying rubric drift, the reason language models give the same answer different scores. The work includes a new benchmark for grading robustness and a training method to reduce it, and it is active and ongoing.
University of Arizona
Stanford University
Carnegie Mellon University
Northeastern University
Boise State University
We are working with a small group of universities and programs on early pilots, built together with their faculty.
Request a demoor email jingshao@arizona.edu