AI markingdata protectioncopyrightGCSEA-Levelsafeguardingedtech

AI marking is dangerous

By Stuart Bourhill2 July 202612 min read

Listen to this article

0:00--:--

It's the wild west out there right now. Teachers are DIY'ing with ChatGPT and Copilot, non-teachers (shall we call them civilians?) are selling "vibe coded" apps, and international companies are slapping "GCSE & A-LEVEL" on existing products with little regard for UK education policy. It's the modern-day "cure all" tonic being sold to desperately overworked teachers.

Somewhere, a man who three months ago was building crypto dashboards is now telling you his app "understands AO3" and safeguarding. This could very well end badly.

Most schools don't have AI policies in place. The way most of it is being done right now is dangerous. Here's why, and what to actually check before you let anything near your students' assessments.

Let me explain what I mean.

The bit nobody's telling you

Over the last 12 months, a useful idea — AI can help with the soul-crushing volume of marking and admin — has collided with a less useful reality. Anyone with a laptop and a free weekend can now "vibe code" a website that looks like a marking tool. Pretty interface. Confident-sounding feedback. Zero idea what's happening in the marking, and zero idea what's happening to your data, your students' work, or the copyrighted exam materials you just fed into it.

Nobody knows how big this actually is

Nobody publishes a robust survey of how many UK teachers have quietly tried marking a script with ChatGPT, Copilot or some app that popped up on TeacherTok from yet another influencer who "escaped teaching" to make posts from Bali. It isn't tracked. Schools don't collect it — they don't have policies. Exam boards don't collect it. So instead of pretending we have a number, let's do the maths, using a conservative assumption instead of a made-up statistic.

England has roughly 260,000 secondary school teachers, according to the Department for Education's own School Workforce Census. Assume, conservatively, that 1 in 5 has at some point opened one of these tools and had a crack at marking something. Not a formal school rollout. Not anything sanctioned by SLT. Just a tired teacher on a Sunday night, staring at a stack of scripts, thinking surely this can help. Given how tempting that promise is, 1 in 5 feels like a floor, not a ceiling.

1 in 5 of 260,000 is 52,000 teachers.

That's over 50,000 people who may have uploaded a student's handwritten work, names, exam centre numbers or a copyrighted mark scheme into a tool with no idea where that data ends up — no idea whether they personally just handed over a British child's data to an international company. And that's the conservative estimate. Nobody actually knows the real number, because nobody's asked properly. That's not reassuring, and it's not meant to be.

The copyright landmine nobody reads the small print on

Go and check your exam board's copyright policy right now. I'll wait.

AQA's policy is unambiguous. They do not allow anybody, including schools and colleges, to use AQA materials to train artificial intelligence tools or technologies. That covers question papers, specifications, mark schemes, examiner reports, teaching resources — all of it. Pasting a mark scheme into generic ChatGPT or Copilot to see what it spits out isn't a grey area, whatever the Facebook comment section tells you. It's a breach of copyright and of the exam board's own policy. Pearson's position on Edexcel materials is just as strict.

So every time a teacher screenshots a mark scheme and drops it into an unverified AI tool to "give it a go," there's a chance that the tool's terms of service allow that input to be used, stored, or fed back into a model somewhere. Nobody asked you to read the terms (does anyone really read T&Cs?). And nobody told your students or your HoD that their handwritten answers to a live, copyrighted paper just left the building — possibly for a server farm you couldn't find on a map.

That's not scaremongering. That's what the exam boards themselves are saying, in writing, right now.

So where does that leave us? We run on paid, enterprise agreements with Anthropic and Google. Contractually, neither of them trains their models on the data that passes through. So when a teacher uploads a mark scheme to mark a set of scripts, that mark scheme isn't quietly becoming part of anyone's next model release. It does the job in front of it, then it can be deleted.

We're not going to pretend that makes the whole question disappear. Exam board copyright policies restrict more than just training — they also restrict where materials get copied to in the first place. We think the training risk is the one that matters most for what this piece is about (permanent, invisible absorption into someone else's model), and it's the one we can point to a contract for.

Where does the data actually go?

Then there's the other half of the problem.

The Department for Education published its policy on generative AI in education in June 2025, and it's been building on it since — most recently with product safety standards updated in January 2026. The guidance is clear that personal or sensitive data shouldn't be entered into generative AI tools without a proper, understood data protection route in place. Schools are expected to check whether a supplier can guarantee that student data won't be used to train a model. They're expected to know where that data is hosted, and whether it ever leaves the UK or EEA. They're expected to have a Data Protection Impact Assessment that actually covers the tool in question — not one copied from a webinar slide in 2023.

Ask yourself honestly. When you last tried that "free" AI site, did you know the answer to any of those questions? Did the tool even tell you?

"If a high-quality product is free, you are not the customer, you are the product" — and in this case, British children are, teachers are uninformed, and schools have by and large not acted fast enough.

This is not a hypothetical risk. It's a live one, sitting in every staffroom where a teacher has quietly started using a tool to save themselves from exhaustion.

The vibe-coded app problem

Here's the uncomfortable bit, and I'm going to say it plainly because somebody needs to. Marking is hard. Genuinely hard. Getting an AI system to read messy handwriting, interpret a mark scheme the way an experienced examiner would, apply assessment objectives consistently across 60 scripts, and flag its own uncertainty rather than just guessing with confidence — all while building a platform that keeps a teacher in the loop on every decision that affects students — that is not a weekend project, no matter how many "I built this in 48 hours" posts you've seen this month.

So when something new lands in a Facebook group looking polished and promising the earth, ask the boring questions before you ask the exciting ones. How was it benchmarked, and against what? Who checked its mark scheme was even correct before it started grading against it? What happens to the mark scheme and the scripts once you've uploaded them? Does it tell you when it isn't sure, or does it just hand you a number and hope you don't check?

If those questions get vague answers, or no answers, that's your signal. "Pretty and professional-looking" and "reckless" are not mutually exclusive. In fact, right now they're flatmates.

The marking roulette nobody wants to disclose

Then there's the bit that should worry you even more than the copyright question, because it's less visible. Bias, and dice-roll inconsistency.

You've probably seen the clips doing the rounds. A teacher runs the exact same script through an AI marking tool twice and gets two different marks back. Same student, same handwriting, same answer — different grade depending on nothing more than which run of the dice it landed on. That's not a glitch. For a lot of tools, that's just how they work under the hood, and nobody's told you. Any tool advertising "97–98% accuracy!" should be met with extreme caution. A number with no methodology attached is not a statistic — it's a vibe wearing a lab coat.

And it gets worse. Researchers at ETS, the organisation behind the SAT (yes, America, fine), tested a leading AI model against more than 13,000 real student essays and found it consistently marked Asian students lower than human examiners did — purely because of patterns the model had absorbed during training on internet data, nothing to do with the quality of the writing itself.

A University of Cambridge study published in May 2026 tested three frontier models — Claude Opus 4.6, GPT-5.4 and Gemini 3 Flash — against 761 real undergraduate psychology essays from Cambridge, Nottingham and Manchester Metropolitan. The AI matched the human examiner's grade band somewhere between 35% and 65% of the time, depending on the university. That's not a typo. That's a coin flip with extra steps. The researchers found all three models suffered from what they politely called "central tendency bias" — the AI squashes everything toward the middle, docking marks from the genuinely brilliant essays and inflating the mediocre ones, because brilliance is harder to pattern-match than "sounds like an essay." On top of that, the models were oversensitive to length and vocabulary, essentially rewarding students for padding — the exact opposite of what a good examiner does.

None of that is a reason to write off AI marking altogether. It's a reason to demand that whatever tool you use is actually fit for purpose, and is honest with you about what it found. If a supplier can't tell you how they checked for phrasing bias, verbosity bias, or whether the same script marked twice gives you the same result twice, they haven't checked. That's not a technicality. That's the whole exercise.

For what it's worth, this is the exact problem we obsessed over when building DeepMark. We've marked thousands of assessments, and we took the hardest possible case — 40-mark open creative writing essays — and marked the same scripts five times each under different configurations, just to see how much the number would wobble. In our current production setup, the exact same essay re-marked five times lands within 0.94 marks of itself on average, on a question where human examiners themselves routinely differ by several marks. We're not claiming that solves marking forever. We're saying "trust me" is not a methodology, and if a tool can't show you a number like that, ask why not.

One thing we're not going to pretend we've nailed. Cambridge's essays were marked holistically — one number out of 100 — rather than built up from a list of itemised criteria the way most GCSE and A-level mark schemes are. Central tendency bias looks exactly like the failure mode you'd expect on that shape of task: the longer and more open-ended the response, the more room a model has to quietly hedge toward a safe middle answer instead of committing to the extremes. Our repeatability numbers are strong on the questions GCSE and A-level papers actually contain, but we haven't yet stress-tested that specific failure mode on our biggest, most open-ended long-answer questions the way Cambridge did on full university essays. We think it's the most likely place any AI marker — including ours — will quietly bite you, and we're actively investigating it rather than assuming our numbers generalise upward. If a supplier tells you that's a solved problem, be suspicious. We're not telling you it's solved.

What responsible AI marking actually requires

This is the checklist we built DeepMark around, because we hit every one of these problems ourselves before we solved them.

Contractual data commitments, not vibes. A proper enterprise agreement with the underlying AI providers that guarantees your data is never used to train their models. Not a line in a blog post. A contract. We also redact student names before anything touches the platform, and honour GDPR deletion requests within 24 hours — well inside the statutory one-month window, because "well inside the legal minimum" should be the bar, not the ceiling.

UK-hosted infrastructure. Student data should stay on UK infrastructure, full stop. If a tool can't tell you where the servers are, that's the answer.

Mark scheme provenance. The tool should be honest about which parts of a mark scheme came from the teacher, which were extracted from an upload, and which it invented itself to fill a gap. If it can't tell the difference, neither can you.

A confirmation gate before marking begins. No script should be marked against a mark scheme the teacher hasn't seen and approved. Silent upgrades or invisible autogeneration is how correct answers phrased in an unexpected way end up being marked wrong.

Proof of reading, not just a score. If all a tool gives you is a number, you have no way of knowing whether it actually engaged with what your student wrote or just produced a plausible-sounding guess. Every mark and comment should be anchored to the exact bit of text it's responding to, so you can check the working — the way you'd check a student's.

Published, honest accuracy claims. Not "99% accurate," which means nothing. A tool should tell you what it was benchmarked against, on what scale, and be honest about the size of that sample.

Bias and consistency testing, not just accuracy. A tool can be accurate on average and still be quietly unfair. Ask whether the same script marked twice gives the same result twice, and whether anyone has checked for phrasing bias — where a student who expresses a correct idea in unexpected language gets marked down for it.

Teacher judgement stays in control. The whole point of marking, beyond the grade, is the diagnostic insight a teacher gets from reading their own students' work. Any tool that tries to fully remove the teacher from that loop has misunderstood the job.

So, is AI marking dangerous?

Done badly — without oversight, without checking where your data goes or whether you've just breached your exam board's copyright policy — yes. Genuinely, properly dangerous, in a way that could land on a Head of Department's desk with your name on it. And not the good kind of desk visit.

Done properly, with the right safeguards built in from day one, it's the most useful thing to happen to teacher workload in history.

Teacher judgement. Accelerated. Not replaced, and never at the cost of your students' data or copyright.

If you're currently marking with a tool you found on an Instagram reel last Thursday, or blindly using ChatGPT, it might be worth five minutes checking it against the list above. Your Head of Department will thank you for asking first. Your students definitely will.

DeepMark. Built by teachers, for the marking problem.

support@getdeepmark.com