AI markingteacher workloadGCSEA-Levelassessmentedtech

AI marking that doesn't suck

By Geoff Waugh26 June 20267 min read

Listen to this article

0:00--:--

Most AI marking tools aren't built for teachers. That makes sense. There are more students than teachers in the world. Tutoring platforms, practice tools, revision apps — it's a bigger market.

What that logic misses is that teachers are the ones drowning. Not in a general sense. In a specific, 4pm-on-a-Sunday-with-thirty-assessments-left sense. Marking isn't just time consuming. It's the thing that eats evenings, weekends, and the mental bandwidth that should go to teaching. I am one of DeepMark's co-founders and a practising teacher. I have lived this, as have many of you reading this post. So consider this our hello — what we built, why, and what we're learning as we go.

So when we started looking at what already existed for teacher-facing AI marking, we found two categories: student tools that don't address the problem, and platforms that do try to help teachers but ask a lot in return. QR codes on every sheet. Proprietary ecosystems. Horrible user experiences with infrastructure you have to adopt before you can mark a single paper. The friction cost of switching to those tools is high enough that most teachers never bother.

Teachers are already using ChatGPT & Co-pilot to survive

Teachers are some of the heaviest ChatGPT users. Lesson planning, for sure. But marking too. According to recent research by OnlyForTeachers, 65% of UK teachers are already using general AI tools in their professional practice. When asked what their ideal AI tool would do, the most common answer was marking — specifically, something that can handle essay marking, provide meaningful feedback, and read student handwriting accurately. The demand is there. The adoption isn't. Marking-specific tools are used by just 4% of teachers.

While AI tools like ChatGPT, Claude and Gemini have helped some teachers with their marking workloads, the process is cumbersome, repeatability is questionable and the format of the output is awkward to work with. These are general purpose tools and the workflow they enable is slow and awkward. But teachers are doing it anyway because the pain of spending three hours marking a class set is worse than the pain of a clunky workaround. When people adopt friction-heavy workarounds, it tells you the underlying problem is real.

It also raises a concern we take seriously: ChatGPT isn't designed with student data in mind. Uploading scripts with pupil names means sending PII to a general-purpose consumer AI with opaque data practices. DeepMark doesn't train on uploaded scripts. Student data stays student data.

Challenges we face

One of our biggest challenges is market penetration. Teachers — our primary users — are what we call "problem aware" but not solution aware. They know marking is eating their evenings. They don't know that a tool exists that was built specifically to fix that. That's a distribution problem as much as a product problem.

The other challenge is AI scepticism, and it's a real one. Scepticism of AI isn't unique to education — it exists across society, and it's well founded. As an AI marking tool, trust isn't a nice-to-have for us. It's essentially table stakes. We have to demonstrate with hard data that DeepMark produces reliable, professional-level marking that holds up to scrutiny. We have to ensure teachers feel in control of the process at every step, because if they don't, they shouldn't be using it.

That trust sits on two pillars: marking performance and data security. On data security, we redact student names before saving anything to the platform and we offer GDPR-compliant deletion requests within 24 hours — significantly faster than the statutory one-month obligation. On marking performance, we need to show that the same script will consistently arrive within an acceptable margin of error, comparable to the variation seen between professional AQA markers.

Earning trust in an AI-sceptical world

Anyone who has marked the same paper twice — or asked two colleagues to mark the same paper — knows that marking is inherently variable. That variability is manageable when a trained human is applying professional judgement. It becomes a problem when an AI produces different marks every time you run the same script through it. That's closer to a game of roulette.

We spent a significant amount of time on this. Our goal was to make DeepMark's output consistent: run the same script against the same mark scheme twice and you should get the same mark, or very close to it. The margin of error should be small and predictable, not random.

The mechanism is fidelity to the mark scheme. When a teacher uploads their mark scheme, DeepMark treats it as ground truth. We're not applying a general sense of what a good answer looks like — we're applying the specific criteria the teacher and the exam board has defined. That constraint is what produces consistency.

We test this obsessively. Our repeatability study took the hardest case: ten student-authored 40-mark open creative writing essays, marked each five times under different configurations. The results:

Configuration	Mean variance per re-mark	Range across 5 runs
Original setting	±2.58 marks	6.6 marks
Current production	±0.94 marks	2.0 marks
High-precision mode	±0.16 marks	0.4 marks

Current production DeepMark re-marks the same 40-mark essay to within ±0.94 marks across five runs. On a question where human examiners routinely differ by several marks, that's a number we're comfortable standing behind.

One thing we want to be clear about is that this is repeatability data, not accuracy data. We know DeepMark is consistent. The next study — how DeepMark's mark compares to a trained teacher's mark on the same script — is underway. We'll publish that when we have it.

A Cambridge study published in May 2026 tested three frontier AI models across 761 authentic undergraduate essays and found AI matched human grade classifications only 35–65% of the time. It's rigorous work and worth reading. Our own data tells a different story — and we ran it on less powerful models. At temperature zero, DeepMark marks the same 40-mark essay to within ±0.94 marks across five runs. We'd be genuinely interested to compare notes.

A few honest caveats: our study is smaller, and we haven't yet tested on 100-mark questions, which don't appear in GCSE / A-level papers but do in the Cambridge dataset. That context difference almost certainly matters and we'd expect our accuracy picture to evolve as we gather more data. One finding from Cambridge we do recognise: AI tends to mark stricter than human assessors. We've observed that tendency too. It's something we're actively working on.

We're not claiming the problem is solved. We're saying our results are better than the headlines suggest, and we intend to keep publishing the data.

On student data, we take a simple position: it isn't ours. We redact student names before anything is saved to the platform. We don't train on uploaded scripts. And we offer GDPR-compliant deletion within 24 hours — well inside the statutory month.

Trust has to be earned through behaviour, not claimed in a tagline. That's what we're trying to do.

What the experience actually looks like

We made a deliberate choice about how to present marking output. Rather than a dashboard of scores and statistics, DeepMark's editor is built around the script itself. Marks and annotations sit on the document, connected to the text they're responding to. A comment isn't floating in a sidebar — it's anchored to the specific words it refers to.

Teachers need to be able to verify the marking, and seeing that a comment is attached to a specific phrase in a student's answer is what makes it possible to check whether the AI has actually read the script, not just produced a plausible-sounding score. We call this proof of reading, and it's not optional.

The editor will feel familiar if you've used Google Docs or Microsoft Word. You can share a marked script with a colleague or your HoD for a second opinion. You can mark a script as human-reviewed. The export is a PDF, which means printing and distributing to students works the same way it always has.

Where we are now

DeepMark is live. The subjects where it performs best right now are the ones with the heaviest marking load: English and essay-based subjects where extended writing is the thing being assessed. Maths, physics, and geography are on the roadmap.

There's a free tier. You don't need to book a demo or talk to sales.

We built this because the problem is real and the existing tools don't solve it well enough. That's the whole pitch.

Try DeepMark →

Mark a class in minutes, not evenings

DeepMark gives every answer examiner-quality marks and feedback, so you can spend your time teaching — not marking.

Try DeepMark