Back

Microsoft Research Finds AI Model Degradation Is Quietly Corrupting Your Work Documents

Deveshi Dabbawala

May 4, 2026
Table of contents

You hand a task to an AI. It edits your document, makes changes, and returns it. At first glance, everything looks fine. But during that process, about 25% of your content may have been silently altered or lost.

This is not a theory. A recent Microsoft Research paper highlights this risk.

What the research shows

Published in April 2026 by Philippe Laban, Tobias Schnabel, and Jennifer Neville, this research examines a simple but overlooked question; can you rely on an LLM to edit documents without introducing errors?

Based on their findings, the answer is clear. Not yet.

What Is DELEGATE-52?

To test this, the researchers created a benchmark called DELEGATE-52. It simulates real-world workflows where AI handles long, multi-step document edits across 52 domains, including coding, music notation, crystallography, and legal writing.

This setup reflects how people actually use AI today. Instead of quick fixes, users delegate full documents with multiple instructions like updates, revisions, and additions. DELEGATE-52 was built to evaluate whether current LLMs can manage this kind of work reliably.

What the results reveal about AI model degradation

Across 19 models tested, including Gemini 2.5 Pro, Claude Opus 4, and GPT-4.5, every model showed document degradation during long workflows. Even the top performers corrupted around 25% of the content after extended interactions.

These errors are not random or easy to catch. Instead of many small mistakes, the models make a few critical ones that alter or remove key parts of the document. This makes the damage harder to detect and more risky in real use.

What makes it worse

The researchers also identified what makes the problem worse.

  • Larger documents increase the risk: As content grows, models struggle to keep track of what needs to stay unchanged.
  • Longer workflows add more errors: Each step builds the previous one, so small mistakes quickly grow into bigger issues.
  • Multiple files create confusion: When several documents are in context, models often mix them up and edit the wrong parts.

They also tested whether giving AI tools like file access would improve results. It did not. The same patterns of degradation continued, even with these added capabilities.

Why this matters now

AI-assisted work is growing fast. Tasks like coding, document editing, and multi-step workflows are now commonly handled by AI systems.

However, this research shows that current trust in these tools is ahead of their actual reliability. The issue is not obvious about mistakes. The models introduce subtle errors, such as incorrect formulas, quiet changes in facts, or code that runs but produces the wrong result.

The study highlights a specific problem. Output quality declines within a single workflow as interactions increase. This is different from the usual idea of model degradation over time. Here, the risk comes from errors that build step by step, often going unnoticed until the damage is significant.

The GoML perspective: this is exactly why architecture matters

At GoML, we build AI Solution Accelerators for industries like healthcare, finance, legal and SaaS where document accuracy is critical. A small error in these fields can lead to serious consequences.

This research confirms what we see in real deployments. Strong models alone are not enough. Without the right workflow design, even the best AI can produce unreliable results over long interactions.

To address this, we focus on structured system design.

  • We break workflows into smaller steps and verify each stage before moving forward. This helps catch errors early.
  • We separate reading and writing contexts, so the model only works on the intended document. This reduces confusion when multiple files are involved.
  • We include human reviews for high-impact changes. Automation handles routine tasks, while people validate critical outputs.
  • The key takeaway is simple and tools alone do not solve reliability issues. The system around them matters.

If you are using AI for document-heavy workflows, this research highlights the need for careful design.

What should you do about it?

The paper highlights the problem, not the solution. But a practical approach already exists.

GoML’s Document Intelligence Accelerator, part of AI Matic, avoids open-ended document rewriting. It focuses on accurate retrieval using knowledge graphs and search agents. The model reads and surfaces information instead of freely editing, which helps preserve document integrity.

This approach has already shown results. A $9B hedge fund used it to process documents 99% faster without corruption risk.

If you are using LLMs today:

  • Do not assume clean output means correct output
  • Review changes carefully, section by section
  • Be cautious with long workflows and multiple files
  • Always verify before final use

The takeaway is simple. AI is useful, but needs the right system design. AI Matic provides that structure with guardrails, checkpoints, and governance for reliable document workflows.

https://arxiv.org/html/2604.15597v1