Data Science Talent Logo
Call Now

The Unstructured Data Problem No One Wants to Talk About by Shreya Mishra

 

 width=Shreya Mishra is a Frankfurt-based data professional with a master’s in Big Data and Business Intelligence and a foundation in Computer Science. She’s worked in a wide range of industries in the fast-paced world of fintech specialised consulting, giving her a unique perspective on how to solve problems in ever-changing landscapes.
Currently, Shreya is a Data Engineering Consultant at DAQUMA, where she’s focused on the life sciences sector, taking complex technical workflows and turning them into simple, human-ready narratives that drive growth.
In this post, Shreya Mishra addresses a critical issue impacting the adoption of agentic AI. While most organisations have enormous reserves of unstructured data, many rush to deploy agentic AI systems without recognising that this data simply isn’t usable in its raw state. As Shreya explains, data engineers are the key to refining it – and to giving the organisation a significant competitive advantage:

Think of data as the crude oil of our era – and the analogy is more precise than it first appears. Oil does not emerge from the ground ready to fuel a jet engine. It arrives as a dark, viscous, chemically complex mixture: useful in theory, inert in practice. It requires extraction, refining, separation, and structured delivery through an intricate system of pipelines before it becomes the thing that actually powers anything. Remove the pipeline, and you have a reservoir. A vast, valuable, and entirely inaccessible reservoir.

Enterprise data is exactly this. Every organisation of scale is sitting atop what amounts to an open mine – an accumulation of sensor readings, contract documents, laboratory notebooks, email chains, engineering schematics, meeting recordings, regulatory filings, and handwritten technician notes stretching back decades. The reserves are enormous. The intelligence locked within them is extraordinary. And almost none of it, in its raw state, is usable by an AI system. Not because the AI is insufficiently capable. Because no pipeline has been built to refine it.

This is the foundational reality that the current discourse around agentic AI – systems capable of reasoning, using tools, and executing multi-step workflows autonomously – consistently fails to address. We are promised agents that will manage supply chains, settle insurance claims, and conduct scientific research without human intervention. Boardrooms are captivated. Investors are energised. The technology press cannot print headlines fast enough. Yet the quiet crisis in the engine room goes unremarked: agentic AI is only as competent as the context it is given, and in the enterprise, the vast majority of that context remains locked in the ground – unrefined, unstructured, and unreachable.

Just as the great oil economies of the twentieth century were not built by those who discovered the reservoirs, but by those who designed the refineries and laid the pipelines, the intelligence economies of this century will be built by those who engineer the data infrastructure that makes raw information usable. The data engineer is not a support function for AI. The data engineer is the refinery.

FROM A DATA ENGINEER’S LENS

In practice, this “refinery” role is not abstract. It means constantly asking uncomfortable questions:

  • Where did this data originate?
  • What assumptions were baked into it?
  • What is missing that the model will silently guess?

Most failures in AI systems are not loud crashes – they are quiet misinterpretations. As a data engineer, my job is to eliminate those silent failure paths before they ever reach an agent.

1. THE CONTEXT GAP:

WHY LLMS ALONE ARE NOT ENOUGH

The last decade of data engineering was dominated by one discipline: moving structured rows from relational databases into warehouses and data lakes, then presenting them to business intelligence tools for dashboards and reports. We became very good at this. But agentic AI does not operate on clean rows. It needs the “why” and the “how” – the institutional knowledge that lives in PDFs, email chains, engineering schematics, handwritten notebooks, and recorded meetings.

The industry’s dominant response has been the RAG-first approach – retrieval-augmented generation. The assumption is straightforward: load all your documents into a vector database, and an agent will retrieve what it needs. In practice, this assumption fails in three predictable and consequential ways.

THE THREE FAILURE MODES OF NAÏVE RETRIEVAL

• Semantic Noise

Standard text-chunking algorithms divide documents by a fixed token count, with no awareness of meaning. A paragraph that begins on one chunk and concludes on the next is severed. The agent retrieves the first half and misses the conclusion, generating a response that is technically grounded but substantively wrong. In a medical setting, this distinction – between an interaction being flagged and being missed – can be the difference between safety and harm.

• The Hierarchy Problem

A table inside a PDF is not merely text – it is a relational structure in which column headers, row identifiers, and cell values are semantically interdependent. Most ingestion pipelines flatten this into a stream of characters, destroying the structure entirely. An agent attempting to perform a financial calculation on flattened table data is working with meaningless strings. The schema is gone. The context is gone. The answer is wrong.

• The Provenance Gap

In regulated industries, the question of which document an agent used to reach a decision is not a preference – it is a legal requirement. If an agent approves a loan, flags a transaction, or updates a dosage recommendation based on a document that was a superseded draft rather than the authoritative version, the consequences are severe. Without a robust lineage layer, the answer to the regulator’s inevitable question – “why did it do that?” – is silence.

WHAT I WATCH FOR IN PRODUCTION SYSTEMS

These failure modes are not theoretical; they show up immediately in live pipelines. I actively monitor:

Chunk integrity → Are we breaking meaning across boundaries?

Structure preservation → Are tables, diagrams, and hierarchies intact?

Lineage completeness → Can every answer be traced to a source?

If any one of these breaks, the agent doesn’t fail visibly – it fails convincingly.

2. THE PRACTITIONER’S REALITY:

THREE INDUSTRIES, THREE CRISES

Theory is useful. But the unstructured data problem reveals itself most sharply in specific, high-stakes, and irreducibly complex enterprise contexts. The three case studies below – drawn from hands-on project work – illustrate not merely the technical challenges, but the operational and institutional consequences of getting context engineering wrong.

ENERGY & MOBILITY:

WHEN THE INVOICE BECOMES THE OBSTACLE

A mobility services organisation managing electric vehicle charging transactions across multiple clients had, for years, run its reconciliation process entirely manually. Analysts received invoices from charge point operators, read them, and transcribed the relevant fields into Excel. The process worked – at low volume. As EV adoption accelerated and transaction counts scaled, the cracks became crises. Operators numbered in the dozens. Invoice volumes ran into the thousands monthly. And every single one arrived as a PDF.

The data reality was unsparing. Each charge point operator produced invoices in their own format. There was no industry standard. One operator placed session identifiers in the header; another embedded them mid-table; a third omitted them entirely. Column names for the same field – the chargeable amount, the session start time, the vehicle identifier – differed across every provider. Some operators rendered their billing tables as images Images within the PDF, invisible to any text-based parser. Others used multi-row merged cells that collapsed into a stream of undifferentiated characters when extracted. And a meaningful proportion of invoices arrived with missing transaction IDs altogether – the one field on which any downstream reconciliation depended.

Just as the great oil economies were built not by those who discovered the reservoirs but by those who laid the pipelines, the intelligence economies of this century will be built by those who engineer the infrastructure that makes raw data usable. The data engineer is not a support function for AI. The data engineer is the refinery.

The specific failure mode that crystallised the problem was deceptively quiet. When an analyst manually transcribed an invoice with a missing transaction ID, they would leave the field blank and move on – a reasonable human workaround. But when the same invoice passed through an automated pipeline with no awareness of the missing field, the record was ingested silently and incompletely. An agent querying the reconciliation data for that session had no basis on which to flag the gap. It had a record. The record appeared complete. The transaction ID was simply absent, and the agent had no way of knowing that it did not have. Errors of this kind do not surface as failures. They surface as billing discrepancies, weeks after the reconciliation cycle had closed.

A naïve extraction approach – parsing each PDF as a flat stream of text – could never have resolved this. The structural variation across operators was too great. A column labelled “Session Ref” in one invoice and “Transaction No.” in another and absent entirely in a third cannot be handled by a generic text parser. The table rendered as an image is invisible to it. The merged cell becomes noise. And the missing transaction ID is ingested as if it were intentional.

The engineering response was to build a layout-aware invoice parsing application – an automated extraction pipeline that treated each operator’s PDF format as a distinct document schema to be understood rather than a text stream to be read. The system used keyword matching and structural heuristics to locate fields regardless of their column name or position; a custom extraction mode allowed operators’ specific field conventions to be defined and applied consistently; invoices with missing transaction IDs were flagged explicitly rather than ingested silently, surfacing the gap at the point of entry rather than at the point of reconciliation failure. The entire output was exported as a structured, downloadable table. What had been a manual, analyst-dependent process running on transcription and goodwill became a reliable, auditable pipeline. The agent querying this data for reconciliation was now working with records that were complete, consistently structured, and annotated with their own limitations.

FINANCIAL SERVICES:

WHEN THE INVOICE BECOMES THE OBSTACLE

A financial services organisation managing transaction data across multiple major clients decided to modernise, migrating from a legacy Oracle database to a cloud-based system. The decision was strategically sound. The execution exposed something that strategic decisions rarely anticipate: the data itself was not ready to move. Not because it was voluminous. Because it was scattered, uncontrolled, and partially invisible.

The data reality was unsparing. The Oracle database was under third-party maintenance, meaning the team had limited visibility into its schema and no direct control over its structure. A SharePoint environment served as a parallel operational layer, holding documents, reports, and records that existed nowhere in the database. And client portals – the interfaces through which external partners submitted transaction data – were unstructured by design; each client portal reflected its own conventions, its own field naming, its own approach to completeness. The target cloud system was asked to ingest all three. It could not. Pipelines built to bridge Oracle and the cloud retrieved records incompletely – fields that existed in the source arrived empty or missing on the other side. The SharePoint layer had no reliable pipeline at all. And the client portal data, unstructured and inconsistently formatted, had no clear pathway into the new system.

The specific failure mode that made this concrete was the invisibility of what was lost. When a pipeline extracts a record from a legacy database and delivers it to the cloud system with fields missing, the cloud system does not raise an alarm; it accepts the record as received. The missing fields do not appear as gaps. They simply do not appear. An agent querying the cloud system for a complete transaction history retrieves what is there and reports it as the whole picture. The fields that dropped in transit – residing in a third-party-maintained schema or in a document store that had no pipeline connection – are absent from the answer with no indication that the answer is incomplete. Decisions made on this basis are not wrong in ways that are immediately visible. They are wrong in ways that only surface when someone traces the original source and finds it does not match.

The engineering response required confronting each source on its own terms before attempting to unify them. The legacy database connection, constrained by third-party maintenance boundaries, required careful schema mapping to identify which fields were reliably extractable and which were structurally inaccessible – and to document that boundary explicitly, so that downstream gaps were visible rather than silent. The document store required a separate ingestion pathway that could traverse file structures and extract operational records that existed only in document form. And the client portal data required a normalisation layer that resolved each portal’s idiosyncratic field conventions into a common schema before any record reached the cloud system. None of this was glamorous. All of it was necessary. Without it, the cloud system would have accepted the migrated data, declared the migration complete, and quietly discarded a portion of the organisation’s operational history in the process – leaving any agent built on top of it to reason from an evidence base that was incomplete by construction.

LIFE SCIENCES:

THE REGULATED DATA MIGRATION PROBLEM

A life sciences organisation migrating operational data from a legacy system into a regulated platform was confronting a challenge that appears, on the surface, to be a technical one: move records from one system to another. In practice, it was something considerably more demanding. In GxP-regulated environments – where Good Practice standards govern the integrity, traceability, and auditability of data across the entire product lifecycle – every record that moves between systems carries a compliance obligation. The question was not simply whether the data could be transferred, but whether it could be transferred without losing the contextual integrity that made it meaningful and defensible under regulatory scrutiny. And the data, accumulated over years in a system built to different conventions than the one it was moving into, had no intention of cooperating.

The data reality was immediate and concrete. The naming conventions of the legacy system and the target platform did not match. A field storing a person’s name was labelled one thing in the source and something else entirely in the destination. Date formats differed – the legacy system stored dates in one convention; the target platform expected another. These were not edge cases. They were pervasive. Every record in the migration carried some version of this mismatch: a field name that did not correspond, a format that did not translate, a value that would arrive in the target system correctly transferred but incorrectly interpreted. And layered beneath the structured fields was a further problem: years of operational notes, deviation records, and process observations stored as free text – unformatted, unconstrained, and carrying embedded information that had never been formalised into discrete fields. Dosage references written in shorthand. Batch identifiers noted parenthetically. Process decisions recorded in natural language with no controlled vocabulary. This was the genuinely unstructured layer – and it existed in every record that had ever required a human annotation.

Two specific failure modes crystallised the risk. The first was one that GxP environments make uniquely consequential: a record that arrived in the target system with a mismatched date format was not rejected – it was accepted, stored, and made queryable. From the platform’s perspective, the record was complete. From a regulatory audit perspective, a date recorded as the sixth of March and stored as the third of June is not a formatting quirk. It is a falsified record. An agent querying clinical or operational timelines from this system would retrieve those dates as authoritative. The second failure mode was subtler: the free-text notes fields, ingested as-is into the target platform, were queryable as text but not as data. An agent asked whether a specific process deviation had been recorded for a given batch would search for structured deviation flags – and find none – while the actual deviation sat in a natural-language note three records away, invisible to any query that expected a formal field.

The engineering response required three parallel workstreams. The field mapping layer explicitly translated every legacy field name to its target platform equivalent – not assumed, not inferred, but defined and validated against the target schema before a single record moved. Date format normalisation was applied at the pipeline boundary, converting source conventions to target conventions with documented transformation rules. And the free-text notes fields required a structured extraction pass – a pipeline layer that parsed natural-language annotations for embedded data points, identified deviation records, batch references, and process flags written in operational shorthand, and surfaced them as discrete, queryable fields in the target system rather than opaque text blobs. The entire mapping framework was subject to lineage tracking: every transformation applied to every field in every record was recorded, so that a regulator asking “why does this field contain this value?” could be shown precisely which rule had produced it and from which source value it had been derived.

This work was painstaking. Field-by-field mapping across two systems with different conventions, at the scale of a full platform migration, is not a task that can be automated away. It requires a data engineer who understands both the source system’s logic and the target platform’s requirements, and who is willing to validate every transformation rule against real records before declaring it correct. But once in place, every record in the target system was not merely present – it was correctly named, correctly formatted, and traceable to its origin. The agent querying this environment was working with data that meant what it appeared to mean.

3. THE ACE STACK:

A FRAMEWORK FOR CONTEXT ENGINEERING

These three case studies are not anomalies. They are representative of the unstructured data challenge as it presents itself across every industry that is seriously attempting to deploy agentic AI in production environments. And they converge on a common architectural response – a framework I have come to call the ACE Stack: the Agentic Context Engineering Stack.

The ACE Stack is not a product or a platform. It is a design philosophy: a set of engineering disciplines applied in deliberate sequence to transform the chaotic, multi-format reality of enterprise data into a structured, verifiable, and agent-navigable world model. It consists of four layers, each addressing a distinct and well-documented failure mode.

The ACE (Agentic Context Engineering) Stack

Layer Responsibility Why It Matters
Ingestion Multi-modal parsing – OCR, Computer Vision, Speech-to-Text Eliminates blind spots. Most enterprise knowledge is locked in PDFs, schematics, recordings, and handwritten forms. Without this layer, agents are illiterate to the majority of the organisation’s institutional memory.
Refinement Semantic chunking and metadata enrichment Precision over volume. A document sliced by token count loses meaning at every boundary. This layer preserves complete thoughts as semantically coherent units and tags each with source, date, jurisdiction, and domain.
Graphing Entity resolution and relationship mapping Contextual wisdom. Agents do not just need facts; they need relationships. A graph layer connects an invoice to a transaction record, a legacy field to its target platform equivalent, a free-text deviation note to the structured batch record it belongs to – enabling chains of inference no vector search can replicate.
Governance Lineage, versioning, and audit mapping Trust and compliance. For any high-stakes decision – a loan approval, a drug interaction flag, a compliance ruling – the governance layer provides the auditable evidence chain that regulators and executives require.

Table 1 – The ACE (Agentic Context Engineering) Stack

What distinguishes the ACE Stack from a conventional data pipeline is that every design decision is made in explicit service of the agent’s reasoning process – not the analyst’s querying needs. A traditional pipeline asks: How do I get this data into a format that a human can query? The ACE Stack asks: how do I present this data to an autonomous agent in a form that is complete, trustworthy, and navigable – without human mediation?

The Graphing layer deserves particular emphasis, as it represents the most significant departure from conventional pipeline thinking. Linking disparate entities – connecting an invoice to the transaction record it references, a legacy field name to its target platform equivalent, a free-text operational note to the structured record it annotates – creates a knowledge graph that allows the agent to follow chains of inference that no vector similarity search could surface. This is how agents begin to simulate the associative expertise that experienced domain professionals accumulate over careers.

HOW I APPLY THE ACE STACK IN PRACTICE

I don’t treat these layers as sequential checkboxes – they are continuously validated loops:

  • Ingestion is re-evaluated when new formats appear
  • Refinement evolves as business meaning changes
  • Graph relationships are constantly enriched
  • Governance is never ‘done’ – only tightened

The stack only works if it is alive, not static.

4. THE “SMARTER MODEL” FALLACY AND WHY IT IS DANGEROUS

There is a seductive belief circulating through AI strategy circles that the unstructured data problem will be resolved by the next generation of foundation models.

As context windows expand to millions of tokens and multi-modal capabilities process images, audio, and structured data natively, the argument goes, the need for specialised data engineering will diminish.

Simply present the model with everything, and trust it to identify what is relevant.

This argument misunderstands the nature of enterprise data at scale.

Even if a model could theoretically process every document in an organisation’s dataset simultaneously – an assumption that strains credibility on both technical and economic grounds – the fundamental problems of provenance, semantic structure, and version control would remain.

A model that has ingested both a superseded regulatory directive and its amendments does not automatically understand which takes precedence unless that relationship has been explicitly encoded.

A model that has read a PDF invoice and a structured transaction log does not automatically reconcile them unless the extraction pipeline has resolved the identifiers they share.

A model querying a migrated database does not automatically know that a date formatted as the 3rd of June was originally recorded as the 6th of March unless the transformation was documented at the pipeline boundary.

“The graphing layer is how agents begin to simulate the kind of associative expertise that experienced professionals take for granted. It is the difference between retrieval and reasoning.”

More fundamentally, the “smarter model” fallacy confuses capability with reliability.

In high-stakes domains – regulated data migration, financial transaction monitoring, invoice reconciliation at scale – the standard is not whether the model usually produces a correct output.

It is whether we can demonstrate, for any given decision, precisely which evidence the model used and why that evidence was authoritative.

That is an engineering discipline, not a modelling one.

No increase in benchmark performance changes this requirement.

Data engineers bring three things that no model improvement can substitute:

Truth – the systematic elimination of noise, duplication, and error from the evidence base, reducing the conditions under which hallucination occurs.

Breadth – the integration of siloed and multi-modal data sources that the model cannot access without explicit pipeline work.

Speed – the design of high-throughput retrieval architectures that return relevant context within the latency constraints of real-time decision workflows.

A Reality Check from Engineering

Every time a model improves, expectations shift, but data quality rarely does at the same pace.

In my experience:

  • Bigger context windows amplify bad data.
  • Multi-modal models expose unresolved inconsistencies.
  • Faster inference magnifies incorrect assumptions.

Better models don’t reduce the need for data engineering.

They increase the cost of ignoring it.

5. CONCLUSION:

THE COMPETITIVE MOAT IS IN THE MESS

The unstructured data problem is not a defect in the agentic AI story. It is, paradoxically, one of the most significant sources of competitive advantage available to the organisations that choose to take it seriously.

If your enterprise data is clean, well-structured, and easily parsed by off-the-shelf tools, then so is your competitors’. Any agent built on your data can be replicated within weeks by a competitor purchasing the same models and the same cloud infrastructure. The differentiation evaporates.

But if your organisation has invested in the engineering work required to transform the chaotic, multi-format, historically accumulated reality of your business operations into a high-fidelity world model – one that captures not just facts, but their relationships, their provenance, their effective dates, and their domain context – then you have built something that cannot be replicated simply by procuring a more capable model.

The three organisations whose challenges are described

The three organisations whose challenges are described in this article were not attempting to build a chatbot. They were attempting to build the nervous system of a future enterprise: a system capable of sensing, interpreting, and acting on the full complexity of its operational environment without requiring human mediation at every decision point.

That ambition is realisable.

But it is realisable only if we are honest about where the foundational work lies.

It does not lie in model selection, parameter counts, or prompt engineering.

It lies in the patient, painstaking, domain-intensive discipline of making enterprise data legible to a machine – in all its multi-modal, version-controlled, entity-resolved, governance-compliant complexity.

That is the work of the data engineer.

And it is, quietly and without sufficient recognition, the foundational work of the intelligence age.

WHAT THIS MEANS FOR DATA ENGINEERS

The role is no longer about pipelines alone – it is about designing the cognitive foundation on which AI operates.

We are not just moving data anymore.

We are deciding what an AI system is allowed to understand as reality.


KEY TAKEAWAYS

  • Agentic AI is only as reliable as the context it receives. Model intelligence cannot compensate for a broken data foundation.
  • The three core failure modes of naïve retrieval – semantic noise, structural flattening, and provenance gaps – require engineering solutions, not model upgrades.
  • The ACE Stack (Ingestion, Refinement, Graphing, Governance) provides a four-layer architectural framework for production-grade context engineering.
  • Across energy and mobility, financial services, and life sciences, the decisive engineering challenge was not model selection but unstructured document parsing, multi-source pipeline integrity, and regulated data migration – the foundational work that makes an agent’s reasoning trustworthy.
  • The organisations that build the most reliable AI systems will be those that invest most seriously in data engineering – not those that procure the most powerful models.
  • The competitive moat in the age of agentic AI is built not in the model, but in the mess: the institutional, historical, unstructured data that only rigorous pipeline engineering can render legible.

“The future of AI is not about better models. It is about better data engineering. The competitive moat is not in the algorithm – it is built in the mess.”

Back to blogs
Share this:
© Data Science Talent Ltd, 2026. All Rights Reserved.