Ahmed Al Mubarak is a recognised global talent in machine learning and artificial intelligence, currently working as a Director of business Intelligence and data science at Howden Re. Ahmed focuses on leveraging emerging technologies to drive innovation and efficiency across global business operations, combining technical expertise with strategic insight.
The rise of agentic AI – autonomous AI agents capable of decision-making and task execution – promises to transform how organisations operate. But how can real-world businesses take advantage of this emerging tech?
In our latest post, Ahmed Al Mubarak identifies a key problem in agentic AI adoption: messy, multi-format inputs impacting output quality. The solution, Ahmed explains, lies in context engineering, a disciplined way to design, structure, and deliver exactly the right information, in the right format, at the right time.
Aslarge language models (LLMs) evolve, one principle has become unavoidable in my day-to-day work: the quality of the input governs the quality of the output especially when the inputs are messy, multi-format, and spread across the enterprise. For years, many of us leaned on OCR, vision models, and increasingly elaborate prompts to coax LLMs into reproducing the structure of source documents. That works until it doesn’t. Real-world content varies wildly by layout and language, and often mixes dense prose with tables, figures, scans, handwriting, and screenshots. The result is brittle pipelines and prompts that need constant tweaking. What has emerged in response is context engineering: a disciplined way to design, structure, and deliver exactly the right information, in the right format, at the right time, so the model can actually do the task at hand (Schmid | 2025 | philschmid.de).
In my own practice as a data scientist, I’ve built solutions that transform raw reports into usable data for AI applications and business intelligence. Some models did a credible job turning financial statements with embedded tables into Markdown; most failed to preserve the table’s original semantics and layout. The deeper challenge was variability. Semi-structured tables in financial reports change year-to-year, company-to-company, and country-to-country. Language localisation adds another layer of ambiguity. Prompt engineering helped up to a point. But as I kept rewriting prompts to fit the next exception, it was clear that the problem wasn’t only the instruction. The problem was the context. Context engineering systematises the ‘what’ to unlock the ‘how’ we design the information environment; data, knowledge, tools, memory, structure, logic, and environment, that surrounds the model so it behaves predictably and productively (Mei et al. | 2025 | arxiv.org).
I use a practical definition: context engineering is the discipline of shaping what the model sees and can do before it generates a token, ensuring the task is plausibly solvable with minimal improvisation (Schmid | 2025 | philschmid.de). In other words, it is not just better prompting; it is the deliberate combination of content, constraints, and capabilities that make LLMs useful inside complex workflows. Anthropic company makes a similar distinction between prompt engineering (instructions) and context engineering (curation and delivery of the right evidence and tools) for agentic systems (Anthropic | 2025 | anthropic.com).
To make this concrete, I frame context engineering as layered components that the agent can rely on throughout its run. I often present the following table when onboarding stakeholders, because it clarifies how each piece contributes to reliable outputs.
This framing aligns with the literature: surveys now describe context engineering as a holistic practice that couples retrieval, processing, and management of contextual inputs to LLMs (Mei et al. | 2025 | arxiv.org), while practitioner sources emphasise the importance of shaping what the model ‘sees’ and the tools it may invoke (Anthropic | 2025 | anthropic.com; LlamaIndex | 2025 | llamaindex.ai).
Unstructured content is everything that refuses to sit neatly in relational tables: emails, PDFs, slide decks, scanned forms, spreadsheets with evolving columns, charts, images, and handwritten notes. The practical obstacles are well-known to anyone who has tried to automate enterprise reporting. Reading order on multi-column pages is easily scrambled. Side-by-side tables lose header–cell relationships. Figures become detached from captions and units. Equations, stamps, and watermarks confuse naïve OCR. Multilingual fragments collide with domain jargon. Without a layout-preserving representation and a disciplined selection of context, LLMs hallucinate links between cells, misinterpret numbers, or simply ignore crucial evidence.
In my experience, two constraints matter most. First, selecting the right context: the agent must retrieve enough evidence to be correct but not so much that the window is clogged with irrelevant material. Second, fitting the token window: the context must be compact, structured, and deduplicated, so the model sees just what is needed, exactly once. That means we cannot treat the document as a blob of text. We must convert it into a structured substrate so the agent can query tables as cells with headers, figures with captions and alt text, sections with IDs, entities with types then index those pieces for targeted retrieval (LlamaIndex | 2025 | llamaindex.ai).
“””Useful for retrieving knowledge from a database containing information about XYZ. Each query should be a pointed and specific natural language question or query.”””
<code></code>
return retrieved_knowledge
Enterprise data is scattered by design. Some live in email threads and calendar invites. Some sit in OneDrive or SharePoint, some in Slack or Confluence, some in S3 buckets with cryptic prefixes, some in line-of-business portals. Even ‘simple’ files arrive in a dozen types of PDFs of varying provenance, images and scans, spreadsheets with hidden columns, or mixed-language reports from regional teams. If you manage to pull the right files, the next problem is semantic: extracting the right slices from unstructured or semi-structured formats while preserving their meaning.
I have found that the fastest path to reliability is to normalise all sources into a layout-preserving, canonical JSON that becomes the single source of truth for generation and export. Tools like modern document parsers and layout engines produce consistent elements: pages, blocks, tables, figures, entities, citations, without flattening away structure. On top of that canonical layer, we build a hybrid index (vector + lexical + metadata) and label each chunk with jurisdiction, line of business, effective dates, language, and freshness. The agent retrieves evidence bundles per section top-k passages, required tables, and key figures bounded to the token window, then writes into a target schema. Compared to ‘send the whole PDF to the model,’ this approach is auditable, scalable, and much cheaper.
To build integrations with hundreds of foundational libraries and forge relationships with their maintainers, contributing back to upstream projects through careful diplomacy.
To create a representation language rich enough for complex spreadsheets, simple enough for basic text documents.
To handle unhandleable documents – scanned PDFs, embedded charts, complex layouts requiring visual understanding via Vision Language Models, Object Detection Models, and Computer Vision techniques.
And Generative LLM model to handle the text in the document.
One subtle but important shift in 2025 has been the rise of AI configuration files ; machine-readable context and instructions checked into the repository, such as AGENTS.md, CLAUDE.md, or copilot-instructions.md. Instead of scattering the ‘first context’ across prompts and tribal knowledge, teams standardise how agents should behave in a specific codebase or project. The file encodes the project structure, build/test commands, code style, contribution rules, and references. Agent tools read the file automatically and inject its content into their working context (GitHub | 2025 | github.blog; agents.md | 2025 | agents.md).
A recent work-in-progress study examined 466 opensource repositories and found wide variation in both what is documented and how it is written; descriptive, prescriptive, prohibitive, explanatory, conditional, with no single canonical structure yet (Mohsenimofidi et al | 2025 | arxiv. org). That variability matches my experience: AGENTS.md is most powerful when it encodes not just instructions but also context logic , the guardrails that determine how an agent chooses and uses evidence. If you want reproducible behaviour, version your context alongside your code.
When I propose context engineering to stakeholders, I describe it as a lightweight ‘operating system’ for agents. Its job is to orchestrate how data, knowledge, tools, memory, structure, logic, and environment come together at run time.
The ingestion pathway accepts sources from email, OneDrive/SharePoint, Slack/Confluence, S3, and direct uploads. Parsing combines OCR and layout detection to preserve headings, reading order, footnotes, sideby-side tables, figures with captions, and equations. Canonicalisation converts everything to JSON with strict linking between headers and cells, captions and figures, and cross-references. Indexing builds hybrid search over all elements with rich metadata. Selection composes compact evidence bundles per section that fit the window. Planning decomposes the task into sub-goals mapped to tools: compute KPIs, normalise units, summarise exposures, reason about compliance. Drafting writes section-bysection into target schemas with per-paragraph citations. Validation runs numeric, structural, and compliance checks. Export renders multiple formats PDF, PPT, HTML from the same canonical draft. Feedback writes reviewer edits into memory, so the next run starts smarter.
This pipeline turns a static model into a programmable, tool-aware writer. It also reduces risk. Grounded evidence and per-paragraph citations curb hallucination. Target schemas unlock automated QA. And the context logic protects against failure modes such as stale regulations or missing loss runs (Anthropic | 2025 | anthropic.com; DAIR. AI | 2025 | promptingguide.ai).
I’ll anchor the above with a real-world scenario I build frequently: producing a quarterly insurance technical report from scattered content. The company asks the AI agent to deliver a print-ready PDF, a presentation-grade PPT, and an HTML page with live citations. The data arrives in every shape imaginable: scanned policy schedules; emails with embedded tables; spreadsheets whose columns change annually; photos of damage; prior reports in multiple languages. Prompt engineering alone will not tame this variety.
When I propose context engineering to stakeholders, I describe it as a lightweight ‘operating system’ for agents. Its job is to orchestrate how data, knowledge, tools, memory, structure, logic, and environment come together at run time.
The ingestion pathway accepts sources from email, OneDrive/SharePoint, Slack/Confluence, S3, and direct uploads. Parsing combines OCR and layout detection to preserve headings, reading order, footnotes, sideby-side tables, figures with captions, and equations. Canonicalisation converts everything to JSON with strict linking between headers and cells, captions and figures, and cross-references. Indexing builds hybrid search over all elements with rich metadata. Selection composes compact evidence bundles per section that fit the window. Planning decomposes the task into sub-goals mapped to tools: compute KPIs, normalise units, summarise exposures, reason about compliance. Drafting writes section-bysection into target schemas with per-paragraph citations. Validation runs numeric, structural, and compliance checks. Export renders multiple formats PDF, PPT, HTML from the same canonical draft. Feedback writes reviewer edits into memory, so the next run starts smarter.
The run begins with intake that preserves layout. Content from email, OneDrive, Slack, Confluence, and S3 is parsed so reading order, headings, footnotes, and sideby-side tables survive extraction. Each artefact becomes canonical JSON, not a flattened blob. Tables retain header–cell linkage and include text grids, optional HTML, and a faithful image snapshot. Figures carry captions and alt text. This universal representation makes mixed-language pages and embedded charts intelligible to the agent without losing structure (LlamaIndex | 2025 | llamaindex.ai).
Before querying anything, the agent receives a short AGENTS.md-style brief. It states the audience and tone, the mandatory sections (Executive Summary, Exposure Profile, Loss History, Coverage Analysis, Recommendations, Appendices), the approved phrasing for sensitive topics, and the guardrails: cite every number, prefer the newest regulation, redact PII, use UK spelling. It also enumerates registered tools with their scopes: table reconstructor, risk ratio calculator, currency converter, date harmoniser, translation helper, and templating engines for PDF/PPT/HTML. This ‘first context’ constrains behaviour and prevents the tool from wandering.
For each section, the agent composes a bounded evidence bundle. For Loss History, it retrieves the five-year claim table with date, cause, paid, and case reserve; a figure showing quarterly frequency; and a passage that explains the spike in Q4. It computes frequency and severity, calculates loss ratios, normalises currencies and dates, and annotates each paragraph with its source IDs. Tables are rebuilt from structured cells, not pasted as screenshots. Captions remain attached to their charts. If evidence is missing, the agent flags the gap and suggests next steps instead of inventing data. This retrieval-aware writing loop produces text that is faithful, traceable, and ready for review (Anthropic | 2025 | anthropic.com; DAIR.AI | 2025 | promptingguide.ai).
Quality assurance runs in parallel. Validators check totals within tolerance, enforce header–cell alignment, and ensure every required section is present. Unit and date normalisers catch silent inconsistencies. A compliance pass redacts PII and scans for prohibited phrasing. Any discrepancy becomes a visible note for the agent to resolve by fetching better evidence or marking a decision for human review. Editorial changes flow back into memory, so the next report begins with the newest boilerplate, the latest regulatory references, and the team’s approved tone.
Publishing is simply rendering the same canonical draft into multiple formats. The PDF is print-ready for underwriters and auditors, the PPT distils each section into a slide with highlights and exhibits, and the HTML carries deep links into the evidence log so reviewers can jump from a claim statistic to the source row. Because everything stems from the same structured draft, there is no drift between formats.
Several small techniques pay outsized dividends. I keep evidence bundles under a hard token budget and prioritise diversity of sources over repetition. I strip duplicate passages aggressively at retrieval time to avoid polluting the window. I store units and currencies explicitly alongside values in the canonical JSON and normalise them before generation. I treat tables as first-class objects: headers, data types, totals, footnotes, and units are embedded and validated. I make captions mandatory for figures and link them to the nearest paragraph to prevent orphaned visuals. I push style and compliance into context logic: casing, spelling (en-GB vs en-US), phrasing constraints for risk disclosures, and redaction rules. Importantly, I version the context schemas, briefs, rules alongside the code, because context is now part of the software artefact (GitHub | 2025 | github.blog; Mohsenimofidi et al. | 2025 | arxiv.org).
When a model fabricates a number, the instinct is to tighten the prompt. In my experience, fabrication is more often a context failure. The model did not see the right evidence, or it saw too much contradictory evidence, or it lacked a rule that forbade speculation. Retrieval noise, stale documents, and silent unit drift are common culprits. The remedy is upstream: curate the index, privilege freshness in the ranking, annotate units and dates, enforce ‘no evidence → no claim,’ and keep explicit gaps visible in the output. Another failure mode is format fidelity: tables that look right but misalign headers and cells. Treating tables as objects with validations, not pictures, prevents subtle integrity loss. A third is scale drift as tasks grow longer and agents call more tools. Here the fix is to encode planning: require the agent to produce a plan that maps sub-tasks to tools, then execute with explicit inputs/ outputs logged in memory. These patterns are echoed across practitioner guides and research agendas on agentic systems (Anthropic | 2025 | anthropic.com; Villamizar et al. | 2025 | arxiv.org; Baltes et al. | 2025 | arxiv.org).
Context engineering benefits from clear metrics. I measure groundedness (share of sentences with traceable citations), section completeness (mandatory fields present), numeric integrity (totals within tolerance, unit consistency), layout fidelity (table header–cell alignment, caption attachment), and editorial effort (redlines per thousand words). On the operational side, I track token budget adherence, tool error rates, and time-to-publish. These are not abstract KPIs; they pinpoint which part of the context pipeline needs attention: index quality, selection heuristics, rules, or validators. Methodological guidance for LLM-in-SE studies underscores the need for rigorous, reproducible evaluation when agents interact with real artefacts (Baltes et al. | 2025 | arxiv.org).
Agentic coding tools have normalised the idea that we should document for machines, not just humans. The adoption of AGENTS.md-style files in open-source shows teams actively shaping their agents’ context–project structure, build/test routines, conventions, and guardrails then versioning that context in Git. The emerging research suggests styles and structures vary widely, but the trajectory is clear: context is becoming a first-class software artefact (Mohsenimofidi et al. | 2025 | arxiv. org). In my view, the same pattern will define enterprise reporting, customer service, finance ops, and risk management. As tasks lengthen and tool calls multiply, the backbone of reliable automation will be a context OS: layout-preserving data, explicit knowledge, callable tools, durable memory, structured outputs, enforceable logic, and a well-integrated environment.
I started this journey believing that better prompts would fix unreliable outputs. After building dozens of pipelines on top of documents that resist structure, I now believe something different. Prompting is necessary; context engineering is decisive. When we treat context as an engineered system – complete with schemas, indexes, rules, and validators – LLMs become competent collaborators rather than gifted improvisers. In the insurance report demo, context engineering turned chaotic inputs into a grounded, auditable narrative that ships as PDF, PPT, and HTML on schedule. The payoff is consistent across domains: fewer redlines, faster cycles, clearer audit trails, and models that learn from feedback. The discipline is still evolving, and conventions for documenting machinereadable context are far from settled. But the direction is unmistakable. The next generation of AI systems will be built on the quiet infrastructure we place around the model, not only on the cleverness we push into the prompt (Schmid | 2025 | philschmid.de; Anthropic | 2025 | anthropic.com; Mei et al. | 2025 | arxiv.org).
Anthropic. “ Effective Context Engineering for AI Agents ” 2025 anthropic.com engineering/effective-context-engineering-for-ai-agents
Baltes, S. et al. “ Guidelines for Empirical Studies in Software Engineeri ng involving Large Language Models ” 2025. arxiv.org/abs/2508.15503
DAIR.AI. “ Elements of a Prompt | Prompt Engineering Guide ” 2025 promptingguide.ai/introduction/elements
GitHub. “ Copilot Coding Agent Now Supports AGENTS.md Custom Instructions (Changelog )” 2025 github.blog/changelog/2025-08-28-copilot-coding-agent-now-supports-agents-mdcustom-instructions/
Horthy, D. “ Getting AI to Work in Complex Codebases ” 2025 github.com/humanlayer/advanced-context-engineering-for-coding-agents
LlamaIndex. “ Context Engineering: What It Is and Techniques to Consider ” 2025 llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider
Mei, L. et al. “ A Survey of Context Engineering for Large Language Models ” 2025 arxiv.org/abs/2507.13334.
Mohsenimofidi, S., Galster, M., Treude, C., and Baltes, S. “ Context Engineering for AI Agents in Open-Source Software ” 2025 arxiv.org/abs/2510.21413
Schmid, P. “The New Skill in AI is Not Prompting, It’s Context Engineering” 2025 philschmid.de/context-engineering
Villamizar, H. et al. “ Prompts as Software Engineering Artifacts: A Research Agenda and Preliminary Findings ” 2025 arxiv.org/abs/2509.17548