Anatomy of an agentic harness

A dissection of a production AI platform for credit underwriting.

Jun 28, 20267 min read

AIAgentArchitecture

When people talk about AI agents, they often focus on the model. In practice, the LLM is only one component. Building production-grade agentic systems is largely a systems engineering problem involving orchestration, observability, reliability, human feedback, and continuous improvement.

Over the past year, I've been building an AI-native platform for credit underwriting and risk analysis that accelerates document-heavy workflows while ensuring humans remain in control of critical decisions. Along the way, I've realized that most agentic harnesses share a common architecture.

This is the anatomy of the system we've built.

The harness around the reasoning engine - the model at the center, surrounded by the subsystems that make it dependable

Orchestration

The first challenge isn't prompting, it's coordination.

An agentic system needs to answer questions like:

Which agent should execute next?
What happens if an agent fails?
How do multiple agents share context?
How are long-running workflows resumed after interruptions?

We use workflow orchestration to model the underwriting lifecycle as a graph of dependent tasks instead of a linear pipeline.

This makes branching logic, retries, parallel execution, and human approval checkpoints first-class citizens.

Orchestration modelled as a graph of dependent tasks - branching, parallel execution, retries, and approval gates

Durable Execution

Unlike chat applications, enterprise workflows may take several minutes or even hours to complete.

Documents arrive asynchronously. Humans review outputs. External services fail. Workers restart.

Durable execution ensures that the workflow never loses its state. Every step can be retried, resumed, or replayed without starting from scratch, allowing the platform to operate reliably at scale.

Durable execution - steps checkpoint into a state store so a crashed workflow resumes instead of restarting

Shared Context

Each agent executes inside an isolated runtime.

Instead of maintaining shared in-memory state, agents communicate through a common storage layer that holds uploaded documents, extracted artifacts, intermediate outputs, metadata, and structured results. This loose coupling makes agents stateless, horizontally scalable, and independently deployable.

Isolated agents coupled only through a common storage layer holding documents, artifacts, outputs, metadata, and results

Observability

Traditional applications expose logs.

Agentic systems require much richer visibility.

For every execution we capture prompts, model responses, reasoning traces, latency, costs, evaluation metrics, and workflow execution paths. Without this level of observability, improving agents becomes largely guesswork.

Tracing every interaction makes prompt iteration and production debugging significantly easier.

Observability - every run captures prompts, responses, reasoning traces, latency, cost, metrics, and execution paths into a trace store

Human-in-the-Loop

One lesson becomes obvious very quickly:

The goal isn't to remove humans, it is to involve them only where their expertise creates the most value.

In underwriting, analysts review edge cases, validate extracted information, and approve high-impact decisions.

The workflow is intentionally designed around these checkpoints rather than attempting full automation.

Human-in-the-loop - a confidence gate automates routine cases and routes edge cases to an analyst for review

Learning from Feedback

The most interesting part of the architecture isn't the agents, it's how they improve.

Imagine an extraction agent processing a scanned bank statement. The account number is partially scratched out, causing the extraction to fail.

During review, an analyst correctly infers the missing digits using contextual information elsewhere in the document. Instead of treating this as a one-time correction, we capture it as structured feedback.

That feedback becomes an input for improving prompts, extraction heuristics, sub-agent strategies, orchestration policies, and evaluation datasets. Over time, every human correction makes the system incrementally better.

Rather than static automation, the platform evolves through operational feedback.

The learning loop - a human correction becomes structured feedback that improves prompts, heuristics, strategies, policies, and evaluation datasets

Engineering the Platform

The system itself is composed of three major engineering domains:

Application - A TypeScript monorepo powering the web application, APIs, authentication, and user experience.
Agent Runtime - Python services responsible for workflow orchestration, agent execution, configuration, and observability.
Analytics Engine - Domain-specific financial intelligence including transaction categorization, financial formulas, ML inference, and business logic.

This separation allows each domain to evolve independently while keeping the overall platform modular.

The platform's three engineering domains - Application, Agent Runtime, and Analytics Engine

The Bigger Picture

Many discussions around AI agents revolve around prompting techniques or model selection.

In production, those are only a small part of the equation.

Reliable agentic systems require workflow orchestration, durable execution, observability, scalable infrastructure, human oversight, structured feedback loops, and domain-specific reasoning working together.

The LLM is the reasoning engine.

The harness around it is what turns reasoning into a dependable product.

And that's what makes building agentic AI such an interesting engineering challenge.