Anatomy of an agentic harness
When people talk about AI agents, they often focus on the model. In practice, the LLM is only one component. Building production-grade agentic systems is largely a systems engineering problem involving orchestration, observability, reliability, human feedback, and continuous improvement.
Over the past year, I've been building an AI-native platform for credit underwriting and risk analysis that accelerates document-heavy workflows while ensuring humans remain in control of critical decisions. Along the way, I've realized that most agentic harnesses share a common architecture.
This is the anatomy of the system we've built.
Orchestration
The first challenge isn't prompting, it's coordination.
An agentic system needs to answer questions like:
- Which agent should execute next?
- What happens if an agent fails?
- How do multiple agents share context?
- How are long-running workflows resumed after interruptions?
We use workflow orchestration to model the underwriting lifecycle as a graph of dependent tasks instead of a linear pipeline.
This makes branching logic, retries, parallel execution, and human approval checkpoints first-class citizens.
Durable Execution
Unlike chat applications, enterprise workflows may take several minutes or even hours to complete.
Documents arrive asynchronously. Humans review outputs. External services fail. Workers restart.
Durable execution ensures that the workflow never loses its state. Every step can be retried, resumed, or replayed without starting from scratch, allowing the platform to operate reliably at scale.
Shared Context
Each agent executes inside an isolated runtime.
Instead of maintaining shared in-memory state, agents communicate through a common storage layer that holds uploaded documents, extracted artifacts, intermediate outputs, metadata, and structured results. This loose coupling makes agents stateless, horizontally scalable, and independently deployable.
Observability
Traditional applications expose logs.
Agentic systems require much richer visibility.
For every execution we capture prompts, model responses, reasoning traces, latency, costs, evaluation metrics, and workflow execution paths. Without this level of observability, improving agents becomes largely guesswork.
Tracing every interaction makes prompt iteration and production debugging significantly easier.
Human-in-the-Loop
One lesson becomes obvious very quickly:
The goal isn't to remove humans, it is to involve them only where their expertise creates the most value.
In underwriting, analysts review edge cases, validate extracted information, and approve high-impact decisions.
The workflow is intentionally designed around these checkpoints rather than attempting full automation.
Learning from Feedback
The most interesting part of the architecture isn't the agents, it's how they improve.
Imagine an extraction agent processing a scanned bank statement. The account number is partially scratched out, causing the extraction to fail.
During review, an analyst correctly infers the missing digits using contextual information elsewhere in the document. Instead of treating this as a one-time correction, we capture it as structured feedback.
That feedback becomes an input for improving prompts, extraction heuristics, sub-agent strategies, orchestration policies, and evaluation datasets. Over time, every human correction makes the system incrementally better.
Rather than static automation, the platform evolves through operational feedback.
Engineering the Platform
The system itself is composed of three major engineering domains:
- Application - A TypeScript monorepo powering the web application, APIs, authentication, and user experience.
- Agent Runtime - Python services responsible for workflow orchestration, agent execution, configuration, and observability.
- Analytics Engine - Domain-specific financial intelligence including transaction categorization, financial formulas, ML inference, and business logic.
This separation allows each domain to evolve independently while keeping the overall platform modular.
The Bigger Picture
Many discussions around AI agents revolve around prompting techniques or model selection.
In production, those are only a small part of the equation.
Reliable agentic systems require workflow orchestration, durable execution, observability, scalable infrastructure, human oversight, structured feedback loops, and domain-specific reasoning working together.
The LLM is the reasoning engine.
The harness around it is what turns reasoning into a dependable product.
And that's what makes building agentic AI such an interesting engineering challenge.