In the last post, I focused on why most AI agent demos struggle in production. This one addresses the next question teams ask once they accept those constraints:
What frameworks, retrieval methods, and abstractions should we actually use?
I questioned myself a lot when building my first agent system. The honest answer is still inconvenient:
It depends on the system you’re trying to run, not the tools you like, or the blog posts you’ve read.
It’s a way to reason about tradeoffs - and to recognize when an abstraction is helping versus when it’s just hiding complexity.
Agents Are Just Systems of LLM Calls Link to heading
The first useful reframing is to stop treating “agents” as a special category.
Mechanically, an agent system is just coordinated LLM calls with shared context, routing logic, tool invocation, retries, and termination conditions. Once you see that, frameworks stop feeling magical and start feeling like what they are: opinions about control flow, state, and observability.
Writing Your Own Agent Loop vs Using a Framework Link to heading
A few hundred lines of Python can suffice to implement the core pieces (model calls, a reasoning loop, and tool execution, optionally with memory/state control) and give you deep insight into how agents work under the hood. The core logic of an agent is just a controlled loop over reasoning and actions, and it’s perfectly fine for a small amount of users or just learning concepts.
Frameworks, on the other hand, earn their keep once complexity compounds - multiple tools, retries, parallelism, observability, structured outputs, or multi-agent coordination. They trade transparency for speed and consistency. In practice, the trade-off looks like this:
- Custom agent loop: Maximum control and debuggability, at the cost of more upfront effort and ongoing maintenance. You own the failure modes, but you also understand them deeply.
- Agent framework: Faster time-to-demo and shared infrastructure, but with opinionated abstractions that can obscure control flow and make edge cases harder to reason about.
A custom loop forces you to confront what actually breaks: where tokens are wasted, how retries amplify cost, and how small prompt changes ripple through the system. Even if you later adopt a framework, building a minimal agent once gives you a mental model most teams never quite develop - and a sharper sense of what frameworks truly provide versus what they hide.
Comparing Popular Agent Framework Styles Link to heading
Most agent frameworks differ less in capability than in philosophy. They all wrap the same primitives - model calls, tools, state, and control flow, but make very different tradeoffs about how explicit those pieces should be. At a high level:
- LangChain-style frameworks optimize for velocity and ecosystem breadth. They make it easy to wire models, tools, and memory together, but often at the cost of opaque control flow.
- Graph- or DAG-based systems make state transitions explicit. You do more upfront design, but you get predictability, debuggability, and replayability.
- Planner-heavy / auto-agent approaches maximize autonomy, letting the model decide what to do next with minimal orchestration. These are powerful for demos and research, but brittle in production.
Frameworks are rarely the bottleneck. Misunderstanding what they abstract away usually is.
Agent Framework Comparison (Real-World, by 2025) Link to heading
| Framework | Style / Philosophy | Strengths | Tradeoffs | Typical Use |
|---|---|---|---|---|
| LangChain | Chain & agent abstractions | Huge ecosystem, fast prototyping, broad model/tool support | Control flow can be implicit and hard to trace | Prototyping, internal tools, early-stage products |
| LangGraph | Explicit state machine / graph | Deterministic execution, resumability, production-friendly | More upfront design, less “plug-and-play” | Complex workflows, long-running agents |
| LlamaIndex | Data-centric agent framework | Strong RAG, indexing, retrieval primitives | Agent logic less flexible than LangChain | Knowledge-heavy apps, enterprise RAG |
| PydanticAI | Typed, schema-first agents | Strong structure, validation, predictable outputs | Less flexible for exploratory agents | Production systems, tool-heavy backends |
| CrewAI | Role-based multi-agent | Simple mental model, good demos, fast setup | Coordination logic can get implicit | Personal projects, agent “teams”, experiments |
| AutoGen | Conversational multi-agent | Flexible agent-to-agent interaction | Hard to reason about, non-deterministic | Research, simulations, experimentation |
Teams could use more than one framework - a minimal loop or LangGraph-style core for production paths, and a higher-level framework for experimentation and iteration.
Here’s a reorganized version that keeps the substance, reduces table fatigue, and pushes the reader toward clearer mental models. I’ve collapsed tables into bullets where pattern recognition matters, and kept one table per section only where it genuinely helps.
Retrieval Is Task-Specific, Not Vector-by-Default Link to heading
One of the most common production mistakes is treating embeddings as the default retrieval solution. They aren’t.
Retrieval is a systems design problem, not a modeling one. Different tasks demand different primitives, and forcing everything through a vector index usually makes systems slower, noisier, and harder to debug. Here are some example use cases:
| Search Target / Task Type | Best Retrieval Method | Why |
|---|---|---|
| Code search | Lexical (BM25, trigrams) | Token-level precision and symbol matching matter |
| Exact record lookup | SQL / key-value | Deterministic, cheap, predictable |
| Structured entities (users, orders, configs) | SQL + indexes | Clear schema, no semantic ambiguity |
| Metadata filtering | SQL / column filters | Faster and more accurate than embeddings |
| Log / trace search | Lexical + time filters | Ordering and exact matches dominate |
| FAQ / doc QA | Embeddings | Semantic similarity helps recall |
| Natural-language to data | Hybrid (filters + embeddings) | Structure first, semantics second |
| Long-form research | Hybrid + reranking | Balance recall and precision |
| Mixed or unknown queries | Hybrid | Safest default in production |
Robust systems usually start with deterministic retrieval, layer in semantic search only where it clearly helps, and then reconcile results with ranking or filtering. This approach is cheaper, faster, and far more debuggable than “vector-everything” pipelines.
RAG Pipelines Are a Spectrum, Not a Pattern Link to heading
Retrieval-augmented generation isn’t a single recipe - it’s a set of tradeoffs between simplicity, precision, and control.
Most production systems evolve along this path:
- Simple RAG (embed → retrieve → prompt) Fast to build, easy to demo, but noisy and opaque.
- Filtered RAG (metadata filters + retrieval) Adds precision, requires schema discipline.
- Hybrid RAG (lexical + embeddings + reranking) More moving parts, but significantly better quality and debuggability.
- Multi-stage RAG (iterative retrieval + reasoning) High recall, high latency, usually reserved for research or complex workflows.
Databases Are Part of the Retrieval Design Link to heading
Here the table actually earns its place, because it maps constraints to reasonable defaults:
| Context | Reasonable Choice |
|---|---|
| Local / small datasets | SQLite + FTS5 |
| Local vector search | SQLite + sqlite-vec |
| Existing Postgres stack | Postgres + pgvector |
There are many great opensource project optimized for search, such as LanceDB, a strong choice when you want fast local or cloud vector search with good developer ergonomics. Useful once vector workloads become central - not required on day one. Meilisearch is excellent for search-first applications (docs, catalogs, dashboards). If search is a core product feature rather than a supporting capability, it can replace large parts of a custom retrieval stack.
The goal isn’t picking the “best” database up front. It’s choosing something that lets the retrieval system evolve as usage becomes real, instead of starting at maximum complexity and discovering later that most queries never needed embeddings in the first place.
Monitoring and Evaluation Are Not Optional Link to heading
Once agents and RAG systems leave notebooks, observability becomes part of the architecture, not an afterthought.
Every popular framework already emits OpenTelemetry traces. That means you have three realistic options:
Monitoring & Evaluation Approaches Link to heading
| Observation Backend | Examples | Strengths | Tradeoffs | When It Makes Sense |
|---|---|---|---|---|
| Framework-native | LangSmith, LlamaIndex Observability | Zero-friction, framework-aware | Framework lock-in | Single-framework stacks |
| Framework-agnostic | Langfuse, Helicone, W&B | Cross-framework visibility | Integration effort | Mixed agent stacks |
| OTel-compatible backend | Jaeger, Grafana Tempo, Honeycomb | Vendor-neutral, infra-native | Lowest-level abstraction | Mature platforms |
Orthogonally, you can add custom evaluation metric as needed. At minimum, you want visibility into:
- prompt versions
- retrieved documents
- tool calls
- latency and token usage
- failure and retry paths
Without this, you’re tuning blind. With it, many “agent problems” turn out to be simple retrieval or prompt issues.
MCP Is a Coordination Tool, Not an Intelligence Upgrade Link to heading
Model Context Protocol (MCP) is best understood as a boundary: a contract between agents and tools.
It shines when multiple agents share tooling, when execution must be clearly separated from reasoning, or when tools evolve independently. It’s overkill for tightly scoped systems where simplicity and latency dominate.
Like most abstractions, MCP solves coordination problems - not reasoning problems.
Tool Choice Is About Constraints, Not Fashion Link to heading
Across production systems, tools change constantly. Constraints don’t.
- Latency budgets.
- Cost ceilings.
- Data freshness.
- Observability.
- Failure tolerance.
Good tools make these constraints explicit and manageable. Bad ones hide them until the system becomes fragile.
The goal isn’t to pick the “best” framework or RAG pipeline. It’s to pick ones that fail in ways you can understand, debug, and recover from.
In the next post, I’ll walk through a concrete experiment: building a local agent with retrieval, monitoring, and evaluation that fits my constraints - not a generic benchmark.