Building AI Agents That Actually Work

The gap between demo and production

Everyone has seen the demos. An AI agent books a flight, writes a report, queries three databases, and summarizes the results in a clean table. Impressive. Now try running that 10,000 times with messy real-world inputs and tell me how often it actually works.

The honest answer, for most teams shipping their first agent: somewhere between 60% and 80% of the time. That is not good enough for production. Your users do not care that "the LLM sometimes hallucinates" -- they care that your software broke.

Building reliable AI agents is not an AI problem. It is an engineering problem. The same discipline that makes traditional software reliable -- input validation, error handling, monitoring, graceful degradation -- applies here. The teams that treat agents like engineering systems ship agents that work. The teams that treat them like magic ship demos.

What makes a good agent use case

Before writing a single line of code, ask yourself three questions:

1. Is the scope well-defined? Agents that do one thing well outperform agents that try to do everything. "Answer customer questions about our billing policies using our docs" is a good use case. "Be a general-purpose assistant for our business" is not.

2. Are there clear success criteria? You need to know what "correct" looks like so you can measure it. If you cannot write an eval for the task, you cannot build a reliable agent for it.

3. Can a human step in when it fails? Every agent needs a fallback path. The best agent architectures include confidence scoring and automatic escalation to a human when the agent is unsure. This is not a weakness -- it is what makes the system trustworthy.

The strongest use cases we see in production: document processing and extraction, customer support triage with human escalation, code review assistants, data analysis pipelines with structured outputs, and workflow automation where the steps are known but the inputs vary.

Architecture patterns that hold up

The tool-calling agent

This is the workhorse pattern. Your agent has access to a set of tools (functions), the LLM decides which tools to call and in what order, and your code executes the tools and feeds results back.

Here is a basic tool-calling agent using the Vercel AI SDK with proper error handling:

typescript

import { generateText, tool } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

// Define tools with strict input/output schemas
const tools = {
  lookupCustomer: tool({
    description: "Look up a customer by email address",
    parameters: z.object({
      email: z.string().email(),
    }),
    execute: async ({ email }) => {
      const customer = await db.customers.findByEmail(email);
      if (!customer) {
        return { error: "Customer not found", email };
      }
      return {
        id: customer.id,
        name: customer.name,
        plan: customer.plan,
        billingStatus: customer.billingStatus,
      };
    },
  }),

  getRecentInvoices: tool({
    description: "Get the 5 most recent invoices for a customer",
    parameters: z.object({
      customerId: z.string().uuid(),
    }),
    execute: async ({ customerId }) => {
      const invoices = await db.invoices.findRecent(customerId, 5);
      return invoices.map((inv) => ({
        id: inv.id,
        amount: inv.amount,
        status: inv.status,
        date: inv.createdAt.toISOString(),
      }));
    },
  }),
};

async function runAgent(userQuery: string) {
  const MAX_STEPS = 5;
  const TIMEOUT_MS = 30_000;

  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), TIMEOUT_MS);

  try {
    const result = await generateText({
      model: openai("gpt-4o"),
      system: `You are a billing support agent. You help customers
        with billing questions. Only use the tools provided.
        If you cannot answer a question with the available tools,
        say so clearly. Never make up information.`,
      prompt: userQuery,
      tools,
      maxSteps: MAX_STEPS,
      abortSignal: controller.signal,
    });

    return {
      success: true,
      response: result.text,
      toolCalls: result.steps.flatMap((s) => s.toolCalls),
      tokenUsage: result.usage,
    };
  } catch (error) {
    if (error.name === "AbortError") {
      return {
        success: false,
        error: "Agent timed out after 30 seconds",
        escalate: true,
      };
    }
    return {
      success: false,
      error: "Agent encountered an error",
      escalate: true,
    };
  } finally {
    clearTimeout(timeout);
  }
}

A few things to notice. The tools have strict Zod schemas for inputs, so the LLM cannot pass garbage data. There is a maxSteps limit so the agent cannot loop forever. There is a hard timeout. And every failure path returns a structured error with an escalate flag so your application knows to route to a human.

RAG pipelines

Retrieval-Augmented Generation is the pattern for "answer questions using our documents." The agent retrieves relevant chunks from a vector database, then uses those chunks as context when generating an answer.

The critical engineering decisions in RAG are not about the LLM -- they are about the retrieval. How you chunk your documents, how you embed them, and how you rank results determines 80% of your answer quality.

Our recommendations for production RAG:

Chunk size: 512-1024 tokens with 10-20% overlap between chunks
Embedding model: OpenAI text-embedding-3-small for most use cases (good quality, low cost)
Vector store: Pinecone or pgvector if you are already on PostgreSQL
Retrieval: Hybrid search (vector similarity + keyword BM25) outperforms pure vector search
Reranking: Add a reranker (Cohere Rerank) between retrieval and generation for a significant quality boost

Multi-step workflows

For complex tasks, break the work into discrete steps where each step validates the output of the previous one. This is more reliable than giving an agent a complex instruction and hoping it figures out the right sequence.

typescript

async function processDocument(doc: Document) {
  // Step 1: Classify the document type
  const classification = await classifyDocument(doc);
  if (classification.confidence < 0.85) {
    return { status: "needs_review", reason: "Low classification confidence" };
  }

  // Step 2: Extract structured data based on type
  const extracted = await extractFields(doc, classification.type);

  // Step 3: Validate extracted data against business rules
  const validation = validateExtraction(extracted);
  if (!validation.valid) {
    return { status: "needs_review", reason: validation.errors };
  }

  // Step 4: Write to database
  await saveExtraction(extracted);
  return { status: "processed", data: extracted };
}

Each step has a clear contract. Each step can fail independently. And failures route to human review instead of producing garbage output.

Guardrails are not optional

In traditional software, bad input produces an error. In agent systems, bad input produces confident-sounding wrong answers. That is worse. You need guardrails at every layer.

Input guardrails:

Validate and sanitize user inputs before they reach the LLM
Use a classifier to detect prompt injection attempts
Reject inputs that are out of scope for your agent

Output guardrails:

Parse LLM outputs with Zod or similar schema validation
Check for hallucinated data (e.g., verify that customer IDs the agent references actually exist)
Implement confidence scoring -- if the agent is not sure, it should say so

Cost guardrails:

Set per-request token budgets
Monitor daily spend with hard circuit breakers
Use maxSteps to prevent infinite tool-calling loops
Cache repeated queries to avoid redundant LLM calls

Rate limiting:

Per-user rate limits to prevent abuse
Per-model rate limits to stay within provider quotas
Queue-based processing for non-real-time workloads

Need help implementing this? Our team can help you put these practices into action.

Monitoring and observability

You cannot improve what you cannot see. Agent systems need more observability than traditional software because failures are often semantic (the answer was wrong) rather than syntactic (the server returned a 500).

What to track:

Success rate by task type -- are certain queries failing more than others?
Latency distribution -- p50, p95, p99 for agent completion time
Token usage per request -- catch cost spikes before they hit your bill
Tool call patterns -- which tools are called most? Are there unexpected sequences?
Escalation rate -- what percentage of requests need human intervention?
User feedback -- thumbs up/down on agent responses, tracked over time

We use Langfuse for agent tracing in most of our projects. It gives you a Datadog-like experience for LLM calls: traces, latency, cost tracking, and the ability to replay any request. Open source, self-hostable, and it integrates with the Vercel AI SDK in a few lines of code.

Common failure modes

After shipping agents for a dozen clients, these are the failures we see most:

1. The infinite loop. The agent calls the same tool repeatedly with slightly different parameters, burning through tokens. Fix: set maxSteps limits and detect repeated tool calls.

2. The confident hallucination. The agent makes up data that sounds plausible but is completely wrong. Fix: validate every piece of data the agent references against your actual database.

3. The context window overflow. Too much context gets stuffed into the prompt, the model starts ignoring important information, and quality degrades. Fix: be surgical about what context you include. More is not always better.

4. The prompt injection. A user figures out how to make your agent ignore its instructions. Fix: input classification, output validation, and never give your agent access to tools that can do irreversible damage without human approval.

5. The cost explosion. A bug or edge case causes the agent to make 50 LLM calls per request instead of 3. Fix: hard token budgets, circuit breakers, and alerting on cost anomalies.

The bottom line

AI agents are powerful, but they are software -- and they need to be engineered like software. Define clear scope. Validate inputs and outputs. Set limits on cost and execution time. Monitor everything. Build human fallback paths.

The teams shipping reliable agents in production are not using better models or fancier frameworks. They are applying basic engineering discipline to a new category of system. That is the whole secret.

Start with the simplest agent that could work. Measure its reliability. Fix the failure modes. Ship the next version. Repeat. That is how you build agents that actually work.

Building AI Agents That Actually Work

The gap between demo and production

What makes a good agent use case

Architecture patterns that hold up

The tool-calling agent

RAG pipelines

Multi-step workflows

Guardrails are not optional

Monitoring and observability

Common failure modes

The bottom line

Need help implementing this?

Get engineering insights delivered

More articles

Building for Scale: A Startup CTO's Technology Playbook

7 Infrastructure Mistakes Every Startup Makes (And How to Fix Them)