How to Build a Voice AI Agent: An Architecture Walkthrough (2026)
To build a voice AI agent you assemble a real-time streaming pipeline: phone audio comes in over telephony, speech-to-text transcribes it as the caller speaks, a turn detector decides when they’ve finished, an LLM reasons and (often) calls tools, text-to-speech speaks the reply, and the whole loop repeats — fast enough that the caller never feels the seams. Below is how I architect each stage, and the latency budget that holds it all together.
What are the parts of a voice AI agent?
A production voice agent is not one model — it is a chain of components, each of which can make or break the call. Conceptually the loop is: caller audio → telephony → STT → VAD/turn-taking → LLM (+ tools) → TTS → caller audio, with barge-in and handoff running alongside it. Here is what each stage does and the kind of tooling I reach for:
| Stage | What it does | Typical tools |
|---|---|---|
| Telephony | Connects the agent to the phone network for inbound/outbound calls and media streaming | Twilio, SIP / phone numbers |
| Speech-to-text (STT) | Streams the caller’s audio into live text | Deepgram, Whisper, OpenAI Realtime |
| Turn-taking / VAD | Detects when the caller has finished a turn so the agent replies at the right moment | Voice activity detection, end-of-turn models |
| LLM | Understands intent, holds conversation state, calls tools (calendar, CRM) | OpenAI, Claude, via LangGraph |
| Text-to-speech (TTS) | Turns the reply into natural, streamed speech | ElevenLabs, OpenAI Realtime, Cartesia |
| Orchestration platform | Wires the pipeline together and manages the call lifecycle | Retell AI, Vapi, custom LiveKit |
How does the telephony layer work?
Telephony is how your agent actually gets on a phone call. In practice that means a provider like Twilio handles the phone number, the carrier connection, and a bidirectional media stream of raw audio. Your agent subscribes to that stream, pushes audio to STT, and pushes synthesized speech back. The same layer handles inbound (the agent answers your line) and outbound (the agent places calls), call routing into your existing phone setup, and — critically — the warm transfer when a human needs to take over. Get this layer wrong and you fight audio glitches and dropped calls for weeks, so I treat it as real infrastructure, not an afterthought.
Why is turn-taking (VAD) the hardest part?
Turn-taking is deciding *when the caller has finished talking* so the agent knows it is its turn to speak. It sounds trivial; it is the single most common reason voice bots feel broken. End the turn too early and the agent cuts the caller off mid-sentence; end it too late and there is an awkward dead pause after every reply. Voice activity detection (VAD) plus an end-of-turn model listens for pauses, intonation, and silence thresholds to make that call in real time. This is also where barge-in lives — the ability for a caller to interrupt the agent mid-sentence and have it stop talking and listen, exactly like a person would. Without barge-in, the agent talks over people and the call collapses.
How do the LLM and TTS turn intent into a spoken reply?
Once a turn is closed, the transcript goes to the LLM. I do not drive these agents with one giant prompt and hope — I use an explicit conversation state machine (a LangGraph state graph) that sequences understanding, tool calls, and confirmation, so the agent follows your process instead of improvising. The LLM calls typed tools to check live calendar availability, qualify a lead against your rules, or write to your CRM. Its reply then streams into text-to-speech, which begins speaking the first words before the full sentence is even generated — streaming end to end is how you shave hundreds of milliseconds off perceived latency.
“Callers forgive a lot, but they do not forgive lag and they do not forgive being talked over. Fix those two and the agent already sounds human.”
— Saswat Mishra
What is the latency budget for a natural-feeling call?
The threshold where a phone conversation feels live rather than awkward is roughly 800ms of round-trip latency, and sub-500ms feels genuinely natural — that is the bar I hit on shipped agents like Podit. The only way to get there is to assign every stage a budget and engineer to it. Here is a realistic budget for a sub-700ms loop:
| Stage | Latency budget | How you protect it |
|---|---|---|
| STT (final transcript) | ~100–200ms | Streaming STT, partial results, fast model |
| Turn / end-of-turn detection | ~50–150ms | Tuned VAD thresholds, end-of-turn model |
| LLM first token | ~200–400ms | Smaller/faster model, short prompts, streaming |
| TTS first audio | ~100–200ms | Streaming TTS, start speaking on first chunk |
| Network / telephony overhead | ~50–100ms | Co-located services, persistent connections |
| Total round-trip target | ~500–700ms | Stream every stage; never wait for completion |
How do you handle barge-in and human handoff?
These two are what move an agent from demo to production. Barge-in: when the caller starts speaking while the agent is talking, the system must instantly stop TTS playback, flush the queued audio, and switch back to listening — otherwise it steamrolls the caller. Human handoff: on defined triggers — an explicit request for a person, low model confidence, or a high-stakes or sensitive request — the agent does a warm transfer to a live human and passes the full transcript and context along. Where no one is available, it falls back to voicemail or a scheduled callback so no caller ever hits a dead end. I build both of these into every voice agent; they are not optional.
How do you actually build one, step by step?
- Map the call flows and define the win — booked calls, qualified leads, deflected tickets — before writing any code.
- Stand up the telephony layer (Twilio number + media stream) and prove you can route audio in and out.
- Wire streaming STT and tune VAD/end-of-turn detection on recordings of real calls, not synthetic ones.
- Build the LLM as an explicit state graph with typed tools into your calendar, CRM, and APIs — not a single prompt.
- Add streaming TTS and barge-in, then tune the whole loop against the latency budget until it feels live.
- Wire the human handoff and fallback paths, then harden with transcripts, tracing, and evals over real and adversarial calls.
You do not have to build every layer from scratch. Managed platforms like Retell and Vapi get you to production quickly with solid turn-taking, while a custom LiveKit pipeline gives maximum control over latency and behavior for demanding cases. I choose based on your latency, integration, and cost requirements rather than defaulting to one — and I break down the trade-offs in my Retell vs Vapi vs Bland comparison.