RetailFeatured

SimplyGuest: Real-Time Voice AI Agent for Parking Search & Support

Production conversational voice AI that helps drivers find parking listings, understand how SimplyGuest works, and resolves common issues—grounded in live marketplace tools instead of guessing

LLMVoice AITelephonyPlivoGemini LiveMCPTool CallingPythonMarketplace

Overview

Use Case

Inbound voice customer support for SimplyGuest.com: parking search, listing details, availability checks, token flow questions, and issue resolution with human escalation

Scale

Production deployment with real-time audio streaming and tool-grounded answers from SimplyGuest's live listing system

Timeline

6 weeks MVP to production with ongoing iteration enabled by prompt updates and observability

The Challenge

SimplyGuest serves drivers who need parking quickly—often while commuting. The support experience had to be fast, safe, and accurate, without turning into an expensive call-center bottleneck.

Driving-Safe UX Requirements

Callers can't listen to long explanations or navigate menus—responses must be short, clear, and actionable

Live Inventory Can't Be "Guessed"

Parking availability, pricing, and listing details change frequently. A voice agent that hallucinates even occasionally is worse than no agent

Strict Policy & Safety Constraints

Agent must never share parking owner contact details directly, must be transparent that listings aren't verified, and must communicate token/replacement-token policy consistently

Production Observability

Voice AI needs traceability: what the user said, what the model answered, which tools were called, and why a call failed—without relying on "it sounded fine" demos

The Solution

We built a real-time, production voice AI agent with three core pillars: low-latency telephony streaming, tool-grounded marketplace answers, and end-to-end call observability.

Real-Time Voice Conversation (Telephony → LLM → Telephony)

A bidirectional streaming pipeline enables natural interruption handling and quick responses:

• Inbound call audio streams via WebSocket from the telephony provider
• The model responds with native audio output (not just text-to-speech stitched later)
• Model audio is resampled to telephony-compatible PCM and played back in real time
• Interruption events clear buffered audio to avoid "talking over" the caller

Tool-Grounded Answers via MCP (Model Context Protocol)

Instead of relying on memory, the agent queries live SimplyGuest tools for anything that depends on current data:

• Natural language parking search by city/area/landmark
• Listing detail retrieval for "tell me more about this one"
• Availability checks for specific listings
• Review lookups when asked

This ensures responses reflect current marketplace state and reduces hallucination risk.

Policy-Safe Customer Support Behavior

The voice agent was designed with strict conversational guardrails:

• Short answers (1–2 sentences) optimized for driving callers
• One question at a time to avoid cognitive overload
• Never reveal parking owner phone numbers directly
• Clear transparency: SimplyGuest is a listing platform; users should visit and verify
• Correct handling of token and replacement-token policy
• Human escalation path for urgent issues

Full-Call Observability and QA Workflow

Every call produces structured artifacts for debugging, QA, and iteration:

• transcript.txt: human-readable call timeline
• events.jsonl: structured events (setup, interruptions, tool calls/results, errors)
• caller_in.wav + agent_out.wav: separate audio tracks
• stereo.wav post-processing: caller on left, agent on right—easy to review
• recording_meta.json: sample rates, durations, and recording notes

This enables fast iteration on prompts, tool schemas, and audio behavior without "black box" risk.

Technical Architecture

AI & Voice Stack

• Google Gemini Live (native audio) for low-latency conversational speech output
• Prompted for short, safe, policy-compliant CSR behavior
• Streaming transcription capture for both caller and agent turns

Telephony & Real-Time Streaming

• Plivo bidirectional <Stream> WebSocket integration
• 16kHz PCM (audio/x-l16) streaming for telephony compatibility
• Real-time audio resampling from model output to telephony sample rate
• Buffer management + clearAudio on interruption events

Tool Integration Layer

• SimplyGuest tools exposed via MCP Streamable HTTP
• Dynamic tool declaration fetching (function schemas) at session start
• Tool-call logging with latency measurement and error capture

Observability & Recording

• Per-call directory layout with UTC date partitioning
• Structured event logs for auditing tool behavior and failures
• Call recording with timeline alignment (including silence insertion for agent track alignment)
• Stereo WAV generation for fast human review

Results

Customer Experience

72% of callers found a relevant parking listing without needing human escalation
45 seconds median time from call start to first relevant parking recommendation
4.3/5 average satisfaction rating from post-call surveys (n=500+ calls)

Operational Efficiency

68% reduction in repetitive support queries handled by humans (parking search, availability checks, token policy explanations)
52% reduction in average handle time for escalated calls (full call transcripts + tool traces eliminate re-investigation)
96.8% tool-call success rate with 180ms median latency (marketplace data + review lookups)

Technical Performance

240 ms median end-to-end audio response latency (caller speech → agent audio with tool calls)
99.7% uptime across production (including tool provider availability)
99.2% of calls with complete transcripts, events, and stereo recordings for QA review

Key Takeaways

Tool Grounding Wins in Marketplaces: Any system that answers from "memory" will drift from live inventory. Making tools the default path for availability/details dramatically reduces incorrect answers.

Voice UX Needs Hard Constraints: Short, single-question turns aren't just "nice"—they're essential for driving callers and for reducing conversational failure modes.

Observability is the Difference Between a Demo and Production: Transcripts, tool traces, and stereo recordings make it possible to debug and improve safely—without guessing what happened on a call.

Ready to deploy a production voice agent for your marketplace?

If you want a voice AI system that's grounded in real data, designed for production reliability, and instrumented for continuous improvement—we can help you ship it.

Discuss Your Voice AI Needs