When machines learn to speak
One API call from human-like AI conversation: the profound shift from typing to talking and what it means for human interaction
This post is part of my 2¢ series - my raw thoughts about recent topics in AI. Not always practical thoughts, but always thought-provoking. Some of my previous ones covered the economies of scale for foundation AI models, consolidation in the AI space, and autonomous agents.
This post is about the unprecedented shift happening in voice AI interfaces and what it means for human interaction. As these new capabilities become accessible through simple APIs, a massive opportunity is emerging for founders to build products that reimagine how we communicate with technology and each other.
A NotebookLM-powered podcast episode discussing this post:
June 2025. Sarah paces in her living room, rehearsing an important client presentation. Her AI companion listens intently, chiming in when relevant to offer real-time feedback on her delivery and content. "I think you rushed through the ROI section," it suggests in a warm, natural voice. "Let's try that part again, but this time—" Sarah cuts in mid-sentence, "Actually, can we focus on the opening first? And don't be so nitpicky!" The AI smoothly adjusts, without awkward pauses or robotic transitions. What was once a frustrating experience of rigid, unnatural interactions with voice assistants has evolved into fluid, human-like conversation.
I've spent considerable time lately thinking about and building in the voice AI space, and something unprecedented is emerging: for the first time in history, we have real-time, affordable, and competent artificial voice that's just one API call away. In just a few months, we've seen significant leaps forward from the likes of OpenAI’s Advanced Voice Mode (AVM) and new speech models, Google’s real-time conversational Gemini Flash, and Sesame’s emotionally intelligent AI1.
This isn't just a technical milestone—it's a fundamental shift in how we interact with technology and, potentially, with each other. It will create numerous new opportunities for builders while redefining the very nature of human communication.
Gavin Purcell is arguing with Sesame’s realtime voice AI 👆
The dawn of natural voice AI
Remember the last time you called your bank's automated system? The familiar dance of repeated phrases, misunderstood words, and the desperate pressing of "0" to reach a human operator. That era is ending. OpenAI's release of Advanced Voice Mode (AVM) last September marked a pivotal moment when conversing with AI began to feel genuinely human.
This transformation stems from two key breakthroughs. First, the shift from cascading architectures (speech-to-text → text processing → text-to-speech) to direct speech-to-speech models eliminates intermediate processing stages that previously slowed conversational AI interactions. Second, the dramatic reduction in latency and cost. When OpenAI initially released its Realtime API, the price made it impractical for widespread adoption (18$/hour). But just four months later, Google's release of Gemini Flash 2.0 and OpenAI's 60% price reduction opened the floodgates for affordable and human-like voice AI applications that are one API call away.
Just last week, OpenAI unveiled its most human-like speech models yet, enabling developers to embed expressive cues like [WHISPERING] or [LAUGHING] directly into the text. Here's a quick demo from OpenAI.fm—a public tool launched alongside this release, showcasing what this new level of expressiveness sounds like in action:
Builders can now launch phone assistants that qualify sales leads, resolve customer support calls, automate insurance sales, or screen patients before their upcoming appointments. The necessary tools are already available and are just a single API call away.
The interruption problem
However, building truly natural voice interactions isn't just about faster processing and better voice synthesis. One of the most fascinating challenges lies in handling interruptions—a fundamental aspect of human conversation that AI still struggles with.
Current voice AI systems, including the ones mentioned like OpenAI’s AVM, face several key challenges:
Oversensitivity to background noise (I always mute myself when not speaking)
Inability to distinguish between relevant speakers and ambient conversation
Lack of visual cues that humans use to anticipate and manage interruptions
Unlike human phone conversations, where near-zero latency and natural turn-taking make interruptions manageable, AI interactions often feel clunky when users try to interject2. Interestingly, humans tend to interrupt AI more frequently and aggressively than they would other humans, creating a new challenge for voice AI developers while creating a new interaction paradigm for human-AI conversation.
The social impact
This voice revolution raises profound questions about human interaction and relationships:
Could the instant gratification of interruptible AI conversations and the ability to be rude without consequences degrade our patience and interpersonal skills, similar to how ubiquitous access to pornography has distorted societal expectations around intimacy?
The convenience of always-available AI consultation might reduce our reliance on human relationships. Consider how we once relied on reading maps and asking locals for directions—skills now largely abandoned as we defer to GPS. Could meaningful conversations be next?
Could we soon have more conversational exchanges with AI agents than with human companions?
Think: Would you rather rehearse a high-stakes presentation in front of a potentially judgmental friend or instantly consult a non-judgmental AI companion available 24/7?
What does this mean for our interpersonal relationships?
Cultural nuances in AI conversation
One size doesn't fit all in human conversation, and the same is true for AI. OpenAI's recent update of GPT-4o to GPT-4.5 was mainly about moving away from its "corporate HR" tone, recognizing that natural conversation varies significantly across cultures and contexts.
Different cultures have distinct interruption patterns, politeness norms, and conversation styles. Today's systems largely fail to account for these cultural differences, creating a significant opportunity for AI builders to develop models that adapt to:
Cultural background
Individual user patterns
Contextual cues
Historical interactions
OpenAI already possesses such context through its Memory feature, and Google, of course, knows virtually everything about us already.
I imagine the best conversational AI systems of the future will incorporate nuances that we take for granted.
Rethinking communication
The holy grail for conversational AI might be achieving the natural flow of a phone call between humans, where interruptions feel natural and turn-taking is seamless. But perhaps we need to aim higher. As AI systems gain multimodal capabilities (vision, touch, etc.), they could potentially surpass human conversation by reading subtle cues we often miss.
Figure's household robots learn tasks on the fly 👆
What surprises me most is how slowly Advanced Voice Mode is being adopted. Despite its impressive capabilities, many of my friends still default to typing or using Whisper (OpenAI's speech-to-text model) rather than having natural conversations with it. Perhaps this hesitation reflects our collective uncertainty about speaking naturally to machines, or simply a lack of awareness—after all, it only became available to free users last month, and many may not yet know how to use it. Either way, it suggests we're in an awkward adolescent phase of voice AI adoption—the technology is capable, but our habits and expectations haven't quite caught up.
The voice AI revolution isn't just about making machines sound more human—it's about fundamentally changing how we think about conversation, relationships, and human interaction. While we'll certainly see a proliferation of phone AI agents and computer assistants in the short term, there's a more profound transformation taking shape beneath the surface.
As we build these systems, we need to consider not just what's technically possible, but what's socially desirable. For now, it's clear that we're entering an era where the line between human and AI conversation is increasingly blurry—for better or worse.
To end on a lighter note, here’s a fun video of ChatGPT’s Voice Mode reimagining an alternate ending to Titanic.
Sesame just released an open-sourced (Apache 2.0) version of its impressive voice assistant model
Word around San Francisco is that top AI labs are on the cusp of a breakthrough that could solve these challenges
The fact that I don't talk with my AI assistants has nothing to do with uncertainty or lack of awareness.
It's because many contexts don't lend themselves for audible conversations that can be overheard by everyone around us. I don't want to talk about my business and my problems is public.
And it's also because many like to finish an asynchronous comment before hitting Send. (There's not a single social platform that I'm aware of that shows keystrokes to recipients in real time. There's a reason for that.)
And it's also because a text chat is much easier to scroll through and manipulate than a real-time conversation.
What I always find surprising is when technologists simply expect people to switch to a different communication mode with a complete disregard for the context and lived experience.
Interesting!