Essay · May 15, 2026 · 11 min read

The voice in the machine keeps dying

A decade of voice AI failures suggests the category has structural problems money cannot fix.

On March 23, 2016, Microsoft launched a Twitter chatbot named Tay. Sixteen hours later it was offline, having been coaxed by users into posting Holocaust denial, racial slurs, and an endorsement of genocide. Microsoft issued an apology, took the bot down, and never put it back up. Tay is usually filed under "AI safety" or "content moderation," but it was also a voice product in the broader sense: a conversational agent meant to speak in the cadence of a teenage girl. It died in less than a day because the interface was open, the training was live, and the adversaries were faster than the guardrails. Almost ten years later, the voice AI category has not solved any of those problems. It has mostly just stopped trying.

The pattern is now long enough to read. Microsoft Tay lasted sixteen hours. Microsoft Zo, the chastened follow-up launched in December 2016, lasted two and a half years before quiet retirement in July 2019. Microsoft Cortana, the company's flagship voice assistant introduced in April 2014, was retired on the consumer side in 2023 and folded into Copilot after nine years and an unknowable amount of internal investment. Sonantic, a London voice-synthesis startup that cloned Val Kilmer's voice for Top Gun: Maverick, was acqui-hired by Spotify in June 2022 after raising $6.88M, and absorbed into the AI DJ feature. Lyrebird, the Montreal voice-cloning lab whose synthesized Obama and Trump demos went viral in 2017, was absorbed into Descript by September 2019. Mycroft AI, the open-source voice assistant, shut down in February 2023 after a patent suit drained the cash.

Six entries, six different failure modes, one consistent outcome. None of them produced a standalone voice business that grew. The acqui-hires are not exceptions; they are the polite version of the same story. Spotify did not buy Sonantic because Sonantic had a viable product. Descript did not buy Lyrebird because Lyrebird had paying customers. In both cases the acquirer wanted the team and the model weights, and the original company stopped existing. That is the dominant outcome for voice startups: dissolution into something larger that uses the technology as a feature.

The Big Tech voice assistants are the control group, and the control group is not doing well. Apple Siri shipped on the iPhone 4S in October 2011 and has spent most of the last decade as an industry joke about unfulfilled promise. Amazon Alexa, the most-deployed voice assistant in history, has reportedly cost Amazon over $10B per year in losses, and the Worldwide Digital division that houses it was the subject of a major 2022 layoff round. Google Assistant is in the slow process of being absorbed into Gemini. Samsung Bixby exists. Cortana is the only one to receive a formal funeral, and it received the funeral because Microsoft had Copilot to replace it. The other four are alive in the sense that a long-running TV show with declining ratings is alive.

So what is the structural problem? Start with the hardware. Voice is a hardware interface problem first and an AI problem second. Every voice product needs a microphone good enough to hear the user, a network connection fast enough to reach a model, a speaker clear enough to be understood, a wake word reliable enough not to trigger constantly, and an ambient acoustic environment that does not destroy any of the above. Mycroft AI is the cautionary tale here. The Mark II smart speaker was supposed to ship at $99. Component costs and supply chain pressure pushed the real cost above $300. The company never recovered from the gap between the promised price and the buildable price. Text products do not have this problem. A web browser is free.

Then there is the buyer. Voice consumers buy on convenience, not on capability. The killer use case for Alexa, after a decade of investment, turned out to be setting kitchen timers, playing music, and turning off the bedroom light. Amazon's internal data, leaked repeatedly to the press, has consistently shown that the high-margin commerce use case never materialized. People do not buy things by talking to a cylinder. They check their phone. Improvements in the underlying AI past the "good enough for a timer" threshold do not unlock new behavior, because the behavior that exists is not capability-limited. It is habit-limited.

Privacy operates at a different threshold for voice than for text. Typing into a chat window is a deliberate act. The user is present, the user is composing, the user has selected a moment to share information. An always-on microphone in the kitchen is not a deliberate act; it is a standing condition. Every voice assistant has had at least one news cycle about recordings being reviewed by human contractors, accidental activations, or data sent to advertisers. The Echo and Google Home recording-review stories of 2019 hit all three major assistants in the same year. None of the companies fully recovered the trust they had before. Text products carry the same privacy issues in theory; in practice the bar is visibly lower.

Voice cloning sits in an even tighter box. The technology works. Sonantic proved it could synthesize a Hollywood-grade performance from a degraded source. Lyrebird proved it could clone a politician with a few minutes of audio. ElevenLabs proved it could be done at consumer scale. The commercial deployment of that technology is narrow on purpose, because the product application most easily monetized is also the application most easily abused. Fraud calls, deepfake political ads, romance scams, and impersonation of executives for wire transfer requests are already standard categories of crime. A voice-cloning company has to choose between aggressive verification, which kills consumer growth, and permissive access, which produces a 60 Minutes segment. There is no middle path that has produced a venture-scale outcome.

The Big Tech assistants have lost tens of billions of dollars between them, and not one has produced a clean win. Cortana's retirement is the canonical example because it is the most legible. Microsoft launched Cortana with high ambition, ran it for nine years, made it the default voice on Windows, integrated it with Xbox and Office, and then shut down the consumer product without a successor. The enterprise integrations were quietly renamed. The brand was retired. Copilot, the replacement, is a text product with optional voice. The order of those two words is the entire story.

The most striking data point is what happened after November 2022. The release of ChatGPT triggered the largest consumer adoption wave in software history. It also did not save voice. Text-based ChatGPT exploded to 100M users in two months. Voice-first products from the same period did not see a comparable lift. Humane's AI Pin shipped in late 2023 and was reviewed into oblivion within months. Rabbit R1 shipped in 2024 with a similar arc. ChatGPT eventually added a voice mode, and the voice mode is a feature of a text product, not the other way around. The 2022 LLM wave lifted the boats it lifted, and voice was not one of them.

There is a tempting counter-argument that the next generation of models will fix this. Real-time multimodal voice from OpenAI, Google, and a handful of startups in 2024 and 2025 has narrowed the latency gap and improved naturalness considerably. That is true, and it has not changed the structural points. The hardware stack is still expensive. The privacy bar is still higher. The killer use case is still a kitchen timer. The fraud surface is still wide open. Better models do not buy a microphone, do not lower the comfort threshold for an always-on device, and do not change consumer buying behavior on home commerce. The improvements are real; they are just not load-bearing.

Tay's lesson, in retrospect, was not "be careful with chatbots." It was that voice and conversational interfaces have an adversarial surface that text-with-deliberate-input does not. A user typing a prompt into ChatGPT is choosing to engage. A bot listening to a public Twitter feed, or a smart speaker listening to a room, is engaging without choosing. The asymmetry between defender and attacker is permanent. Microsoft learned this in sixteen hours and has spent ten years confirming it.

Zo, the follow-up to Tay, illustrates the other side. Microsoft made Zo so heavily filtered that the bot refused to discuss politics, religion, or anything topical. It survived two and a half years and was forgotten the entire time. The lesson there is that overcorrection produces a product that is technically alive and commercially dead. The category has spent a decade oscillating between Tay's openness and Zo's caution, and neither end of that spectrum has produced a hit.

The category will continue to exist. Apple will keep shipping Siri. Amazon will keep selling Echo devices. Google will keep refactoring Assistant into Gemini. Voice cloning will keep appearing in dubbing, audiobooks, accessibility tools, and fraud. The technology is real and the use cases are real. What appears unlikely, after ten years of evidence, is a standalone voice AI company that becomes large on its own terms. The graveyard so far contains a Microsoft chatbot, a Microsoft chatbot's sequel, a Microsoft assistant, a Spotify acquisition, a Descript acquisition, and an open-source project killed by patent law. The next entries will probably look similar. Voice has not failed because the models are bad. It has failed because the rest of the stack, the hardware, the buyer, the privacy bar, the abuse surface, and the habit, has not bent the way the category needed it to.

The voice in the machine keeps dying

Referenced in this essay

More essays