If I stop performing this sanitized version of myself, will the people on the other side of this screen still find me useful?
It is a question few people in the modern workforce ask out loud, primarily because we are taught that “professionalism” is a synonym for “neutrality.” We are told that to be understood is to be efficient, and to be efficient, we must shave off the jagged edges of our origins.
But in my work as a grief counselor, I’ve found that the things people mourn most aren’t just the people they’ve lost, but the versions of themselves they felt forced to execute for the sake of a smooth transaction.
The Latency of Performance
of latency define the border between a conversation and a performance. Yusuf sits in a glass-walled conference room in Berlin, the afternoon sun hitting the table at a sharp that makes him squint.
He is presenting a proposal for a decentralized logistics network, a project he has lived with for . He speaks into a sleek, cardioid microphone that promises “studio-quality” clarity. But as he speaks, he watches the bottom of the screen, where a live transcription tool is attempting to turn his voice into text.
Natural (200ms)
Yusuf’s Screen (700ms)
Human Ear (150ms)
The “Performance Gap”: When processing time exceeds human conversational rhythm, identity is sacrificed for clarity.
Yusuf is from Istanbul. His English is technically perfect, but it carries the rhythmic swell of the Marmara Sea. When he says the word “logistics,” the software stammers. It prints “low-istics.” Then “lost sticks.” Yusuf pauses.
He feels the collective heat of six people waiting for the sentence to resolve. He tries again. This time, he flattens the “o.” He clips the “s.” He speaks slower, his voice becoming a series of disconnected percussive strikes, like he’s trying to communicate with a stubborn smoke detector.
The transcription finally gets it right, but the music has left his voice. The room nods at a sentence he never actually said, at least not in any way that felt like him.
The Anatomy of a Disconnect
The physical traversal of a human voice is an intricate, messy journey. It begins in the diaphragm, moves through the glottis, is shaped by the tongue against the palate, and finally vibrates the air in a room.
Origin
The diaphragm and the Marmara rhythmic swell.
Digitization
Fiber optic cables buried under the Atlantic.
The Filter
A server farm in Northern Virginia comparing Yusuf to a “Standard.”
From there, it hits the diaphragm of a microphone, is converted into a stream of binary data, travels through fiber optic cables buried under the Atlantic, and eventually lands in a server farm in Northern Virginia. At that data center, an algorithm compares Yusuf’s vocal patterns against a “standard” model.
Most speech recognition software was trained on a specific, narrow dataset-often the voices of people who live within a three-hour drive of the engineers who built it. When the software fails to understand Yusuf, the marketing copy for that software frames it as a “challenge of accents.”
It suggests that Yusuf has the obstacle, and the software is the generous tool trying to help him overcome it. This is a subtle, pervasive gaslighting. It shifts the burden of comprehension entirely onto the speaker. It suggests that if you aren’t understood by the machine, you are the one who is broken.
Tissues and Filing Cabinets
I often walk into a room-my own office, usually-to grab a box of tissues for a client who is mid-sob, and I find myself standing by the filing cabinet wondering what I came in for. The weight of someone else’s grief has a way of displacing your own immediate intentions.
I see a similar displacement in people who have spent years talking to machines. They start to forget the “tissue”-the original point of their communication-because they are so focused on the “filing cabinet”-the rigid structure of the tool they are trying to satisfy. They become linguistic contortionists, bending their natural cadence into shapes that the silicon ear finds acceptable.
Hidden Labor
Mental energy spent “neutralizing” origins to avoid looking foolish.
Lost Innovation
Focus shifts from decentralized networks to vowel pronunciation.
This is the “phonetic tax.” It is the mental energy expended by millions of people every day to “neutralize” themselves so an algorithm doesn’t make them look foolish in a meeting. It is an invisible labor that drains creativity. When Yusuf is focused on how he is pronouncing “logistics,” he isn’t focused on the brilliance of his decentralized network. He is occupied by the fear of a glitch.
We are told that technology is a bridge, but many of these tools act more like a filter. They don’t just carry the message; they demand a tribute of identity before they let the message through. A system that blames your speech never has to improve its listening.
It creates a feedback loop where the only people who are heard are the people who already sound like the engineers. Everyone else is left to over-enunciate until they sound like a series of clicks and whirs, a human parody of a computer.
An accent is a map of a person’s history, a record of the languages their ancestors spoke, the geography of their childhood, and the migrations of their family. To “reduce” an accent is to ask someone to erase their map so they don’t get lost in your narrow hallway. It is an act of profound disrespect that we have rebranded as “clarity.”
If we are to build a world where technology actually facilitates human connection, we have to stop treating the human voice as a defective input. We need tools that are built with the understanding that variation is the natural state of humanity. We need systems that don’t ask Yusuf to sound like a news anchor from Nebraska just to get his point across.
A Radical Act of Listening
This is why the approach of
is so quietly radical.
Instead of forcing the speaker to adapt to a rigid, predetermined “standard,” it is designed to meet the speaker where they are. It treats the myriad textures of global languages and accents not as “problems” to be flattened, but as the fundamental reality of conversation.
By using AI to bridge the gap without demanding the speaker sacrifice their natural cadence, it allows people to remain themselves. It works inside the tools we already use-Zoom, Teams, Meet-without the intrusive presence of a “bot” that reminds everyone that a translation is happening. It stays invisible so the humanity of the speaker can stay visible.
The Relief of Being Heard
In my grief work, I’ve noticed that when a person finally feels heard-truly heard, without having to translate their pain into a “socially acceptable” format-their entire body changes.
The drop of a shoulder when the mask falls away.
Their shoulders drop . Their breathing slows. The tension in their jaw, which they didn’t even know they were holding, evaporates.
We deserve that same relief in our professional lives. We shouldn’t have to carry the tension of being our own “accent reduction” coaches. The burden of understanding should be on the listener, and if the listener is a machine, that machine better be a very good one.
The of silence that followed Yusuf’s mangled transcription felt like an eternity. But imagine a different scenario. Imagine Yusuf speaks, his voice full of the history of Istanbul, and the technology simply… works.
It captures the nuance. It translates the meaning without stripping the soul. It allows him to focus on his logistics network rather than his vowels.
When we stop treating accents as obstacles, we open the door to a level of collaboration that isn’t possible when everyone is wearing a linguistic mask. We stop wasting energy on the performance of neutrality and start investing it in the substance of our ideas.
The silicon ear doesn’t have to be a judge; it can be a witness. But it can only be a witness if it’s designed to recognize that the music of a voice is just as important as the words themselves. If we keep building tools that demand we sound like machines, we shouldn’t be surprised when our conversations start to feel mechanical.
We have to choose tools that honor the texture of our lives, tools that let us walk into the room and remember exactly why we are there.
When the smoke detector demands a flat tone, the soul only speaks in the music that the transcription is designed to ignore.
The Second Wave
We are currently living through a transition where “good enough” technology is being replaced by “human-centric” technology. The first wave of AI was about the machine being able to do the task at all. The second wave, the one we are entering now, is about the machine doing the task while respecting the human at the center of it.
I think about Yusuf often, and the thousands of Yusufs in glass-walled offices all over the world, staring at screens, wondering if they should repeat themselves or just give up. I think about the loss of all those unsaid things, the ideas that were swallowed because the “phonetic tax” was too high that day.
We can’t afford that loss anymore. We need to stop fixing people and start fixing the tools. We need to realize that the accent was never the problem; the problem was our refusal to listen to the person behind it.
