The voice of the machine

Low-cost AI voices aren't anywhere near good enough yet

Sep 16, 2024

Over the last few months, I’ve found myself thinking more and more in terms of the spoken word rather than the written word. As I’ve mentioned a few times already, I want to turn my short stories into readings, and I want to start writing audio dramas. The problem, of course, is finding voice actors. With the written word, what I put on the screen is pretty much the final product, other than the cover. With audio, there’s a whole additional stage to the production. (Or several, if you include music, foley, sound effects, and mixing.)

I can read most of my stories myself, and I’m very looking forward to doing it over the winter. But there are some that just don’t work well with my voice. I don’t sound like an ancient New Englander mountain man, for example - let alone a woman. And when it comes to audio drama, I’m not even going to attempt to do all the parts myself.

But hiring voice actors is out of my budget right now, so I figured I’d take a look at whether AI could provide a stopgap solution. At least when it comes to audio dramas, I figured AI might be sufficient to create a working copy, much like orchestral composers create a synthesized version of their work to try it out before going anywhere near a real orchestra. That would be a great way to learn my craft, just as machinima was a great way to learn a lot of film-making techniques without having to worry about actors, sets, or equipment.

The good news is that there are plenty of free and low-cost voice generators to choose from, offering a wide range of voices with different tones, accents and speaking styles. Young, old, male, female, Chinese, Arabic, Texan, tired, peppy, authoritative… whatever you want, it’s probably out there somewhere. And if you can’t find what you’re looking for, you can easily create it from just a very small sample of someone’s voice. (Let’s not get into the ethics of that, though.)

The bad news is that they all sound like robots. They’re not as horrible as the ancient text-to-speech programs that you find reading public domain books, or the inbuilt text-to-speech readers that were designed to help people with vision impairments. In fact, for short clips, they’re often not bad. They’re perfectly adequate for running chatbots, generating voice output for an LLM, or doing a voice-over for a short corporate video.

But for reading fiction or voice acting, they’re awful. They can’t sustain a pleasing cadence for an entire paragraph, let alone an entire story: they just keep going in the same basic rhythm without pausing to draw breath. They don’t know how to stress a sentence for dramatic effect: they simply apply some basic grammatical rules to determine where the stress should be. They don’t know how to adjust their tone to convey emotion: they can’t switch from melancholy to resignation to hope to relief in response to what’s happening in the story. They’re flat, they’re boring, and they don’t engage the listener.

Admittedly, some of them do give you the ability to add mark-up and give the AI detailed instructions on how to perform, but it’s ridiculously time-consuming and often completely ineffective. As a director, I can just tell a human actor what I want them to do, and they’ll generally understand what I want. An AI needs to be given precise, detailed, specific instructions: it’ll do what you tell it, not what you want. So as a fast, easy workaround, it’s just not a practical solution.

Obviously, the latest (and more expensive) AIs can do a significantly better job, but if I could afford those, I’d hire actual voice actors instead. For now, however, I’ll just have to focus on recording the stories that will work with my voice, and begging for favors from my old friends in the machinima community.

Thank you for reading. This post is public so feel free to share it - especially if you know any voice actors who work for ExposureBucks!

Lindsay

Please, sir, I am begging you to check out some of the resources for audio drama and fiction podcast creators, like The Fiction Podcast Weekly newsletter (full disclosure: I edit and send this newsletter out for The Podcast Host), the Audio Drama Hub on Facebook, the Audio Drama Gazette on Substack, and the vast array of Discord servers for voice actors. Casting Call Club is another option. I hope more voice actors will post resources for you. I know if you post on the Audio Drama Hub on Facebook, "Hey, I'm a small-budget audio drama podcast creator with a big dream, how can I find living human voice actors?" the world will provide more information and resources than you can shake a stick at.

Just as human writers are unique and irreplaceable, so are human voices.

Expand full comment

4 replies by Matt Kelland and others

Dave Morris

Tim Child (the producer of the Knightmare TV series) was working on an "emotional markup language" for automated text-to-speech about twenty years ago. The logical extension of Pinter's pauses, perhaps!

4 more comments...

Matt Kelland, Writer

Discussion about this post