March 6th, 2018

One of the most powerful editing tools I use is the built-in text-to-speech facility of macOS. It’s as easy as highlighting some text and hitting ⌃-⌘-⌥-S.  Of course, reading with my eyes is always the first pass, but there’s something wonderful and insidious about the human brain’s capacity to “fix” things for us. I could blitz right past the word “is” and not notice that it’s supposed to be “it.” That's where listening to the words really helps. The computer gives me the unvarnished truth. Misplaced words jump right out at me - and this got me thinking. Now, being a total nerd and coder-by-day, I thought hey, why not automate this and I convert the whole book to mp3 files that I can listen to it on my morning walk. I’d like to take a moment here and suggest in the strongest terms that this is in no way audio-book quality. But for editing purposes, it has become an essential tool.

Version 1: 

I did some Googling and found that there’s a command line version of the same text-to-speech I had been using. It’s called ‘say’ and it can convert text files to .aiff files using a variety of options like different voices and speeds. It also doesn’t do it in ‘real-time’ - so you don’t have to wait five minutes to get five minutes of audio files. Nice!

So, what the heck is a .aiff file? It’s some weird audio format and while the documentation says you can specify others - they don’t work. So, one of the tasks was to find a utility that would convert these to mp3. I found one called ‘ffmpeg’ and it did the job. By the way, I use an app called ‘brew’ to install these type of things. If you were paying attention earlier, you may have noticed that I said ‘say’ could convert text files. It cannot, however, convert .rtf files which is what Scrivener uses. So, the next task was to find a convenient way to extract the plain ASCII text from the .rtf files. I tried several things and realized that I actually needed some of the meta-data like italics. It just sounds wrong without the emphasis. So instead of the .rtf files, I used Scrivener’s ‘Export’ function and created a set of html files complete with <i> tags which I could convert to something ‘say’ would recognize as emphasis. These .html files would then need to be converted to .txt files. Brew to the rescue again! I found a utility called ‘textutil’ that would do the job. Here’s the process:

Use Scrivener to export .html files

Replace <i> </i> italics with [[emph +]] in the .html files.

Use textutil to convert the .html files to .txt

Use ‘say’ to convert to .aiff files

Use ‘ffmpeg’ to convert to .mp3 files

Wow, what a slog! And I did each file by hand that first time. It took forever!

Version 2:

This is really just the same process, but I automated it with a PHP script. It also does some nice things like adding the chapter number and title to the top so it will introduce it. This is super-convenient for navigating if you lose your place. In this version, I also added a web page to play these files for me. I needed to be able to skip back 10 seconds for ‘what-did-he-say?’ moments. This was a little tedious, but it worked for several weeks - until - I upgraded my mac to the next version of the operating system. The text-to-speech in macOS High-Sierra is broken. The output to files just hangs. This is a known bug, but they seem in no rush to fix it for me. That got me to searching for an alternative.

Version 3:

I have written iOS apps before, but they always came out pretty awful. Apple introduced a new programming language called Swift, and I had been wanting an excuse to dive into it. Arrrrg! Swift is not an improvement on Objective-C. It’s a lateral move at best. It’s every bit as abstruse. The quirky, hard-to-read syntax is just quirky and hard-to-read in a different way. It doesn’t make the job any easier. Still, I gave it a try. The first iteration was a desktop app that basically did the same text-to-speech as ‘say’. All I had to do was drag and drop the .rtf files  (yes I said .rtf files!) onto the app, and it could convert them on-the-fly to the speakers. It even sounded exactly the same! I don’t know why I expected it to be different. When I tried to convert to audio files it failed in exactly the same way as the command-line utility. 

Version 4:

Okay so if it will only go to the speakers, I just need a mobile App to do the same as the desktop version. I created an iOS version. Now you might expect that this would be simple, right? Just tell Xcode to compile a version for iOS that I can run on my iPhone, right? Nope. Start all over because everything is different. It evens sounds different. The quality is terrible! I went back and forth to compare and started noticing subtleties that the desktop version does like the sound of the ‘speaker’ taking a breath. How crazy is that! But it really makes a big difference. The iOS version was so bad it was distracting and made it hard to concentrate on the story.

Version 5:

I started looking around for alternatives that would improve the quality and still let me use the iOS app. I found a service on Amazon called “Polly” that will convert text-to-audio, and it’s pretty good! I found another with excellent quality called Acapela. Neither of these is free. Acapela is too expensive for my needs, but Amazon Polly is about $4 for 1,000,000 characters which translates to about 25 hours of audio. Not too bad!


 I did get the Amazon Polly version to work, but the lag between tapping the play button and getting sound back was too frustrating. So, in the end, I just went back to the PHP script in Version 2 using a laptop which I haven’t upgraded yet. I had hopes of making this a free app to help other authors but the poor quality and the complexity would just frustrate most people. But there is hope! The cpu power available to mobile apps gets better with every generation of phones, and soon we will have some that can do the same quality text-to-speech as the desktop. After that, it will get even better. Adobe has been teasing us for two years with VOCO(still not out) and I do think we will get very high quality one way or another. I don’t believe that we will ever get a replacement for real voice actors doing what they do for the audio-books I love. A computer that can do what Nick Podehl or John Hodgman does won’t be a computer. It will be an A.I. So, be careful what you wish for!

Comments? Hit me up on twitter @BrainsInChains