TEXT TO SPEECH - The Audio Penguin

Lately I have received several promotions for software designed to take a text file and convert it into natural speech. The idea is one can produce narration or other oral information without hiring a professional narrator.

Although I will get to certain issues related to text to speech, it will be useful to begin with the development of digital music samplers. When analog synthesizers first came on the scene, instruments such as Moog, Buchla, EMS, etc. they were often used to create rather novel or weird sounds to expand the sonic “vocabulary” of music. But most of those were not all that good at trying to accurately re-create the sound of natural acoustic instruments. Nobody who bought the Wendy Carlos album, “Switched on Bach,” ever thought it was made with acoustic instruments and that was the idea: electronics were “switched on” and made Bach more relevant to a young generation.

Enter the sampling synthesizer (or just “sampler”). The earliest ones were very expensive. I first saw a demo about 1987 of a Kurzweil sampler that cost nearly $20,000. Another early digital sampler was the Synclavier, but those were even more expensive, typically in the $200,000 – $300,00 range. Both those instruments, and some others, could sound like real acoustic instruments and many were used especially in TV commercials where a nice instrumental background was wanted but the cost of hiring musicians for a 30-second spot was perceived as too high. Yet, at the time, the sounds of the Kurzweil were perceived by most people as “natural.”

I was told by someone who worked for Kurzweil that the live musicians for a major Broadway musical were threatening to go on strike for higher pay. Someone programmed two Kurzweil samplers with the score for the musical. When it was demonstrated to management and the union, and they saw how realistic the instrument sounds were, the union settled right away knowing they could be replaced by two samplers that could play many very realistic instrument sounds simultaneously in a way that the audience would be unaware that a live orchestra was not in the pit.

The reason digital samplers sound so realistic is they have in memory sounds recorded from real instruments. These are then broken down into components: attack, sustain, decay and are stored in the sampler such that when one presses a key on the keyboard, the attack begins, then as one holds down the key the sustain continues, when the key is released the natural decay of the acoustic instrument original sound is heard. Some instruments, such as piano and harpsichord, do not really have a sustain. They start decaying right after the attack, though sudden dampening of a string stops the sound with a different kind of quick decay. The sampler handles all this and, for instruments of varying loudness, the loudness of a given sound is determined by the strength of the keypress (often called “velocity”). I am over-simplifying the process here but my main point is to note the realism of the sound. There were early attempts at analog samplers such as tape-loop basic instrument or oscillator-based electronic organs. But they didn’t have the realism of digital sampling.

The first time the original Kurzweil (the $20,000 one) was demonstrated to me I was amazed. The presenter could bring up a very natural piano sound. It turns out that Kurzweil sampled a 9’ concert Steinway for the piano sound and even hired the strings of the Boston Symphony to play each note on the scale to be used by the Kurzweil for their orchestral string “patch.” I was amazed and wanted a Kurzweil but could not initially afford it. (They later brought out some smaller cheaper models which I did buy and still own.)

As realistic as these sampled instrumental sounds were, they were not perfect. Many instruments allow for subtle expressions, more than can be stored on a digital sampler. A violin, for example, can start soft and while holding a note gradually make it louder. It can also produce harmonics and other expressive sounds.

As I heard commercials on TV that used the Kurzweil and the Synclavier I started to recognize that these were sampled and could amaze my friends by saying things like “the instrumental background for this ad is being performed on a Kurzweil synth.”

Some instruments do sample well because they have a limited range of variability within a note. These include pipe organs, harpsichords, and piano. But a saxophone, for example, has a large range of sounds it can produce. A later technology, physical modeling, was developed that could help these instruments. The first physical modeling synth I saw was optimized for saxophone and had a keyboard for selecting the notes, but a breath controller to modify the sounds as in a real sax.

Hardware and software technology have developed to extremes I would not have thought possible 30 years ago (except for large “super computers” that cost $millions).

Now, as I mentioned at the start, we are seeing software one can run on a personal computer (Mac or Windows) that can “read” out loud from a text file and sound “natural.” But just as early sampling synthesizers and physical modeling synths could not exactly reproduce the subtlety of many acoustic instruments, I find, from the demos I have seen, that narration software is very limited in terms of the possibilities of the human voice spoken by professional narrators.

Narration software promotes itself as sounding so natural one need not hire a professional narrator, just as early music samplers promoted themselves as being able to create realistic acoustic instruments so one would not have to hire professional musicians. In the demos I have seen, short text paragraphs are read and, especially at first, seem to sound natural like a human narrator.

But at this stage of development, at least, the vocal range is quite limited. If one just wants a sonic introduction to one’s website, or verbal instructions on how to assemble an Ikea bookcase to accompany a video showing the process, then this could work well. But how about a six hour audiobook with several characters and moods? I have not seen any examples that I think would fool the listener into thinking he/she was listening to a professional narrator. For a better test, why doesn’t one of these companies program the text of a scene from Shakespeare and play it for an audience to see if they believe it was really recorded by a professional acting company. I doubt it would fool anyone. Even a recording of an early Kurzweil music synth would fool many people into believing they were hearing a orchestra or chamber group.

In my opinion (at this stage of software development) the claims that “you can get professional sounding narration without having to hire a narrator” seem exaggerated for all but the most mundane content. Text-to-narration software seems, at the current stage, to be less developed than early sampling music synths. If anyone has used this software for serious oral interpretation, poetry, drama, audiobooks, etc. please leave a comment about your experience, good or bad, with whatever software you used.

All for now.
Eric

Comments

Leave a Reply Cancel reply