I did some reasearch on text-to-speech solutions for a project we are doing. I was looking for a solution to produce speech samples from short text streams, and cache them. The clips cannot be generated all in one go (they need to be updated sometimes) but it does not need to be all dynamic (most of it will be cached when it’s served).
The first solution that I considered was using the Mac OS builtin text-to-speech system. It has an applescript and a command line interface, so it would be possible to pull all the text that we want to convert, generate many samples, and move them from a local machine where they need to be.
For example, at the command line, this is the “say” command synopsis :
say [-v voice] [-o out.aiff | -n name:port ] [-f file | string ...]
So you can try:
say fitter happier
and this will save it to a file:
say -o fredsays.aiff we are now in the year two thousand
I was discussing this with Tommi and the conversation made me realise that the Mac voice is a bit iconic, it part of our audio culture, you ear it and you know it’s Fred from Mac voices. A short Fred playlist : “Fitter Happier” on Radiohead’s OK Computer, “The Analyst” on Arpanet’s “Wireless Internet”. BTW listen to the newest voices that they introduced with Leopard… Fast Darwin transform in the uncanny valley.
So, the Mac text-to-speech system is allready a quite good solution for what we want to do. The main problem will be to keep the content and the audio in sync. For now we don’t have access to a Mac OS X server, that means that we would need to update the sounds of the system we’re building from a local computer. Ideally this would be done by batches, something like a “generate and upload all new sounds” script that we would need to run each time there is a significant content update.
To get around this we would need to integrate the text-to-speech service in the content strorage and administration system. There would then be callback functions on some fileds; and when those fields are created or modified we would ping a “speech server” to get the content as audio in return. There are two parts in this project: getting the speech server to run, writting the API calls to get the samples to cache when some data is created/modified.
The ideal solution here would be a webservice that run the server and provide an API. Text-to-speech is a well explored domain, and there are several tools in the main programming languages (Java, C++) to build a speech server on a unix machine without too much difficulties. So there are plenty of companies who sell this kind of services, and a few free webservices (vozMe and SpokenText to name two) but the free ones does not seem to have documented stable APIs (I found a pirate one) so we cannot go this way.
To setup our own server, I found that the open source Festival seems to be the tool of choice of a lot of people, it is available thru several linux package managers and it has a PHP client, called pvox (from 2008), this would round the API corner. So it seems a good choice.
To conclude: I’m looking into installing Festival w. pvox in one of our servers, and the Mac solution is our backup strategy.