Flite was primarily developed to address one of the most common complaints about the Festival Speech Synthesis System. Festival is large and slow, even with the software bloat common amongst most products and that that bloat has helped machines get faster, have more memory and large disks, still Festival is criticized for its size.
Although sometimes this complaint is unfair, it is valid and although much work was done to ensure Festival can be trimmed and run fast it still requires substantial resources per utterance to run. After some investigation to see if Festival itself could be trimmed down it became clear because there was a core set of functions that were sufficient for synthesis that a new implementation containing only those aspects that were necessary would be easier than trimming down Festival itself.
Given that a new implementation was being considered a number of problems with Festival could also be addressed at the same time. Festival is not thread-safe, and although it runs under Windows, in server mode it relies on the Unix-centric view of fast forks with copy-on-write shared memory for servicing clients. This is a perfectly safe and practical solution for Unix systems, but under Windows where threads are the more common feature used for servicing multiple events and forking is expensive, a non-thread safe program can't be used as efficiently.
Festival is written in C++ which was a good decision at the time and perfectly suitable for a large program. However what was discovered over the years of development is that C++ is not a portable language. Different C++ compilers are quite different and it takes significant amount of work to ensure compatibility of the code base over multiple compilers. What makes this worse is that new versions of each compiler are incompatible and changes are required. At first this looked like we were producing bad quality code but after 10 years it is clear that it is also that the compilers are still maturing. Thus it is clear that Festival and the Edinburgh Speech Tools will continue to require constant support as new versions of compilers are released.
A second problem with C++ is the size and efficiency of the code produced. Proponents of C++ may rightly argue that Festival and the Edinburgh Speech Tools aren't properly designed, but irrespective if that is true or not, it is true that the size of the code is much larger and slower than it need be for what it does. Throughout the design there is a constant trade-off between elegancy and efficiency which unfortunately at times in Festival requires untidy solutions of copying data out of objects processing it and copying back because direct access (particularly in some signal processing routines) is just too inefficient.
Another major criticism of Festival is the use of Scheme as the interpreter language. Even though it is a simple to implement language that is adequate for Festival's needs and can be easily included in the distribution, people still hate it. Often these people do learn to use it and appreciate how run time configurability is very desirable and that new voices may be added without recompilation. Scheme does have garbage collection which makes leaky programs much harder to write and as some of the intended audience for developing in Festival will not be hard core programmers a safe programming language seems very desirable.
After taking into consideration all of the above it was decided to develop Flite as a new system written in ANSI C. C is much more portable than C++ as well as offering much lower level control of the size of the objects and data structure it uses.
Flite is not intended as a research and development platform for speech synthesis, Festival is and will continue to be the best platform for that. Flite however is designed as a run-time engine when an application needs to be delivered. It specifically addresses two communities. First as a engine for small devices such as PDAs and telephones where the memory and CPU power are limited and in some cases do not even have a conventional operating system.
The second community is for those running synthesis servers for many clients. Here although large fixed databases are acceptable, the size of memory required per utterance and speed in which they can be synthesized is crucial.
However in spite of the decision to build a new synthesis engine we see this as being tightly coupled into the existing free software synthesis tools or Festival and the FestVox voice building suite. Flite offers a companion run-time engine. Our intended mode of development is to build new voices in FestVox and debug and tune them in Festival. Then for deployment the FestVox format voice may be (semi-)automatically compiled into a form that can be used by Flite.
In case some people feel that development of a small run-time synthesizer is not an appropriate thing to do within a University and is more suited to commercial development, we have a few points which they should be aware of that to our mind justify this work.
We have long felt that research in speech and language should have an identifiable link to ultimate commercial use. In providing a platform that can be used in consumer products that falls within the same framework as our research we can better understand what research issues are actually important to the improvement our work.
In considering small useful synthesizers it forces a more explicit definition of what is necessary in a synthesizer and also how we can trade size, flexibility and speed with the quality of synthesized output. Defining that relationship is a research issue.
We are also advocates of speech technology within other research areas and the ability to offer support on new platforms such as PDAs and wearables allows for more interesting speech applications such as speech-to-speech translation, robots, and interactive personal digital assistants, that will prove new and interesting areas of research. Thus having a platform that others around us can more easily integrate into their research makes our work more satisfying.
The basic architecture of Festival is good. It is well proven. Paul Taylor, Alan W. Black and Richard Caley spent many hours debating low level aspects of representation and structure that would both be adequate for current theories but also allow for future theories too. The heterogeneous relation graphs (HRG) are theoretically adequate, computationally efficient and well proven. Thus both because HRGs have such a background and that Flite is to be compatible with voices and models developed in Festival, Flite uses HRGs as its basic utterance representation structure.
Most of a synthesizer is in its data (lexicons, unit database etc), the actual synthesis code it pretty small. In Festival most of that data exists in external files which are loaded on demand. This is obviously slow and memory expensive (you need both a copy on the data on disk and in memory). As one of the principal targets for Flite is very small machines we wanted to allow that core data to be in ROM, and be appropriately mapped into RAM without any explicit loading (some OS's call this XIP -- execute in place). This can be done by various memory mapping functions (in Unix its called mmap) and is the core technique used in shared libraries (called DLLs in some parts of the world). Thus the data should be in a format that it can be directly accessed. If you are going to directly access data you need to ensure the byte layout is appropriate for the architecture you are running on, byte order and address width become crucial if you want to avoid any extra conversion code at access time (like byte swapping).
At first is was considered that synthesis data would be converted in binary files which could be mmap'ed into the runtime systems but building appropriate binaries files for architectures is quite a job. However the C compiler does this in a standard way. Therefore the mode of operation for data within Flite is to convert it to C code (actually C structures) and use the C compiler to generate the appropriate binary structures.
Using the C compiler is a good portable solution but it as these
structures can be very big this can tax the C compiler somewhat. Also
because this data is not going to change at run time it can all be
const. Which means (in Unix) it will be in the text
segment and hence read only (this can be ROM on platforms which have
that distinction). For structures to be const all their subparts
must also be const thus all relevant parts must be in the same file,
hence the unit databases files can be quite big.
Of course, this all presumes that you have a C compiler robust enough to compile these files, hardware smart enough to treat flash ROM as memory rather than disk, or an operating system smart enough to demand-page executables. Certain "popular" operating systems and compilers fail in at least one of these respects, and therefore we have provided the flexibility to use memory-mapped file I/O on voice databases, where available, or simply to load them all into memory.
Go to the first, previous, next, last section, table of contents.