This speech corpus was collected as part of the DIPLOMAT project, carried out at Carnegie Mellon University from November 1996 to November 1998. The DIPLOMAT project was designed to test the feasibility of rapid-deployment, wearable, bi-directional speech translation systems. The project's goal was to develop a methodology that would allow implementation of complete recognition and synthesis systems at an acceptable level of quality within a few weeks after initial recording corpus design, with continual, graceful improvement to a good level of quality over a period of months.
The corpus contains read speech from 150 native speakers of Haitian Creole residing in three different countries. The first three recording phases took place in Pittsburgh (September 1997), Paris (December 1997) and New York City (January 1998) and included 45 speakers. The fourth phase took place in Haiti during March-April 1998 and included 105 speakers. The gender breakdown for all speakers was 90 male and 60 female. The directory /data contains speech from recordings in Haiti; the directory /data2 contains speech from the earlier recordings.
The material that speakers read was taken from texts available in electronic form on the internet and from non-governmental organizations and literacy institutes. The sources are novels, political speeches, newspapers, training manuals, literacy primers, web pages, etc. Two native speakers read through all of the selected texts to correct typographic errors and change foreign loanwords to their Haitian Creole forms. Extremely short (1-2 words) or long (beyond two lines on a laptop screen) sentences were eliminated. The material that remained contained over 1,200,000, words and over 33,000 unique word types. From this material, recording sentences were chosen using a greedy text selection method that yielded a set of phonetically rich and relatively balanced sentences. The final recording material consisted of over 2000 sentences.
The sentences were randomized and cut into subsets that speakers read. The number of sentences recorded in a single session by the first 13 speakers ranged from 99 to 231. For subsequent speakers, sessions were limited to approximately 30 minutes each (corresponding to subsets of roughly 15-25 sentences) to facilitate the recording task and encourage greater participation. In general, a given speaker recorded one or two sentence subsets.
The recording procedure was as follows. For each sentence, the participant read the sentence out loud for practice, recorded the sentence, re-recorded immediately if a mistake occurred, and listened to the sentence to verify that the recorded file accurately reflected the written form. A simple computer interface was used to allow participants to self-record, and a native or near-native speaker was present to provide assistance or catch any word pronounced in a manner different from the written form.