Public release of Haitian Creole language data by Carnegie Mellon

The Language Technologies Institute (LTI) of Carnegie Mellon University's School of Computer Science (CMU SCS) is making publicly available the Haitian Creole spoken and text data that we have collected or produced. We are providing this data with minimal restrictions (see license) in order to allow others to develop language technology for Haiti, in parallel with our own efforts to help with this crisis. Since organizing the data in a useful fashion is not instantaneous, and more text data is currently being produced by collaborators, we will be publishing the data incrementally on the web, as it becomes available.


Note that several spelling systems exist for Haitian Creole.
We use here the official Haitian orthography for Haitian Creole, by the IPN (Institut Pedagogique National), 1979.

Haitian Creole data

Data License

Haitian Creole Speech data
Update: Directory reorganized, ASR models added, noon EST on 28 January 2010.
Update: Additional speech data added (data2), 12:45pm EST on 2 February 2010.
Update: Added detailed description of speech data collection methodology on 24 March 2010.
Speech data originally collected by the U.S. DARPA-funded DIPLOMAT project

Haitian Creole Text data
Update: There is an important update to this directory as of 1 p.m. EST on 27 January 2010. Please re-visit if you have used this data.
Various text data, including:


In addition to the members of the projects cited above:
Jeff Allen, SAP (formerly of Carnegie Mellon)
Vigdis Eriksen, Eriksen Translations Inc.
Manuel Stoeckl, Eriksen Translations Inc.
Karen Wallace
and these current Carnegie Mellon members:
Vamshi Ambati
Gopala Krishna Anumanchipalli
Alan W Black
Ralf Brown
Jaime Carbonell
Robert Frederking
Greg Hanneman
Sanjika Hewavitharana
David Huggins-Daines
Alon Lavie
Stephan Vogel
Contact for this page: Robert Frederking