The LDC Catalogue at Carnegie Mellon


This is a complete listing of LDC corpora, as taken from their catalogue pages. Many of these corpora are in the local Carnegie Mellon corpus library. The icons indicate the status of a particular item. Hover over an icon to read what it means. Catalogue numbers with links lead to an on-line copy of the corpus. Others may be obtained from the librarian (Barbara Sandling). Corpus names link to the corpus catalogue page at the LDC.

Please bear in mind that this listing is generated automatically from local data as well as from material on the LDC web site. All of these sources are maintained manually and so incosistencies will creep in. If you notice something odd, do let me know (air@cs.cmu.edu) and I will try to figure it out.

Click here to subscribe You can keep track of changes to this collection by subscribing to the RSS feed.


 
LDC0000m   To be filled  [Hua Yu] [Laura Mayfield]
1993 
LDC93S10L   TIDIGITS 
LDC93S11L   Road Rally 
LDC93S12L  oHCRC Map Task Corpus 
LDC93S1L  oTIMIT Acoustic-Phonetic Continuous Speech Corpus 
LDC93S2L  oNTIMIT 
LDC93S3AL  oResource Management Complete Set 2.0 
LDC93S3B^   Resource Management RM1 2.0
LDC93S3C^   Resource Management RM2 2.0
LDC93S4Am   ATIS0 Complete 
LDC93S4Be   ATIS0 Pilot
LDC93S4B-2e   ATIS0 Read
LDC93S4B-3m   ATIS0 SD Read 
LDC93S5m   ATIS2 
LDC93S6Ae   CSR-I (WSJ0) Complete
LDC93S6Bm   CSR-I (WSJ0) Sennheiser 
LDC93S6Cm   CSR-I (WSJ0) Other 
LDC93S7m   Switchboard 
LDC93S8m   Switchboard Credit Card 
LDC93S9L   TI 46-Word 
LDC93T1m   ACL/DCI  [Larry Zitnick]
LDC93T2m   Penn Treebank 0.5 
LDC93T3AL   TIPSTER Complete
LDC93T3B^   TIPSTER Volume 1 
LDC93T3C^   TIPSTER Volume 2 
LDC93T3D^   TIPSTER Volume 3 
LDC93T4e   Switchboard-1 Transcripts
1994 
LDC94L1m   CELEX Lexical Database 
LDC94L2e   COMLEX English Syntax Lexicon
LDC94L3e   COMLEX Pronouncing Dictionary
LDC94S13Ae   CSR-II (WSJ1) Complete
LDC94S13Be   CSR-II (WSJ1) Sennheiser
LDC94S13Cm   CSR-II (WSJ1) Other 
LDC94S14Ae   Air Traffic Control Complete
LDC94S14Be   Air Traffic Control BOS
LDC94S14Ce   Air Traffic Control DCA
LDC94S14De   Air Traffic Control DFW
LDC94S15e   SPIDRE
LDC94S16L   YOHO Speaker Verification 
LDC94S17m   OGI Multilanguage Corpus 
LDC94S18m   OGI Spelled and Spoken Word 
LDC94S19m   ATIS3 Training Data 
LDC94S20L  oBRAMSHILL 
LDC94S21L  oMACROPHONE 
LDC94T4AL   UN Parallel Text (Complete) 
LDC94T4B-1^   UN Parallel Text (English) 
LDC94T4B-2^   UN Parallel Text (French) 
LDC94T4B-3^   UN Parallel Text (Spanish) 
LDC94T5m   ECI Multilingual Text  [Dan Dewey]
1995 
LDC95L4o  oCOMLEX English Syntax Lexicon 
LDC95L5o  oCOMLEX Pronouncing Dictionary 
LDC95S22e   KING Speaker Verification
LDC95S23m   CSR-III Speech 
LDC95S24L  oWSJCAM0 Cambridge Read News 
LDC95S25L  oTRAINS Spoken Dialog Corpus 
LDC95S26m   ATIS3 Test Data 
LDC95S27e   PhoneBook: NYNEX Isolated Words
LDC95S28m   LATINO-40 Spanish Read News 
LDC95T11L  oEuropean Language Newspaper Text 
LDC95T13L   Mandarin Chinese News Text 
LDC95T20L   Hansard French/English 
LDC95T21L   North American News Text Corpus 
LDC95T6e   CSR-III Text
LDC95T7m   Treebank-2  [Brian MacWhinney]
LDC95T8!   Japanese Business News Text 
LDC95T9m   Spanish News Text 
1996 
LDC96A02e   Switchboard Speaker ID Evaluation Test
LDC96A12e   Raw Data:1996 Language Recognition Evaluation
LDC96L14o  oCELEX2 
LDC96L15o  oCALLHOME Mandarin Chinese Lexicon 
LDC96L16o  oCALLHOME Spanish Lexicon 
LDC96L17o  oCALLHOME Japanese Lexicon 
LDC96L6e   COMLEX English Syntax Lexicon
LDC96L7e   COMLEX Pronouncing Dictionary
LDC96S29e   Frontiers in Speech Processing 93
LDC96S30L  oCTIMIT 
LDC96S31e   CSR-IV HUB4
LDC96S32L  oFFMTIMIT 
LDC96S33e   CSR-IV HUB3
LDC96S34L  oCALLHOME Mandarin Chinese Speech 
LDC96S35m   CALLHOME Spanish Speech  [Alex Waibel] [Laura Mayfield]
LDC96S36L  oBoston University Radio Speech Corpus 
LDC96S37m   CALLHOME Japanese Speech  [Alex Waibel]
LDC96S38m   DCIEM/HCRC 
LDC96S39e   RM Isolated and Spelled Word Data
LDC96S40e   Frontiers in Speech Processing 94
LDC96S41m   VAHA (POLYPHONE II) 
LDC96S46L  oCALLFRIEND American English-Non-Southern Dialect 
LDC96S47m   CALLFRIEND American English-Southern Dialect 
LDC96S48m   CALLFRIEND Canadian French  [Dan Dewey]
LDC96S49m   CALLFRIEND Egyptian Arabic 
LDC96S50m   CALLFRIEND Farsi 
LDC96S51m   CALLFRIEND German 
LDC96S52m   CALLFRIEND Hindi 
LDC96S53L  oCALLFRIEND Japanese 
LDC96S54m   CALLFRIEND Korean 
LDC96S55m   CALLFRIEND Mandarin Chinese-Mainland Dialect 
LDC96S56m   CALLFRIEND Mandarin Chinese-Taiwan Dialect 
LDC96S57L  oCALLFRIEND Spanish-Caribbean Dialect 
LDC96S58L  oCALLFRIEND Spanish-Non-Caribbean Dialect 
LDC96S59m   CALLFRIEND Tamil 
LDC96S60m   CALLFRIEND Vietnamese 
LDC96S61L  o1996 Speaker Recognition Benchmark 
LDC96S64L   JEIDA/JCSD-Channel 0 Complete 
LDC96S64-1^   JEIDA/JCSD-Channel 0 City Names
LDC96S64-2^   JEIDA/JCSD-Channel 0 Control Words
LDC96S64-3^   JEIDA/JCSD-Channel 0 Isolated Digits
LDC96S64-4^   JEIDA/JCSD-Channel 0 Four Digit Sequences
LDC96S64-5^   JEIDA/JCSD-Channel 0 Mono Syllables
LDC96S65L   JEIDA/JCSD-Channel 1 Complete 
LDC96S65-1^   JEIDA/JCSD-Channel 1 City Names
LDC96S65-2^   JEIDA/JCSD-Channel 1 Control Words
LDC96S65-3^   JEIDA/JCSD-Channel 1 Isolated Digits
LDC96S65-4^   JEIDA/JCSD-Channel 1 Four Digit Sequences
LDC96S65-5^   JEIDA/JCSD-Channel 1 Mono Syllables
LDC96T10o  oMessage Understanding Conference (MUC) 6 Additional News Text 
LDC96T11o  oCOMLEX Syntax Text Corpus Version 2.0 
LDC96T16o  oCALLHOME Mandarin Chinese Transcripts 
LDC96T17o  oCALLHOME Spanish Transcripts 
LDC96T18o  oCALLHOME Japanese Transcripts 
1997 
LDC97E1o  oSpoken Document Retrieval Training 
LDC97E2e   preliminary CALLHOME Egyptian Arabic Transcripts
LDC97E3e   Spoken Document Retrieval Speech Recognizer
LDC97E4o  oMUC-VII 
LDC97E5o  o1997 TREC SDR Test Set Text Data 
LDC97E6m   CSR-VI Hub 4 Spanish News Lexicon  [Maxine Eskenazi]
LDC97E7e   CSR-VI Hub 4 Mandarin Chinese News Lexicon
LDC97L18o  oCALLHOME German Lexicon 
LDC97L19o  oCALLHOME Egyptian Arabic Lexicon 
LDC97L20o  oCALLHOME American English Lexicon (PRONLEX) 
LDC97S42L   CALLHOME American English Speech 
LDC97S43m   CALLHOME German Speech  [Alex Waibel]
LDC97S44L   1996 English Broadcast News Speech (HUB4) 
LDC97S45m   CALLHOME Egyptian Arabic Speech  [Alex Waibel]
LDC97S62L  oSwitchboard-1 Release 2 
LDC97S63m   The CMU Kids Corpus  [Jack Mostow] [Maxine Eskenazi]
LDC97S66m   1996 English Broadcast News Dev and Eval (HUB4)  [Hua Yu]
LDC97T12o  oDSO Corpus of Sense-Tagged English 
LDC97T14o  oCALLHOME American English Transcripts 
LDC97T15o  oCALLHOME German Transcripts 
LDC97T19o  oCALLHOME Egyptian Arabic Transcripts 
LDC97T22o  o1996 English Broadcast News Transcripts (HUB4) 
1998 
LDC98E10o  o1998 HUB4 English UTF Transcript Compendium 
LDC98E11o  oBBN IE/NE-tagged HUB4 Training Transcripts 
LDC98E8o  o1998 SDR Textual Training Corpus 
LDC98E9o  o1998 TREC-7 SDR Evaluation Material 
LDC98L21o  oCOMLEX English Syntax Lexicon 
LDC98S67e   HTIMIT
LDC98S68e   LLHDB
LDC98S69L   HUB5 Mandarin Telephone Speech Corpus 
LDC98S70m   HUB5 Spanish Telephone Speech Corpus  [Laura Mayfield]
LDC98S71m   1997 English Broadcast News Speech (HUB4) 
LDC98S72L   Taiwanese Putonghua Speech and Transcripts 
LDC98S73L  o1997 Mandarin Broadcast News Speech (HUB4-NE) 
LDC98S74L   1997 Spanish Broadcast News Speech (HUB4-NE) 
LDC98S75L  oSwitchboard-2 Phase I 
LDC98S76m   1998 Speaker Recognition Benchmark 
LDC98S77m   Voicemail Corpus Part I 
LDC98T10   *UNKNOWN*
LDC98T23o  oCSR-VI Hub 4 Spanish News Transcripts 
LDC98T24o  o1997 Mandarin Broadcast News Transcripts (HUB4-NE) 
LDC98T25o  oTDT Pilot Study Corpus 
LDC98T26o  oHUB5 Mandarin Transcripts 
LDC98T27o  oHUB5 Spanish Transcripts 
LDC98T28o  o1997 English Broadcast News Transcripts (HUB4) 
LDC98T29o  o1997 Spanish Broadcast News Transcripts (HUB4-NE) 
LDC98T30L   North American News Text Supplement 
LDC98T31L  o1996 CSR HUB4 Language Model 
LDC98T32L   JURIS 
1999 
LDC99E12e   BBN_IENE_TRN99
LDC99L22o  oEgyptian Colloquial Arabic Lexicon 
LDC99L23m   American English Spoken Lexicon 
LDC99S78L   SUSAS 
LDC99S79L   Switchboard-2 Phase II 
LDC99S80m   1997 Speaker Recognition Benchmark  [Nadine Reaves]
LDC99S81m   1999 Speaker Recognition Benchmark  [Nadine Reaves]
LDC99S82e   USC Marketplace Broadcast News Speech
LDC99S83L   Tactical Speaker Identification Speech Corpus (TSID) 
LDC99S84m   TDT2 English Audio 
LDC99T33o  oSUSAS Transcripts 
LDC99T34!   Japanese Business News Text Supplement 
LDC99T35m   TDT2 English Text 
LDC99T36e   USC Marketplace Broadcast News Transcripts
LDC99T37e   TDT2 English Text, Version 2
LDC99T38L  oTDT2 Mandarin Text 
LDC99T39L   TDT2 Multilanguage Text Version 3.0 
LDC99T40L   Portuguese Newswire Text 
LDC99T41L   Spanish Newswire Text, Volume 2 
LDC99T42!  oTreebank-3 
2000 
LDC2000S85!   Santa Barbara Corpus of Spoken American English Part I  [Brian MacWhinney]
LDC2000S86!   1998 HUB4 Broadcast News Evaluation English Test Material 
LDC2000S87L   Speech in Noisy Environments (SPINE) Training Audio 
LDC2000S88!   1999 HUB4 Broadcast News Evaluation English Test Material 
LDC2000S89L   Voice of America (VOA) Czech Broadcast News Audio 
LDC2000S92L  oTDT2 Careful Transcription Audio 
LDC2000S96!   Speech in Noisy Environments (SPINE) Evaluation Audio 
LDC2000T43L   BLLIP 1987-89 WSJ Corpus Release 1 
LDC2000T44o  oTDT2 Careful Transcription Text 
LDC2000T45L   Korean Newswire 
LDC2000T46o  oHong Kong News Parallel Text 
LDC2000T47o  oHong Kong Laws Parallel Text 
LDC2000T48?   Chinese Treebank Final Release 
LDC2000T49o  oSpeech in Noisy Environments (SPINE) Training Transcripts 
LDC2000T50L   Hong Kong Hansards Parallel Text 
LDC2000T51L   TREC Spanish 
LDC2000T52!   TREC Mandarin 
LDC2000T53o  oVoice of America (VOA) Broadcast News Czech Transcript Corpus 
LDC2000T54o  oSpeech in Noisy Environments (SPINE) Evaluation Transcripts 
2001 
LDC2001S04L  oSpeech in Noisy Environments (SPINE2) Part 1 Audio 
LDC2001S06L  oSpeech in Noisy Environments (SPINE2) Part 2 Audio 
LDC2001S08L  oSpeech in Noisy Environments (SPINE2) Part 3 Audio 
LDC2001S13L   Switchboard Cellular Part 1 Audio 
LDC2001S15L  oSwitchboard Cellular Part 1 Transcribed Audio 
LDC2001S16L  oGrassfields Bantu Fieldwork: Ngomba Tone Paradigms 
LDC2001S91L  o1997 HUB4 Broadcast News Evaluation Non-English Test Material 
LDC2001S93o  oTDT2 Mandarin Audio Corpus 
LDC2001S94L  oTDT3 English Audio 
LDC2001S95L  oTDT3 Mandarin Audio 
LDC2001S97L  o2000 NIST Speaker Recognition Evaluation 
LDC2001S99L   Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio 
LDC2001T02o  oMessage Understanding Conference (MUC) 7 
LDC2001T05o  oSpeech in Noisy Environments (SPINE2) Part 1 Transcripts 
LDC2001T07o  oSpeech in Noisy Environments (SPINE2) Part 2 Transcripts 
LDC2001T09o  oSpeech in Noisy Environments (SPINE2) Part 3 Transcripts 
LDC2001T10L   Prague Dependency Treebank 1.0 
LDC2001T11o  oChinese Treebank 2.0 
LDC2001T14o  oSwitchboard Cellular Part 1 Transcription 
LDC2001T55!  oArabic Newswire Part 1  [NianLi Ma]
LDC2001T57L   TDT2 Multilanguage Text Version 4.0 
LDC2001T58L   TDT3 Multilanguage Text Version 2.0 
LDC2001T60o  oSyllable-Final /s/ Lenition 
LDC2001T61e   CALLHOME Spanish Dialogue Act Annotation
LDC2001T62L  oCETEMpublico 
2002 
LDC2002E14o  oChinese English Translation Lexicon Version 3-beta 
LDC2002E15L  oUN Arabic English Parallel Text Version 1 beta 
LDC2002E16o  oHong Kong News Parallel Text Version 2 beta 
LDC2002E17o  oEnglish Translation of Chinese Treebank Version 1 beta 
LDC2002E18o  oXinhua Chinese English Parallel News Text Version 1 beta 
LDC2002E19o  oHong Kong Hansard Parallel Text Version 2 beta 
LDC2002E27o  oChinese English Translation Dictionary v3.0 
LDC2002E32o  oTDT3 Arabic Text Version 0.1 
LDC2002E33e   ACE Phase 2 Training Data Version 6
LDC2002E36e   2002 DUC Evaluation Version 0.1
LDC2002E48o  oUmmah Arabic English Parallel News Text 
LDC2002E49o  oBuckwalter Arabic Morphological Analyzer 
LDC2002E50o  oName-Annotated TDT Corpus Supplement for ACE 
LDC2002E52e   TDT4 Multilanguage Text Corpus
LDC2002E53o  oMultiple-Translation Chinese Corpus 2.0 
LDC2002E54o  oMultiple-Translation Arabic Corpus 
LDC2002E55o  oArabic Treebank: Part 1 v 1.0 
LDC2002E58o  oSinorama Chinese English Parallel Text 
LDC2002L27o  oChinese-English Translation Lexicon Version 3.0 
LDC2002L49  oBuckwalter Arabic Morphological Analyzer Version 1.0
LDC2002S02L  oWest Point Arabic Speech Corpus 
LDC2002S04L   Translanguage English Database (TED) Speech 
LDC2002S06L  oSwitchboard-2 Phase III Audio 
LDC2002S10!   1998 HUB5 English Evaluation 
LDC2002S11L  o1997 HUB4 English Evaluation Speech and Transcripts 
LDC2002S12L  o2001 HUB5 Mandarin Evaluation 
LDC2002S13L  o2001 HUB5 English Evaluation 
LDC2002S22L  o1997 HUB5 Arabic Evaluation 
LDC2002S24L  o1997 HUB5 German Evaluation 
LDC2002S25L  o1997 HUB5 Spanish Evaluation 
LDC2002S28L  oEmotional Prosody Speech and Transcripts 
LDC2002S34!  o2001 NIST Speaker Recognition Evaluation Corpus 
LDC2002S35L  oVoicemail Corpus Part II 
LDC2002S37L  oCALLHOME Egyptian Arabic Speech Supplement 
LDC2002S56L  o2000 Communicator Evaluation 
LDC2002T01o  oMultiple-Translation Chinese Corpus 
LDC2002T03o  oTranslanguage English Database (TED) Transcripts 
LDC2002T07o  oRST Discourse Treebank 
LDC2002T26o  oKorean English Treebank Annotations 
LDC2002T31L  oThe AQUAINT Corpus of English News Text 
LDC2002T38o  oCALLHOME Egyptian Arabic Transcripts Supplement 
LDC2002T39   1997 HUB5 Arabic Transcripts
LDC2002T42e   1997 HUB5 Spanish Transcripts
2003 
LDC2003E01o  oChinese <-> English Name Entity Lists Version 1.0 beta 
LDC2003E02m   TDT4 Multilanguage Speech  [Hua Yu]
LDC2003E03o  oTDT4 Multilanguage Transcripts 
LDC2003E04o  oMultiple Translation Chinese Corpus Part 3 
LDC2003E05o  oArabic Translation Corpus Part 1 
LDC2003E06o  oChinese Treebank 3.0 
LDC2003E07o  oChinese Treebank English Parallel Corpus 
LDC2003E08o  oChinese News Translation Corpus Part 1 
LDC2003E09o  oArabic News Translation Corpus Part 2 
LDC2003E10e   Aquaint Xinhua for NTCIR Evaluation
LDC2003E11L  oUN Chinese English Parallel Text Version 1.0 beta 
LDC2003E12m   Fisher Training Speech Part 1  [Hua Yu]
LDC2003E12Bm   Fisher Training Speech Part 2  [Hua Yu]
LDC2003E12Cm   Fisher Training Speech Part 3  [Hua Yu]
LDC2003E12Dm   Fisher Training Speech Data, Part 4  [Hua Yu]
LDC2003E12Em   Fisher Training Speech Data, Part 5  [Hua Yu]
LDC2003E12Fm   Fisher Training Speech Data, Part 6  [Hua Yu]
LDC2003E13o  oFisher Quick Transcription Part 1 Version 1.0 
LDC2003E13Bo  oFisher Quick Transcription Part 2 Version 1.0 
LDC2003E13Co  oFisher Quick Transcription Part 3 Version 1.0 
LDC2003E13Do  oFisher Training Transcripts Part 4, v1.0 
LDC2003E13Eo  oFisher Training Transcripts, Part 5, v1.0 
LDC2003E14L  oFBIS Multilanguage Texts 
LDC2003E15L  oHARD GovDocs 
LDC2003E16o  oSIGHAN Bakeoff 
LDC2003E17e   Arabic Treebank: Part 2 v 1.0
LDC2003E18e   ACE3-V1.3
LDC2003E19e   EARS MDE RT-03F Training Corpus
LDC2003E20e   TDT4 Multilanguage Text Subset for TIDES Extraction 2003
LDC2003E21m   TDT4 Multilanguage Text Version 1.1  [Hua Yu] [Jamie Callan] [Jian Zhang]
LDC2003E22e   The SLX Corpus of Classic Sociolinguistic Interviews
LDC2003E24e   Arabic Treebank: Part 2 v 1.1
LDC2003E25L  oHong Kong News Parallel Text 
LDC2003E26e   ACE 2004 Pilot Corpus V1.0
LDC2003E27e   EARS MDE RT-03 DevTest and Evaluation Corpus
LDC2003L01L   Grassfields Bantu Fieldwork: Dschang Lexicon 
LDC2003L02o  oKorean Telephone Conversations Lexicon 
LDC2003S01L  o2001 Communicator Evaluation 
LDC2003S02L  oGrassfields Bantu Fieldwork: Dschang Tone Paradigms 
LDC2003S03L  oKorean Telephone Conversations Speech 
LDC2003S04e   Cross-Channel Forensic Speech for Automatic Speaker Recognition
LDC2003S05L   West Point Russian Speech 
LDC2003S06L  oSanta Barbara Corpus of Spoken American English Part II 
LDC2003S07e   Korean Telephone Conversations Complete Set
LDC2003T01o  o2001 HUB5 Mandarin Transcripts 
LDC2003T02o  o1998 HUB5 English Transcripts 
LDC2003T03o  o1997 HUB5 German Transcripts 
LDC2003T04o  o1997 HUB5 Spanish Transcripts 
LDC2003T05o  oEnglish Gigaword 
LDC2003T06o  oArabic Treebank: Part 1 v 2.0 
LDC2003T07o  oArabic Treebank: Part 1 - 10K-word English Translation 
LDC2003T08o  oKorean Telephone Conversations Transcripts 
LDC2003T09L  oChinese Gigaword 
LDC2003T10o  oSAID 
LDC2003T11L  oACE-2 Version 1.0 
LDC2003T12L  oArabic Gigaword 
LDC2003T13o  oMessage Understanding Conference (MUC) 6 
LDC2003T15L  oSLX Corpus of Classic Sociolinguistic Interviews 
LDC2003T16L   SummBank 1.0 
LDC2003T17o  oMultiple-Translation Chinese (MTC) Part 2 
LDC2003T18o  oMultiple-Translation Arabic (MTA) Part 1 
LDC2003T20L   American National Corpus(ANC) First Release 
LDC2003V01L  oFORM2 Kinematic Gesture 
2004 
LDC2004E01