home complete list a - z parallel corpora learner corpora historical corpora spoken corpora ice corpora more languages german corpora english corpora search
You are now in section: CorporaEnglish CorporaA - F
     
 
A - F
G - L
M - R
S - Z
ACE BNC Spoken CLC
ANC CANCODE COLT
BASE cbc4kids CPSA
BoE CELT ELISA
BNC CHRISTINE FLOB
BNC Baby CIC FROWN

   
 

../ ACE - Australian Corpus of English
Org:
Macquarie University, Sydney, Australia
Time:
  based on materials published in 1986
Size:
  1 million words
Content:
  heterogeneous; written Australian English; 500 samples of ~2000 words, distributed over 15 text categories
Access:
  available on the ICAME CD-Rom; check out the online sample
Notes:
  This corpus was modeled on the LOB and Brown corpora.
    to the top of the page
 
../ ANC - American National Corpus
Org:

Project Manager: Randi Reppen, Associate Professor, Department of English, Northern Arizona University

Time:
  ongoing (final release expected in fall 2005)
Size:
  First release 10 million words; final release 100 million words
Content:
  heterogeneous; written and spoken American English; The ANC will be the counterpart of the BNC.
Access:
  The first release (ANC v1.0) is available through the LDC.
Notes:
  The second release to be released in late spring 2005.
    to the top of the page
 

../ Bank of English - BoE
Org:
Cobuild group at University of Birmingham, UK (John Sinclair)
Time:
  The project was launched in 1991 and is ongoing.
Size:
  currently approx. 524 million words
Content:
  heterogenous; written and spoken materials
Access:
  A subcorpus (Collins Wordbanks Online English corpus) can be accessed online for free.
Notes:
  The BoE is a so-called 'monitor corpus'.
    to the top of the page
 

../ BASE - British Academic Spoken English Corpus
Org:
Universities of Warwick and Reading, UK
Time:
  ongoing
Size:
  160 lecture and 40 seminar recordings (info recorded Jan. 2005)
Content:
  Digital Video Recordings of seminars and lectures across different disciplines
Access:
  the project is still under development
Notes:
  BASE is developed as a companion to the MICASE corpus
    to the top of the page
 

Org:
An industrial/academic consortium lead by Oxford University Press (more info ...)
Time:
  1991 - 1994
Size:
  100 million words
Content:
  heterogenous; written (~80%) and spoken (~20%) materials
Access:
  You can purchase the BNC World Edition on CD-ROM, access the BNC online through SARA, or purchase the 1 million word BNC sampler.
Notes:
  The BNC has been linguistically annotated with the CLAWS4 automatic tagger.
    to the top of the page
 

Org:
Ylva Berglund and Martin Wynne of the Oxford Text Archive
Time:
  /
Size:
  4 million words
Content:
  The BNC Baby consists of samples taken from the BNC.
Access:
  You can order the BNC Baby on CD-Rom. The CD-Rom also contains a corpus search tool (Xaira) and an "experimental corpus containing the complete works of Shakespeare".
Notes:
  "BNC Baby is a collection of corpora and software designed to demonstrate the full potential of corpus linguistics in the teaching of English language and literature."
   
to the top of the page
 

Org:
Longman Corpus Network
Time:
  1991-1994
Size:
  10 million words
Content:
  natural, spontaneous conversations, lectures, business meetings, etc.
Access:
  Currently restricted to members of Longman Corpus Network
Notes:
  The spoken corpus is part of the BNC.
   
to the top of the page
 

Org:
Brown University
Time:
  published in 1964; materials are from 1961
Size:
  1,014,312 words
Content:
  heterogenous; written American English; 500 samples of ~2000 words, distributed over 15 text categories
Access:
  available on the ICAME CD-Rom; check out the online sample
Notes:
  A revised manual, published in 1979, contains information on the tagged version of the BROWN corpus. ICAME provides samples of both versions online.
    to the top of the page
 

../ CANCODE - Cambridge and Nottingham Corpus of Discourse in English
Org:
Cambridge University Press
Time:
  recordings collected between 1995 and 2000
Size:
  5 million words
Content:
  spoken materials; sponaneous speech only
Access:
  Access is currently restricted to members of Cambridge University Press
Notes:
  "[A]ll the recordings have been coded according to the relationship between the speakers."
    to the top of the page
 

../ cbc4kids - Reading Comprehension Corpus
Org:
The MITRE Corporation and The University of Edinburgh
Time:
  2000 -
Size:
  ?
Content:
  Written English news stories, edited so as to target a Canadian teenage audience. Text, questions and answers have been marked up automatically with 13 layers of linguistic knowledge. cdc4kids is a richly annotated subset of the 249-document MITRE Canadian Broadcasting Corpus (CBC) with 8-12 questions per text and their corresponding answers.
Access:
  freely available for research purposes
Notes:
  "multi-layer markup (POS, parse trees, lemmata)"
    to the top of the page
 

../CELT - Corpus of Electronic Texts
Org:
University College Cork
Time:
   
Size:
   
Content:
  "Texts are taken from the best printed editions*, scanned, and proofread. Markup for structural and analytic features is added according to the recommendations of the Text Encoding Initiative (TEI). Conversions to HTML are made for online reading in the World-Wide Web, and the master files can be used to create versions in other formats, and for contextual searching, concordancing, and other analyses."
Access:
  An experimental search interface is available
Notes:
   
    to the top of the page
 

../CHRISTINE Corpus
Org:
Geoffrey Sampson, University of Essex, UK
Time:
  Samples are from the early 1990s; corpus was first distributed in 2000
Size:
  100,000 words
Content:
  spoken materials; "...based on extracts from the 'demographically-sampled' speech section of the British National Corpus "
Access:
  freely available for download here
Notes:
  The CHRISTINE corpus is an extension of the SUSANNE corpus, which consists mainly of written materials. Both corpora are syntactically annotated. More detailed info is available in the Corpus Manual.
    to the top of the page
 

../ CIC - Cambridge International Corpus
Org:
Cambridge University Press
Time:
  Has been running for the past ten years - ongoing
Size:
  more than 700 million words
Content:
  heterogeneous; written and spoken American and British English as well as Learner English
Access:
  Access is currently restricted to members of Cambridge University Press
Notes:
   
    to the top of the page
 

../ CLC - Cambridge Learner Corpus
Org:
Cambridge University Press
Time:
  ongoing
Size:
  ~ 20 million words
Content:
  Scripts from ~50,000 students from over 100 different first languages and 150 different countries
Access:
  Access is currently restricted to members of Cambridge University Press
Notes:
  The CLC is part of the CIC. Part of the CLC has been coded with a Learner Error Coding system.
    to the top of the page
 

../ COLT - Bergen Corpus of London Teenage Corpus
Org:

Department of English, University of Bergen, Norway

Time:
  materials were collected in 1993
Size:
  500.000 words; Pilot-version consists of 151 texts
Content:
  transcripts of spoken 'London Teenage Language'
Access:
  available on the ICAME CD-ROM, check out the online sample
Notes:
  COLT Manual (doc), (pdf)
    to the top of the page
 

../ CPSA - Corpus of Spoken American English
Org:
Michael Barlow, Athelstan
Time:
  materials recorded between 1994 and 1998
Size:
  2 main sub-corpora, 1 million words each
Content:
  short interchanges by 400 speakers - professional activities broadly tied to academics and politics
Access:
  Registered users only ($49 for the individual license)
Notes:
  The CPSA is also available tagged.
    to the top of the page
 

../ ELISA - English Language Interview Corpus as a Second-Language Learning Application
Org:
Department of Applied English Linguistics, University of Tübingen
Time:
  ongoing
Size:
  currently ~ 60,000 words
Content:
  spoken English; different varieties of English
Access:
  Demo version is available online.
Notes:
  "The corpus is intended as a resource for the creation of learning materials as well as for autonomous exploitation by learners."
    to the top of the page
 

../ FLOB - Freiburg LOB corpus of British English
Org:
Englisches Seminar, Albert-Ludwigs-Universität Freiburg
Time:
  early 1990s
Size:
  1 million words
Content:
  The aim was to "compile a set of corpora that would match the well-known and widely used Brown and LOB corpora with the only difference that they should represent the language of the early 1990s."
Access:
  available on the ICAME CD-Rom; check out the online sample
Notes:
  The FLOB has a simplified SGML-based mark-up.
    to the top of the page
 

../ FROWN - Freiburg Brown Corpus of American English
Org:

Englisches Seminar, Albert-Ludwigs-Universität Freiburg

Time:
  early 1990s
Size:
  1 million words
Content:
  The aim was to "compile a set of corpora that would match the well-known and widely used Brown and LOB corpora with the only difference that they should represent the language of the early 1990s."
Access:
  available on the ICAME CD-Rom; check out the online sample
Notes:
  The FROWN has a simplified SGML-based mark-up.
    to the top of the page