|
|
| |
|
| |
|
../
ACE - Australian Corpus of English |
Org: |
 |
Macquarie University,
Sydney, Australia |
Time: |
|
based on materials
published in 1986 |
Size: |
|
1 million words |
Content: |
|
heterogeneous;
written Australian English; 500 samples of ~2000 words, distributed
over 15 text categories |
Access: |
|
available on the ICAME
CD-Rom; check out the online
sample |
Notes: |
|
This corpus was
modeled on the LOB and Brown corpora. |
| |
|
 |
|
| |
|
../
ANC - American National Corpus |
Org: |
 |
Project Manager: Randi
Reppen, Associate Professor, Department
of English, Northern Arizona University |
Time: |
|
ongoing (final release
expected in fall 2005) |
Size: |
|
First release 10
million words; final release 100 million words |
Content: |
|
heterogeneous; written
and spoken American English; The ANC will be the counterpart
of the BNC. |
Access: |
|
The first release
(ANC v1.0) is available through the LDC. |
Notes: |
|
The second release to be released in late spring 2005. |
| |
|
 |
|
| |
|
../
Bank of English - BoE |
Org: |
 |
Cobuild group at
University of Birmingham, UK (John Sinclair) |
Time: |
|
The project was
launched in 1991 and is ongoing. |
Size: |
|
currently approx.
524 million words |
Content: |
|
heterogenous; written
and spoken materials |
Access: |
|
A subcorpus (Collins
Wordbanks Online English corpus) can be accessed
online for free. |
Notes: |
|
The BoE is a so-called
'monitor corpus'. |
| |
|
 |
|
| |
|
../
BASE - British Academic Spoken English Corpus |
Org: |
 |
Universities of
Warwick and Reading, UK |
Time: |
|
ongoing |
Size: |
|
160 lecture and
40 seminar recordings (info recorded Jan. 2005) |
Content: |
|
Digital Video Recordings
of seminars and lectures across different disciplines |
Access: |
|
the project is still
under development |
Notes: |
|
BASE is developed
as a companion to the MICASE corpus |
| |
|
 |
|
| |
|
|
Org: |
 |
An industrial/academic
consortium lead by Oxford University Press (more
info ...) |
Time: |
|
1991 - 1994 |
Size: |
|
100 million words |
Content: |
|
heterogenous; written
(~80%) and spoken (~20%) materials |
Access: |
|
You can purchase the
BNC World Edition on CD-ROM, access the BNC online through SARA,
or purchase the 1 million word BNC
sampler. |
Notes: |
|
The BNC has been linguistically
annotated with the CLAWS4 automatic tagger. |
| |
|
 |
|
| |
|
Org: |
 |
Ylva Berglund and Martin Wynne
of the Oxford
Text Archive |
Time: |
|
/ |
Size: |
|
4 million words |
Content: |
|
The BNC Baby consists of samples
taken from the BNC. |
Access: |
|
You can order the BNC Baby on
CD-Rom. The CD-Rom also contains a corpus search tool (Xaira)
and an "experimental corpus containing the complete works
of Shakespeare". |
Notes: |
|
"BNC Baby is a collection
of corpora and software designed to demonstrate the full potential
of corpus linguistics in the teaching of English language and
literature." |
| |
|
|
|
| |
|
Org: |
 |
Longman Corpus Network |
Time: |
|
1991-1994 |
Size: |
|
10 million words |
Content: |
|
natural, spontaneous conversations,
lectures, business meetings, etc. |
Access: |
|
Currently restricted to members
of Longman Corpus Network |
Notes: |
|
The spoken corpus is part of
the BNC. |
| |
|
|
|
| |
|
|
Org: |
 |
Brown University |
Time: |
|
published in 1964;
materials are from 1961 |
Size: |
|
1,014,312 words |
Content: |
|
heterogenous; written
American English; 500 samples of ~2000 words, distributed over
15 text categories |
Access: |
|
available on the ICAME
CD-Rom; check out the online
sample |
Notes: |
|
A revised manual,
published in 1979, contains information on the tagged
version of the BROWN corpus. ICAME provides samples
of both versions online. |
| |
|
 |
|
| |
| ../ CANCODE - Cambridge and Nottingham Corpus of Discourse in English |
Org: |
 |
Cambridge University Press |
Time: |
|
recordings collected between 1995 and 2000 |
Size: |
|
5 million words |
Content: |
|
spoken materials; sponaneous speech only |
Access: |
|
Access is currently restricted to members of Cambridge University Press |
Notes: |
|
"[A]ll the recordings have been coded according to the relationship between the speakers." |
| |
|
 |
|
| |
|
../
cbc4kids - Reading Comprehension Corpus |
Org: |
 |
The MITRE
Corporation and The
University of Edinburgh |
Time: |
|
2000 - |
Size: |
|
? |
Content: |
|
Written English
news stories, edited so as to target a Canadian teenage audience.
Text, questions and answers have been marked up automatically
with 13 layers of linguistic knowledge. cdc4kids is a richly
annotated subset of the 249-document MITRE Canadian Broadcasting
Corpus (CBC) with 8-12 questions per text and their corresponding
answers. |
Access: |
|
freely available
for research purposes |
Notes: |
|
"multi-layer
markup (POS, parse trees, lemmata)" |
| |
|
 |
|
| |
|
../CELT - Corpus of Electronic Texts |
Org: |
 |
University College Cork |
Time: |
|
|
Size: |
|
|
Content: |
|
"Texts are taken from the best printed editions*, scanned, and proofread. Markup for structural and analytic features is added according to the recommendations of the Text Encoding Initiative (TEI). Conversions to HTML are made for online reading in the World-Wide Web, and the master files can be used to create versions in other formats, and for contextual searching, concordancing, and other analyses." |
Access: |
|
An experimental search interface is available |
Notes: |
|
|
| |
|
 |
|
| |
|
../CHRISTINE
Corpus |
Org: |
 |
Geoffrey
Sampson, University of Essex, UK |
Time: |
|
Samples are from
the early 1990s; corpus was first distributed in 2000 |
Size: |
|
100,000 words |
Content: |
|
spoken materials; "...based
on extracts from the 'demographically-sampled' speech section
of the British National Corpus " |
Access: |
|
freely available
for download
here |
Notes: |
|
The CHRISTINE corpus
is an extension of the SUSANNE corpus, which consists mainly
of written materials. Both corpora are syntactically annotated.
More detailed info is available in the Corpus
Manual. |
| |
|
 |
|
| |
|
../
CIC - Cambridge International Corpus |
Org: |
 |
Cambridge University
Press |
Time: |
|
Has been running
for the past ten years - ongoing |
Size: |
|
more than 700 million
words |
Content: |
|
heterogeneous;
written and spoken American and British English as well as Learner
English |
Access: |
|
Access is currently
restricted to members of Cambridge University Press |
Notes: |
|
|
| |
|
 |
|
| |
|
../
CLC - Cambridge Learner Corpus |
Org: |
 |
Cambridge University
Press |
Time: |
|
ongoing |
Size: |
|
~ 20 million words |
Content: |
|
Scripts from ~50,000
students from over 100 different first languages and 150 different
countries |
Access: |
|
Access is currently
restricted to members of Cambridge University Press |
Notes: |
|
The CLC is part
of the CIC. Part of the CLC has been coded with a Learner
Error Coding system. |
| |
|
 |
|
| |
|
| |
|
../
CPSA - Corpus of Spoken American English |
Org: |
 |
Michael Barlow,
Athelstan |
Time: |
|
materials recorded
between 1994 and 1998 |
Size: |
|
2 main sub-corpora,
1 million words each |
Content: |
|
short interchanges
by 400 speakers - professional activities broadly tied to academics
and politics |
Access: |
|
Registered users
only ($49 for the individual license) |
Notes: |
|
The CPSA is also
available tagged. |
| |
|
 |
|
| |
|
| |
|
../
FLOB - Freiburg LOB corpus of British English |
Org: |
 |
Englisches Seminar,
Albert-Ludwigs-Universität Freiburg |
Time: |
|
early 1990s |
Size: |
|
1 million words |
Content: |
|
The aim was to "compile
a set of corpora that would match the well-known and widely used
Brown and LOB corpora with the only difference that they should
represent the language of the early 1990s." |
Access: |
|
available on the ICAME
CD-Rom; check out the online
sample |
Notes: |
|
The FLOB has a simplified
SGML-based mark-up. |
| |
|
 |
|
| |
|
../
FROWN - Freiburg Brown Corpus of American English |
Org: |
 |
Englisches
Seminar, Albert-Ludwigs-Universität Freiburg |
Time: |
|
early 1990s |
Size: |
|
1 million words |
Content: |
|
The aim was to "compile
a set of corpora that would match the well-known and widely used
Brown and LOB corpora with the only difference that they should
represent the language of the early 1990s." |
Access: |
|
available on the ICAME
CD-Rom; check out the online
sample |
Notes: |
|
The FROWN has a
simplified SGML-based mark-up. |
| |
|
 |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|