ICAL TEFL Certificate
Click here for your TEFL Certificate.
Put your banner here on the TEFL World Wiki.

Corpus

From TEFL World Wiki
Jump to: navigation, search
Linguistics
Corpus

A Corpus (plural Corpora) is a large collection of written texts which are used in computational linguistics for analysis of the way language is used.


Contents

Types of Corpora

A corpus can be one or more of the following:

Analysis of a corpus will bring to light certain ways of language use within that group. For example, it may well show that scientific papers use the passive voice far more often than newspapers do.

Methods of Analysis

Corpora are generally searched and analysed using computers. Whilst a human analysis will show grammatical differences between these two sentences:

Time flies like an arrow.
Fruit flies like a banana.

a computer is not able to make the disctinction in meaning between the two uses of the words, flies and like. To get around this, corpora are often tagged or annotated. Typically this would involve giving Parts of Speech tags thus:

Time [noun] flies [verb] like [adverb] an [determiner] arrow [noun].
Fruit [adjective] flies [noun] like [verb] a [determiner] banana [noun].

This allows, for example, a concordancer to analyse all uses of like as a verb as oppose to like as an adverb.


In the Classroom

Use of corpora in the classroom, for example by using a concordancer, can be carried out by students under the guide of a teacher. This will allow students to see how language is used by native speakers in everyday situations. As a teacher a student may ask questions like, "Do we say the team is or the team are?" If this happens and you have access to the internet, you can have your students find out for themselves and work out which is more appropriate and when.

Incidentally, an online search of the BNC (British National Corpus) shows 109 occurances of the team is and just 37 occurances of the team are. Without going into further analysis this should tell your students that, given the choice, it is 3 times more likely to be correct to use the team is than the team are!

Corpus Linguistics

The use of corpora in analysing language means that linguists can see how language is used in real-life situations. This allows them to produce a descriptive grammar of English (i.e. describing how grammar is actually used rather than how it "should" be used).


Notable Corpora

The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late twentieth century from a wide variety of genres with the intention that it be a representative sample of spoken and written British English of that time.

Of the two parts to the 10-million word spoken corpus, one is a demographic part, containing transcriptions of spontaneous natural conversations made by members of the public and the other a context-governed part, containing transcriptions of recordings made at specific types of meetings and events. All the original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library.

The corpus is marked up following the recommendations of the Text Encoding Initiative and includes full linguistic annotation and contextual information The most recent edition, from March 2007, is distributed in XML format along with the XAIRA software. It is freely available under a licence and is very widely distributed.

The BNC can be searched online for specific words or phrases.

The American National Corpus is a paid membership-based collaboratory with the aim of creating an electronic text corpus of American English. The collection will include text and transcripts of spoken data produced from 1990, with the goal of a 100 million word corpus.

ANC Consortium members include publishers, software companies, and academic members. Consortium members have exclusive access throughout the development period and for five years after the first installment of the corpus. The First Release of the American National Corpus (ANC) was made available in mid-fall, 2003. The data includes approximately 11 million words of American English, including written and spoken data and a variety of text types annotated for part of speech and lemma. The corpus is provided in XML format conformant to the XML Corpus Encoding Standard (XCES).

External Links

The British National Corpus

Personal tools
Namespaces
Variants
Actions
Navigation
Forum Menu
Toolbox
Online TEFL Certicate
TEFL Directory