This page is no longer being updated as it has been relocated to http://www.mcs.vuw.ac.nz/~elvis/db/FishBrainWiki?Corpus.

About the page

This page contains (hopefully) all the information I've collected regarding corpus analysis.

What is a Corpus?

A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language

David Crystal, A Dictionary of Linguistics and Phonetics, Blackwell, 3rd Edition, 1991. [BNC]

A collection of naturally occurring language text, chosen to characterize a state or variety of a language.

John Sinclair, Corpus, Concordance, Collocation, OUP, 1991. [BNC]

The study of real language (showen through practical usage)
"Corpus linuistics is a methodology rather than an aspect of languace requiring explanation or description." - [ICL]
Sampling techniques: Longitudinal (progressing over time), large sampling studies

Types of corpus

Sampling and Representativeness

Quotes from [CL-Ed] unless otherwise marked.

Often in linguistics we are not merely interested in an individual text or author, but a whole variety of language. In such cases we have two options for data collection:

As discussed in lecture 1, one of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously. This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which which a much less biased and representative corpus may be constructed.

We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions. What we are looking for is a broad range of authors and genres which, when taken together, may be considered to "average out" and provide a reasonably accurate picture of the entire language population in which we are interested.

Known References

[CBRL] [VUWC] [ICL] [QGL] [CL] [CCEDMC] [CL-Ed] [BNC]

Misc jottings

Spreadsheets are already in machine readable form. But could they be marked up (XML) in a way that would make them easier to process?

I believe for the purposes of this project that there is a infinte possible population of spreadsheets. That is, it would not be possible to generate all the possible spreadsheets using a systematic program. That said, for any worksheets, the there are only 256 coloums and 65536 rows. I am uncertain if the possible depth of worksheets is indeed infinte (it is at least greater then 256). Further research my be required to determine if indeed spreadsheets are infinte in any way, shape or form.

I have a feeling that there is a significant difference between corpus linguistics and the style of corpus analysis I will be undertaking. Although there are strong grammatical rules and dictionaries in language, part of what is done in corpus linguistics involves looking for possibly new rules of words and proof of usage for older rules and words. A spreadsheet, being a computational entity in the form I will be using it, is (in theory) strongly defined and limited by the environment it is programmed into. That is, the programmers decide what the language can and cannot do. However, all programmes have bugs, backdoors and hacks (especially larger programs like Excel) that allow them to do things not envisaged at the time they were designed and programmed.

Noam Chomsky [NC]