Corpus insider #3: Corpus quirks
I love using
corpus tools to research language. They throw up some fascinating results to
feed into ELT materials in all kinds of ways. They can, however, also be
infuriating at times! In this post, I thought I'd look at a couple of the
quirks that it might help to be aware of when you're using corpus
tools.
Apostrophes,
contractions and negatives
Corpus tools are
great when you're just searching for strings of letters, but throw an
apostrophe into the mix and all kinds of confusion seems to ensue! That's
particularly problematic when you want to start researching anything
grammatical. Searching for contractions, including auxiliary verbs and
negatives, gets tricky, as are possessives. Different corpus tools deal with
these forms in different ways - it may be that you can just type in the whole
word apostrophe and all, you may have to separate the contraction from the main
part of the word so she [space] 's or the software may treat a
negative as a separate entity, so you'd need to search for could [space]
n't.
The essential
thing though is you don't miss out on these forms because you constructed a
search that didn't include them. If you're looking into future forms and you're
only getting examples of will in its full form, you're missing out on
loads of 'll and won't forms that may well be significantly more
common. That means you need to check how the corpus tools you're using deal
with them - look out for any help or FAQs for guidance.
Slippery
parts of speech
Most corpora
you're likely to use will be part of speech tagged. That means that the data's
been automatically analysed to tag each word with a part of speech; noun, verb,
adjective, etc. That allows you to do lots of things. It means you can search
for a word like walk and decide you only want to see verb examples or
noun examples, usually by selecting PoS from some kind of menu. It helps with
searching for lemmas - so you just type in walk and select verb and
you'll get walk, walks, walked and walking. It also
allows collocate searches to categorize and display results by part of speech.
However, part of
speech tagging isn't perfect. Sometimes - well, quite often actually! - the
tech just gets it wrong. So in any search where a word has multiple parts of
speech, you're likely to get a few odd examples cropping up, especially where
the grammatical clues are misleading. So in one text I was looking at, the
fact that the word weather was preceded by to in the sentence
"Ice is very sensitive to weather" made it look like it might
be a verb so the software incorrectly categorized it as one.
These occasional tagging errors aren't generally significant, but much more
problematic are words that are just difficult to classify. That's particularly
true of words that come before nouns. We tend to think of words that pre-modify
nouns as adjectives, but actually lots of them aren't technically. So, we can
have nouns before nouns; an evening dress, a window cleaner, the table
decorations. These don't cause massive problems in corpus searches, but they're
worth noting as they don't often get a look-in in ELT materials.
Far more
difficult are verb participles. Think about these examples - which of the words
in bold are adjectives?
a boring
meeting
another satisfied
customer
an increasing
number
the desired
effect
a neatly folded
piece of paper
It's just one of
those features of English that's annoyingly problematic for anyone who's
dealing with learners and doesn't want to get into over-technical jargon. If
you look in a learner's dictionary, you'll likely find the common ones (boring/bored,
tiring/tired, exciting/excited, etc.) listed as adjectives, but the less
common ones (increasing, desired) may or may not be there. It's an
editorial decision where the cut-off point is for these. It's unsurprising,
then, that corpus tools often struggle to handle them. Often, they just don't
recognize them as adjectives, instead, they get lumped in with the verb lemma.
That has a number of consequences:
- You may not be able to search for increasing as an adjective. Of course, you can usually search for the exact word form instead of the lemma, but that will leave you with a mix of adjectival and verbal examples.
- You can't always do collocate searches for these words, because the tools don't recognize them as an adjective.
- When you do collocate searches, you need to take into account that some of the collocates might be in the 'wrong' places, so participle adjectives will often show up (in their base form) in the verb column. If I search for temperature, for example, I might get verb collocates that include lower, withstand and measure which are all genuine verb collocates, but then I get rise, operate and desire. When I click through, rise turns out to be a mix of rising temperatures and the temperature rose. Whereas desire and operate are always modifying the noun: the desired temperature and operating temperature.
Embracing
quirkiness
So what does all
this awkward quirkiness mean for the average corpus user? Well, it doesn’t mean
you should give up using corpora or stop trusting corpus results. It just means
you have to familiarize yourself with how your corpus tools deal with the odd
stuff and be on the look-out for apparent anomalies, like mis-tagged parts of
speech and unlikely collocates. And as you get more familiar with the tool
you’re using, you’ll get to know its quirks and, more importantly, how to get
around them or take them into account.
Labels: apostrophes, collocation, corpus insider, corpus research, part of speech, quirks, tagging