The occasional ramblings of a freelance lexicographer

Wednesday, May 02, 2018

Corpus insider #2: Frequency & typicality

Corpora are really great for checking collocations: words that are typically used together. Collocation's a really important aspect of language and a vital part of language teaching if we want to help students avoid 'doing' obvious mistakes. As expert speakers, we generally have a feel for an individual word's most typical collocates, but when you're writing materials, it's easy to get a particular combination stuck in your head or to start doubting your intuitions - do we say get a bus or take a bus? The more you say it to yourself, the sillier each one starts to sound. A bit of outside evidence can be really helpful.
If you want to use a corpus to check out collocations though, it's important to understand a few basics about the statistics behind what the corpus tools are showing you and what type of collocations might be appropriate for the materials you're writing.

Frequent vs Typical

The most important distinction to get to grips with is the difference between frequent collocations and typical or significant or strong collocations. Most corpus tools will show you which words most commonly co-occur just based on raw frequency, but some tools will also have an option to rank collocates by strength of attraction, shown as a score. That is, the software will take into account not just how often two words occur together, but how likely that combination is based on the relative frequency of the two items. So the chances of two very frequent words occurring together is quite high and therefore often fairly predictable and uninteresting. If you look, for example, at the raw frequencies for words which modify the noun car, you'll come across a whole load of very common adjectives - new car, old car, small car, first car, other cars, etc. That doesn't really tell you an awful lot about language. Most students could probably guess these combinations. But if you rearrange the collocates by significance, combinations like electric car, sports car, rental car and police car start rising to the top, along with some cars that aren't even cars, like cable car. They're clearly much more interesting from a linguistic perspective, much less predictable and much more what we think of when we talk about teaching collocation. See this Sketch Engine blog post for more about this and more examples (although, I kind of disagree with its conclusions re. language teaching!).

Ranked by frequency (the underlined number)
Source: Sketch Engine, English Web 2013 corpus
Ranked by score (the number on the right)
Source: Sketch Engine, English Web 2013 corpus

When you want typical

I started off using corpora as a lexicographer working on learner's dictionaries. In a dictionary, you want to show the range of a word and its usage, so looking at typical collocates is a great starting point for getting a feel for a word. It helps you to tease out different senses - like the AmE sense of car meaning carriage, as in rail car, train car, freight car, etc. - to identify possible compounds, phrases and idioms  - car park, car pool, get car sick - and to pick out some of the most significant collocates you might want to exemplify and perhaps highlight.

The less obvious but typical collocations are important in teaching materials too, especially when an unpredictable collocation is also very frequent, like catch a bus or board a plane; which score highly on both types of measure. The typical collocations aren't, however, always what we want to focus on.

When you want vanilla

Many dictionary entries, especially for more frequent words, will start with what's known as a 'vanilla' example. That is a simple example that illustrates the basic meaning of the word in a context that's authentic but doesn't contain other elements that distract from the word being exemplified. Information about less obvious collocations, phrases or colligational patterns will come later. So the Cambridge Dictionaries entry for car has the following example sentences:

They don't have a car. (the 'vanilla' example - 'have' is actually one of the top collocating verbs by raw frequency, but it's unremarkable)
Where did you park the car? ('park' is a more interesting collocate)
It's quicker by car.
a car chase/accident/factory

The same principle holds for many other teaching contexts.

When you're introducing potentially new vocabulary items, you want students to focus on those new words. Of course, you want to present them in a realistic context with appropriate collocates, but you don't want to overwhelm the student with extra information and especially not with collocates that are well above the level of the original target word. So if I was, say, teaching car for the first time, I probably wouldn't throw in sports car or rental car, but it might be appropriate to add a bit of variety to the material with simple combinations like new car or small car. Only later when car was a familiar vocabulary item might I want to extend students' range to talk about other types of cars as appropriate contexts cropped up.

When frequent isn't necessarily obvious

A particularly tricky case in English is the set of 'delexical' verbs (make, do, take, get, have, put, give, etc.) which are all incredibly frequent, but for a learner of English, not at all obvious in terms of which to choose. If we go back to what we do with buses, by far the most frequent collocating verb is take. If you look at collocates by frequency, it's right at the top for most corpora. If you switch to order collocates by significance though, because it's a very common verb, it drops way down the order to be replaced by board, ride, catch, park and drive. Obviously, that doesn't mean that we don't need to teach take the bus because it'll be obvious to our students … because it won't!

Weighing up the numbers

So what does all this mean? Which statistics should we be looking at? Well, the answer is probably both. When I'm researching the collocates of a word, I'll flick between both types of ranking to get an overall picture of how the word works, then make my choices based on the teaching context.
  • If I'm looking for a natural example for a new vocab item, I'll probably look at raw frequencies to find a collocate that's common but not distracting.
  • If a collocate - like catch a bus - is high on both scores - it's probably worth teaching, and maybe highlighting, early on.
  • If I'm looking to extend students' range and get them to use familiar words in more varied ways, then I'll investigate the more interesting collocates that come up when ranked by score
A note about data

Finally, as ever with corpora, it’s also important to know what data you’re looking at. As I mentioned in my last corpus insider post, most corpora are made up of predominantly written data and, of course, that’s going to affect the type of results you get back. So, going back to my query at the start of this post about get the bus vs. take the bus, most of the corpora I looked at listed take as a top collocate by frequency, but get, which felt more natural to me, was much further down the lists (both by score and raw frequency). When I looked at the Spoken BNC2014 (a corpus of contemporary spoken British English) though, suddenly get the bus rocketed to the top, suggesting it's something we say, but maybe write slightly less often.

Labels: , , ,


Post a Comment

<< Home