Corpus frequencies: what exactly counts?
Senses, idioms and phrasal verbs
Following on from my last post prompted by Michael Rundell’s webinar: Tweets, blogs and corpora: How computer technology helps us make better dictionaries. There was one more question from a webinar participant which I think opens up a whole area of corpus research and word frequencies as shown in dictionaries that tends to get glossed over.
“Do these numbers [corpus frequencies] consider all the meanings of a word or only the common ones?”
Response from another participant:
“I’d guess that the counting program [the corpus software] doesn’t understand the meaning so it is for all meanings of the word.”
An astute question and a correct answer! Corpus software is very clever at number crunching and identifying patterns, but computers still fall down when it comes down to actually understanding language. When you do a corpus search, you can choose the part of speech you’re interested in (separating out noun and verb senses of a word like walk, for example) and you can search for a ‘lemma’ rather than just a string of letters (so searching for the verb walk will include walk, walks, walked and walking). When it comes down to differentiating between different senses or uses of a word though, that can still only be done “by hand” by a human being sorting through a sample of corpus lines one-by-one. Sometimes, where one sense is overwhelmingly more frequent, the sense frequencies are obvious at a glance. In other cases, especially with very polysemous words, it’s a trickier business. Thus, sense ordering by frequency is, to a degree, impressionistic and doesn’t involve exact statistics.
Does this matter? Again, my answer is “not really”, provided we’re only taking frequency information in a dictionary as a general guide. For many words, the most frequent sense(s) of a word will probably account for the majority of its occurrences, so it’s fair to say that overall its core meaning(s) will fall within a general frequency band. It’s unlikely that in many cases there will be lots of obscure senses of a word that significantly distort the frequency statistics.
Where caution may be required though is where it’s the less frequent senses you’re actually interested in. To take an example I came across recently working on EAP vocabulary, if you do a corpus search for chemist, physicist and biologist, chemist comes up as much more frequent – as reflected in most of the learner’s dictionaries. Now that isn’t because there are more scientists studying chemistry than there are physics or biology. But of course, in British English at least, a chemist can be a pharmacy or a pharmacist as well as a scientist in a lab with their test tubes.
And it’s not just the effect that different senses of a word might have on frequency that needs considering. Another big area to take into account is words that form part of a phrase of some kind. Going back to my example of walk, it crops up in various phrases or idioms – walk the walk, run before you can walk, walk free, etc. – and a whole list of phrasal verbs – walk away with, walk in on, walk off, walk out … In most learner’s dictionaries, these come at the end of the entry for the headword and in most of the major dictionaries (I think with the exception of Cambridge), they don’t have frequency information in their own right. Instead, they get lumped into the overall frequency for the whole entry. This has two consequences; firstly, it means that learners can’t see which phrases and phrasal verbs are most frequent and also, it further undermines the frequency information for some words as we can’t be certain what it’s referring to. Take the verb deal as an example, highlighted as frequent in most dictionaries. In fact, something like 85% of occurrences of the verb deal are actually instances of the phrasal verb deal with. Yet, in most dictionaries, it appears that the basic verb senses (giving out cards or drugs) are common, while the phrasal verb deal with has no highlighting at all.
It is possible to construct corpus searches to find particular phrases or phrasal verbs, even where their form varies slightly - for example, where phrasal verbs have moveable particles. So it is possible to get frequency information for them, albeit not as simply or reliably as for single words. And if you look in specialist dictionaries of phrasal verbs or idioms, you’ll often find the most common ones highlighted. So why don’t most general learner’s dictionaries include this information? Well, firstly, it’s very time-consuming to research and secondly, it isn’t easy within a traditional dictionary format to devise a system that encompasses frequency information for both whole words (with senses lumped together) and individual usages in the form of phrases and phrasal verbs.
So having completely ripped apart the frequency information in dictionaries, am I saying that it’s useless and should be ignored? No, far from it! I think as a broad guide to which words are generally more frequent (and so worth focusing on), I still think it’s an incredibly useful tool. But as in any area of life, statistics should always be approached critically and before you rely too much on them, you need to understand what’s behind them, how they’re compiled and what caveats you might need to take into consideration.
Labels: corpora, dictionaries, Macmillan, Michael Rundell, webinar