Corpus insider #4: The problem with polysemy
It's a bit of a
standing joke that every talk I give includes the word polysemy, but it's such an important concept to bear in mind when
you're looking at language in any context and especially for any corpus
research. Recently, I gave a talk to students at Goldsmiths, University of
London, about careers in linguistics. I wanted to give them a taste of both
corpus research and lexicography, so I put together a small set of corpus lines
for them to look at to tease out the different senses of a word and organize
them into a dictionary entry.
Whilst it's
possible to do a corpus search for a specific lemma (e.g. rest as a verb; rest, rests,
rested, resting or rest as a
noun; rest, rests), with reasonably
reliable (if not 100%) results, corpus tools can't distinguish between the different
senses or uses of a polysemous word. If you think about the noun rest, which sense immediately springs to
mind? It's one of those words that highlights the difference between our intuitions
and the realities of usage. Quite likely, the first sense you thought of was to
do with 'taking a break or time to relax'. In fact, the rest (of) meaning 'what's remaining' or 'the others' is something
like three times as frequent.
When
lexicographers are working with a corpus to put together a dictionary entry,
determining the sense division and ordering of senses is a manual process. You
can get a flavour of a word by looking at its collocates (for example, using
WordSketch in SketchEngine), but that only tells part of the story - you'll
find the ‘relax’ sense of rest has
far more strong collocates than the duller, more functional the rest of.
Section of a WordSketch for rest (noun) - English Web 2015 via Sketch Engine |
You can sort
concordance lines to the left and right of the node word and you start to see
the patterns emerge (here, the rest of
becomes very obvious). But ultimately, you just have to go through a sample of
cites manually to establish the different senses and uses (including as part of
phrases), and the frequency order. The actual statistical frequency of a
particular sense is almost impossible to determine in most cases, not least
because, for many words, there are senses which overlap and examples that are
ambiguous.
So what are the
practical implications of this?
Dictionary frequency information: A number of learner’s dictionaries (Collins
COBUILD, Macmillan, Longman) provide information about the frequency of a word
using a system of stars or dots. Whilst this is useful in giving you a
ball-park guide to more and less frequent words, the ratings are based on the
frequency of the whole word, not the
individual senses. For some words, all the senses may be relatively high
frequency, while in other cases, the first sense(s) may be high frequency and
others quite obscure.
Phrases: It is possible to find the frequency of many phrases with carefully
constructed corpus searches, but phrases with variable elements and those
containing very common words (such as phrasal verbs) which could co-occur in
different ways are much trickier to pin down. For that reason, they’re not
generally allocated their own frequency information and just get lumped in with
the individual headwords.
Word lists: Many frequency-based word lists also don’t take into account the
different senses of a word and their relative frequency. Unless words on the list come with definitions
attached, it’s difficult to know whether they just refer to the most frequent
sense or to other senses as well.
Text analysis tools: Tools that allow you to input a text and
get a breakdown of the words by frequency or as ranked in EVP, for instance,
such as Text Inspector or Lextutor, will generally allocate words according to
their overall frequency or most frequent sense. So, an obscure sense of a
common word, such as leg in the
context of a cricket match (see sense 5 here), will likely be labelled as high frequency. The paid
version of Text Inspector does allow the user to choose the relevant sense of a
word when looking at EVP labels from a drop-down menu, but it doesn’t offer
off-list options (including the cricketing sense of leg which it just labels as A1) or allow you to allocate words to phrases that haven’t been
automatically detected.
So, does this
means that all these tools are completely useless? Of course not. In many
cases, we’re using frequency information as a rough guide, so finer sense
distinctions don’t come into play. Like anything though, it’s important to know
the limitations of the sources and tools you use and to be on the look-out for
anything that doesn’t seem quite right.
Labels: corpus insider, corpus research, frequency, polysemy, wordlists