Lexicoblog

The occasional ramblings of a freelance lexicographer

Monday, January 07, 2019

2018: Themes of the year


It's that time when you find yourself looking back on the past 12 months and ahead to the coming year. To be honest, 2018 wasn't the easiest year for me workwise. Through the first half of the year, I struggled with work as my chronic pain condition went through a particularly bad patch. This led to me taking two months off through the summer to rest and recover. It really helped from a health perspective, but meant a big financial hit. Then over the last few months of the year, I had the frustration of projects being delayed and cancelled, with more lost income and my cash-flow at less than a trickle!

Those things aside, it was a good year for ideas. Two of my highlights of 2018 were conference talks which reflect two of the themes of my professional year.

IATEFL: vocabulary learning and teaching

After many years of doing talks on behalf of publishers, I decided to submit my own proposal - Wordlists: snog, marry, avoid? - for IATEFL 2018 (summary here). When you put together a talk for a publisher, it's usually based on a project you've been working on, so apart from deciding on what angle to take, the content is generally fairly straightforward. Planning my own talk was a very different proposition. I had a few ideas floating around my head about vocab-related themes I’d like to tackle, but settling on a specific topic and then deciding exactly what to include was trickier. 



Over the past couple of years, I’ve been getting more interested in the principles behind vocabulary learning and teaching, and planning my talk sent me into a new flurry of reading and thinking (often in cafes and also on a rather lovely reading retreat). Vocabulary has long been my ‘thing’ and I’ve dipped into theory and research over the years and, of course, built up lots of accumulated knowledge from experience of working with vocab day in, day out.  I always felt my wider knowledge was a bit patchy though and I didn’t want to stand up in front of a roomful of ELT experts with a load of gaping holes in my arguments!  Although I know there’s still masses out there to read and digest, I do feel like I’ve now filled a few gaps and joined up a few dots. More importantly, perhaps, I feel like I’ve got something to say in my own right, which has been a bit of a revelation.

All my mulling over of vocab-related stuff led onto another talk about the principles I try to apply when I’m writing vocab materials at the joint MaWSIG/Oxford Brookes event in June (summary here) – another great event and lovely to get such a positive response from my peers, thanks guys :)

And I’ve still got lots of vocab-related ideas whirring around, so I think there’s more to come if I can just find the right outlets …

IVACS: corpus research

This time last year, I was at a bit of a turning point in my ELT career and I decided I needed to refocus on the areas of ELT that interest me most (see posts here and here). One of those areas was corpus research and it’s something that I have managed to get more involved in over the past year or so, with corpus research work for a couple of different publishers and rather excitingly, my first talk at a corpus linguistics conference in Malta in June.


Unlike the ELT events I’m familiar with, corpus linguistics conferences tend to be much more academic affairs. So, although I felt confident that I had some interesting stuff to talk about (summary here), I wasn’t 100% certain about the reception I’d get from an audience of academics. Much to my relief, no one questioned my methodology or picked up on my lack of a reference list! In fact, many of the people I spoke to were quite excited to meet someone who actually does corpus research ‘in the real world’ and I had lots of great conversations with a wonderful range of fascinating people. It’s definitely a world I’d like to stay in touch with and with a couple of new and interesting pieces of research under my belt this year, it’s something I’d like to talk more about … if I can find a way to fund it …

The cold, hard economics of it all

Although IATEFL and IVACS were highlights of my professional year, both were largely self-funded and, together with another couple of events, ate up a lot of cash which I didn’t really have to spare given the aforementioned patchy workflow and lack of income.

So this year, I’ve had to rule out going to events unless I have a sponsor to help out with costs.  Luckily, I already have two conferences– the English UK academic conference in London on 19 Jan and TESOL Spain in Oviedo in March – lined up with backing from event organizers/publishers and maybe another one in the pipeline for the summer.  I’ve had to cross another couple that I’d hoped to speak at off my list because I couldn’t get any backing, which is sad, but hey.

Perhaps more importantly for 2019, as well as the little inspirational blips that conferences provide, I need to refocus on the day-to-day work at my desk to pay the bills. January has got off to a busy start with one project finishing up and another quickie writing job in progress, but my schedule for February onwards is looking worryingly empty. I’d really like to see all my investment in reading and thinking and talking at events translate into interesting writing projects where I can put some of those ideas into practice.

Labels: , , ,

Sunday, December 02, 2018

Corpus insider #4: The problem with polysemy


It's a bit of a standing joke that every talk I give includes the word polysemy, but it's such an important concept to bear in mind when you're looking at language in any context and especially for any corpus research. Recently, I gave a talk to students at Goldsmiths, University of London, about careers in linguistics. I wanted to give them a taste of both corpus research and lexicography, so I put together a small set of corpus lines for them to look at to tease out the different senses of a word and organize them into a dictionary entry.

Whilst it's possible to do a corpus search for a specific lemma (e.g. rest as a verb; rest, rests, rested, resting or rest as a noun; rest, rests), with reasonably reliable (if not 100%) results, corpus tools can't distinguish between the different senses or uses of a polysemous word. If you think about the noun rest, which sense immediately springs to mind? It's one of those words that highlights the difference between our intuitions and the realities of usage. Quite likely, the first sense you thought of was to do with 'taking a break or time to relax'. In fact, the rest (of) meaning 'what's remaining' or 'the others' is something like three times as frequent.

When lexicographers are working with a corpus to put together a dictionary entry, determining the sense division and ordering of senses is a manual process. You can get a flavour of a word by looking at its collocates (for example, using WordSketch in SketchEngine), but that only tells part of the story - you'll find the ‘relax’ sense of rest has far more strong collocates than the duller, more functional the rest of

Section of a WordSketch for rest (noun) - English Web 2015 via Sketch Engine

You can sort concordance lines to the left and right of the node word and you start to see the patterns emerge (here, the rest of becomes very obvious). But ultimately, you just have to go through a sample of cites manually to establish the different senses and uses (including as part of phrases), and the frequency order. The actual statistical frequency of a particular sense is almost impossible to determine in most cases, not least because, for many words, there are senses which overlap and examples that are ambiguous.

So what are the practical implications of this?

Dictionary frequency information: A number of learner’s dictionaries (Collins COBUILD, Macmillan, Longman) provide information about the frequency of a word using a system of stars or dots. Whilst this is useful in giving you a ball-park guide to more and less frequent words, the ratings are based on the frequency of the whole word, not the individual senses. For some words, all the senses may be relatively high frequency, while in other cases, the first sense(s) may be high frequency and others quite obscure.

Phrases: It is possible to find the frequency of many phrases with carefully constructed corpus searches, but phrases with variable elements and those containing very common words (such as phrasal verbs) which could co-occur in different ways are much trickier to pin down. For that reason, they’re not generally allocated their own frequency information and just get lumped in with the individual headwords.

Word lists: Many frequency-based word lists also don’t take into account the different senses of a word and their relative frequency.  Unless words on the list come with definitions attached, it’s difficult to know whether they just refer to the most frequent sense or to other senses as well.

Text analysis tools: Tools that allow you to input a text and get a breakdown of the words by frequency or as ranked in EVP, for instance, such as Text Inspector or Lextutor, will generally allocate words according to their overall frequency or most frequent sense. So, an obscure sense of a common word, such as leg in the context of a cricket match (see sense 5 here), will likely be labelled as high frequency. The paid version of Text Inspector does allow the user to choose the relevant sense of a word when looking at EVP labels from a drop-down menu, but it doesn’t offer off-list options (including the cricketing sense of leg which it just labels as A1) or allow you to allocate words to phrases that haven’t been automatically detected.

So, does this means that all these tools are completely useless? Of course not. In many cases, we’re using frequency information as a rough guide, so finer sense distinctions don’t come into play. Like anything though, it’s important to know the limitations of the sources and tools you use and to be on the look-out for anything that doesn’t seem quite right.

Labels: , , , ,

Tuesday, September 04, 2018

Corpus insider #3: Corpus quirks


I love using corpus tools to research language. They throw up some fascinating results to feed into ELT materials in all kinds of ways. They can, however, also be infuriating at times! In this post, I thought I'd look at a couple of the quirks that it might help to be aware of when you're using corpus tools.

Apostrophes, contractions and negatives
Corpus tools are great when you're just searching for strings of letters, but throw an apostrophe into the mix and all kinds of confusion seems to ensue! That's particularly problematic when you want to start researching anything grammatical. Searching for contractions, including auxiliary verbs and negatives, gets tricky, as are possessives. Different corpus tools deal with these forms in different ways - it may be that you can just type in the whole word apostrophe and all, you may have to separate the contraction from the main part of the word so she [space] 's or the software may treat a negative as a separate entity, so you'd need to search for could [space] n't

The essential thing though is you don't miss out on these forms because you constructed a search that didn't include them. If you're looking into future forms and you're only getting examples of will in its full form, you're missing out on loads of 'll and won't forms that may well be significantly more common. That means you need to check how the corpus tools you're using deal with them - look out for any help or FAQs for guidance.

Some useful search tips from COCA (click to view larger)

Slippery parts of speech
Most corpora you're likely to use will be part of speech tagged. That means that the data's been automatically analysed to tag each word with a part of speech; noun, verb, adjective, etc. That allows you to do lots of things. It means you can search for a word like walk and decide you only want to see verb examples or noun examples, usually by selecting PoS from some kind of menu. It helps with searching for lemmas - so you just type in walk and select verb and you'll get walk, walks, walked and walking. It also allows collocate searches to categorize and display results by part of speech.

A collocate search from Sketch Engine

However, part of speech tagging isn't perfect. Sometimes - well, quite often actually! - the tech just gets it wrong. So in any search where a word has multiple parts of speech, you're likely to get a few odd examples cropping up, especially where the grammatical clues are misleading. So in one text I was looking at, the fact that the word weather was preceded by to in the sentence "Ice is very sensitive to weather" made it look like it might be a verb so the software incorrectly categorized it as one.

These occasional tagging errors aren't generally significant, but much more problematic are words that are just difficult to classify. That's particularly true of words that come before nouns. We tend to think of words that pre-modify nouns as adjectives, but actually lots of them aren't technically. So, we can have nouns before nouns; an evening dress, a window cleaner, the table decorations. These don't cause massive problems in corpus searches, but they're worth noting as they don't often get a look-in in ELT materials.

Far more difficult are verb participles. Think about these examples - which of the words in bold are adjectives?

a boring meeting
another satisfied customer
an increasing number
the desired effect 
a neatly folded piece of paper

It's just one of those features of English that's annoyingly problematic for anyone who's dealing with learners and doesn't want to get into over-technical jargon. If you look in a learner's dictionary, you'll likely find the common ones (boring/bored, tiring/tired, exciting/excited, etc.) listed as adjectives, but the less common ones (increasing, desired) may or may not be there. It's an editorial decision where the cut-off point is for these. It's unsurprising, then, that corpus tools often struggle to handle them. Often, they just don't recognize them as adjectives, instead, they get lumped in with the verb lemma. That has a number of consequences:
  • You may not be able to search for increasing as an adjective. Of course, you can usually search for the exact word form instead of the lemma, but that will leave you with a mix of adjectival and verbal examples.
  • You can't always do collocate searches for these words, because the tools don't recognize them as an adjective.
  • When you do collocate searches, you need to take into account that some of the collocates might be in the 'wrong' places, so participle adjectives will often show up (in their base form) in the verb column. If I search for temperature, for example, I might get verb collocates that include lower, withstand and measure which are all genuine verb collocates, but then I get rise, operate and desire. When I click through, rise turns out to be a mix of rising temperatures and the temperature rose. Whereas desire and operate are always modifying the noun: the desired temperature and operating temperature.

Embracing quirkiness
So what does all this awkward quirkiness mean for the average corpus user? Well, it doesn’t mean you should give up using corpora or stop trusting corpus results. It just means you have to familiarize yourself with how your corpus tools deal with the odd stuff and be on the look-out for apparent anomalies, like mis-tagged parts of speech and unlikely collocates. And as you get more familiar with the tool you’re using, you’ll get to know its quirks and, more importantly, how to get around them or take them into account.

Labels: , , , , , ,