Lexicoblog

The occasional ramblings of a freelance lexicographer

Tuesday, September 04, 2018

Corpus insider #3: Corpus quirks


I love using corpus tools to research language. They throw up some fascinating results to feed into ELT materials in all kinds of ways. They can, however, also be infuriating at times! In this post, I thought I'd look at a couple of the quirks that it might help to be aware of when you're using corpus tools.

Apostrophes, contractions and negatives
Corpus tools are great when you're just searching for strings of letters, but throw an apostrophe into the mix and all kinds of confusion seems to ensue! That's particularly problematic when you want to start researching anything grammatical. Searching for contractions, including auxiliary verbs and negatives, gets tricky, as are possessives. Different corpus tools deal with these forms in different ways - it may be that you can just type in the whole word apostrophe and all, you may have to separate the contraction from the main part of the word so she [space] 's or the software may treat a negative as a separate entity, so you'd need to search for could [space] n't

The essential thing though is you don't miss out on these forms because you constructed a search that didn't include them. If you're looking into future forms and you're only getting examples of will in its full form, you're missing out on loads of 'll and won't forms that may well be significantly more common. That means you need to check how the corpus tools you're using deal with them - look out for any help or FAQs for guidance.

Some useful search tips from COCA (click to view larger)

Slippery parts of speech
Most corpora you're likely to use will be part of speech tagged. That means that the data's been automatically analysed to tag each word with a part of speech; noun, verb, adjective, etc. That allows you to do lots of things. It means you can search for a word like walk and decide you only want to see verb examples or noun examples, usually by selecting PoS from some kind of menu. It helps with searching for lemmas - so you just type in walk and select verb and you'll get walk, walks, walked and walking. It also allows collocate searches to categorize and display results by part of speech.

A collocate search from Sketch Engine

However, part of speech tagging isn't perfect. Sometimes - well, quite often actually! - the tech just gets it wrong. So in any search where a word has multiple parts of speech, you're likely to get a few odd examples cropping up, especially where the grammatical clues are misleading. So in one text I was looking at, the fact that the word weather was preceded by to in the sentence "Ice is very sensitive to weather" made it look like it might be a verb so the software incorrectly categorized it as one.

These occasional tagging errors aren't generally significant, but much more problematic are words that are just difficult to classify. That's particularly true of words that come before nouns. We tend to think of words that pre-modify nouns as adjectives, but actually lots of them aren't technically. So, we can have nouns before nouns; an evening dress, a window cleaner, the table decorations. These don't cause massive problems in corpus searches, but they're worth noting as they don't often get a look-in in ELT materials.

Far more difficult are verb participles. Think about these examples - which of the words in bold are adjectives?

a boring meeting
another satisfied customer
an increasing number
the desired effect 
a neatly folded piece of paper

It's just one of those features of English that's annoyingly problematic for anyone who's dealing with learners and doesn't want to get into over-technical jargon. If you look in a learner's dictionary, you'll likely find the common ones (boring/bored, tiring/tired, exciting/excited, etc.) listed as adjectives, but the less common ones (increasing, desired) may or may not be there. It's an editorial decision where the cut-off point is for these. It's unsurprising, then, that corpus tools often struggle to handle them. Often, they just don't recognize them as adjectives, instead, they get lumped in with the verb lemma. That has a number of consequences:
  • You may not be able to search for increasing as an adjective. Of course, you can usually search for the exact word form instead of the lemma, but that will leave you with a mix of adjectival and verbal examples.
  • You can't always do collocate searches for these words, because the tools don't recognize them as an adjective.
  • When you do collocate searches, you need to take into account that some of the collocates might be in the 'wrong' places, so participle adjectives will often show up (in their base form) in the verb column. If I search for temperature, for example, I might get verb collocates that include lower, withstand and measure which are all genuine verb collocates, but then I get rise, operate and desire. When I click through, rise turns out to be a mix of rising temperatures and the temperature rose. Whereas desire and operate are always modifying the noun: the desired temperature and operating temperature.

Embracing quirkiness
So what does all this awkward quirkiness mean for the average corpus user? Well, it doesn’t mean you should give up using corpora or stop trusting corpus results. It just means you have to familiarize yourself with how your corpus tools deal with the odd stuff and be on the look-out for apparent anomalies, like mis-tagged parts of speech and unlikely collocates. And as you get more familiar with the tool you’re using, you’ll get to know its quirks and, more importantly, how to get around them or take them into account.

Labels: , , , , , ,

Monday, August 13, 2018

Not working


Today is my first proper day back at my desk after roughly seven weeks of not working. That might sound like a fabulous long holiday, but it was actually an extended break to try and get my health back on track.

As some of you will know, I suffer from a chronic pain condition which makes managing work a bit of a juggling act at the best of times. My condition fluctuates enormously. I have good patches and bad patches, some long, some short, some which coincide with busy patches of work, some which don’t. At the start of this year, things were particularly bad. I put it down to a combination of several busy projects back-to-back and the cold, damp, winter weather. As winter finally morphed into spring though and my workload settled down to what should have been a very manageable level, by the beginning of May my pains were even worse than ever and I was struggling to work at all. So I decided that maybe it was time for a complete break and an extended period of rest.

I finished a project at the end of June and made the decision not to take on any more work for the rest of the summer. As a freelancer, that’s quite a scary step because no work means no income. I figured though that I probably had enough in the bank to eke out a few weeks off if I budgeted carefully.

Not working

As a freelancer, you have to be pretty self-motivated, so you get used to just getting up in the morning and getting on with work. On the whole, I enjoy my work, so motivation isn’t usually a problem, and even when I’m feeling less excited about a project, just clocking up the hours and working towards the next invoice holds a certain satisfaction too. Given that mindset, not doing anything turns out to be actually quite difficult.

My natural reaction was one of: woo-hoo, time off to do lots of other stuff and fabulous summer weather too! But of course, the whole point of the exercise was to physically rest, so energetic gardening or DIY or days out traipsing around shops or galleries were also off the cards because they’re all just as likely to aggravate my pains. So, what have I been doing?

Taking it gently 

I’ve been perfecting the art of “pottering”. Rather than rush at things full throttle, I've picked a couple of small jobs each day, spaced them out, taken time wandering to the post office, maybe stopping off for a coffee en route. The hot weather helped to slow me down - you can’t rush anywhere when it’s 30°C! 


I haven’t always managed life in the slow lane smoothly though. When you’re used to being busy, it’s surprisingly difficult to change gear. Some days, I got it spot-on with enough achieved in a day to be satisfying but without overdoing it. Other days, I faffed about and didn’t really settle to anything and got to the end of the day feeling listless and frustrated. I had to keep reminding myself that even if I’d done nothing else, I’d achieved “resting”.

Not dropping off the radar

I made a decision early on that I didn’t want to switch off from work completely. Partly, that’s because I didn’t want to give the impression of being “unavailable” for fear that the unavailable tag would stick in people’s minds far beyond the summer. And partly, I just enjoy the social part of my work. I like keeping up with what people are up to and joining in the social media chat. Plus, without the cash or the energy to rush around visiting friends, it would have been easy just to sit at home and feel isolated. That means that I’ve been spending a bit of time most mornings at my desk browsing through social media, commenting or re-tweeting, reading the odd interesting article or blog post and writing a few blog posts of my own too. 


Easing back in

So, my pains have calmed down enormously now. I’m still achy and not completely pain-free, but after 20 years of chronic pain, I know not to expect miracles! I think I feel ready though to ease back into work. Plus, I’ve started getting twitchy about not working and I really need to start earning again. I’m determined to take it gently though. The first project on my desk should be about 2 weeks’ work, so I’m starting it 3 weeks before the deadline to give myself plenty of leeway.

Looking ahead, I’m making all my usual resolutions to work smart – to spend short bursts at my desk and take plenty of breaks, to pay attention to my posture, to keep exercising regularly, to use my voice recognition software a bit more and to keep my workload at a sensible level. Yes, I know, all easier said than done - especially the last one! - but I’m definitely going to try.

Labels: , , ,