Lexicoblog

Tuesday, September 04, 2018

Corpus insider #3: Corpus quirks

I love using corpus tools to research language. They throw up some fascinating results to feed into ELT materials in all kinds of ways. They can, however, also be infuriating at times! In this post, I thought I'd look at a couple of the quirks that it might help to be aware of when you're using corpus tools.

Apostrophes, contractions and negatives

Corpus tools are great when you're just searching for strings of letters, but throw an apostrophe into the mix and all kinds of confusion seems to ensue! That's particularly problematic when you want to start researching anything grammatical. Searching for contractions, including auxiliary verbs and negatives, gets tricky, as are possessives. Different corpus tools deal with these forms in different ways - it may be that you can just type in the whole word apostrophe and all, you may have to separate the contraction from the main part of the word so she [space] 's or the software may treat a negative as a separate entity, so you'd need to search for could [space] n't.

The essential thing though is you don't miss out on these forms because you constructed a search that didn't include them. If you're looking into future forms and you're only getting examples of will in its full form, you're missing out on loads of 'll and won't forms that may well be significantly more common. That means you need to check how the corpus tools you're using deal with them - look out for any help or FAQs for guidance.

Some useful search tips from COCA (click to view larger)

Slippery parts of speech

Most corpora you're likely to use will be part of speech tagged. That means that the data's been automatically analysed to tag each word with a part of speech; noun, verb, adjective, etc. That allows you to do lots of things. It means you can search for a word like walk and decide you only want to see verb examples or noun examples, usually by selecting PoS from some kind of menu. It helps with searching for lemmas - so you just type in walk and select verb and you'll get walk, walks, walked and walking. It also allows collocate searches to categorize and display results by part of speech.

A collocate search from Sketch Engine

However, part of speech tagging isn't perfect. Sometimes - well, quite often actually! - the tech just gets it wrong. So in any search where a word has multiple parts of speech, you're likely to get a few odd examples cropping up, especially where the grammatical clues are misleading. So in one text I was looking at, the fact that the word weather was preceded by to in the sentence "Ice is very sensitive to weather" made it look like it might be a verb so the software incorrectly categorized it as one.

These occasional tagging errors aren't generally significant, but much more problematic are words that are just difficult to classify. That's particularly true of words that come before nouns. We tend to think of words that pre-modify nouns as adjectives, but actually lots of them aren't technically. So, we can have nouns before nouns; an evening dress, a window cleaner, the table decorations. These don't cause massive problems in corpus searches, but they're worth noting as they don't often get a look-in in ELT materials.

Far more difficult are verb participles. Think about these examples - which of the words in bold are adjectives?

a boring meeting

another satisfied customer

an increasing number

the desired effect

a neatly folded piece of paper

It's just one of those features of English that's annoyingly problematic for anyone who's dealing with learners and doesn't want to get into over-technical jargon. If you look in a learner's dictionary, you'll likely find the common ones (boring/bored, tiring/tired, exciting/excited, etc.) listed as adjectives, but the less common ones (increasing, desired) may or may not be there. It's an editorial decision where the cut-off point is for these. It's unsurprising, then, that corpus tools often struggle to handle them. Often, they just don't recognize them as adjectives, instead, they get lumped in with the verb lemma. That has a number of consequences:

You may not be able to search for increasing as an adjective. Of course, you can usually search for the exact word form instead of the lemma, but that will leave you with a mix of adjectival and verbal examples.
You can't always do collocate searches for these words, because the tools don't recognize them as an adjective.
When you do collocate searches, you need to take into account that some of the collocates might be in the 'wrong' places, so participle adjectives will often show up (in their base form) in the verb column. If I search for temperature, for example, I might get verb collocates that include lower, withstand and measure which are all genuine verb collocates, but then I get rise, operate and desire. When I click through, rise turns out to be a mix of rising temperatures and the temperature rose. Whereas desire and operate are always modifying the noun: the desired temperature and operating temperature.

Embracing quirkiness

So what does all this awkward quirkiness mean for the average corpus user? Well, it doesn’t mean you should give up using corpora or stop trusting corpus results. It just means you have to familiarize yourself with how your corpus tools deal with the odd stuff and be on the look-out for apparent anomalies, like mis-tagged parts of speech and unlikely collocates. And as you get more familiar with the tool you’re using, you’ll get to know its quirks and, more importantly, how to get around them or take them into account.

Labels: apostrophes, collocation, corpus insider, corpus research, part of speech, quirks, tagging

Wednesday, September 20, 2017

It’s not all about the apostrophes…

I saw a tweet recently that really made me smile:

When I meet new people, it’s always a bit of a challenge to explain what I do, mostly because I don’t do a single job. Once people get the general idea though, I find it leads to all kinds of assumptions about what a languagey sort of person must be interested in.

“Correct” grammar

As someone who spends a lot of time researching and writing about grammar norms, yes, non-standard grammar does, inevitably, jump out at me. I do automatically spot misplaced apostrophes, there/their/they’re mix-ups and sentences missing a main verb, but they don’t necessarily have me up in arms. For me, it’s all down to context. If it’s in a Facebook post or a quickie email, I really don’t care. If someone has gone to the trouble (and expense) of having something professionally printed without getting it proofread (a menu, a leaflet, a business website), then yes, it makes me sigh and roll my eyes.

Etymology

I admit that I love words. I find English vocabulary in all its wonderful variety fascinating. Am I bothered about the origins of a particular word or expression though? Not especially. Yes, understanding a bit about the roots of English can be useful, but for me, it’s functional rather than fascinating. I’m much more interested in how language is used now than where it came from. I have several unopened books on my shelves about the “stories behind words” bought as well-intentioned presents, but now collecting dust.

Trendy coinages

When I tell people I work in dictionaries, one of the common reactions is: “it must be all about finding new words”. Unsurprising perhaps, seeing as the only time dictionaries seem to be in the news is when they announce their “word of the year”: staycation or post-truth or sharenting. And yes, they’re fun, I enjoy a new coinage much as the next person, but they’re very much the fluffy, soundbite end of lexicography. As someone working in ELT, I’m much more involved in trying to explain the frequent, and yes even boring, everyday language that the average learner needs to master. Which, by the way, can be far more interesting and challenging.

The decline of English

At the same time as being excited by new coinages, people also expect me to be outraged by the apparent decline of the English language. I should be vehemently against verbing and appalled by the Americanization of English. I’m not. Language change happens, it always has (see etymology above). Of course, there are some changes that I personally embrace more than others, but asking whether I’m for or against language change seems a fairly nonsensical question to me. There isn’t some malign force out there forcing changes on us, it’s how we collectively choose to use our language that influences the direction of change.

I could go on (my spelling is rubbish, I’m not a literary type, I’ve never watched Countdown …), but I guess my real message is: I love language in my own ways.

Labels: apostrophes, etymology, grammar, language, new words, Twitter

Lexicoblog

Tuesday, September 04, 2018

Corpus insider #3: Corpus quirks

Wednesday, September 20, 2017

It’s not all about the apostrophes…

Lexicoblog

About Me

Previous Posts

Archives