Lexicoblog

The occasional ramblings of a freelance lexicographer

Thursday, June 07, 2018

Learner corpus research: mixed methods and detective work


In a blog post at the end of last year, one of my resolutions for 2018 was to focus more on the areas that especially interest me and one of those was corpus research. So far, so good. I’ve spent the past couple of months researching Spanish learner errors for a writing project and next week, I’ll be presenting at my first corpus linguistics conference, IVACS in Valletta, Malta. Coincidentally, I’m going to be talking about my work with the Cambridge Learner Corpus (CLC) researching errors by Spanish students to feed into ELT materials for the Spanish market.

Although this will be my first time speaking at a corpus linguistics conference, it’s far from the first time I’ve spoken about my work with the CLC. In fact, my first presentation at a major conference, at IATEFL in Dublin back in 2000, was about the work I’d been doing using the CLC to write common error notes for the Cambridge Learner’s Dictionary.

So what is the Cambridge Learner Corpus?

The CLC is a collection of exam scripts written by learners taking Cambridge exams, including everything from PET and KET, through First, CAE and Proficiency, IELTS and the Business English exams. It includes data from over 250 000 students from 173 countries. You can choose to either search through the raw data or there’s a coded version of the corpus in which errors have been classified and corrections suggested, much as they would be by a teacher marking a student essay, allowing you to search for specific error types. So, the example below shows how a verb inflection error would be coded.

We are proud that you have <#IV> choosen | chosen </#IV> our company.

You can also search by CEFR level, by specific exams, by country and by L1.

Research for exams:

When I gave my talk in Dublin all those years ago, one of the first questions at the end came from the redoubtable Mario Rinvolucri who was sitting right in the middle of the front row. He was concerned that the corpus didn’t include any spoken data, so wasn’t really representative of student language in general. And he was right. One of the major drawbacks of the CLC is that it only reflects students’ written performance and then, only in an exam writing context. That means it doesn’t pick up issues that are specifically related to spoken language and the data is rather skewed by the topics and genres of exam writing tasks (largely emails and essays).

That does, however, make it perfect for informing exam practice materials. Over the years, I’ve carried out learner corpus research to feed into a whole range of exam materials. This has largely involved searching for the most frequent errors that match different parts of a coursebook syllabus in order to provide the writers with evidence and examples to help them better target particular problem areas. It also led to the Common Mistakes at … series, most of which I was involved in researching and two of which I wrote.


Mixed methods and detective work:

One of the things I enjoy most about working with the learner corpus, even after more than 18 years, is that I’m constantly finding new ways to get the best from the data. It’s easy to start out by searching for the most frequent errors by type, so the top ten spelling errors or the most commonly confused nouns (trip/travel, job/work, etc). But the stats don’t tell the whole story. 

Firstly, there’s the issue I’ve already mentioned about the skewing of the data by particular exam questions and topics. So, for example, one of the top noun confusion errors amongst Spanish learners in the corpus is to use ‘jail’ instead of ‘cage’; a lot of animals are locked up in jails. It is a legitimate false friend error (cage is jaula in Spanish), but it’s only so prominent in the data because of a classic FCE essay question about keeping animals in zoos. Does it merit highlighting in materials? Probably not compared to other errors that involve more high-frequency nouns and crop up across a range of contexts. There’s a balancing act to achieve between looking at the corpus stats, delving into the source of the errors (Were all the examples of an error prompted by a single exam question?), understanding the likely needs of students (Is it a high frequency word or likely to come up again in an exam?) and understanding what’ll work (and won’t work!) in published materials. I think it’s when it comes to these last two that, as a former teacher and full-time materials writer, I probably have the edge over a purely academic researcher.

Then there’s the tagging of errors to consider. Many learner corpus researchers are wary of error tagging because it can push the data into fixed categories that aren’t always appropriate. Any teacher who’s ever marked a student essay will know that some errors are very straightforward to mark (a spelling error or a wrong choice of preposition, for example), while others are messy. There can be several things going on in a single sentence that contribute to it not working and sometimes it’s difficult to pin down the root cause. Not to mention those chunks of text that are so garbled, you’re just not sure what the student intended and you don’t know how to go about correcting them. That means that while the coders who mark up the CLC data do a fantastic job, there are always instances that are open to interpretation. 

When I find an error that looks worth highlighting within a particular error category, I’ll often do a more general search for the word (or form or chunk) to see whether it crops up elsewhere with different tags. Sometimes, I’ll go back to the untagged data too to see how students are using the word more generally. This can help me pin down issues that stray across error categories. Then, if the error isn’t a straightforward one, I’ll flick over to a native-speaker corpus to check how the word or structure in question is typically used – after looking at too much learner data, you start to question your intuitions! – and check a few reference sources to help me pinpoint exactly where the mismatch lies and try to come up with a clear explanation for the students.

It’s this multi-layered detective work to understand where things are going wrong and figure out the best way to help learners understand and hopefully, overcome language issues that I find so satisfying.

At the IVACS conference, I’ll be talking about delving into the issues specific to Spanish learners at 16.00 on Thursday, 14 June for anyone who’s going to be joining me in Valletta.

Labels: , , , , ,

Wednesday, May 23, 2018

Entrepreneurial micro-business or ELT Deliveroo?


In my working life, I inhabit a number of different worlds.

There’s the general ELT world I mingle with at conferences, events and online that includes teachers, teacher trainers, publishing folk, academics and other freelancers. We talk about teaching, methodology, technology, language and yes, occasionally, the state of ELT.

Then there’s the ELT writing crowd, other freelance writers and editors who congregate face-to-face and online via groups like MaWSIG and ELT Freelancers, as well as through various Facebook groups. Like most colleagues, we largely enjoy a good moan … about our latest hassles and project nightmares, about the stresses of being a freelancer subject to the whims of the publishing industry and inevitably, about how things ‘ain’t like they used to be’. They are, however, also a very supportive bunch, always happy to offer encouragement, practical advice and, most importantly, a good laugh. And at times, it can be pretty inspiring to see the varied and exciting things we all get up to.

I also occasionally dip a toe into the world of local networking groups full of (largely female) entrepreneurs and small business owners who seem to spend a lot of time and energy (and money!) on branding and marketing and business plans and coaching and serious networking … the idea of an ‘elevator pitch’ or asking for ‘referrals’ at an ELT event would make most folk run a mile, but for these ladies, it’s all an essential part of the game. If that sounds a touch ‘sniffy’, it really isn’t meant to be. I don’t quite feel part of the ‘networking gang’ largely because my work doesn’t really fit their model. Most of them are customer-facing businesses (fitness instructors, therapists, consultants of some kind) who need to create a brand and market it to members of the public (and each other!). Some are small businesses with staff and premises and physical products to sell. And whilst a lot of the chat in these circles doesn’t really apply to me and my context, I do still meet some interesting people and I often pick up ideas that are tangentially useful or that I can adapt to be relevant.

How I see myself professionally varies enormously depending on who I’ve been hanging out with and how work’s going at any one time. I don’t quite feel like I’m a small business or an entrepreneur, but after a particularly inspiring networking event or talk, I err towards the idea of being a successful, funky little micro-business. After a successful talk at an ELT conference, discussing language and pedagogy with all kinds of different people, I can see myself as a budding ‘expert’ in my field with things to say and stuff to contribute. A lot of the time though, I’m just a slightly frustrated and disillusioned hack writer churning out ‘content’ in less-than-ideal conditions and barely scraping together a living (for the record, I earn considerably less than the average UK salary and my average yearly income has barely risen over 20 years of freelancing).

Earlier this week, a radio programme – the Digital Human on Radio 4 – made me stop and think again about work and my relationship with it. The programme explored how, as a society, we’ve become intent on finding ways to use technology to make our lives easier, more ‘frictionless’. It asked where all the time we’ve supposedly saved goes and it looked into how our work and home lives have increasingly merged, especially those of us involved in the gig economy. One anecdote from anthropologist, Jan English-Lueck really struck a chord with me:

“I remember talking to a woman who had a really bad problem with carpal tunnel and she’d given up camping, she’d given up reading books, she’d given up everything. And she held up her hands and said ‘I save these for my workplace’.” 


As many of you will know, I’ve been managing a chronic pain condition for nearly 20 years now. I’ve given up many things over the years, but in the past few months, my pains have been particularly troublesome and I’ve found myself giving up driving, giving up going to events that involve lots of standing around or sitting in one place (cinema, gigs, theatre) and increasingly, opting out of social events because at the end of the day, I’m so shattered, I just want to collapse into a fug of painkillers. Am I saving what strength and ability I have for work at the expense of other things in my life? Probably. Because I need to work to earn money and pay the bills, and as a freelancer, my income is unstable, I can’t afford to turn down work or miss deadlines, I don’t get sick pay or paid holidays. 

The programme got me wondering whether I’m really an entrepreneurial micro-business with the freedom to choose what I work on and to fit my working hours around other things or whether it’s all just a kind of ELT Deliveroo without the perks of the reflective jacket?

I really don’t know the answer and I’ve flipped between the two poles – and all points in-between – just in the course of writing this post.  What’s your relationship with your working life? Do you see yourself as a business, a creative entrepreneur, an expert, as a gun for hire, a hack writer or a harmless drudge?

Labels: , , ,

Wednesday, May 02, 2018

Corpus insider #2: Frequency & typicality


Corpora are really great for checking collocations: words that are typically used together. Collocation's a really important aspect of language and a vital part of language teaching if we want to help students avoid 'doing' obvious mistakes. As expert speakers, we generally have a feel for an individual word's most typical collocates, but when you're writing materials, it's easy to get a particular combination stuck in your head or to start doubting your intuitions - do we say get a bus or take a bus? The more you say it to yourself, the sillier each one starts to sound. A bit of outside evidence can be really helpful.
 
If you want to use a corpus to check out collocations though, it's important to understand a few basics about the statistics behind what the corpus tools are showing you and what type of collocations might be appropriate for the materials you're writing.

Frequent vs Typical

The most important distinction to get to grips with is the difference between frequent collocations and typical or significant or strong collocations. Most corpus tools will show you which words most commonly co-occur just based on raw frequency, but some tools will also have an option to rank collocates by strength of attraction, shown as a score. That is, the software will take into account not just how often two words occur together, but how likely that combination is based on the relative frequency of the two items. So the chances of two very frequent words occurring together is quite high and therefore often fairly predictable and uninteresting. If you look, for example, at the raw frequencies for words which modify the noun car, you'll come across a whole load of very common adjectives - new car, old car, small car, first car, other cars, etc. That doesn't really tell you an awful lot about language. Most students could probably guess these combinations. But if you rearrange the collocates by significance, combinations like electric car, sports car, rental car and police car start rising to the top, along with some cars that aren't even cars, like cable car. They're clearly much more interesting from a linguistic perspective, much less predictable and much more what we think of when we talk about teaching collocation. See this Sketch Engine blog post for more about this and more examples (although, I kind of disagree with its conclusions re. language teaching!).

Ranked by frequency (the underlined number)
Source: Sketch Engine, English Web 2013 corpus
Ranked by score (the number on the right)
Source: Sketch Engine, English Web 2013 corpus



When you want typical

I started off using corpora as a lexicographer working on learner's dictionaries. In a dictionary, you want to show the range of a word and its usage, so looking at typical collocates is a great starting point for getting a feel for a word. It helps you to tease out different senses - like the AmE sense of car meaning carriage, as in rail car, train car, freight car, etc. - to identify possible compounds, phrases and idioms  - car park, car pool, get car sick - and to pick out some of the most significant collocates you might want to exemplify and perhaps highlight.

The less obvious but typical collocations are important in teaching materials too, especially when an unpredictable collocation is also very frequent, like catch a bus or board a plane; which score highly on both types of measure. The typical collocations aren't, however, always what we want to focus on.

When you want vanilla


Many dictionary entries, especially for more frequent words, will start with what's known as a 'vanilla' example. That is a simple example that illustrates the basic meaning of the word in a context that's authentic but doesn't contain other elements that distract from the word being exemplified. Information about less obvious collocations, phrases or colligational patterns will come later. So the Cambridge Dictionaries entry for car has the following example sentences:

They don't have a car. (the 'vanilla' example - 'have' is actually one of the top collocating verbs by raw frequency, but it's unremarkable)
Where did you park the car? ('park' is a more interesting collocate)
It's quicker by car.
a car chase/accident/factory

The same principle holds for many other teaching contexts.

When you're introducing potentially new vocabulary items, you want students to focus on those new words. Of course, you want to present them in a realistic context with appropriate collocates, but you don't want to overwhelm the student with extra information and especially not with collocates that are well above the level of the original target word. So if I was, say, teaching car for the first time, I probably wouldn't throw in sports car or rental car, but it might be appropriate to add a bit of variety to the material with simple combinations like new car or small car. Only later when car was a familiar vocabulary item might I want to extend students' range to talk about other types of cars as appropriate contexts cropped up.

When frequent isn't necessarily obvious

A particularly tricky case in English is the set of 'delexical' verbs (make, do, take, get, have, put, give, etc.) which are all incredibly frequent, but for a learner of English, not at all obvious in terms of which to choose. If we go back to what we do with buses, by far the most frequent collocating verb is take. If you look at collocates by frequency, it's right at the top for most corpora. If you switch to order collocates by significance though, because it's a very common verb, it drops way down the order to be replaced by board, ride, catch, park and drive. Obviously, that doesn't mean that we don't need to teach take the bus because it'll be obvious to our students … because it won't!


Weighing up the numbers

So what does all this mean? Which statistics should we be looking at? Well, the answer is probably both. When I'm researching the collocates of a word, I'll flick between both types of ranking to get an overall picture of how the word works, then make my choices based on the teaching context.
  • If I'm looking for a natural example for a new vocab item, I'll probably look at raw frequencies to find a collocate that's common but not distracting.
  • If a collocate - like catch a bus - is high on both scores - it's probably worth teaching, and maybe highlighting, early on.
  • If I'm looking to extend students' range and get them to use familiar words in more varied ways, then I'll investigate the more interesting collocates that come up when ranked by score
A note about data

Finally, as ever with corpora, it’s also important to know what data you’re looking at. As I mentioned in my last corpus insider post, most corpora are made up of predominantly written data and, of course, that’s going to affect the type of results you get back. So, going back to my query at the start of this post about get the bus vs. take the bus, most of the corpora I looked at listed take as a top collocate by frequency, but get, which felt more natural to me, was much further down the lists (both by score and raw frequency). When I looked at the Spoken BNC2014 (a corpus of contemporary spoken British English) though, suddenly get the bus rocketed to the top, suggesting it's something we say, but maybe write slightly less often.

Labels: , , ,