The occasional ramblings of a freelance lexicographer

Thursday, June 07, 2018

Learner corpus research: mixed methods and detective work

In a blog post at the end of last year, one of my resolutions for 2018 was to focus more on the areas that especially interest me and one of those was corpus research. So far, so good. I’ve spent the past couple of months researching Spanish learner errors for a writing project and next week, I’ll be presenting at my first corpus linguistics conference, IVACS in Valletta, Malta. Coincidentally, I’m going to be talking about my work with the Cambridge Learner Corpus (CLC) researching errors by Spanish students to feed into ELT materials for the Spanish market.

Although this will be my first time speaking at a corpus linguistics conference, it’s far from the first time I’ve spoken about my work with the CLC. In fact, my first presentation at a major conference, at IATEFL in Dublin back in 2000, was about the work I’d been doing using the CLC to write common error notes for the Cambridge Learner’s Dictionary.

So what is the Cambridge Learner Corpus?

The CLC is a collection of exam scripts written by learners taking Cambridge exams, including everything from PET and KET, through First, CAE and Proficiency, IELTS and the Business English exams. It includes data from over 250 000 students from 173 countries. You can choose to either search through the raw data or there’s a coded version of the corpus in which errors have been classified and corrections suggested, much as they would be by a teacher marking a student essay, allowing you to search for specific error types. So, the example below shows how a verb inflection error would be coded.

We are proud that you have <#IV> choosen | chosen </#IV> our company.

You can also search by CEFR level, by specific exams, by country and by L1.

Research for exams:

When I gave my talk in Dublin all those years ago, one of the first questions at the end came from the redoubtable Mario Rinvolucri who was sitting right in the middle of the front row. He was concerned that the corpus didn’t include any spoken data, so wasn’t really representative of student language in general. And he was right. One of the major drawbacks of the CLC is that it only reflects students’ written performance and then, only in an exam writing context. That means it doesn’t pick up issues that are specifically related to spoken language and the data is rather skewed by the topics and genres of exam writing tasks (largely emails and essays).

That does, however, make it perfect for informing exam practice materials. Over the years, I’ve carried out learner corpus research to feed into a whole range of exam materials. This has largely involved searching for the most frequent errors that match different parts of a coursebook syllabus in order to provide the writers with evidence and examples to help them better target particular problem areas. It also led to the Common Mistakes at … series, most of which I was involved in researching and two of which I wrote.

Mixed methods and detective work:

One of the things I enjoy most about working with the learner corpus, even after more than 18 years, is that I’m constantly finding new ways to get the best from the data. It’s easy to start out by searching for the most frequent errors by type, so the top ten spelling errors or the most commonly confused nouns (trip/travel, job/work, etc). But the stats don’t tell the whole story. 

Firstly, there’s the issue I’ve already mentioned about the skewing of the data by particular exam questions and topics. So, for example, one of the top noun confusion errors amongst Spanish learners in the corpus is to use ‘jail’ instead of ‘cage’; a lot of animals are locked up in jails. It is a legitimate false friend error (cage is jaula in Spanish), but it’s only so prominent in the data because of a classic FCE essay question about keeping animals in zoos. Does it merit highlighting in materials? Probably not compared to other errors that involve more high-frequency nouns and crop up across a range of contexts. There’s a balancing act to achieve between looking at the corpus stats, delving into the source of the errors (Were all the examples of an error prompted by a single exam question?), understanding the likely needs of students (Is it a high frequency word or likely to come up again in an exam?) and understanding what’ll work (and won’t work!) in published materials. I think it’s when it comes to these last two that, as a former teacher and full-time materials writer, I probably have the edge over a purely academic researcher.

Then there’s the tagging of errors to consider. Many learner corpus researchers are wary of error tagging because it can push the data into fixed categories that aren’t always appropriate. Any teacher who’s ever marked a student essay will know that some errors are very straightforward to mark (a spelling error or a wrong choice of preposition, for example), while others are messy. There can be several things going on in a single sentence that contribute to it not working and sometimes it’s difficult to pin down the root cause. Not to mention those chunks of text that are so garbled, you’re just not sure what the student intended and you don’t know how to go about correcting them. That means that while the coders who mark up the CLC data do a fantastic job, there are always instances that are open to interpretation. 

When I find an error that looks worth highlighting within a particular error category, I’ll often do a more general search for the word (or form or chunk) to see whether it crops up elsewhere with different tags. Sometimes, I’ll go back to the untagged data too to see how students are using the word more generally. This can help me pin down issues that stray across error categories. Then, if the error isn’t a straightforward one, I’ll flick over to a native-speaker corpus to check how the word or structure in question is typically used – after looking at too much learner data, you start to question your intuitions! – and check a few reference sources to help me pinpoint exactly where the mismatch lies and try to come up with a clear explanation for the students.

It’s this multi-layered detective work to understand where things are going wrong and figure out the best way to help learners understand and hopefully, overcome language issues that I find so satisfying.

At the IVACS conference, I’ll be talking about delving into the issues specific to Spanish learners at 16.00 on Thursday, 14 June for anyone who’s going to be joining me in Valletta.

Labels: , , , , ,


Post a Comment

<< Home