Learner corpus research: mixed methods and detective work
In a blog post at the end of last year, one of my
resolutions for 2018 was to focus more on the areas that especially interest me
and one of those was corpus research. So far, so good. I’ve spent the past
couple of months researching Spanish learner errors for a writing project and
next week, I’ll be presenting at my first corpus linguistics conference, IVACS
in Valletta, Malta. Coincidentally, I’m going to be talking about my work with
the Cambridge Learner Corpus (CLC) researching errors by Spanish students to
feed into ELT materials for the Spanish market.
Although this will be my first time speaking at a corpus
linguistics conference, it’s far from the first time I’ve spoken about my work
with the CLC. In fact, my first presentation at a major conference, at IATEFL
in Dublin back in 2000, was about the work I’d been doing using the CLC to
write common error notes for the Cambridge Learner’s Dictionary.
So what is the Cambridge Learner Corpus?
The CLC is a collection of exam scripts written by learners taking Cambridge
exams, including everything from PET and KET, through First, CAE and
Proficiency, IELTS and the Business English exams. It includes data from over
250 000 students from 173 countries. You can choose to either search through
the raw data or there’s a coded version of the corpus in which errors have been
classified and corrections suggested, much as they would be by a teacher
marking a student essay, allowing you to search for specific error types. So, the example below shows how a verb inflection error would be coded.
We are proud that you have <#IV> choosen | chosen </#IV> our company.
You can also search by CEFR level, by specific exams, by
country and by L1.
Research for exams:
When I gave my talk in Dublin all those years ago, one of
the first questions at the end came from the redoubtable Mario Rinvolucri who
was sitting right in the middle of the front row. He was concerned that the
corpus didn’t include any spoken data, so wasn’t really representative of
student language in general. And he was right. One of the major drawbacks of
the CLC is that it only reflects students’ written performance and then, only
in an exam writing context. That means it doesn’t pick up issues that are
specifically related to spoken language and the data is rather skewed by the topics
and genres of exam writing tasks (largely emails and essays).
That does, however, make it perfect for informing exam
practice materials. Over the years, I’ve carried out learner corpus research to
feed into a whole range of exam materials. This has largely involved searching
for the most frequent errors that match different parts of a coursebook
syllabus in order to provide the writers with evidence and examples to help
them better target particular problem areas. It also led to the Common Mistakes
at … series, most of which I was involved in researching and two of which I
wrote.
Mixed methods and detective work:
One of the things I enjoy most about working with the
learner corpus, even after more than 18 years, is that I’m constantly finding
new ways to get the best from the data. It’s easy to start out by searching for
the most frequent errors by type, so the top ten spelling errors or the most
commonly confused nouns (trip/travel, job/work, etc). But the stats don’t tell
the whole story.
Firstly, there’s the issue I’ve already mentioned about the
skewing of the data by particular exam questions and topics. So, for example,
one of the top noun confusion errors amongst Spanish learners in the corpus is
to use ‘jail’ instead of ‘cage’; a lot of animals are locked up in jails. It is
a legitimate false friend error (cage is jaula in Spanish), but it’s only so
prominent in the data because of a classic FCE essay question about keeping
animals in zoos. Does it merit highlighting in materials? Probably not compared
to other errors that involve more high-frequency nouns and crop up across a
range of contexts. There’s a balancing act to achieve between looking at the
corpus stats, delving into the source of the errors (Were all the examples of
an error prompted by a single exam question?), understanding the likely needs
of students (Is it a high frequency word or likely to come up again in an
exam?) and understanding what’ll work (and won’t work!) in published materials.
I think it’s when it comes to these last two that, as a former teacher
and full-time materials writer, I probably have the edge over a purely academic
researcher.
Then there’s the tagging of errors to consider. Many learner
corpus researchers are wary of error tagging because it can push the data into
fixed categories that aren’t always appropriate. Any teacher who’s ever marked
a student essay will know that some errors are very straightforward to mark (a spelling
error or a wrong choice of preposition, for example), while others are messy.
There can be several things going on in a single sentence that contribute to it
not working and sometimes it’s difficult to pin down the root cause. Not to
mention those chunks of text that are so garbled, you’re just not sure what the
student intended and you don’t know how to go about correcting them. That means
that while the coders who mark up the CLC data do a fantastic job, there are
always instances that are open to interpretation.
When I find an error that looks worth highlighting within a
particular error category, I’ll often do a more general search for the word (or
form or chunk) to see whether it crops up elsewhere with different tags.
Sometimes, I’ll go back to the untagged data too to see how students are using
the word more generally. This can help me pin down issues that stray across
error categories. Then, if the error isn’t a straightforward one, I’ll flick
over to a native-speaker corpus to check how the word or structure in question
is typically used – after looking at too much learner data, you start to
question your intuitions! – and check a few reference sources to help me
pinpoint exactly where the mismatch lies and try to come up with a clear
explanation for the students.
It’s this multi-layered detective work to understand where
things are going wrong and figure out the best way to help learners understand
and hopefully, overcome language issues that I find so satisfying.
At the IVACS conference, I’ll be talking about delving into
the issues specific to Spanish learners at 16.00 on Thursday, 14 June for
anyone who’s going to be joining me in Valletta.
Labels: Cambridge learner corpus, ELT materials, IVACS, learner errors, Malta, presentation
0 Comments:
Post a Comment
<< Home