Lexicoblog

The occasional ramblings of a freelance lexicographer

Wednesday, June 20, 2018

#IVACS2018: learner corpus research & ELT materials for Spanish learners


Last week, I spoke at the IVACS (Inter-Varietal Applied Corpus Studies) conference in Malta about my work using the Cambridge Learner Corpus (CLC) to help develop ELT materials targeted at Spanish learners of English. So, following on from my last post about my work generally using the learner corpus, here's a brief summary of my talk.

Photo from Naill Curry via Twitter
ELT: a global market

From the perspective of a large ELT publisher, if they're to invest in producing a major coursebook series - over several levels each with multiple components - it makes economic sense to sell it to the widest possible global market. This one-size-fits all approach, however, ignores the fact that different learners have different needs. Just one of the factors that differentiates learners is the influence of their first language; their L1. It's well-established that friction between a learner's L1 and target language, in this case English, can result in language transfer issues or interference, a factor not accounted for in materials for a global audience. In recent years, I've worked on a number of projects for CUP that have involved localizing materials to target them more effectively at Spanish learners. More specifically, I've used the CLC to investigate errors by Spanish learners to feed into English for Spanish Speakers (ESS) versions of a number of books.

For more about the CLC see my previous post.

Error types:
When you start looking at learner data for a specific L1 group, three broad error types emerge. There are global errors, that is errors that are common across learners more-or-less regardless of L1. These can be described as developmental or intralingual errors that are a result of the inherent quirks and irregularities in English that trip everyone up. Then there are interlingual errors where the learner's L1 rubs up against English in a way that creates friction and interference. Some of these are common across a language group, such as errors frequent among all Romance language speakers learning English, while others are L1 specific, so peculiar to say, Spanish speakers.

In my session, I took an example of each error type to show how I went about investigating the error and then incorporating activities to target the issue into classroom materials.

Global errors:
One classic example of a global, developmental error is with irregular verbs. Below is a list of the most common past simple/past participle verb inflection errors across the whole learner corpus. As you'd expect, there are some irregular verbs (pay, choose, rise, hear) and others where the spelling rules around whether or not to double the final consonant cause difficulties.

1 occured; 2 happend; 3 payed; 4 choosen; 5 prefered; 6 planed; 7 rised; 8 developped; 9 heared; 10 stoped

If we then look at the top tens for Spanish and French speakers for comparison, we see a lot of overlap.

Spanish: 1 choosen; 2 prefered; 3 payed; 4 teached; 5 refered; 6 planed; 7 occured; 8 heart; 9 writen; 10 tryed


French: 1 developped; 2 mentionned; 3 occured; 4 prefered; 5 choosen; 6 planed; 7 rised; 8 red; 9 enroled; 10 stoped

There are a few interesting differences though. The Spanish use of 'heart' as the past form of 'hear' doesn't seem to follow the pattern you'd expect - as with 'heared' in the global list. This can be put down to an issue of pronunciation; Spanish speakers tend not to pronounce voiced consonants at the end of words, so that a /d/ sound often becomes a /t/ (or is sometimes lost altogether) and this seems to spill over into the spelling. In the French list, we see the extra double letters in 'developped' and 'mentionned', this time because they're cognates in French (developper and mentionner), but both spelt in French with the double consonant that then creeps into the English. So whilst all learners need help and reminders about similar verb inflections, there are local factors that might come into play too.

Language Group errors and the issue of 'below-level' mistakes:
The error I looked at here is around students adding an unnecessary 's' inflecton onto adjectives to agree with a plural noun, so "differents reasons", "two news friends", "interestings questions", etc. Of course, many languages have adjective inflections that agree with the noun they modify for number and these kinds of errors are particularly simple to search for using the coded version of the corpus (where errors are tagged by type). Interestingly though, the corpus data suggests that this particular error is especially prevalent amongst Romance language speakers (Spanish, French, Italian, Portuguese).

What's perhaps more interesting here from a materials writer's perspective is that these errors crop up across levels, with examples right up to proficiency level in the data, even though students will likely learn the basic rules about adjectives in English in their beginner class. So these aren't 'errors' in the sense that the learners clearly know the rules around adjectives in English. Instead, they're mistakes, inadvertent slips. Looking at learner corpus data reveals a lot of these and it shows that the pattern of these mistakes can often be described as something of a bell curve, whereby learners make few errors when they first learn a new language form, partly just because they're cautious and don't use it very much. Then as they progress, they start to make a lot more mistakes with the forms they learnt at previous levels as they experiment and become more adventurous. You could say that they take their eye off the ball with adjectives by B1 or B2 because they're more concerned about complex sentence constructions and whether or not to use a past perfect simple verb form, for example. Then eventually the mistakes start to tail off as learners become more proficient, their language skills more automatic and they have the cognitive capacity to tidy up.


This presents a problem for me as a corpus researcher trying to feed into classroom materials. On the one hand, the data is telling me that these mistakes are significant at mid-levels and probably worth highlighting, but how do I convince editors, teachers and students that they need to focus on simple adjective forms at A2 or even B1 level without the materials seeming 'dumbed down' and 'below level'? The approach I took in one book illustrated below (Empower, A2, CUP 2016) was to:
  1. Make it clear that this is revision. The note starts with the word 'remember' to acknowledge that students probably already know this and the explanation is short and simple - they don't need the 'rules' explained in detail all over again.
  2. Combine several errors around adjectives. An activity just practising adjectives with singular and plural nouns would be pretty pointless at this level. Once the issue had been highlighted, students would find any follow-up activity mechanical and wouldn't engage with the point. By combining a number of issues, there's more to think about and you up the challenge. And a proof-reading activity of this kind is an authentic task type mirroring what students need to do with their own writing to reduce the number of mistakes that slip through.





Going beyond error codes:
The third point in the box above is also worth a bit more attention from a research perspective. The first two errors here jump out of the coded data (they're tagged as adjective inflection and word order errors), but the issue with the word 'colour' was less obvious. As I was looking through adjective examples, I started to notice various instances of awkward phrasing which had been tackled in the coded data in different ways.

I bought it in <#MD> | a </#MD> green colour .  (KET, A2)

It's blue and white <#MT> | in </#MT> colour . (KET, A2)

It only cost 20€ and <#DD> it | its </#DD> colours are red and black. (KET, A2)

<#UP> It's | Its </#UP> colour <#MV> | is </#MV> black. (KET, A2)

I like it because <#MA> | it </#MA> is very small and <#MA> | it </#MA> is <#UN> colour | </#UN> black. (KET, A2)

Anyone who's ever marked student writing will know that there's more than one way to go about trying to correct an oddly-worded sentence and the suggestions in the coding above are all legitimate, but somehow didn't quite ring true to me. It struck me that in each case, the best solution would actually just be to drop the word 'colour' altogether. You might have noticed that all the examples are very similar and they were all indeed in response to the same question which asked students to describe a new mobile phone, including what colour is was. Hmm, so was this just a case of the wording of the question skewing the data? Was it just task effect? It prompted me to search more widely and I found that although I had a lot of examples from this one question, the same issue was cropping up at other levels amongst the Spanish learner data in response to completely different tasks. And from what I understand (I should confess at this point that I'm not a Spanish speaker!), it’s possible to say something along the lines of "a dress of colour blue" in Spanish. It's not a really major error, but it's a high-frequency word and I think the point fits nicely here and, hopefully, gives students (and teachers) pause for thought over something they may not have considered before.

L1-specific errors and classic false friends:
Finally, some of the most satisfying errors are the ones you track down which are clearly examples of L1 interference. And perhaps the most fun are the simple 'false friends'; the English words which seem to be a near equivalent to something in Spanish, but turn out to mean something different. I note these down as I work through the learner data, then try to collect them together into thematic sets which I can tie in with the coursebook syllabus. Below are a few around the theme of 'information' that I was looking at recently for some B2 material, shown along with the Spanish 'false friend' in brackets.

I am writing to you to reply to your <#RN> announcement | advertisement </#RN> in the newspaper. (anuncio)

It is really complicated to talk about a <#RN> theme | subject </#RN> as controversial as the cruelty of keeping animals in zoos. (tema)

What <#UD> a | </#UD> great <#RN> notice | news </#RN>!  (noticia)

We would like to know if you will be able to come, and give a <#RN> conference | talk </#RN>. (conferencia)


In some of these, the meaning of the Spanish word simply doesn't match its English near equivalent -  although they're often in the same semantic ballpark - announcement/advert, notice/news. Others are more about range of usage. So, the Spanish 'tema' seems more widely applicable than the English word 'theme' and gets used by students where 'subject' or 'topic' would fit better in English. And 'conferencia' in Spanish can describe both a conference and an individual talk or lecture. Activities for these are about raising students' awareness, drawing attention to the differences and where relevant, provoking some discussion.

As a side note here, when looking for example sentences for practice activities, although the learner corpus is great for a getting a feel for level, you have to be careful not to transfer subtly awkward phrasing and atypical constructions, ‘learnerese’ if you like, into materials. Especially at higher levels and with subtle differences, such as the theme/subject distinction, I’ll often have a browse through NS corpus data for example sentences. That way I’m ensuring learners have an authentic model and it’s also good to up the level of the language just a little to provide a sense of challenge and progress even when essentially revising.


Research into practice:
Hopefully, this handful of examples gives a taste of the work I've been doing, the way I make use of the learner corpus, both by using the error tags and going beyond the tags to explore less obvious errors. I've also tried to show just some of the issues that emerge in trying to translate the results of that analysis into materials that fit in with the coursebook syllabus, that focus on significant, but apparently below-level mistakes in a way that's appropriately challenging and engaging, and that draws learners' attention to language points that are especially relevant to them rather than just part of a generic global syllabus.

Labels: , , , ,

Thursday, June 07, 2018

Learner corpus research: mixed methods and detective work


In a blog post at the end of last year, one of my resolutions for 2018 was to focus more on the areas that especially interest me and one of those was corpus research. So far, so good. I’ve spent the past couple of months researching Spanish learner errors for a writing project and next week, I’ll be presenting at my first corpus linguistics conference, IVACS in Valletta, Malta. Coincidentally, I’m going to be talking about my work with the Cambridge Learner Corpus (CLC) researching errors by Spanish students to feed into ELT materials for the Spanish market.

Although this will be my first time speaking at a corpus linguistics conference, it’s far from the first time I’ve spoken about my work with the CLC. In fact, my first presentation at a major conference, at IATEFL in Dublin back in 2000, was about the work I’d been doing using the CLC to write common error notes for the Cambridge Learner’s Dictionary.

So what is the Cambridge Learner Corpus?

The CLC is a collection of exam scripts written by learners taking Cambridge exams, including everything from PET and KET, through First, CAE and Proficiency, IELTS and the Business English exams. It includes data from over 250 000 students from 173 countries. You can choose to either search through the raw data or there’s a coded version of the corpus in which errors have been classified and corrections suggested, much as they would be by a teacher marking a student essay, allowing you to search for specific error types. So, the example below shows how a verb inflection error would be coded.

We are proud that you have <#IV> choosen | chosen </#IV> our company.

You can also search by CEFR level, by specific exams, by country and by L1.

Research for exams:

When I gave my talk in Dublin all those years ago, one of the first questions at the end came from the redoubtable Mario Rinvolucri who was sitting right in the middle of the front row. He was concerned that the corpus didn’t include any spoken data, so wasn’t really representative of student language in general. And he was right. One of the major drawbacks of the CLC is that it only reflects students’ written performance and then, only in an exam writing context. That means it doesn’t pick up issues that are specifically related to spoken language and the data is rather skewed by the topics and genres of exam writing tasks (largely emails and essays).

That does, however, make it perfect for informing exam practice materials. Over the years, I’ve carried out learner corpus research to feed into a whole range of exam materials. This has largely involved searching for the most frequent errors that match different parts of a coursebook syllabus in order to provide the writers with evidence and examples to help them better target particular problem areas. It also led to the Common Mistakes at … series, most of which I was involved in researching and two of which I wrote.


Mixed methods and detective work:

One of the things I enjoy most about working with the learner corpus, even after more than 18 years, is that I’m constantly finding new ways to get the best from the data. It’s easy to start out by searching for the most frequent errors by type, so the top ten spelling errors or the most commonly confused nouns (trip/travel, job/work, etc). But the stats don’t tell the whole story. 

Firstly, there’s the issue I’ve already mentioned about the skewing of the data by particular exam questions and topics. So, for example, one of the top noun confusion errors amongst Spanish learners in the corpus is to use ‘jail’ instead of ‘cage’; a lot of animals are locked up in jails. It is a legitimate false friend error (cage is jaula in Spanish), but it’s only so prominent in the data because of a classic FCE essay question about keeping animals in zoos. Does it merit highlighting in materials? Probably not compared to other errors that involve more high-frequency nouns and crop up across a range of contexts. There’s a balancing act to achieve between looking at the corpus stats, delving into the source of the errors (Were all the examples of an error prompted by a single exam question?), understanding the likely needs of students (Is it a high frequency word or likely to come up again in an exam?) and understanding what’ll work (and won’t work!) in published materials. I think it’s when it comes to these last two that, as a former teacher and full-time materials writer, I probably have the edge over a purely academic researcher.

Then there’s the tagging of errors to consider. Many learner corpus researchers are wary of error tagging because it can push the data into fixed categories that aren’t always appropriate. Any teacher who’s ever marked a student essay will know that some errors are very straightforward to mark (a spelling error or a wrong choice of preposition, for example), while others are messy. There can be several things going on in a single sentence that contribute to it not working and sometimes it’s difficult to pin down the root cause. Not to mention those chunks of text that are so garbled, you’re just not sure what the student intended and you don’t know how to go about correcting them. That means that while the coders who mark up the CLC data do a fantastic job, there are always instances that are open to interpretation. 

When I find an error that looks worth highlighting within a particular error category, I’ll often do a more general search for the word (or form or chunk) to see whether it crops up elsewhere with different tags. Sometimes, I’ll go back to the untagged data too to see how students are using the word more generally. This can help me pin down issues that stray across error categories. Then, if the error isn’t a straightforward one, I’ll flick over to a native-speaker corpus to check how the word or structure in question is typically used – after looking at too much learner data, you start to question your intuitions! – and check a few reference sources to help me pinpoint exactly where the mismatch lies and try to come up with a clear explanation for the students.

It’s this multi-layered detective work to understand where things are going wrong and figure out the best way to help learners understand and hopefully, overcome language issues that I find so satisfying.

At the IVACS conference, I’ll be talking about delving into the issues specific to Spanish learners at 16.00 on Thursday, 14 June for anyone who’s going to be joining me in Valletta.

Labels: , , , , ,