Lexicoblog

The occasional ramblings of a freelance lexicographer

Monday, November 30, 2020

A jobbing corpus linguist

In a Facebook corpus linguistics group I follow, someone recently posted the following question:

I immediately wanted to put my hand up and shout "Me! Me!" I excitedly typed a reply in the comments, but soon realized I had more to explain than I could realistically fit in, so I promised the poster a follow-up blog post. 

Getting started: 

So, how did I become a corpus linguist? Well, after about 7 years as a full-time EFL teacher, I realized the teaching lifestyle wasn't for me and I did an MA at Birmingham University. I already had an idea that dictionaries might be my thing – which was why I chose Birmingham as the home of COBUILD - and I took options in lexicography and corpus linguistics. 

Lexicography: 

I finished my MA in late 1998 at a time when there was a bit of a boom in ELT dictionaries. I was actually lucky enough to have interviews for in-house lexicography roles at three big ELT dictionary publishers within the space of a few months. I took a job at CUP – mostly because the timing worked out best – and was lucky enough to get stuck in straight away on the new, from-scratch Cambridge Learner's Dictionary (an intermediate-level dictionary). I learnt loads from my fabulous in-house colleagues and when I later went freelance, worked for the next 5 years or so on dictionaries for most of the major publishers (CUP, Longman, Macmillan, OUP, Chambers and eventually, many years later, Collins COBUILD). 

Broadening out: 

I worked on back-to-back lexicography projects through to around 2005. A few things then happened to send me off in different directions. Having worked with the Cambridge Learner Corpus when I was in-house (on dictionary error notes), I was asked by CUP to do some learner corpus research into common learner errors for their new Common Mistakes series of books. While doing the research, I realized I'd quite like to take the next step and write the material too, so ended up authoring two of the books in the series. After a long stretch of lexicography, it was nice to branch out into other things and I started working on more general ELT writing, initially alongside lexicography projects. Over the next few years, my focus shifted more towards writing – a shift that happened to coincide with a gradual decline in dictionary projects as several of the big publishers scaled back their dictionary operations. 

 


A mixed portfolio: 

Since then the mix of general writing and corpus-related work I do has varied year-to-year. I've done bursts of mainly writing, but always come back to corpus work. That's continued to include dictionaries and other reference projects, like Collins COBUILD Key Words series. I also do quite a lot of learner corpus research for CUP to feed into their ELT books. Sometimes that's just straightforward research investigating the issues made by a specific group of learners – mostly by level, but also by L1 – where I research to a brief and produce a report that goes to the authors. Frequently though, I do the research and also write the material, often in the form of notes and practice activities around specific learner issues. I had a look back over the past 3 years and my mix of work breaks down very roughly as below.


So, in answer to the original question, no, I don't have a job as such as a corpus linguist. I do, however, spend a large chunk of my working life using my corpus linguistic skills in some way or another. And even on the jobs I haven't classified as directly corpus research, I'm dipping in and out of corpora pretty much daily for almost everything I do.

Labels: , , ,

Monday, December 02, 2019

Missing grammar: parallel structure


I've been researching learner language using the Cambridge Learner Corpus for 20 years now and there are certain issues that crop up again and again among learners at all levels. One that I pick up on regularly is illustrated in the examples below (made up examples rather than real corpus data, but they illustrate the point):

At the weekend, he goes to the park and play football. (subject-verb agreement)
I like playing football and run. (verb + -ing form)
I'd love to visit Paris and seeing the Eiffel Tower. (verb + to do)
We went to the park and play football. (past simple verb form)
We can swim in the sea and playing volleyball on the beach. (modal + verb form)
I've tided the kitchen and did the washing up. (present perfect/past participle form)
I was sitting on the train, chatted to my friend on the phone. (past continuous/-ing form)


Basically students attempt to use a second verb form (usually) after a conjunction without repeating the subject, but they forget to match the verb form to the start of the sentence. In each of the examples above, the correct form would become clear(er) if we inserted the 'missing' subject (+verb/auxiliary/modal):

At the weekend, he goes to the park and [he] plays football.
I like playing football and [I like] running.
I'd love to visit Paris and [I'd love to] see the Eiffel Tower.
We went to the park and [we] played football.
We can swim in the sea and [we can] play volleyball on the beach.
I've tided the kitchen and [I've] done the washing up.
I was sitting on the train [and I was] chatting to my friend on the phone.

It's something I've noted in countless corpus reports, but I've never been quite sure what to call it. Until last week when I came across it for the first time in an ELT coursebook referred to as parallel structure. It was in a B2 book in a section about academic writing style and covered a wider range of structures than those above (not just verb phrases, but nouns, adjectives and full clauses too), but it still made me cheer out loud at my desk. It's long puzzled me why these incredibly common structures aren't explicitly addressed in most ELT materials when they cause so many issues for students.

I rarely get the chance to choose the grammar points I cover in the materials I work on, because they're mostly supplementary materials and the syllabus is already fixed by the time I get started. So I've never had the opportunity to cover this explicitly myself. I have tried to include examples in practice exercises, but they usually end up getting cut by editors who want all the items to fit on a single line and don't like the longer examples these structures often involve (grrr!).

So I'm making a case for this to be included explicitly in more ELT materials. It's relevant at every level and with almost every kind of verb structure we teach. It doesn't have to be a separate grammar point and it doesn't even have to have the label parallel structure. I think it's a great thing to bring up when you're revising a particular verb form as a slight variation on the usual practice activities, just to raise students' awareness. You could have a simple intro as above showing/eliciting the 'missed out' words and the correct second verb forms. Then straight into some practice examples (as gap-fills or freer practice). It works perfectly for any kind of list: 

  • daily routines (She leaves the house at 8 and catches the bus at 8.15)
  • a dramatic narrative (He opened the box and looked inside)
  • background to a narrative (People were sitting in the café, eating and drinking)
  • things people like doing (I like watching TV and chatting to my friends online)
  • things people would like to do in the future (I'd like to go to university and study drama)
  • things ticked off on a list (We've booked a room for the party and set up a Facebook page)
  • things on a to-do list (I still need to confirm the hotel booking and renew my travel insurance)

I'm happy to be proved wrong with a flurry of comments about ELT materials that practise exactly this already ...

Labels: , , , ,

Wednesday, June 20, 2018

#IVACS2018: learner corpus research & ELT materials for Spanish learners


Last week, I spoke at the IVACS (Inter-Varietal Applied Corpus Studies) conference in Malta about my work using the Cambridge Learner Corpus (CLC) to help develop ELT materials targeted at Spanish learners of English. So, following on from my last post about my work generally using the learner corpus, here's a brief summary of my talk.

Photo from Naill Curry via Twitter
ELT: a global market

From the perspective of a large ELT publisher, if they're to invest in producing a major coursebook series - over several levels each with multiple components - it makes economic sense to sell it to the widest possible global market. This one-size-fits all approach, however, ignores the fact that different learners have different needs. Just one of the factors that differentiates learners is the influence of their first language; their L1. It's well-established that friction between a learner's L1 and target language, in this case English, can result in language transfer issues or interference, a factor not accounted for in materials for a global audience. In recent years, I've worked on a number of projects for CUP that have involved localizing materials to target them more effectively at Spanish learners. More specifically, I've used the CLC to investigate errors by Spanish learners to feed into English for Spanish Speakers (ESS) versions of a number of books.

For more about the CLC see my previous post.

Error types:
When you start looking at learner data for a specific L1 group, three broad error types emerge. There are global errors, that is errors that are common across learners more-or-less regardless of L1. These can be described as developmental or intralingual errors that are a result of the inherent quirks and irregularities in English that trip everyone up. Then there are interlingual errors where the learner's L1 rubs up against English in a way that creates friction and interference. Some of these are common across a language group, such as errors frequent among all Romance language speakers learning English, while others are L1 specific, so peculiar to say, Spanish speakers.

In my session, I took an example of each error type to show how I went about investigating the error and then incorporating activities to target the issue into classroom materials.

Global errors:
One classic example of a global, developmental error is with irregular verbs. Below is a list of the most common past simple/past participle verb inflection errors across the whole learner corpus. As you'd expect, there are some irregular verbs (pay, choose, rise, hear) and others where the spelling rules around whether or not to double the final consonant cause difficulties.

1 occured; 2 happend; 3 payed; 4 choosen; 5 prefered; 6 planed; 7 rised; 8 developped; 9 heared; 10 stoped

If we then look at the top tens for Spanish and French speakers for comparison, we see a lot of overlap.

Spanish: 1 choosen; 2 prefered; 3 payed; 4 teached; 5 refered; 6 planed; 7 occured; 8 heart; 9 writen; 10 tryed


French: 1 developped; 2 mentionned; 3 occured; 4 prefered; 5 choosen; 6 planed; 7 rised; 8 red; 9 enroled; 10 stoped

There are a few interesting differences though. The Spanish use of 'heart' as the past form of 'hear' doesn't seem to follow the pattern you'd expect - as with 'heared' in the global list. This can be put down to an issue of pronunciation; Spanish speakers tend not to pronounce voiced consonants at the end of words, so that a /d/ sound often becomes a /t/ (or is sometimes lost altogether) and this seems to spill over into the spelling. In the French list, we see the extra double letters in 'developped' and 'mentionned', this time because they're cognates in French (developper and mentionner), but both spelt in French with the double consonant that then creeps into the English. So whilst all learners need help and reminders about similar verb inflections, there are local factors that might come into play too.

Language Group errors and the issue of 'below-level' mistakes:
The error I looked at here is around students adding an unnecessary 's' inflecton onto adjectives to agree with a plural noun, so "differents reasons", "two news friends", "interestings questions", etc. Of course, many languages have adjective inflections that agree with the noun they modify for number and these kinds of errors are particularly simple to search for using the coded version of the corpus (where errors are tagged by type). Interestingly though, the corpus data suggests that this particular error is especially prevalent amongst Romance language speakers (Spanish, French, Italian, Portuguese).

What's perhaps more interesting here from a materials writer's perspective is that these errors crop up across levels, with examples right up to proficiency level in the data, even though students will likely learn the basic rules about adjectives in English in their beginner class. So these aren't 'errors' in the sense that the learners clearly know the rules around adjectives in English. Instead, they're mistakes, inadvertent slips. Looking at learner corpus data reveals a lot of these and it shows that the pattern of these mistakes can often be described as something of a bell curve, whereby learners make few errors when they first learn a new language form, partly just because they're cautious and don't use it very much. Then as they progress, they start to make a lot more mistakes with the forms they learnt at previous levels as they experiment and become more adventurous. You could say that they take their eye off the ball with adjectives by B1 or B2 because they're more concerned about complex sentence constructions and whether or not to use a past perfect simple verb form, for example. Then eventually the mistakes start to tail off as learners become more proficient, their language skills more automatic and they have the cognitive capacity to tidy up.


This presents a problem for me as a corpus researcher trying to feed into classroom materials. On the one hand, the data is telling me that these mistakes are significant at mid-levels and probably worth highlighting, but how do I convince editors, teachers and students that they need to focus on simple adjective forms at A2 or even B1 level without the materials seeming 'dumbed down' and 'below level'? The approach I took in one book illustrated below (Empower, A2, CUP 2016) was to:
  1. Make it clear that this is revision. The note starts with the word 'remember' to acknowledge that students probably already know this and the explanation is short and simple - they don't need the 'rules' explained in detail all over again.
  2. Combine several errors around adjectives. An activity just practising adjectives with singular and plural nouns would be pretty pointless at this level. Once the issue had been highlighted, students would find any follow-up activity mechanical and wouldn't engage with the point. By combining a number of issues, there's more to think about and you up the challenge. And a proof-reading activity of this kind is an authentic task type mirroring what students need to do with their own writing to reduce the number of mistakes that slip through.





Going beyond error codes:
The third point in the box above is also worth a bit more attention from a research perspective. The first two errors here jump out of the coded data (they're tagged as adjective inflection and word order errors), but the issue with the word 'colour' was less obvious. As I was looking through adjective examples, I started to notice various instances of awkward phrasing which had been tackled in the coded data in different ways.

I bought it in <#MD> | a </#MD> green colour .  (KET, A2)

It's blue and white <#MT> | in </#MT> colour . (KET, A2)

It only cost 20€ and <#DD> it | its </#DD> colours are red and black. (KET, A2)

<#UP> It's | Its </#UP> colour <#MV> | is </#MV> black. (KET, A2)

I like it because <#MA> | it </#MA> is very small and <#MA> | it </#MA> is <#UN> colour | </#UN> black. (KET, A2)

Anyone who's ever marked student writing will know that there's more than one way to go about trying to correct an oddly-worded sentence and the suggestions in the coding above are all legitimate, but somehow didn't quite ring true to me. It struck me that in each case, the best solution would actually just be to drop the word 'colour' altogether. You might have noticed that all the examples are very similar and they were all indeed in response to the same question which asked students to describe a new mobile phone, including what colour is was. Hmm, so was this just a case of the wording of the question skewing the data? Was it just task effect? It prompted me to search more widely and I found that although I had a lot of examples from this one question, the same issue was cropping up at other levels amongst the Spanish learner data in response to completely different tasks. And from what I understand (I should confess at this point that I'm not a Spanish speaker!), it’s possible to say something along the lines of "a dress of colour blue" in Spanish. It's not a really major error, but it's a high-frequency word and I think the point fits nicely here and, hopefully, gives students (and teachers) pause for thought over something they may not have considered before.

L1-specific errors and classic false friends:
Finally, some of the most satisfying errors are the ones you track down which are clearly examples of L1 interference. And perhaps the most fun are the simple 'false friends'; the English words which seem to be a near equivalent to something in Spanish, but turn out to mean something different. I note these down as I work through the learner data, then try to collect them together into thematic sets which I can tie in with the coursebook syllabus. Below are a few around the theme of 'information' that I was looking at recently for some B2 material, shown along with the Spanish 'false friend' in brackets.

I am writing to you to reply to your <#RN> announcement | advertisement </#RN> in the newspaper. (anuncio)

It is really complicated to talk about a <#RN> theme | subject </#RN> as controversial as the cruelty of keeping animals in zoos. (tema)

What <#UD> a | </#UD> great <#RN> notice | news </#RN>!  (noticia)

We would like to know if you will be able to come, and give a <#RN> conference | talk </#RN>. (conferencia)


In some of these, the meaning of the Spanish word simply doesn't match its English near equivalent -  although they're often in the same semantic ballpark - announcement/advert, notice/news. Others are more about range of usage. So, the Spanish 'tema' seems more widely applicable than the English word 'theme' and gets used by students where 'subject' or 'topic' would fit better in English. And 'conferencia' in Spanish can describe both a conference and an individual talk or lecture. Activities for these are about raising students' awareness, drawing attention to the differences and where relevant, provoking some discussion.

As a side note here, when looking for example sentences for practice activities, although the learner corpus is great for a getting a feel for level, you have to be careful not to transfer subtly awkward phrasing and atypical constructions, ‘learnerese’ if you like, into materials. Especially at higher levels and with subtle differences, such as the theme/subject distinction, I’ll often have a browse through NS corpus data for example sentences. That way I’m ensuring learners have an authentic model and it’s also good to up the level of the language just a little to provide a sense of challenge and progress even when essentially revising.


Research into practice:
Hopefully, this handful of examples gives a taste of the work I've been doing, the way I make use of the learner corpus, both by using the error tags and going beyond the tags to explore less obvious errors. I've also tried to show just some of the issues that emerge in trying to translate the results of that analysis into materials that fit in with the coursebook syllabus, that focus on significant, but apparently below-level mistakes in a way that's appropriately challenging and engaging, and that draws learners' attention to language points that are especially relevant to them rather than just part of a generic global syllabus.

Labels: , , , ,