The occasional ramblings of a freelance lexicographer

Monday, March 05, 2018

Corpus insider #1: Representativeness

As I was putting together my talk for the ELT Freelancers’ Awayday and the follow-up blog post, I realized that over 20 years of using corpora, there are a whole host of factors I’ve learnt to take into account. I touched very briefly on a few of them in my talk, but I thought some might be worth exploring further. So, this is the first in a series of posts about things you might need to bear in mind if you want to use corpus tools to inform your work on ELT materials.

When I explain to people what a corpus is, I usually start off by saying that it’s a large collection of language that we use to represent the way English is used as a whole. It seems a simple premise, but when you dig a bit deeper, it gets more complicated. As with any research, the validity of your results is dependent on the quality of your data. The data your chosen corpus contains will determine exactly what kind of language it can actually be said to represent and so how useful it is for your purpose. To take a simple example, if you were writing for a specifically British English market, using a corpus that contained only American data wouldn’t be very useful. Similarly, if you were working on speaking materials, looking at usage in a corpus of entirely written data wouldn’t really tell you much about how people normally speak. Understanding a bit about the corpus you plan to use, the data it contains, and what that might represent is absolutely essential before you start doing any corpus research.

Corpus types:
There are two main types of corpus, those which contain data drawn from one type of source or genre and those which are said to be ‘balanced’ and contain data from a wide variety of different genres. The first type includes purely spoken corpora (like the Spoken BNC2014), corpora of academic writing (of either published texts, like the academic part of COCA or student writing, like BAWE or MICUSP) and many corpora are composed largely of journalism, because it’s one of the simplest sources of data to collect, especially for those corpora that rely on web-based content (e.g. Monco, NOW, etc.).

Large balanced corpora, containing written and spoken data from a wide range of sources, are much more difficult to put together. For this reason, they’re mainly owned and maintained by large publishers, especially those who produce dictionaries, and aren’t publically available. The British National Corpus (BNC) is a balanced corpus that’s freely available, but it’s relatively small by modern standards and, perhaps more importantly, it’s becoming increasingly out of date (with data from the 1980s and 90s). The Corpus of Contemporary American English (COCA) sits in a mid-ground with data from spoken sources (although all radio transcripts rather than everyday conversation), fiction, popular magazines, newspapers and academic texts. It’s reasonably balanced, although all American as the name suggests.

The trouble with media hype:
Data from newspapers, magazines and blogs is very easy to collect and makes up a large proportion of many corpora. It can provide lots of interesting information about language used to talk about a wide range of topics, but it’s important to remember that journalism as a genre has its own quite marked features that don’t necessarily reflect the way that ordinary people use language day to day. It may seem obvious to say that journalists report news, but that means they’re generally writing about what’s new, surprising, shocking or problematic. They also want to draw their readers in and keep their attention with colourful language choices and hyperbole. For my recent talk, I demonstrated an example of a query about the language of social media and in particular, which verbs collocate with the noun ‘newsfeed’. I used the Monco corpus, because I was interested in up-to-date usage, and came up with the following verbs:

scroll through your newsfeed
pop up on your newsfeed
fill/flood/dominate/clog up your newsfeed

The first two feel like expressions you might use in conversation, the others, however, are clearly journalistic in style; bemoaning the way that a particular trend is overtaking our online lives. Searching a couple of other news-dominated corpora came up with similar results (enTenTen: spam/clutter/bombard/clog your newsfeed; NOW: scroll through/appear on/pop up on/tweak/flood/clog your newsfeed). They’re all interesting collocations, but they’re probably not the first ones you’d choose to teach an intermediate learner who wants to talk about the way they use social media themselves. That’s not to say you shouldn’t use these corpora when you're researching ideas for ELT materials, but knowing a corpus contains only or largely data from journalistic sources means that you can be on the lookout for this type of language and be selective about what you use as appropriate for the learners you’re writing for.

Professional and lay writers:
Unsurprisingly, the majority of written corpus data comes from published sources and, as such, it’s written by people who are professional writers: authors, journalists, copy-writers. As we saw with journalism, above, this can mean the language is more colourful and probably more varied than the average lay person typically tends to use. This came out very clearly in a recent study into academic vocabulary (Durrant, 2016*) which looked at how many of the words on the Academic Vocabulary List (based on a corpus of published academic writing) were actually used regularly by student writers (using a corpus of university-level student writing). It turned out that the student essays contained a vastly narrower range of vocabulary than the published academic texts, written by experienced (and edited!) academics. That’s not to say the student writing was in some way lacking – all the papers had got high marks – it’s just a different genre with different expectations. 

When you’re using a corpus to search for ideas, it’s all too easy to pick out examples and patterns that are elegant or appealing, but I think it’s always important to ask yourself how typical they are of what the average person might say or write. Is it a writerly flourish? Is it helpful as a model for your target learners?

I’m not saying that as ELT writers and editors we should reject all corpus evidence as flawed and unhelpful. Far from it, I think corpus tools can be incredibly helpful in backing up our intuitions and uncovering patterns of usage we might not have thought of, but they are just that, ‘tools’ and should be used with an element of caution. It's all too easy to be drawn in by a corpus that's new or especially large or has a nice interface and nifty tools, but making sure you know what your corpus represents is vital. If a collocation or pattern feels unlikely or overly fancy, then ask yourself why. Don’t just accept the first results that pop up, click through to the examples, scroll down to see where they come from and understand exactly what’s going on.

* There's a good summary of Durrant's study on ELT Research Bites.

Labels: , , , , , ,


Post a Comment

<< Home