Corpus insider #1: Representativeness
As I was putting together my talk for the ELT Freelancers’ Awayday and the follow-up blog post, I realized that over 20 years
of using corpora, there are a whole host of factors I’ve learnt to take into
account. I touched very briefly on a few of them in my talk, but I thought some
might be worth exploring further. So, this is the first in a series of posts
about things you might need to bear in mind if you want to use corpus tools to
inform your work on ELT materials.
When I explain to people what a corpus is, I usually start
off by saying that it’s a large collection of language that we use to represent
the way English is used as a whole. It seems a simple premise, but when you dig
a bit deeper, it gets more complicated. As with any research, the validity of
your results is dependent on the quality of your data. The data your
chosen corpus contains will determine exactly what kind of language it can
actually be said to represent and so how useful it is for your purpose. To
take a simple example, if you were writing for a specifically British English
market, using a corpus that contained only American data wouldn’t be very
useful. Similarly, if you were working on speaking materials, looking at usage in
a corpus of entirely written data wouldn’t really tell you much about how
people normally speak. Understanding a bit about the corpus you plan to use,
the data it contains, and what that might represent is absolutely essential
before you start doing any corpus research.
Corpus types:
There are two main types of corpus, those which contain data
drawn from one type of source or genre and those which are said to be
‘balanced’ and contain data from a wide variety of different genres. The first
type includes purely spoken corpora (like the Spoken BNC2014), corpora of
academic writing (of either published texts, like the academic part of COCA or student writing, like BAWE or MICUSP) and many
corpora are composed largely of journalism, because it’s one of the simplest
sources of data to collect, especially for those corpora that rely on web-based
content (e.g. Monco, NOW, etc.).
Large balanced corpora, containing written and spoken data from a wide range of sources, are much more difficult to put together. For this reason, they’re mainly owned and maintained by large publishers, especially those who produce dictionaries, and aren’t publically available. The British National Corpus (BNC) is a balanced corpus that’s freely available, but it’s relatively small by modern standards and, perhaps more importantly, it’s becoming increasingly out of date (with data from the 1980s and 90s). The Corpus of Contemporary American English (COCA) sits in a mid-ground with data from spoken sources (although all radio transcripts rather than everyday conversation), fiction, popular magazines, newspapers and academic texts. It’s reasonably balanced, although all American as the name suggests.
Large balanced corpora, containing written and spoken data from a wide range of sources, are much more difficult to put together. For this reason, they’re mainly owned and maintained by large publishers, especially those who produce dictionaries, and aren’t publically available. The British National Corpus (BNC) is a balanced corpus that’s freely available, but it’s relatively small by modern standards and, perhaps more importantly, it’s becoming increasingly out of date (with data from the 1980s and 90s). The Corpus of Contemporary American English (COCA) sits in a mid-ground with data from spoken sources (although all radio transcripts rather than everyday conversation), fiction, popular magazines, newspapers and academic texts. It’s reasonably balanced, although all American as the name suggests.
The trouble with media hype:
Data from newspapers, magazines and blogs is very easy to
collect and makes up a large proportion of many corpora. It can provide lots of
interesting information about language used to talk about a wide range of
topics, but it’s important to remember that journalism as a genre has its own
quite marked features that don’t necessarily reflect the way that ordinary
people use language day to day. It may seem obvious to say that journalists report news, but
that means they’re generally writing about what’s new, surprising,
shocking or problematic. They also want to draw their readers in and keep their
attention with colourful language choices and hyperbole. For my recent talk, I
demonstrated an example of a query about the language of social media and
in particular, which verbs collocate with the noun ‘newsfeed’. I used the Monco corpus,
because I was interested in up-to-date usage, and came up with the following
verbs:
The first two feel like expressions you might use in
conversation, the others, however, are clearly journalistic in style; bemoaning
the way that a particular trend is overtaking our online lives. Searching a couple of other news-dominated corpora came up with similar results (enTenTen: spam/clutter/bombard/clog your newsfeed; NOW: scroll through/appear on/pop up on/tweak/flood/clog your newsfeed). They’re all
interesting collocations, but they’re probably not the first ones you’d choose to
teach an intermediate learner who wants to talk about the way they use social
media themselves. That’s not to say you shouldn’t use these corpora when you're researching ideas for ELT materials, but knowing
a corpus contains only or largely data from journalistic sources means that you can be on the lookout for this type of
language and be selective about what you use as appropriate for the learners
you’re writing for.
Professional and lay writers:
Unsurprisingly, the majority of written corpus data comes
from published sources and, as such, it’s written by people who are
professional writers: authors, journalists, copy-writers. As we saw with
journalism, above, this can mean the language is more colourful and probably
more varied than the average lay person typically tends to use. This came out
very clearly in a recent study into academic vocabulary (Durrant, 2016*) which
looked at how many of the words on the Academic Vocabulary List (based on a
corpus of published academic writing) were actually used regularly by student
writers (using a corpus of university-level student writing). It turned out
that the student essays contained a vastly narrower range of vocabulary than the
published academic texts, written by experienced (and edited!) academics.
That’s not to say the student writing was in some way lacking – all the papers
had got high marks – it’s just a different genre with different expectations.
When you’re using a corpus to search for ideas, it’s all too
easy to pick out examples and patterns that are elegant or appealing, but I
think it’s always important to ask yourself how typical they are of what the average
person might say or write. Is it a writerly flourish? Is it helpful as a model for your target
learners?
I’m not saying that as ELT writers and editors we should reject all corpus evidence as flawed and
unhelpful. Far from it, I think corpus tools can be incredibly helpful in
backing up our intuitions and uncovering patterns of usage we might not have
thought of, but they are just that, ‘tools’ and should be used with an element
of caution. It's all too easy to be drawn in by a corpus that's new or especially large or has a nice interface and nifty tools, but making sure you know what your corpus represents is vital. If a collocation or pattern feels unlikely or overly fancy, then ask yourself why.
Don’t just accept the first results that pop up, click through to the examples,
scroll down to see where they come from and understand exactly what’s going on.
* There's a good summary of Durrant's study on ELT Research Bites.
* There's a good summary of Durrant's study on ELT Research Bites.
Labels: corpus insider, corpus research, ELT Freelancers Awayday, ELT materials, genres, materials writing, representativeness
4 Comments:
Could you please resend the links for Durant's research? It seems that the original one has expired. Thank you so much.
Hi, I'm sorry, but unfortunately, I think the Research Bites blog was discontinued. I'm afraid I don't have any way to find out what the link was to Durrant's research. As a freelancer, I'm not able to access academic articles so I was only able to read the summary on the blog.
Thank you for your kind reply.
Could you please provide any relevant information about Durrant's research? I am able to access academic articles but I need some search words to get it.
TBH, I think you have more access to the facilities necessary to find this than I do. If you just use a bit of common sense and search using the information given in the blog post you'll find it. I just googled "Durrant 2016 academic vocabulary" (the exact words from my post) and it came up immediately. If you're doing academic study, you really should be able to work that out for yourself.
Post a Comment
<< Home