Four Favourite Corpora
Recently, I gave a 10-minute talk at the ELT Freelancers’ Awayday in Oxford about “Simple corpus hacks for ELT editors”. I only had time
to look at one corpus and a handful of searches, but I promised to share some
of my other favourites in a blog post. So here goes…
1 Monco: In my
talk, I looked at the Monco corpus. I chose it because it’s a monitor corpus,
so it monitors current usage, updating with new data daily and as such, I find
it useful for answering language questions that haven’t yet made it into
conventional reference sources like dictionaries. For example, in my talk, we
looked at how wellbeing (spelled as a single word) may be catching up with more
traditional hyphenated form (well-being) that you’ll find in most dictionaries (simply by typing well-being|wellbeing into the search box).
The split was 35% – 65% in Monco compared with 17% – 83% in the British
National Corpus (with data from the 1980s and 90s). We also turned up some
potentially useful verb collocates for newsfeed, including scroll through and
pop up, which won’t have yet made it into a collocations dictionary. One of my
favourite features of Monco, especially for the corpus novice, is its
user-friendly search screen and its nice graphics for results.
On the downside, Monco’s data is drawn from entirely online
news sources which means that it’s really only reflective of journalism, rather
than language usage in general. And although it includes sources from the UK,
US, Canada and Australia, it isn’t balanced, so there’s significantly more data
from some sources than others – a factor to bear in mind that can skew the
results.
2 Brigham Young University: Not strictly a single corpus, but a collection of different
corpora available via the same site and the go-to source for lots of queries. Personally,
I tend to use COCA (the Corpus of Contemporary American English) for checking
US usage. It’s a large corpus containing a nice variety of contemporary sources
(1990 – present), including radio & TV transcripts, fiction, newspapers,
magazines and academic data. Through BYU, you can also find host a specialized
corpora including a corpus of Wikipedia entries and even, slightly weirdly, the
Hansard corpus of British parliamentary proceedings, should that happen to fit
your purpose!
My main grumble with BYU is that I find the interface clunky
and frustrating to use, especially with its rather distracting colour-coding.
3 BAWE and BASE: The British Academic Written English corpus (BAWE) and the British Academic Spoken English corpus (BASE) are composed of written and spoken data collected from university
students at a number of British universities. The written corpus contains
essays and other coursework which received a good pass mark and the spoken data
includes lectures and seminars. I particular like these corpora because they’re
an example of language as it might be used by the peers of the students we’re
aiming at, rather than text produced by professional writers, journalists,
academics, etc. which doesn’t necessarily provide an appropriate model for the
average ELT student. This is obviously university-level language, so is
especially relevant for EAP, but I think BAWE could be useful for any advanced
students who need to write formal essays (IELTS, CAE, Proficiency). And if
you’re looking for US academic equivalents, you could also check out MICUSP and
MICASE.
BAWE and BASE are actually available via several sources,
but I wanted the excuse to get you to experience Sketch Engine, for me, the
gold standard when it comes to corpus tools and the interface used by all the
major dictionary publishers for their large corpora.
4 Spoken BNC2014:
I admit this is the corpus on my list that I’ve probably used least so far, but
I’m including it because it’s one I’m quite excited about finding uses for.
Slightly contrary to its name, it was only released in 2017 and is the result
of a massive project to collect data about current spoken English used in
everyday contexts. If you’re working on speaking materials, looking at evidence
from written English is not going to tell you anything terribly useful, because
we just don’t speak how we write. So I think this could become the go-to corpus
for anyone who wants to know how people actually say things.
Unfortunately, the Spoken BNC2014 doesn’t have the most
user-friendly interface and getting access involves a bit of a faffy sign-up
process which could be off-putting for the casual user. If spoken language is
your thing though, I think it’s worth investing the time and effort to check it
out, not least because some of the content is just really funny!
A note about corpora and
copyright: It’s important to remember that, in general, the data that
appears in a corpus is liable to all the usual copyright restrictions. That
means you can’t just pull a big chunk of language from the corpus and use it in
your activity, especially not if it’s for commercial publication. Occasionally,
of course, you come across very short, ‘vanilla’ examples which could have come
from almost anywhere (A young woman
opened the door. The traffic was particularly bad.), but to be honest,
these are few and far between. Generally, when I search for a particular
language item, I’ll scan through the examples and jot down a ‘frame’:
I/you scroll
through my/your (Facebook) newsfeed to see/searching for/on the
train …
Then I’ll use my notes as the basis for an example that keeps the
feel and pattern of the ones I’ve looked at, but fits my teaching purpose … and
doesn’t infringe copyright.
There are lots of different corpora out there and corpus
fans will have their personal favourites. If you’re new to corpora though, I’d
say pick one or two to check out, play around with a few simple searches, use
the help to get you started, and see what’s most useful for you. Be warned though,
it can be addictive!
Labels: BAWE, BNC2014, COCA, corpora, ELT Freelancers Awayday, ELT materials, materials writing, Monco
0 Comments:
Post a Comment
<< Home