The occasional ramblings of a freelance lexicographer

Monday, February 05, 2018

Four Favourite Corpora

Recently, I gave a 10-minute talk at the ELT Freelancers’ Awayday in Oxford about “Simple corpus hacks for ELT editors”. I only had time to look at one corpus and a handful of searches, but I promised to share some of my other favourites in a blog post. So here goes…

1 Monco: In my talk, I looked at the Monco corpus. I chose it because it’s a monitor corpus, so it monitors current usage, updating with new data daily and as such, I find it useful for answering language questions that haven’t yet made it into conventional reference sources like dictionaries. For example, in my talk, we looked at how wellbeing (spelled as a single word) may be catching up with more traditional hyphenated form (well-being) that you’ll find in most dictionaries (simply by typing well-being|wellbeing into the search box). The split was 35% – 65% in Monco compared with 17% – 83% in the British National Corpus (with data from the 1980s and 90s). We also turned up some potentially useful verb collocates for newsfeed, including scroll through and pop up, which won’t have yet made it into a collocations dictionary. One of my favourite features of Monco, especially for the corpus novice, is its user-friendly search screen and its nice graphics for results.

On the downside, Monco’s data is drawn from entirely online news sources which means that it’s really only reflective of journalism, rather than language usage in general. And although it includes sources from the UK, US, Canada and Australia, it isn’t balanced, so there’s significantly more data from some sources than others – a factor to bear in mind that can skew the results.

2 Brigham Young University: Not strictly a single corpus, but a collection of different corpora available via the same site and the go-to source for lots of queries. Personally, I tend to use COCA (the Corpus of Contemporary American English) for checking US usage. It’s a large corpus containing a nice variety of contemporary sources (1990 – present), including radio & TV transcripts, fiction, newspapers, magazines and academic data. Through BYU, you can also find host a specialized corpora including a corpus of Wikipedia entries and even, slightly weirdly, the Hansard corpus of British parliamentary proceedings, should that happen to fit your purpose!

My main grumble with BYU is that I find the interface clunky and frustrating to use, especially with its rather distracting colour-coding.

3 BAWE and BASE: The British Academic Written English corpus (BAWE) and the British Academic Spoken English corpus (BASE) are composed of written and spoken data collected from university students at a number of British universities. The written corpus contains essays and other coursework which received a good pass mark and the spoken data includes lectures and seminars. I particular like these corpora because they’re an example of language as it might be used by the peers of the students we’re aiming at, rather than text produced by professional writers, journalists, academics, etc. which doesn’t necessarily provide an appropriate model for the average ELT student. This is obviously university-level language, so is especially relevant for EAP, but I think BAWE could be useful for any advanced students who need to write formal essays (IELTS, CAE, Proficiency). And if you’re looking for US academic equivalents, you could also check out MICUSP and MICASE.

BAWE and BASE are actually available via several sources, but I wanted the excuse to get you to experience Sketch Engine, for me, the gold standard when it comes to corpus tools and the interface used by all the major dictionary publishers for their large corpora.

4 Spoken BNC2014: I admit this is the corpus on my list that I’ve probably used least so far, but I’m including it because it’s one I’m quite excited about finding uses for. Slightly contrary to its name, it was only released in 2017 and is the result of a massive project to collect data about current spoken English used in everyday contexts. If you’re working on speaking materials, looking at evidence from written English is not going to tell you anything terribly useful, because we just don’t speak how we write. So I think this could become the go-to corpus for anyone who wants to know how people actually say things.

Unfortunately, the Spoken BNC2014 doesn’t have the most user-friendly interface and getting access involves a bit of a faffy sign-up process which could be off-putting for the casual user. If spoken language is your thing though, I think it’s worth investing the time and effort to check it out, not least because some of the content is just really funny!

A note about corpora and copyright: It’s important to remember that, in general, the data that appears in a corpus is liable to all the usual copyright restrictions. That means you can’t just pull a big chunk of language from the corpus and use it in your activity, especially not if it’s for commercial publication. Occasionally, of course, you come across very short, ‘vanilla’ examples which could have come from almost anywhere (A young woman opened the door. The traffic was particularly bad.), but to be honest, these are few and far between. Generally, when I search for a particular language item, I’ll scan through the examples and jot down a ‘frame’:
I/you scroll through my/your (Facebook) newsfeed to see/searching for/on the train …
Then I’ll use my notes as the basis for an example that keeps the feel and pattern of the ones I’ve looked at, but fits my teaching purpose … and doesn’t infringe copyright.

There are lots of different corpora out there and corpus fans will have their personal favourites. If you’re new to corpora though, I’d say pick one or two to check out, play around with a few simple searches, use the help to get you started, and see what’s most useful for you. Be warned though, it can be addictive!

Labels: , , , , , , ,


Post a Comment

<< Home