Lexicoblog

The occasional ramblings of a freelance lexicographer

Monday, February 05, 2018

Four Favourite Corpora



Recently, I gave a 10-minute talk at the ELT Freelancers’ Awayday in Oxford about “Simple corpus hacks for ELT editors”. I only had time to look at one corpus and a handful of searches, but I promised to share some of my other favourites in a blog post. So here goes…

1 Monco: In my talk, I looked at the Monco corpus. I chose it because it’s a monitor corpus, so it monitors current usage, updating with new data daily and as such, I find it useful for answering language questions that haven’t yet made it into conventional reference sources like dictionaries. For example, in my talk, we looked at how wellbeing (spelled as a single word) may be catching up with more traditional hyphenated form (well-being) that you’ll find in most dictionaries (simply by typing well-being|wellbeing into the search box). The split was 35% – 65% in Monco compared with 17% – 83% in the British National Corpus (with data from the 1980s and 90s). We also turned up some potentially useful verb collocates for newsfeed, including scroll through and pop up, which won’t have yet made it into a collocations dictionary. One of my favourite features of Monco, especially for the corpus novice, is its user-friendly search screen and its nice graphics for results.


On the downside, Monco’s data is drawn from entirely online news sources which means that it’s really only reflective of journalism, rather than language usage in general. And although it includes sources from the UK, US, Canada and Australia, it isn’t balanced, so there’s significantly more data from some sources than others – a factor to bear in mind that can skew the results.

2 Brigham Young University: Not strictly a single corpus, but a collection of different corpora available via the same site and the go-to source for lots of queries. Personally, I tend to use COCA (the Corpus of Contemporary American English) for checking US usage. It’s a large corpus containing a nice variety of contemporary sources (1990 – present), including radio & TV transcripts, fiction, newspapers, magazines and academic data. Through BYU, you can also find host a specialized corpora including a corpus of Wikipedia entries and even, slightly weirdly, the Hansard corpus of British parliamentary proceedings, should that happen to fit your purpose!

My main grumble with BYU is that I find the interface clunky and frustrating to use, especially with its rather distracting colour-coding.


3 BAWE and BASE: The British Academic Written English corpus (BAWE) and the British Academic Spoken English corpus (BASE) are composed of written and spoken data collected from university students at a number of British universities. The written corpus contains essays and other coursework which received a good pass mark and the spoken data includes lectures and seminars. I particular like these corpora because they’re an example of language as it might be used by the peers of the students we’re aiming at, rather than text produced by professional writers, journalists, academics, etc. which doesn’t necessarily provide an appropriate model for the average ELT student. This is obviously university-level language, so is especially relevant for EAP, but I think BAWE could be useful for any advanced students who need to write formal essays (IELTS, CAE, Proficiency). And if you’re looking for US academic equivalents, you could also check out MICUSP and MICASE.


BAWE and BASE are actually available via several sources, but I wanted the excuse to get you to experience Sketch Engine, for me, the gold standard when it comes to corpus tools and the interface used by all the major dictionary publishers for their large corpora.

4 Spoken BNC2014: I admit this is the corpus on my list that I’ve probably used least so far, but I’m including it because it’s one I’m quite excited about finding uses for. Slightly contrary to its name, it was only released in 2017 and is the result of a massive project to collect data about current spoken English used in everyday contexts. If you’re working on speaking materials, looking at evidence from written English is not going to tell you anything terribly useful, because we just don’t speak how we write. So I think this could become the go-to corpus for anyone who wants to know how people actually say things.

Unfortunately, the Spoken BNC2014 doesn’t have the most user-friendly interface and getting access involves a bit of a faffy sign-up process which could be off-putting for the casual user. If spoken language is your thing though, I think it’s worth investing the time and effort to check it out, not least because some of the content is just really funny!


A note about corpora and copyright: It’s important to remember that, in general, the data that appears in a corpus is liable to all the usual copyright restrictions. That means you can’t just pull a big chunk of language from the corpus and use it in your activity, especially not if it’s for commercial publication. Occasionally, of course, you come across very short, ‘vanilla’ examples which could have come from almost anywhere (A young woman opened the door. The traffic was particularly bad.), but to be honest, these are few and far between. Generally, when I search for a particular language item, I’ll scan through the examples and jot down a ‘frame’:
I/you scroll through my/your (Facebook) newsfeed to see/searching for/on the train …
Then I’ll use my notes as the basis for an example that keeps the feel and pattern of the ones I’ve looked at, but fits my teaching purpose … and doesn’t infringe copyright.

There are lots of different corpora out there and corpus fans will have their personal favourites. If you’re new to corpora though, I’d say pick one or two to check out, play around with a few simple searches, use the help to get you started, and see what’s most useful for you. Be warned though, it can be addictive!

Labels: , , , , , , ,

Thursday, January 04, 2018

Using a corpus to fish for inspiration



When I think about using corpus tools to help in writing ELT materials, I tend to think of checking details. So I’ll often use a corpus to check the most common form of a word or phrase, or a typical collocation or colligation pattern. An example that cropped up on Facebook yesterday was whether we say “in winter” or “in the winter” (the answer, by the way, seems to be we use both, sometimes interchangeably and sometimes in different contexts). Today though, I’ve been using a corpus in a slightly different way for much more general inspiration.

I’m currently working on some grammar practice materials and one of the grammar points I need to cover is “compound future tenses” (will have done, will be doing, will have been doing). They’re supplementary materials and part of the brief is to choose different topics and contexts from those used in the student’s book. Of course, the SB author has already nabbed perhaps the most obvious context; predictions about life in the future (By 2050, we’ll all be travelling in driverless cars, etc.). I was casting about for an alternative angle and drawing a blank, so I turned to a corpus*. 

Corpora aren’t always ideal when it comes to grammar because it’s difficult to be specific in your searches. Yes, you can use grammar tags to search for particular word forms, but many common forms have so many different uses that what comes up is often too broad to be useful in an ELT context (imagine how many different uses you’d find if you searched for all present continuous verb forms, for example). When you can narrow things down to more specific words or combinations though, you can uncover some more useful results. So here, I ran a quick series of searches:

will have + past participle
will be + present participle
will have been + present participle

Up came a whole load of contexts which I’d probably never have thought of off the top of my head. And interestingly, a lot of them actually referred to the near future rather than distant futuristic predictions. A couple of the recurrent themes I spotted were:


Weather forecasts:
By 6 o’clock, the showers will have passed.
By Wednesday morning, the winds will be dying down.
The storm will have reached the coast of Cuba by early next week.

Sports reporting:
The team will have played nine games in four weeks.
She’ll be competing in three events at the upcoming Winter Olympics.
The coaching staff will have been preparing the players all winter.

I may not end up using the exact examples turned up by the corpus, but they’ve provided some much-needed inspiration and sent me off down some potentially useful paths.


*When you’re just fishing for inspiration, I don’t think it matters quite so much which corpus you use. For these searches, I used the Monco corpus, just because it’s what I’d been using recently and as a continuously-updated news-based corpus, it throws up a range of current topics.

Labels: , , , ,