Corpus: gospel or guide?
A response to a webinar:
Rather unexpectedly, I got particularly caught up with the reactions of the participants as they appeared in the little text box on the side of the screen. Unfortunately, Michael ran out of time, so didn’t get to address any of the comments or questions that popped up. I, however, was itching to respond to them! So I thought I’d tackle some of the points here which I’ve been mulling over since. And in fact, I’ve got so much to say, I’m going to split this into two posts.
The part of the webinar that interested me most in terms of participant feedback was when Michael was talking about how we use frequency information from corpora to highlight the most common and so "useful" words in a dictionary. Below are some of the comments and questions and my reactions:
“Are there standard lists with the top 250 words?”
“But where can we find the list of words (to know if they are frequent or not)”
I always find interest in wordlists from teachers and students a little bit worrying. It seems to suggest that language learners are rather like computers and if we can just input the right list of words, then they’ll output English at a given level! Whilst I think frequency lists can have a role to play in helping prioritise what to focus on, my feeling is that generally checking frequency should be something that comes after you encounter new vocabulary. You look up a new word you’ve come across in the dictionary and you might use the information about frequency to decide whether it’s worth putting in your vocabulary notebook or whether it’s a word that you can naturally drop into conversation or not. Language is a wonderfully messy, organic, personal sort of a thing and what vocabulary you choose to teach or learn should be governed by all sorts of different factors - interests, needs, context, personality - not some (inevitably very dull) standard list of frequent words.
“Are the top 3000 words in Oxford the same top 3000 words in Macmillan?”
I haven’t researched the answer to this one, but I think I can fairly confidently say “more or less” if we’re just talking about frequency (more on that below). Each of the major dictionary publishers uses a different corpus – or rather a different collection of corpora, some of which overlap (like the BNC). In the early days, with relatively small corpora, you would have expected some variation, with different corpora slightly skewed towards particular types of language. Nowadays though, with all the big publishers using really huge and diverse collections of corpora, I think you’d probably expect a straightforward frequency list (at least at the most frequent end) to come out more or less the same, with only minor variations.
Having said that, each dictionary publisher has it’s own criteria for how it shows frequency information – where it sets it’s limits and how it puts words into frequency bands. The Oxford 3000™, for example, isn’t just based on frequency, but was put together using three criteria; frequency, range and familiarity (if you're interested, you can read more about it here). Does this variation matter? Personally, I don’t think so. How many students, or even teachers, ever read the blurb in the front (or back) of a dictionary that explains the frequency information? My feeling is that most students either don’t even notice it, or if they do, it’s just some general sense that a word is highlighted or has stars next to it, therefore it must be useful to learn. Of course, there’ll be times when some teachers (esp. vocab nerds like myself!) will make a point in class about frequent and more marked synonyms by pointing to the frequency information (and often register labels) in the dictionary. But my feeling is that’s the exception, not the rule. And that’s fine. It’s still worth the information being there as one more tool in the language learning toolbox.
Coming back to the title of this post, corpus information has been incredibly useful over the past couple of decades in understanding how language is actually used and in making teaching materials more natural, but it's still only a guide. Despite much of my work being in the area of corpus research, I'm still very wary about taking corpus data as gospel, partly for some of the reasons I'll talk about in my next post ...