I’ve been mulling over a post about text analysis tools for ages but kept putting it off because I felt like I should research all the different tools out there thoroughly first. A recent post by Pete Clements though has forced my hand, so here’s my thoughts on the tools I have seen and tried. I should also say that I’m just focusing on the vocab aspect of the tools, not any other analysis features they have such as readability scores and the like.
So, what is a text analyser? Basically, it’s an online tool that allows you to cut and paste a text that you’d like to use with students into a box, you hit a button and it comes back with stats about the text. In particular, what most ELT materials writers are interested in is the level of the vocab. We’re usually looking for a breakdown by CEFR level to tell us whether the text is suitable for a particular class/level and which words might be “above level”.
READ THIS BIT FIRST
Before you use any kind of text analysis tool though, here are some basics to bear in mind:
WHICH WORD LIST?
It’s really important to understand how the tool you use is making those judgements about level. Most tools use some kind of word list that’s been developed to peg individual words to CEFR levels. It goes without saying that this in itself is fraught with problems – my blog post here looks at some of them. But if we’re accepting the basic premise of using a word list, then you need to know which one. If you can’t find out which list a tool is using, then I’d probably say, don’t use it because you can’t know what it’s showing you.
A number of tools use Cambridge’s English Vocabulary Profile (EVP) list – the key thing to understand about EVP is it ranks words (largely) by productive level – so the level at which you might typically expect a student to be using a word themselves. Given the way we acquire vocab that might be a level (or two) after students recognize and can understand the same word receptively, i.e. if they read it. The Oxford 3000, on the other hand, ranks vocab more by receptive level, so the point at which students will typically be able to read and understand a word.
Text analysers use clever algorithms to analyse the text you input, but these have a number of shortcomings it’s really important to be aware of:
PART OF SPEECH
Starting at the most basic level, the tools don’t always correctly identify the part of speech of a word, especially words that have the same forms across parts of speech. So, weather is most frequently a noun, but can also be a verb, national is mostly an adjective, but can be a noun (a foreign national). Most tools will opt for the most common form and label its level accordingly. I put in the sentence: Some people who contract this virus can feel very poorly for three to four weeks. And most of the tools identified contract here as a noun and labelled it as around B1, when in fact it’s a verb and EVP pegs it at C2.
Text Inspector: Out of all of the tools, the only one I’ve found to really deal with this issue is Text Inspector which allows you to click on any word in a text that looks like it’s been tagged incorrectly and choose the correct use (meaning and part of speech) from a drop-down menu. Of course, that means you have to spot the incorrectly tagged words, but it’s better than most.
[Click to enlarge the image].
Oxford Text Checker: If you hover over a word in your text with more than one possible part of speech, the Oxford Text Checker shows a box giving the CEFR level of each one, e.g. v = A2, n = B1. Although it only shows the level for the most basic meaning of the word (see multi-sense words below).
Many tools also fail to label certain words, especially function words. So, in the sentence - Others end up in hospital needing oxygen. – many tools left others without a level because they just weren’t sure what to do with it grammatically. Contractions (they’ve, she’d, who’s) also tend to go unlabelled, but are rarely a big issue.
English is a highly polysemous language; lots of words have multiple meanings which students are likely to come across and recognize at different levels. Most published word lists take this into account and assign different level labels to different meanings. Most text analysers though just opt for the most frequent (and usually lowest level) sense. We’ve already seen that to an extent with contract above, but even without the part of speech issue, if you put in a sentence like - There are links in the table below. – table will be shown as A1 (the piece of furniture) rather than A2/B1 (a graphic).
Text Inspector: As we saw above, Text
Inspector gets around this by offering drop-downs for any words you suspect may
be used in a less obvious sense.
For me, the biggest issue to look out for with text analysers is that they mostly treat words individually and ignore the fact that a large proportion of most texts (30-50% by some estimates) is made up of chunks; phrasal verbs (end up, carry on), mundane phrases (of course, as usual, a lot of) and idioms (under the weather, have no idea). And of course, a phrase is often going to have a very different level from the sum of its parts.
Unlike with the more glaring mis-tags of part of speech and meaning, I think multi-word items are far more difficult to spot because as expert speakers, we tend to read through them without noticing, but for students, an unknown phrasal verb can be a real stumbling block. It takes a keen eye to spot every phrase and phrasal verb in a text when it hasn’t been tagged.
Text Inspector: Again, the only tool to rate a bit better here is Text Inspector. It does at least manage to identify some phrases. In the sample text that I’ve been using for this post, it correctly picked out the phrasal verbs end up and carry on and the phrase have no idea.
It didn’t recognize pass it on (presumably because of the object in-between), but you can click on pass and choose the phrasal verb sense from the drop-down. Similarly, it didn’t pick up under the weather, but again, you can click on weather and select the idiom and it changes weather from A1 to C2. It doesn’t allow you to neatly link up the whole phrase (I don’t think), but it’s a reasonable compromise.
You’ll probably have gathered by this point that Text Inspector is very clearly out in front when it comes to analysing vocab from an ELT perspective. I subscribe to the paid version which gives you full functionality. You’ll find a link to a free version in Pete’s post which does much the same, but I’m not going to reshare it because, well, I think we should be paying for the good stuff and it’s a very small amount to invest for a really useful resource.
Here’s a brief overview of some of what’s out there though:
Free version has limited functionality
and doesn’t give CEFR analysis. Sign up for the paid version to get everything via: https://textinspector.com/
Word lists: The paid version allows you to analyse the vocab in a text in terms of EVP, AWL (the Aacademic Word List), BNC and COCA – these last two are corpora and it shows you the frequency of words as they appear in each corpus – useful if you’re into corpora.
Comments: By far the best I’ve seen in terms of at least trying to take into account the factors above.
Oxford Text Checker
Free via: https://www.oxfordlearnersdictionaries.com/text-checker/ (If you get to the main dictionary home page, click on Resources to find it)
Word lists: Based on the Oxford 3000 & 5000.
Comments: Easy to use and colour codes words by CEFR level. However, it always opts for the most common form/meaning of a word and doesn’t recognize phrases. If you hover over a word, it does at least show different CEFR options for different parts of speech, e.g. hovering over feel, you get a box showing v=A1 n=B2. You can also double-click on any of the words in your text to go direct to the dictionary entry which is useful for quickly checking the CEFR label against different meanings. It also has options to create word lists and activities from texts, but given the shortcomings, I wouldn’t be inclined to use them without heavy editing.
Free via: https://www.vocabkitchen.com/profile
Word lists: It shows words on the AWL and NAWL (New Academic Word List). It also claims to show words by CEFR level, but I can’t find out what word list it’s using which for me is a bit of a red flag.
Comments: It’s intuitive and easy to use, but again doesn’t account for different meanings or phrases. I believe it has more options if you register and sign in which I haven’t tried out.
Free but you need to register via: https://papyrus.edia.nl/
Word lists: This site is based on a mix of experts’/teachers’ assessments of the level of texts and AI.
Comments: Quite a nice interface, but it seems to skip quite a few words in your input text - not just function words, but you can see below it completely ignores the phrasal verb end up. And as above, doesn’t deal with different meanings or phrases.
Free via: https://www.lextutor.ca/vp/
Word Lists: Originally designed for corpus geeks, the main focus for this tool is around corpus frequencies and the AWL. It does now include a CEFR option, but reading through the blurb, the CEFR levels seem to be based on some very old (1990) word lists published by Cambridge way back before this became a properly researched area, so I’m not sure how useful they are.
Comments: A horrible user interface, still really for geeks only. It's so messy, I couldn't even get a meaningful screenshot.
Pearson/GSE Text Analyzer
Word lists: based on Pearson’s own Global Scale of English (GSE) lists
Comments: I hesitated to even include
this as it’s just plain weird – unless I’ve missed something. It calculates an
overall level for your text but doesn’t show the level of individual words. It
does highlight words that it judges to be ‘above level’, but the choices seem
to be a bit random. It pegged my sample text at B1+, then picked out poorly and passing as above level, ignoring asymptomatic.