Lexicoblog

The occasional ramblings of a freelance lexicographer

Tuesday, May 04, 2021

Text checkers: an overview

I’ve been mulling over a post about text analysis tools for ages but kept putting it off because I felt like I should research all the different tools out there thoroughly first. A recent post by Pete Clements though has forced my hand, so here’s my thoughts on the tools I have seen and tried. I should also say that I’m just focusing on the vocab aspect of the tools, not any other analysis features they have such as readability scores and the like.

So, what is a text analyser? Basically, it’s an online tool that allows you to cut and paste a text that you’d like to use with students into a box, you hit a button and it comes back with stats about the text. In particular, what most ELT materials writers are interested in is the level of the vocab. We’re usually looking for a breakdown by CEFR level to tell us whether the text is suitable for a particular class/level and which words might be “above level”.

READ THIS BIT FIRST
Before you use any kind of text analysis tool though, here are some basics to bear in mind:

WHICH WORD LIST?

It’s really important to understand how the tool you use is making those judgements about level. Most tools use some kind of word list that’s been developed to peg individual words to CEFR levels. It goes without saying that this in itself is fraught with problems – my blog post here looks at some of them. But if we’re accepting the basic premise of using a word list, then you need to know which one. If you can’t find out which list a tool is using, then I’d probably say, don’t use it because you can’t know what it’s showing you.

A number of tools use Cambridge’s English Vocabulary Profile (EVP) list – the key thing to understand about EVP is it ranks words (largely) by productive level – so the level at which you might typically expect a student to be using a word themselves. Given the way we acquire vocab that might be a level (or two) after students recognize and can understand the same word receptively, i.e. if they read it. The Oxford 3000, on the other hand, ranks vocab more by receptive level, so the point at which students will typically be able to read and understand a word.

Text analysers use clever algorithms to analyse the text you input, but these have a number of shortcomings it’s really important to be aware of:

PART OF SPEECH

Starting at the most basic level, the tools don’t always correctly identify the part of speech of a word, especially words that have the same forms across parts of speech. So, weather is most frequently a noun, but can also be a verb, national is mostly an adjective, but can be a noun (a foreign national). Most tools will opt for the most common form and label its level accordingly. I put in the sentence: Some people who contract this virus can feel very poorly for three to four weeks. And most of the tools identified contract here as a noun and labelled it as around B1, when in fact it’s a verb and EVP pegs it at C2.

Text Inspector: Out of all of the tools, the only one I’ve found to really deal with this issue is Text Inspector which allows you to click on any word in a text that looks like it’s been tagged incorrectly and choose the correct use (meaning and part of speech) from a drop-down menu. Of course, that means you have to spot the incorrectly tagged words, but it’s better than most.

[Click to enlarge the image].

Oxford Text Checker: If you hover over a word in your text with more than one possible part of speech, the Oxford Text Checker shows a box giving the CEFR level of each one, e.g. v = A2, n = B1. Although it only shows the level for the most basic meaning of the word (see multi-sense words below).

Many tools also fail to label certain words, especially function words. So, in the sentence - Others end up in hospital needing oxygen. – many tools left others without a level because they just weren’t sure what to do with it grammatically. Contractions (they’ve, she’d, who’s) also tend to go unlabelled, but are rarely a big issue.

MULTI-SENSE WORDS

English is a highly polysemous language; lots of words have multiple meanings which students are likely to come across and recognize at different levels. Most published word lists take this into account and assign different level labels to different meanings. Most text analysers though just opt for the most frequent (and usually lowest level) sense. We’ve already seen that to an extent with contract above, but even without the part of speech issue, if you put in a sentence like - There are links in the table below. – table will be shown as A1 (the piece of furniture) rather than A2/B1 (a graphic).

Text Inspector: As we saw above, Text Inspector gets around this by offering drop-downs for any words you suspect may be used in a less obvious sense.

MULTI-WORD EXPRESSIONS

For me, the biggest issue to look out for with text analysers is that they mostly treat words individually and ignore the fact that a large proportion of most texts (30-50% by some estimates) is made up of chunks; phrasal verbs (end up, carry on), mundane phrases (of course, as usual, a lot of) and idioms (under the weather, have no idea). And of course, a phrase is often going to have a very different level from the sum of its parts.

Unlike with the more glaring mis-tags of part of speech and meaning, I think multi-word items are far more difficult to spot because as expert speakers, we tend to read through them without noticing, but for students, an unknown phrasal verb can be a real stumbling block. It takes a keen eye to spot every phrase and phrasal verb in a text when it hasn’t been tagged.

Text Inspector: Again, the only tool to rate a bit better here is Text Inspector. It does at least manage to identify some phrases. In the sample text that I’ve been using for this post, it correctly picked out the phrasal verbs end up and carry on and the phrase have no idea.

It didn’t recognize pass it on (presumably because of the object in-between), but you can click on pass and choose the phrasal verb sense from the drop-down. Similarly, it didn’t pick up under the weather, but again, you can click on weather and select the idiom and it changes weather from A1 to C2. It doesn’t allow you to neatly link up the whole phrase (I don’t think), but it’s a reasonable compromise.

WHICH TOOL?

You’ll probably have gathered by this point that Text Inspector is very clearly out in front when it comes to analysing vocab from an ELT perspective. I subscribe to the paid version which gives you full functionality. You’ll find a link to a free version in Pete’s post which does much the same, but I’m not going to reshare it because, well, I think we should be paying for the good stuff and it’s a very small amount to invest for a really useful resource.

Here’s a brief overview of some of what’s out there though:

Text Inspector

Free version has limited functionality and doesn’t give CEFR analysis. Sign up for the paid version to get everything via: https://textinspector.com/

Word lists: The paid version allows you to analyse the vocab in a text in terms of EVP, AWL (the Aacademic Word List), BNC and COCA – these last two are corpora and it shows you the frequency of words as they appear in each corpus – useful if you’re into corpora.

Comments: By far the best I’ve seen in terms of at least trying to take into account the factors above.

Oxford Text Checker

Free via: https://www.oxfordlearnersdictionaries.com/text-checker/ (If you get to the main dictionary home page, click on Resources to find it)

Word lists: Based on the Oxford 3000 & 5000.

Comments: Easy to use and colour codes words by CEFR level. However, it always opts for the most common form/meaning of a word and doesn’t recognize phrases. If you hover over a word, it does at least show different CEFR options for different parts of speech, e.g. hovering over feel, you get a box showing v=A1 n=B2. You can also double-click on any of the words in your text to go direct to the dictionary entry which is useful for quickly checking the CEFR label against different meanings. It also has options to create word lists and activities from texts, but given the shortcomings, I wouldn’t be inclined to use them without heavy editing.

VocabKitchen

Free via: https://www.vocabkitchen.com/profile

Word lists: It shows words on the AWL and NAWL (New Academic Word List). It also claims to show words by CEFR level, but I can’t find out what word list it’s using which for me is a bit of a red flag.

Comments: It’s intuitive and easy to use, but again doesn’t account for different meanings or phrases. I believe it has more options if you register and sign in which I haven’t tried out.

EDIA Papyrus

Free but you need to register via: https://papyrus.edia.nl/

Word lists: This site is based on a mix of experts’/teachers’ assessments of the level of texts and AI.

Comments: Quite a nice interface, but it seems to skip quite a few words in your input text - not just function words, but you can see below it completely ignores the phrasal verb end up. And as above, doesn’t deal with different meanings or phrases.

LexTutor

Free via: https://www.lextutor.ca/vp/

Word Lists: Originally designed for corpus geeks, the main focus for this tool is around corpus frequencies and the AWL. It does now include a CEFR option, but reading through the blurb, the CEFR levels seem to be based on some very old (1990) word lists published by Cambridge way back before this became a properly researched area, so I’m not sure how useful they are.

Comments: A horrible user interface, still really for geeks only. It's so messy, I couldn't even get a meaningful screenshot.

Pearson/GSE Text Analyzer

Free via: https://www.english.com/gse/teacher-toolkit/user/textanalyzer

Word lists: based on Pearson’s own Global Scale of English (GSE) lists

Comments: I hesitated to even include this as it’s just plain weird – unless I’ve missed something. It calculates an overall level for your text but doesn’t show the level of individual words. It does highlight words that it judges to be ‘above level’, but the choices seem to be a bit random. It pegged my sample text at B1+, then picked out poorly and passing as above level, ignoring asymptomatic.

Labels: CEFR levels, ELT materials, online tools, text analysis, text checkers, wordlists

Friday, February 12, 2021

Writing rhythms

On most ELT writing projects, the work (and your life for the duration of the project!) gets divided up into units. For a students' book, that might be 10-15 quite large units, but for many of the sort of self-study, language practice type materials I work on, there can be anywhere between 20 and 50 short units which may only be 2-4 pages each.

At the start of a new project, you spend a bit of time getting to grips with the brief and playing around with the first unit or two to establish how they're going to work. Often, the format's already quite fixed in the brief, sometimes you have a bit of leeway to play with. Then once everyone's happy, you get your head down and start ploughing through unit-by-unit.

What interests me is how different people go about tackling each unit. Do they sketch out the whole thing then go back and fill in the details? Do they do it on paper or straight into a Word doc? Do they start from the beginning and work through each activity in turn? Or do they start with a core component, such as a reading text, then work outwards from it? A lot, of course, depends on the type and scope of the material, but even within that there's quite a bit of room for variation.

For the past couple of months, I've been working on some self-study vocab practice materials. There are 50 units altogether (across two linked projects) which is kind of daunting, but also quite nice as it means I've settled into a rhythm of roughly a unit a day. For each unit, I already have a (more-or-less) predetermined set of vocab items to practise across a number of activities. It's heavily corpus-informed, so I'm researching the vocab items to pick out features to highlight (typical usage and context, collocations, typical colligational patterns, etc.) and also using and adapting corpus examples in the activities. For the first few units, this was my approach:

The major downside of this was that I found myself running the same corpus searches numerous times. So, I'd explore vocab item A extensively in the initial research stage, then I'd find myself searching for it again several times to source examples for each exercise. I revised my approach after a few units so that I still did my research stage as before, but then sketched out a rough plan of the different exercises, e.g. exercise 1 focus on noun collocations, exercise 2 focus on following prepositions, etc. Then I ran a corpus search for each vocab item and added examples to several of the exercises at the same time. This seemed more efficient and I settled into it as a way of working for the first 15 units or so.

As is so often the case though, totting up my hours regularly as I went along, I realized I was spending much longer on the work than I'd budgeted for up-front. That meant that because the project is for a fixed fee, my hourly rate was nose-diving. It also meant I was getting behind schedule. After a bit of a review and discussion, it turned out that a lot of the extra work was just down to there being more involved in the project than I'd originally bargained for – isn't it always the case?! With no more budget available though, I had to try and rein in my hours regardless. So I came up with a new way of working.

On the plus side, it is much quicker because I'm only researching each vocab item once, then just reshuffling the results to create the exercises. On the downside, I'm not able to wait until I've researched all the items to see how the unit's going to shape up. So, if you like, the whole process is slightly less data led. In some units, it works out fine and the examples I've selected shuffle neatly into nice, coherent exercises. Other times, I find that a feature or exercise type starts to suggest itself towards the end of the vocab list and I realize I haven't noted relevant examples for some of the earlier items. Then I either have to squeeze the material I have into exercises which aren't a great fit or I have to go back and look for better examples for some items. For some units, I use up most of the examples I've collected, for others I'm left with a whole page of unused material at the end.

So often, ELT writing is a balancing act between how you'd like to work and what the time and budget allows. In this case, the hurry-up initially felt a bit uncomfortable, but as I go on, I think I'm settling into my rhythm again and making it work.

Labels: corpus research, ELT materials, materials writing, workflow

Monday, December 02, 2019

Missing grammar: parallel structure

I've been researching learner language using the Cambridge Learner Corpus for 20 years now and there are certain issues that crop up again and again among learners at all levels. One that I pick up on regularly is illustrated in the examples below (made up examples rather than real corpus data, but they illustrate the point):

At the weekend, he goes to the park and play football. (subject-verb agreement)

I like playing football and run. (verb + -ing form)

I'd love to visit Paris and seeing the Eiffel Tower. (verb + to do)

We went to the park and play football. (past simple verb form)

We can swim in the sea and playing volleyball on the beach. (modal + verb form)

I've tided the kitchen and did the washing up. (present perfect/past participle form)

I was sitting on the train, chatted to my friend on the phone. (past continuous/-ing form)

Basically students attempt to use a second verb form (usually) after a conjunction without repeating the subject, but they forget to match the verb form to the start of the sentence. In each of the examples above, the correct form would become clear(er) if we inserted the 'missing' subject (+verb/auxiliary/modal):

At the weekend, he goes to the park and [he] plays football.

I like playing football and [I like] running.

I'd love to visit Paris and [I'd love to] see the Eiffel Tower.

We went to the park and [we] played football.

We can swim in the sea and [we can] play volleyball on the beach.

I've tided the kitchen and [I've] done the washing up.

I was sitting on the train [and I was] chatting to my friend on the phone.

It's something I've noted in countless corpus reports, but I've never been quite sure what to call it. Until last week when I came across it for the first time in an ELT coursebook referred to as parallel structure. It was in a B2 book in a section about academic writing style and covered a wider range of structures than those above (not just verb phrases, but nouns, adjectives and full clauses too), but it still made me cheer out loud at my desk. It's long puzzled me why these incredibly common structures aren't explicitly addressed in most ELT materials when they cause so many issues for students.

I rarely get the chance to choose the grammar points I cover in the materials I work on, because they're mostly supplementary materials and the syllabus is already fixed by the time I get started. So I've never had the opportunity to cover this explicitly myself. I have tried to include examples in practice exercises, but they usually end up getting cut by editors who want all the items to fit on a single line and don't like the longer examples these structures often involve (grrr!).

So I'm making a case for this to be included explicitly in more ELT materials. It's relevant at every level and with almost every kind of verb structure we teach. It doesn't have to be a separate grammar point and it doesn't even have to have the label parallel structure. I think it's a great thing to bring up when you're revising a particular verb form as a slight variation on the usual practice activities, just to raise students' awareness. You could have a simple intro as above showing/eliciting the 'missed out' words and the correct second verb forms. Then straight into some practice examples (as gap-fills or freer practice). It works perfectly for any kind of list:

daily routines (She leaves the house at 8 and catches the bus at 8.15)
a dramatic narrative (He opened the box and looked inside)
background to a narrative (People were sitting in the café, eating and drinking)
things people like doing (I like watching TV and chatting to my friends online)
things people would like to do in the future (I'd like to go to university and study drama)
things ticked off on a list (We've booked a room for the party and set up a Facebook page)
things on a to-do list (I still need to confirm the hotel booking and renew my travel insurance)

I'm happy to be proved wrong with a flurry of comments about ELT materials that practise exactly this already ...

Labels: Cambridge learner corpus, ELT materials, grammar patterns, learner errors, parallel structure

Tuesday, July 24, 2018

Word Booster update

Last year, I wrote a review of Word Booster, an online tool that allows you to create an ELT lesson from an online text. It creates a (fully credited) pdf of the text that you can print out for students, along with definitions for key words and a follow-up vocab quiz. At the time, I was disappointed that an idea which seemed so promising fell short on a number of important details.

As soon as I’d posted the blog, the creator of Word Booster got in touch. He was really positive about my feedback and keen to improve the tool as quickly as time, manpower and finances would allow. I was really impressed by his commitment and even more impressed when he got in touch again recently about the latest updates to Word Booster.

The latest version of the tool uses a learner’s dictionary (the Cambridge Advanced Learner’s Dictionary) which makes the definitions appropriate and accessible to the average learner.
Whilst the tool makes suggestions about which words in a text to focus on, the user/teacher is now free to accept or reject these suggestions and to choose whichever words or phrases they feel are most appropriate for their learners or for the aims of the lesson.
The tool suggests an appropriate definition for each word, but allows the user to check it’s the correct sense manually and change it if necessary. This is a massive improvement as automated sense selection can be a bit hit and miss. You can see in the example below, using one of my own blog posts, that when I click on 'folk' in the text, I'm able to select the appropriate sense for the context (the tool had automatically selected the more frequent, 'music' sense). [Click on the picture to view it more clearly full-screen.]

There are also options to (de)select example sentences and to adjust the quiz activities slightly – although not to make edits beyond shuffling around which words appear in which activity type.

All of these changes make the tool far more usable. It still has a few minor technical glitches that I’ve passed back to the Word Booster team, but overall, it’s something I’d now happily recommend for teachers to try out.

I do, however, still have a few reservations.

Dictionary definitions in vocab activities

As a lexicographer myself, I’m a big fan of learner’s dictionaries, but I’m still slightly wary about the use of dictionary definitions in vocab practice activities. Research seems to show that using dictionary look-ups or referring to glosses while reading a text helps students’ incidental* vocabulary learning (Laufer & Hill, 2000). By actively looking up a word, focusing on the form and meaning, and relating it to the context, students are more likely to remember it later. And this is exactly what learner’s dictionaries are intended for. Definitions are written in the expectation that the students will come across a word and look it up to check the meaning – they’re intended for decoding. That’s what the first part of the Word Booster tool caters to perfectly.

Where I feel we get onto shakier ground is in the ‘reverse engineering’, if you like, where students are essentially given a definition and asked to guess the word. This is potentially a much more challenging task and isn’t something that dictionary definitions are designed for. Without the target word alongside, a dictionary definition can seem vague, rather abstract and certainly very difficult to tell apart from definitions for similar words.

That’s not a criticism of dictionary definitions, it’s just the nature of the beast. Definitions have to be concise, so there can’t be lots of detailed explanation to differentiate between similar words**. They have to be written within a defining vocabulary (a limited set of words that avoids the definitions being more difficult than the words they define), so they necessarily can’t be as subtle and nuanced as those in a dictionary for native speakers. They also have to cover all the possible uses of a word, which can make them a bit vague and sometimes slightly awkward. As a lexicographer, you split out clearly different senses, but you can’t just keep on splitting endlessly, you have to lump similar uses together at some point (e.g. this two-part definition from Cambridge Dictionaries – “option: one thing that can be chosen from a set of possibilities, or the freedom to make a choice”).

As a materials writer, if I want to create a practice activity around definitions (which, by the way, I’d do sparingly anyway), whilst I might start off by looking at a dictionary entry, I’d invariably edit the definition. I might change the wording, for example, losing slightly formal passives (e.g. “one thing that you can choose”). I might make it a bit more specific to the context at hand – so I’d choose just the relevant part of the above two-parter. And, if I was dealing with near synonyms, I’d probably add a bit more detail to help make the distinctions clearer.

At the moment, many of the quiz questions generated by Word Booster are at best very tricky and at worst, downright confusing just because of the nature of the dictionary definitions. Being able to edit the defs in the quiz questions would undoubtedly help, but at the same time, it would add to the time required to create the material (which kind of negates one of the key selling points of the tool) and I guess, the ‘authority’ of the definitions would be lost somewhat. It seems to me that the key here is to use the activities sparingly and to choose items carefully, keeping an eye out for odd and confusing defs or combinations and deselecting them. Which brings me onto my main takeaway about this tool …

… it’s how you use it.

Just like any other tool, the success of what’s produced comes down not only to the features of the tool itself, but to how it’s used. As a novice teacher many years ago, I didn’t have fancy online tools like this, but I certainly fell into the trap of choosing a news article that I thought was interesting, photocopying it, picking out some random vocabulary and quite often writing out dictionary definitions for students to match to words from the text. The result was a bit of a confusing mess of a lesson, in which we’d invariably end up decoding the text as a class line-by-line because it was too hard for the students to manage. I’d have to give extra explanation of the definitions of above-level vocab and I’d often struggle to remember the correct answers to the questions that seemed obvious when I wrote them, but which, in the middle of a lesson with a load of confused students, suddenly didn’t make sense any more.

To use a tool like Word Booster effectively, the teacher needs to consider:

The choice of text – is it at the right level for the students both linguistically and cognitively? A few above-level words might provide challenge and interest, but too many will be confusing and demotivating. Is it the right length for the lesson?
The choice of vocab – in my last post, I wrote about the importance of choosing reasonable vocab sets to work with and about having a clear aim for vocab activities (Are you focusing on receptive or productive vocab? Do you want students just to decode the text or are these words useful to learn?). How many words is it reasonable to highlight and practise?
The choice of definitions – the option for the user to pick the correct definition is really useful, but it requires some skill. How is the word being used here and which def fits best? Is the word being used metaphorically? Is it actually part of a phrase, a phrasal verb or an idiom? Is the correct sense available in this learner's dictionary at all?
Activity selection – personally, I think one short-ish definition-based activity per text, using carefully-selected definitions that don’t cause confusion, is probably enough. I might stretch to a second that uses example sentences, but for me, any more than that and it’s becoming a bit mechanical and repetitive. I’d then want to supplement the material generated by Word Booster with some of my own content. Just mining a text for vocabulary doesn’t amount to a successful, engaging lesson. At a minimum, I’d want to add some kind of comprehension questions – whether those were traditional written questions about the text or looser points for discussion. I’d then want some kind of follow-up – a response to the content of the text, perhaps in the form of group discussions, maybe a writing task.

Overall, I’m really impressed with the improvements that Word Booster has made over the past year and I know the team have more upgrades in the pipeline to continue refining their algorithms and adding more features. I’d certainly say it’s worth trying out though. Whilst creating a usable lesson involves a bit of work in terms of choosing the vocab, checking definitions and selecting appropriate quiz questions, I think it does save time in creating a basis for a lesson that you can then build around.

*The term incidental vocabulary learning, doesn’t mean words that students just come across by accident. Incidental learning can be quite planned and intentional, but it just isn’t the main focus of the activity. So in a reading lesson, the main focus is on understanding the text, maybe for discussion or to answer some comprehension questions, but there can be a conscious focus on vocab too – this would be incidental learning.

**When I was working on the Oxford Learner’s Thesaurus, we often needed to add whole extra sentences to help differentiate between synonyms.

Reference:

Laufer, B. & Hill, M. (2000) ‘What lexical information do L2 learners select in a CALL dictionary and how does it affect word retention?’ Language Learning & Technology

Labels: dictionaries, ELT materials, online tools, review, vocabulary

Lexicoblog

Tuesday, May 04, 2021

Text checkers: an overview

Friday, February 12, 2021

Writing rhythms

Monday, December 02, 2019

Missing grammar: parallel structure

Tuesday, July 24, 2018

Word Booster update

Lexicoblog

About Me

Previous Posts

Archives