Lexicoblog

The occasional ramblings of a freelance lexicographer

Tuesday, May 04, 2021

Text checkers: an overview

I’ve been mulling over a post about text analysis tools for ages but kept putting it off because I felt like I should research all the different tools out there thoroughly first. A recent post by Pete Clements though has forced my hand, so here’s my thoughts on the tools I have seen and tried. I should also say that I’m just focusing on the vocab aspect of the tools, not any other analysis features they have such as readability scores and the like.

So, what is a text analyser? Basically, it’s an online tool that allows you to cut and paste a text that you’d like to use with students into a box, you hit a button and it comes back with stats about the text. In particular, what most ELT materials writers are interested in is the level of the vocab. We’re usually looking for a breakdown by CEFR level to tell us whether the text is suitable for a particular class/level and which words might be “above level”.

READ THIS BIT FIRST
Before you use any kind of text analysis tool though, here are some basics to bear in mind:

WHICH WORD LIST?

It’s really important to understand how the tool you use is making those judgements about level. Most tools use some kind of word list that’s been developed to peg individual words to CEFR levels. It goes without saying that this in itself is fraught with problems – my blog post here looks at some of them. But if we’re accepting the basic premise of using a word list, then you need to know which one. If you can’t find out which list a tool is using, then I’d probably say, don’t use it because you can’t know what it’s showing you.

 A number of tools use Cambridge’s English Vocabulary Profile (EVP) list – the key thing to understand about EVP is it ranks words (largely) by productive level – so the level at which you might typically expect a student to be using a word themselves. Given the way we acquire vocab that might be a level (or two) after students recognize and can understand the same word receptively, i.e. if they read it. The Oxford 3000, on the other hand, ranks vocab more by receptive level, so the point at which students will typically be able to read and understand a word.

Text analysers use clever algorithms to analyse the text you input, but these have a number of shortcomings it’s really important to be aware of:

PART OF SPEECH

Starting at the most basic level, the tools don’t always correctly identify the part of speech of a word, especially words that have the same forms across parts of speech. So, weather is most frequently a noun, but can also be a verb, national is mostly an adjective, but can be a noun (a foreign national). Most tools will opt for the most common form and label its level accordingly. I put in the sentence:  Some people who contract this virus can feel very poorly for three to four weeks. And most of the tools identified contract here as a noun and labelled it as around B1, when in fact it’s a verb and EVP pegs it at C2.

Text Inspector: Out of all of the tools, the only one I’ve found to really deal with this issue is Text Inspector which allows you to click on any word in a text that looks like it’s been tagged incorrectly and choose the correct use (meaning and part of speech) from a drop-down menu.  Of course, that means you have to spot the incorrectly tagged words, but it’s better than most. 

[Click to enlarge the image].


Oxford Text Checker: If you hover over a word in your text with more than one possible part of speech, the Oxford Text Checker shows a box giving the CEFR level of each one, e.g. v = A2, n = B1. Although it only shows the level for the most basic meaning of the word (see multi-sense words below).

Many tools also fail to label certain words, especially function words. So, in the sentence - Others end up in hospital needing oxygen. – many tools left others without a level because they just weren’t sure what to do with it grammatically.  Contractions (they’ve, she’d, who’s) also tend to go unlabelled, but are rarely a big issue.

MULTI-SENSE WORDS

English is a highly polysemous language; lots of words have multiple meanings which students are likely to come across and recognize at different levels. Most published word lists take this into account and assign different level labels to different meanings. Most text analysers though just opt for the most frequent (and usually lowest level) sense. We’ve already seen that to an extent with contract above, but even without the part of speech issue, if you put in a sentence like - There are links in the table below.table will be shown as A1 (the piece of furniture) rather than A2/B1 (a graphic).

Text Inspector: As we saw above, Text Inspector gets around this by offering drop-downs for any words you suspect may be used in a less obvious sense.

MULTI-WORD EXPRESSIONS

For me, the biggest issue to look out for with text analysers is that they mostly treat words individually and ignore the fact that a large proportion of most texts (30-50% by some estimates) is made up of chunks; phrasal verbs (end up, carry on), mundane phrases (of course, as usual, a lot of) and idioms (under the weather, have no idea). And of course, a phrase is often going to have a very different level from the sum of its parts.

Unlike with the more glaring mis-tags of part of speech and meaning, I think multi-word items are far more difficult to spot because as expert speakers, we tend to read through them without noticing, but for students, an unknown phrasal verb can be a real stumbling block. It takes a keen eye to spot every phrase and phrasal verb in a text when it hasn’t been tagged.

Text Inspector: Again, the only tool to rate a bit better here is Text Inspector. It does at least manage to identify some phrases. In the sample text that I’ve been using for this post, it correctly picked out the phrasal verbs end up and carry on and the phrase have no idea.


It didn’t recognize pass it on (presumably because of the object in-between), but you can click on pass and choose the phrasal verb sense from the drop-down. Similarly, it didn’t pick up under the weather, but again, you can click on weather and select the idiom and it changes weather from A1 to C2. It doesn’t allow you to neatly link up the whole phrase (I don’t think), but it’s a reasonable compromise.

 

WHICH TOOL?

You’ll probably have gathered by this point that Text Inspector is very clearly out in front when it comes to analysing vocab from an ELT perspective. I subscribe to the paid version which gives you full functionality. You’ll find a link to a free version in Pete’s post which does much the same, but I’m not going to reshare it because, well, I think we should be paying for the good stuff and it’s a very small amount to invest for a really useful resource.

Here’s a brief overview of some of what’s out there though:

Text Inspector

Free version has limited functionality and doesn’t give CEFR analysis. Sign up for the paid version to get everything via: https://textinspector.com/

Word lists: The paid version allows you to analyse the vocab in a text in terms of EVP, AWL (the Aacademic Word List), BNC and COCA – these last two are corpora and it shows you the frequency of words as they appear in each corpus – useful if you’re into corpora.

Comments: By far the best I’ve seen in terms of at least trying to take into account the factors above.


Oxford Text Checker

Free via: https://www.oxfordlearnersdictionaries.com/text-checker/ (If you get to the main dictionary home page, click on Resources to find it)

Word lists: Based on the Oxford 3000 & 5000.

Comments: Easy to use and colour codes words by CEFR level. However, it always opts for the most common form/meaning of a word and doesn’t recognize phrases. If you hover over a word, it does at least show different CEFR options for different parts of speech, e.g. hovering over feel, you get a box showing v=A1 n=B2. You can also double-click on any of the words in your text to go direct to the dictionary entry which is useful for quickly checking the CEFR label against different meanings. It also has options to create word lists and activities from texts, but given the shortcomings, I wouldn’t be inclined to use them without heavy editing.

 

VocabKitchen

Free via: https://www.vocabkitchen.com/profile

Word lists: It shows words on the AWL and NAWL (New Academic Word List). It also claims to show words by CEFR level, but I can’t find out what word list it’s using which for me is a bit of a red flag.

Comments: It’s intuitive and easy to use, but again doesn’t account for different meanings or phrases. I believe it has more options if you register and sign in which I haven’t tried out.


EDIA Papyrus

Free but you need to register via: https://papyrus.edia.nl/

Word lists: This site is based on a mix of experts’/teachers’ assessments of the level of texts and AI.

Comments: Quite a nice interface, but it seems to skip quite a few words in your input text - not just function words, but you can see below it completely ignores the phrasal verb end up. And as above, doesn’t deal with different meanings or phrases.

 

LexTutor

Free via: https://www.lextutor.ca/vp/

Word Lists: Originally designed for corpus geeks, the main focus for this tool is around corpus frequencies and the AWL. It does now include a CEFR option, but reading through the blurb, the CEFR levels seem to be based on some very old (1990) word lists published by Cambridge way back before this became a properly researched area, so I’m not sure how useful they are.

Comments: A horrible user interface, still really for geeks only. It's so messy, I couldn't even get a meaningful screenshot.

Pearson/GSE Text Analyzer

Free via: https://www.english.com/gse/teacher-toolkit/user/textanalyzer

Word lists: based on Pearson’s own Global Scale of English (GSE) lists

Comments: I hesitated to even include this as it’s just plain weird – unless I’ve missed something. It calculates an overall level for your text but doesn’t show the level of individual words. It does highlight words that it judges to be ‘above level’, but the choices seem to be a bit random. It pegged my sample text at B1+, then picked out poorly and passing as above level, ignoring asymptomatic.


 

 


Labels: , , , , ,

Tuesday, March 23, 2021

Coronaversaries: rollouts and re-entry

It's a year ago this week that the UK went into its first coronavirus lockdown and I've spotted quite a few #coronaversary (coronavirus + anniversary) posts across social media as people share what they were doing a year ago and reflect on the past twelve months. So, it seemed like a good time to reflect on the language – or coronavocab – that's developed to describe life in an unprecedented year.

Looking back at my coronavocab posts from last summer, much of the language I highlighted has remained with us and become an all-too-familiar part of our everyday vocabularies; face masks, social distancing, hand sanitizer, lockdown, homeschooling. Some of the more light-hearted coinages also still float about in articles and blog posts; coronacoaster, covidiots, quarantinis, isobaking. But how has our language changed to reflect developments so far in 2021?

For a start, what we call the virus has gradually changed. It started off as coronavirus, but then got renamed (in Feb 2020, for the sake of accuracy) to Covid-19 and has, over time, just come to be known as Covid. Looking at some stats from the Coronavirus Corpus (which collects texts about the pandemic from across the internet), Covid on its own still seems to lag behind, but that's probably down to the fact that it's a corpus of written texts including a number of sources that likely prefer the full form. If you were able to look at spoken usage, I suspect Covid would shoot up the rankings.


Probably the most significant event to influence the way we're talking about the pandemic in recent months though has been the vaccine rollout. Much like lockdown/lock down and the other phrasal verbs in one of my earlier posts, rollout (noun) and it's accompanying phrasal verb, roll out, are not completely new words, but they have increased massively in frequency in a very short time and shifted slightly in usage. Previously, rollouts were predominantly business-related and to do with new products being launched (the rollout of the new iPhone). However, since Covid vaccines started being approved for use late in 2020, governments around the world have been setting up vaccine program(me)s to roll out the vaccine and offer as many people as possible a Covid jab (especially in the UK) or a Covid shot (more in the US).

The language around vaccines includes the everyday language we all use to talk about getting our jabs (in red), the language relating to getting vaccines out to people (in green) as well as still some discussion about their development and production (in blue). It will be interesting to see how the collocations in the red group shift and get added to over the coming months as the implications and effects of more people being vaccinated play out.


Looking forward, in the UK at least, there's starting to be lots talk of easing (of restrictions) and a flurry of re- words such as re-entry and readjustment both from a practical and a psychological perspective:


What are you looking forward to re-entering or concerned about readjusting to in the coming months?

Labels: , , , ,

Sunday, March 14, 2021

A Time-out A-Z

What does a lexicographer do on her day off?

That's not the start of a corny joke, but a real question I found myself asking last week. After working flat-out on back-to-back projects for the past 5 months without a day off, I suddenly realized I was absolutely shattered and really needed a proper break from my desk. With some form of lockdown here in the UK since last autumn, it hasn't seemed worth taking time off because, well, what would I do with it? You can't go anywhere, nothing's open and you're only allowed to go out for exercise locally, which for me means as far as I can walk from home.

I enjoy walking and with my mileage dropping off over the past few weeks, I resolved to spend a lot of my week off getting some fresh air and exercise. But I've been stomping round the same handful of routes for the past year. On Monday, I did a long loop of Ashton Court Park, one of the few accessible large green spaces on this side of Bristol. After a slightly grey start, the clouds cleared and it was a lovely, sunny, 6-mile amble with a longer than usual (takeout) coffee stop and really nice not to be thinking about getting back to my desk:

On Tuesday, I had a lovely socially-distanced walk-and-talk with a friend around Bristol harbour in the morning, then with rubbish weather forecast for the rest of the week, went out for a wander around Leigh Woods - the other accessible green space - with my partner in the afternoon:

But what about the rest of the week? With all my usual walking routes already ticked off, I had to get creative ... and what better for an off-duty lexicographer than a city A-Z. I planned out a route that would take me along streets starting with each letter of the alphabet in turn, photographing each street sign as I went. I started out with two streets I've previously lived on and spent the next few hours squiggling my way around Bristol in a mix of sun, rain, wind and a massive hailstorm! This was the result ... and yes, I know I cheated a bit on J and X, but hey, my game, my rules!

 
 


 



 
















The full A-Z was exactly 9 miles and lots of fun despite the weather. Many of the streets were familiar, but I went down a few I'd never explored before and just generally enjoyed doing something completely frivolous for nothing more than the satisfaction of completing it. And well, you gotta love an A-Z!

Total walking distance for my week off: just under 40 miles (64 km)!




Labels: ,

Monday, March 01, 2021

RSI Day 2021: pain in a pandemic

Yesterday, 28 February, was RSI Awareness Day. This year, even for those of us used to working from home, our work routines have been thrown up in the air and healthy working habits have gone a bit awry.  It's also been a fairly reflective sort of year, so I thought it might be time to talk about some of my pain-related ups and downs. To explain the past year though, I’m going to have to take you back a bit …apologies to those who’ve heard some bits of this story before.

1989:
I broke my right collarbone in a car accident. I was told it'd healed and was sent off to live fairly unbothered by it for the next 10 years or so.

1999:
After spending my 20s teaching abroad, I’d just switched to a desk-based job as a lexicographer when I suddenly started getting severe pains in my right hand, arm, shoulder and neck. I was initially diagnosed with RSI and after lots of appointments, discovered that my collarbone had never fixed properly but was wobbling around causing a generally unstable wonky top right corner and putting all kinds of stresses and strains on the nerves, tendons and muscles around it.

2000 onwards:
Having had lots of doctors more-or-less shrug their shoulders, I spent the following 20 years doing my best to live with increasingly debilitating chronic pain that affected my whole upper body. It limited my professional life significantly. Having gone freelance early-on to give me the flexibility to work how and when I could, I worked part-time hours, was careful not to take on too much and avoided jobs that would be too fiddly and computer-heavy. I tried various workstation set-ups, took lots of regular breaks, tried various forms of exercise and therapy.


Late twenty-teens:
By about 2018 though, things seemed to have hit a real low-point. The pain was getting worse and dominating my life more and more. I was taking bigger chunks of time off work between projects to recover and my personal life was getting narrower as I avoided more and more everyday situations that would cause me pain.

June 2019:
A chance comment on a Facebook thread about mindfulness apps led to a suggestion from Rachael Roberts that I take a look at Curable, an app aimed specifically at chronic pain sufferers. The results were pretty dramatic. It feels a bit silly to say that an app managed to ‘cure’ 20 years of pain in just a couple of weeks, but I think it was just the right thing at the right time and brought together a lot of ideas I’d been aware of for a while but hadn’t known how to act on. I won't go into the details, because we’d be here all day, but it basically centred around mindset and my attitude to pain. It didn’t fix my wonky shoulder, but I learnt how to turn the volume down on the pain that had started bouncing round my brain’s wiring out-of-control. I went from taking strong painkillers pretty much daily to maybe 3 or 4 times in 18 months.

Coronatimes: 
Despite everything goin on in the world, 2020 on the whole was actually okay in terms of both my physical and mental health. After a fairly busy few months in the spring, work dropped off a cliff through the summer and I had 4 months with pretty much no work at all. Of course, it was all a bit worrying, but thankfully, I got government grants that kept me going financially and the weather was fabulous! My partner was out of work and being cooped up at home together wasn’t great, but with the good weather, we could use the garden as an extra room, there was lots of walking and gardening and we rubbed along fine.

Come the autumn, my work picked up again and I’ve been more-or-less flat-out since October – which is great, but maybe not so healthy. As the weather got worse, the days got shorter and my partner got more bored and despondent, I found myself spending longer stretches at my desk, avoiding leaving my office for my usual regular breaks because I didn’t want to be disturbed. By mid-December, I was getting tweaks in my shoulder. I partly put it down to the cold damp weather, but I knew that too much desk-time and increasing tension (mental tension leading to physical tension) were to blame too. By the end of the year, I was exhausted and at the end of my tether with no reserves of energy to draw on to do the clever, pain-subduing mind trick.

2021:
So far this year has been a tough slog; ploughing on with work, going out for fewer walks because I’m really feeling the cold in my joints, and feeling generally resentful and low. Thankfully, I know that I’ve always struggled with winter and I also know that I usually start perking up in March, so I’m hopeful that the advent of spring, along with the gradual easing of lockdown here in the UK will signal an upturn. I’m also just coming to the end of one work project and it looks like the next project I have pencilled in might be a bit delayed. So, I’m planning a much-needed week off. Of course, I won’t be able to go anywhere or do very much, but a bit more walking, perhaps a bit of pottering in the garden. If I can relax and recharge just a bit, then I think I can get my priorities back in perspective - even in these weirdly out-of-perspective times - and get my health back on track.

Labels: , , , ,

Tuesday, February 23, 2021

Like searching for an idiom in the proverbial haystack

Recently, I've been doing quite a bit of research into idioms. It's lots of fun, just because idioms are the fun end of language, but it's also quite challenging from a corpus perspective, because idioms are slippery suckers!

In general, idioms pose two key problems for a corpus researcher:

1 Separating the figurative from the literal: so, for example, trying to get stats on how common the idiom 'an own goal' is – as in The PM scored a bit of a political own goal yesterday – you realize you also have a whole load of cites from football reporting about actual own goals. There's no real way of doing this apart from trawling through a sample of corpus lines to make a rough judgement about the percentage of figurative vs literal uses.

2 Dealing with variation: while a few idioms are completely fixed, most allow for a bit of variation and some are so variable as to be almost impossible to pin down.  For example, you might start off with "frighten the life out of someone" … then you realize that the verb scare is common too and actually there are some examples of terrify … then you look some more and find examples for frighten/ scare the (living/ absolute) shit /crap /hell /fuck /heck /daylights /piss /bejesus* out of someone! (*various spellings) All of which I only uncovered by trying out different search patterns, allowing for alternative verbs and gaps for things that get scared out of you.

[lemma="frighten|scare|terrify"][word="the"][]{1,2}[word="out"][word="of"]

Of course though, the more flexible you make your search, the more 'noise' you get – i.e. examples that aren't of the target idiom – so it's a bit of a balancing act with lots of trial and error.

Then yesterday, a chance comment in a TV programme threw up a whole new issue that I'd never considered – the use of the term 'the proverbial' which is kind of an idiom within an idiom! I scurried off to a corpus to check it out and found that:

It's mostly used before or within a complete idiom (often before a key noun). And notice it doesn't have to be what we'd typically think of as a proverb, it can go with any fixed, idiomatic expression, I think as a way of the speaker acknowledging that what they're saying is a bit of a cliché. (Click on the image to enlarge).

 

Perhaps more interestingly though, it can also be used to replace a key word within an idiom. This often seems to be a way for the speaker to avoid a taboo word (shown in red) – and so be polite – but not always (words in green):

 

It's a fabulous linguistic quirk and lots of fun to play around with, but wow, how the proverbial do you go about explaining that one to a poor learner?!

Labels: ,

Friday, February 12, 2021

Writing rhythms

On most ELT writing projects, the work (and your life for the duration of the project!) gets divided up into units. For a students' book, that might be 10-15 quite large units, but for many of the sort of self-study, language practice type materials I work on, there can be anywhere between 20 and 50 short units which may only be 2-4 pages each.

At the start of a new project, you spend a bit of time getting to grips with the brief and playing around with the first unit or two to establish how they're going to work. Often, the format's already quite fixed in the brief, sometimes you have a bit of leeway to play with. Then once everyone's happy, you get your head down and start ploughing through unit-by-unit.

What interests me is how different people go about tackling each unit. Do they sketch out the whole thing then go back and fill in the details? Do they do it on paper or straight into a Word doc? Do they start from the beginning and work through each activity in turn? Or do they start with a core component, such as a reading text, then work outwards from it? A lot, of course, depends on the type and scope of the material, but even within that there's quite a bit of room for variation.

For the past couple of months, I've been working on some self-study vocab practice materials. There are 50 units altogether (across two linked projects) which is kind of daunting, but also quite nice as it means I've settled into a rhythm of roughly a unit a day. For each unit, I already have a (more-or-less) predetermined set of vocab items to practise across a number of activities. It's heavily corpus-informed, so I'm researching the vocab items to pick out features to highlight (typical usage and context, collocations, typical colligational patterns, etc.) and also using and adapting corpus examples in the activities. For the first few units, this was my approach:

 


The major downside of this was that I found myself running the same corpus searches numerous times. So, I'd explore vocab item A extensively in the initial research stage, then I'd find myself searching for it again several times to source examples for each exercise. I revised my approach after a few units so that I still did my research stage as before, but then sketched out a rough plan of the different exercises, e.g. exercise 1 focus on noun collocations, exercise 2 focus on following prepositions, etc. Then I ran a corpus search for each vocab item and added examples to several of the exercises at the same time. This seemed more efficient and I settled into it as a way of working for the first 15 units or so.

As is so often the case though, totting up my hours regularly as I went along, I realized I was spending much longer on the work than I'd budgeted for up-front. That meant that because the project is for a fixed fee, my hourly rate was nose-diving. It also meant I was getting behind schedule. After a bit of a review and discussion, it turned out that a lot of the extra work was just down to there being more involved in the project than I'd originally bargained for – isn't it always the case?! With no more budget available though, I had to try and rein in my hours regardless. So I came up with a new way of working.

On the plus side, it is much quicker because I'm only researching each vocab item once, then just reshuffling the results to create the exercises. On the downside, I'm not able to wait until I've researched all the items to see how the unit's going to shape up. So, if you like, the whole process is slightly less data led. In some units, it works out fine and the examples I've selected shuffle neatly into nice, coherent exercises. Other times, I find that a feature or exercise type starts to suggest itself towards the end of the vocab list and I realize I haven't noted relevant examples for some of the earlier items. Then I either have to squeeze the material I have into exercises which aren't a great fit or I have to go back and look for better examples for some items. For some units, I use up most of the examples I've collected, for others I'm left with a whole page of unused material at the end.

So often, ELT writing is a balancing act between how you'd like to work and what the time and budget allows. In this case, the hurry-up initially felt a bit uncomfortable, but as I go on, I think I'm settling into my rhythm again and making it work.

Labels: , , ,

Friday, February 05, 2021

Motivation, mindset and guessing vocabulary from context

When it comes to ELT vocabulary, the idea of guessing the meaning of unknown words from context seems intuitively a useful strategy. However, its effectiveness both in terms of reading comprehension and how well it helps learners retain new vocabulary has been questioned – see this blogpost from Philip Kerr for a summary of some of the arguments. I just went back to reread it after something that happened yesterday.

Since my partner recently took Swiss citizenship (via his father), we've been receiving regular piles of paperwork from the canton in which he's registered. It's mostly to do with voting, either in referenda or local elections and is all in French. We both speak some French, but far from fluently.

A pile arrived yesterday and I was flicking through it, mostly just to practise my French as I waited for the kettle to boil on a tea break. There were three referenda questions, two of which I understood quite easily, the third I hesitated over. 

 

Image of red booklet with referendum question

It read: "l'interdiction de se dissumuler le visage." Which I read as "Prohibition/Ban on [reflexive verb which I don't recognize] the face." My first thought was it might be something to do with banning facial recognition software or something similar – it was about banning something to do with people's faces and based on my current world knowledge, that seemed like a logical guess. I read the first paragraph and it initially seemed to fit – it talked about the ban applying in public places such as in the street, on public transport, in sports stadiums, etc.

I still wasn't quite sure though, so I scanned through a bit more of the text. Then I came across a section about the arguments in favour of the ban and it said that "[the noun from the unknown verb] of the face in public spaces symbolises the oppression of women and is against the liberal spirit of living together/community cohesion". Aha! It was at that point that I realized that dissumuler means to conceal or hide or cover and that the question was about face-coverings – presumably in the sense of a niqab rather than a medical face mask (the irony of the timing wasn't lost on me!).

That aha moment was incredibly satisfying – perhaps an under-rated motivator in language learning? Or is that just me? It often strikes me that teachers and linguists, who are inherently fascinated by language for its own sake, may not be the best people to judge what works and what doesn't for the average language learner for whom learning a language may just be a means to an end. Does the average learner get that same sense of achievement from working out meaning? Would they have bothered to form a hypothesis then read on to check it in the way that I did or would they have just given up? It's hard to say and I'm sure it would differ enormously from student to student.

And of course, now I'm not going to be able to fairly judge whether my experience of guessing from context is going to help me retain the new vocab item either. Chances are I will just because I've done the diligent language-learner thing of processing and working with my new word by writing a blog post about it!

Labels: ,