Lexicoblog

The occasional ramblings of a freelance lexicographer

Monday, June 15, 2020

Ludwig Guru: a review


Recently, a fellow ELT writer posted in a Facebook group about a new language tool they'd discovered. I hadn't come across it before, so couldn't resist checking it out.

It's called Ludwig Guru and it describes itself as:

"the first sentence search engine that helps you write better English by giving you contextualized examples taken from reliable sources.

It's aimed at learners/non-expert users of English and the idea is you type in your best guess at an English sentence, or part of a sentence, and it comes back with examples of similar sentences from 'reliable sources'. Then you can see how well the examples match your own attempt. Presumably, if you find lots that are exactly the same, you know you're on the right track and if they're a bit different, you can adjust yours to sound more natural.

The post that had led me to it was from an ELT writer looking for ideas for how a slightly obscure tense (future continuous passive, yes, it's a thing!) is typically used. My first reaction was "Why not use a 'proper' corpus?" … but I am aware that corpus tools can be off-putting until you get used to them and this looked like a potentially more user-friendly alternative. I decided to test it out to see whether it might be a useful tool for ELT writers for checking intuitions or searching for ideas for authentic examples/contexts, as well as a tool to recommend to students.

As with many similar services, there's a free version with limited functionality and a premium version that gives you the full experience. I registered for the free version just to try it out. It's very restrictive! You only get 6 searches per day – and that's a 24-hour period, so if you hit your limit in the afternoon, you can't log back in the next morning – which made it very difficult to test out in any meaningful way. You also only get 15 results per search, which again made it difficult to know whether what I was seeing was a representative sample of what you'd get from a wider search.  You can sign up for a free 15-day trial of the premium version, but that requires you to enter your credit card details, which I wasn't prepared to do. So, to be honest, I didn't get as far as I'd like before I just gave up! But here's what I did find.

The data:
My first question was about what constituted 'reliable sources'. The site uses 22 sources including 8 news media sites (including the BBC, the Guardian, the New York Times, etc.), it has 5 academic science sources (mostly scientific journals), a couple of wikis, a couple of encyclopedias and a collection of other sources that it describes as 'Formal & Business' but are a bit of a mixed bunch, including documents from UNICEF and the European Parliament. You can (with the premium version) choose to filter your results by selecting which sources you want to include. 

My first thought is that it's actually not a bad spread. Many corpora depend heavily on news media sources because they're readily available and reasonably wide-ranging in terms of topics (and so spread of language/vocab). The encyclopedias and wikis will also provide a nice spread of topic vocab. The language of journalism though (and of reference materials too, I suspect) is quite a distinct genre, so isn't necessarily an ideal model for other contexts.

The academic content is made up of only science journals, so obviously doesn't help with other academic disciplines.  I also noticed that a lot of the unexpected results that came up showing examples that felt awkward (and in some cases positively incorrect) came from this section of the data. When I clicked through (as you can) to the original sources, they were papers that appeared to be written (at least judging very crudely by the names of the authors) by non-native speakers of English. That's unsurprising seeing as many academic papers in English language journals have a very international mix of authors. The judgment as to whether something that has managed to pass through the reviewing process for a journal (and in this case actually a small range of journals mostly from the same publisher) represents a good language model or not is up for debate.

The overall British/American split is difficult to determine, but the news media are 50/50 – which is just something to bear in mind as some searches will throw up clear differences between the two. For example, write me (with the person as the direct object) is standard in American English but sounds distinctly odd to a British English speaker (who'd use write to me).

For learners:
The tool has been designed for non-expert users of English to search for specific phrases, so this is where I started.  The results were kind of mixed and the main thing I took away was that they needed quite a degree of language awareness and analysis to be useful. Here a just a few of the searches I tried and the issues they threw up:

Just as a side note, I actually took some of the examples that sounded awkward/unlikely to me from interviews with the creators of the app itself … and interestingly, those exact examples often came up as the first result!

One obvious search is to check collocations, so I looked at a couple that seemed slightly odd to me; firmly think and obtain my goal. My intuition tells me (as do reference sources and other corpus evidence) that we'd be more likely to say firmly believe and attain my goal. Ludwig came back with the following results (click on the images to enlarge).




As a seasoned corpus researcher, I know that in a large body of data, you'll probably find some examples of almost any combination of words. Mostly though, with very small numbers like these, you'd discount them as untypical and unhelpful for a learner. (As I mentioned above, many of the results for obtain a goal, in fact, come from a handful of academic papers likely written by non-native speakers.) Corpus research is all about identifying frequent and typical patterns, not individual quirks of usage. For the student using a tool like this, I guess the question is how they make that judgment. Will they see that there are actually only a relatively small number of matches and instead click through to see the similar patterns? Or will they just see a first screen full of examples that appear to match their own, possibly slightly awkward, wording and stick with it?

If students do discount patterns with fewer hits, then the other tools available can be really helpful. The search for obtain a goal above shows suggestions for achieve/realize/attain a goal – all good, solid collocations. Another search for the slightly awkward a large part of them only turned up 17 exact matches, but Ludwig allows a search for synonyms (by putting an underscore before the word you want synonyms of) which offers some good alternatives shown in frequency order; a large percentage/proportion of them.


The other major issue, of course, is that learners need to feel that a construction might not be right in order to decide to check it in the first place. One review of the app which explicitly highlighted that it was written by a non-native English speaker using the app still contained a few clear language errors. That's not a criticism of the writer – or even of Ludwig, to be honest – but it goes to show that it's impossible to be conscious of all your own errors.

For language research:
Both the basic searches and some of the other tools available do have an appeal for the ELT writer wanting to check out typical usage or just search for ideas, but I think the limitations probably outweigh the benefits.

I was initially unsure whether the searches were lemmatised or not … by that I mean that if you search for take do you just get results for that exact form or do you also get takes/taking/took/taken? It's difficult to be certain with so few search results returned – many of my searches just seemed to come up with the exact form I'd typed in, but then some less frequent ones, like obtain my goal above, did seem to show other forms (obtaining) as 'similar' results. It seems though that exact matches always come up first and they are just that 'exact'. Which is not very helpful for researching most language patterns where you want to allow at least some variation. Even if you were searching for a particular tense, say present perfect, you'd want to allow for has done and have done. Certainly, if you were looking to compare collocations using the comparison tool, e.g. [take get] a bus, you wouldn't want to just look at the base form of the verb, you'd want to compare across all verb forms.

Another issue when it comes to searching for language patterns is allowing for variation. So taking that same example of take/get + bus, you want to see not just take/get a bus, but take the bus, take the airport bus, take the next bus, etc. too. Similarly with verb patterns, you want to allow for negatives have not done and possible adverbs have already done. It may sound a bit silly but by searching for exact matches, you only find what you were searching for … when actually what's often more useful, typical or interesting are the variations you hadn't thought of.

While investigating, I did come up against a number of unexpected results. So, for example, I searched for have * been which should have shown me the most common words that occur between have and been. A standard corpus search uncovers plenty of examples of have already/now/also/just/long/not, etc. been, so it was slightly surprising that Ludwig returned no matches at all. Oddly though, one of the suggested similar searches, the much less frequent, have * participated, came up with 5 matches (have also/never/already/not/consistently participated). This just planted a seed of doubt in my mind about consistency and reliability, but wasn't something I could really explore further within my limited searches.



Conclusions:
Overall, I think the idea behind the project is a good one and the app has some really nice features … but for me, the limitations of the free version make it fairly unusable and the limitations of the whole thing make it not worth paying for premium. Certainly in the case of an ELT writer, you'd be much better off investing a bit of time and your subscription money in learning to use a standard corpus tool which will give you much more flexibility and functionality.

Labels: ,

7 Comments:

Blogger Szymon said...

you've mentioned "standard corpus tool" - which one would you recommend?

7:51 am  
Blogger The Toblerone Twins said...

I use Sketch Engine mostly: https://www.sketchengine.eu/ I have a subscription which gives me access to a wide range of corpora, but there's a free version that has fewer corpora but the same functionality. And if you're attached to a university in Europe, there's also a scheme which gives free institutional access.

10:09 am  
Blogger Szymon said...

Thank you, I'll check it.

6:03 pm  
Blogger Data scientist said...

Thank you so much. I really appreciate your advice. I just got a subscription in sketch engine via my university access. It is amazing.

12:24 pm  
Blogger Beatriz said...

Thank you!! Could you please explain the errors present in the review written by the non-native English speaker?

7:25 pm  
Blogger The Toblerone Twins said...

Hi Beatriz,
Firstly, apologies that it took me a while to spot your comment.

It's more than a year since I wrote this post, so I don't remember exactly what I spotted in the review initially, but looking back at it now, here are a few of the issues I'd pick out. It's tricky to highlight the errors and corrections with limited formating here, but hopefully you can see the corrections I've made.

- The title of the post is a bit odd. "Online linguistic search engine Ludwig helps get your English on" - my first thought is that there's a word missing at the end of the sentence because it doesn't really make sense - unless it's a very clever play on a slang expression which I don't think it is. But I'm guessing maybe the writer is trying to use the phrasal verb "get on" in the sense of "make progress". However, that's an intransitive use - so someone gets on, you can't get sth on.
- "unless The New York Times, *BBC* > *the BBC* ... " ... "such as The New York Times, PLOS ONE, *BBC* > *the BBC* and [a number of] scientific publications."
- "To use Ludwig, people *should* > *need to/have to* type into the Ludwig bar not the sentence they want to translate ..."
- [Within a quote] “Wittgenstein came to a conclusion: *the meaning* > *meaning* is determined by context" [no article]
- "*a large part of them* > *a large proportion of them* (44 percent) *enrolled* > *were enrolled* in a STEM program"
- "an *ads-free* > *ad-free* desktop app"
- "the company would like to *sign* > *establish/develop/form* partnerships with reliable sources"
- "As for leaving Sicily for *the Silicon Valley* > *Silicon Valley*" [no article]

Like I said in my post, most of the errors are quite minor - apart from the headline which is pretty confusing. Several of them are incorrect use of articles which is something that the writer probably didn't notice and so didn't think to check - which is one of the drawbacks to this kind of app - you only tend to check the things you're not sure about and the minor slips go unnoticed.

I hope that answers your question.
Julie

9:56 am  
Blogger bootgums said...

I've only looked at Ludwig when it's come up in a Google search for a word or phrase, and my impression from that limited use is that it's very poor. (I'm a copy-editor, not a learner or an unconfident user of English, but sometimes I want to check something an author has written to confirm or otherwise my feeling that it's 'wrong', or so uncommon that it's best treated as wrong.) Today I searched for 'frustration at' ('about', 'with'). Some related queries popped in the Google search, including one about 'frustration borne [out] of', which had a Ludwig answer. Of course, Ludwig, or anything sensible, should have said 'It's not borne of', it's 'born of'. But it said 'borne of' was fine and gave some examples of its use.
This review tells me why: it's a corpus and contains unassessed text, which will contain errors. So I think it should come with a strong health warning, and not appear near the top of Google searches. I had formed the impression that it was an extremely amateurish product; perhaps it's just posing as something it isn't.

3:28 pm  

Post a Comment

<< Home