Lexicoblog

The occasional ramblings of a freelance lexicographer

Sunday, October 20, 2024

Euralex 2024, Croatia

After 25 years of working in dictionaries, I recently I attended my first ever lexicography conference, the Euralex Conference held in Cavtat, Croatia (8-12 Oct). That might seem a bit surprising, but comes down to a number of factors about the nature of lexicography and lexicography conferences.

Why was it my first lexicography conference?

Primarily, it's because the kind of lexicography I do, working as a freelancer for UK commercial dictionary publishers is a bit of an anomaly in the world of lexicography. In many countries and for many languages other than English, lexicography is something carried out largely by academics attached to universities and language institutes, and funded by governments and grants from places like the EU keen to support those countries' linguistic and cultural heritage. So, many of the conference presentations were much less about the kind of practical, jobbing lexicography that I do and instead were papers reporting on academic research, often in very niche, theoretical areas.

At the reception on the first evening, I found myself chatting to Dirk Geeraerts, a very eminent name in the field and one of the plenary speakers - whose name, thankfully, I did recognize despite my incredibly patchy knowledge of academic lexicography! I was explaining my background and commenting that I didn't even understand many of the talk titles in the programme. He responded that it was probably all stuff I did know about but I just didn't recognize the terminology. He turned out to be spot on.

Not a bad spot for a coffee break!

So, why was I there?

Unlike the ELT conferences I typically go to, there was no chance of me being sponsored by a publisher to speak and there were almost no publishing contacts there for me to network with and potentially pick up new work. All reasons why, as a freelance lexicographer working on commercial dictionaries, I'd never been to a lexicography conference before. My "in" came, instead, via my role with the AS Hornby Dictionary Research Awards (ASHDRA), which are, as the name suggests, directly involved in dictionary research. The Hornby Trust was one of the conference sponsors, they sponsored the Hornby Lecture, this year by the fabulous Kory Stamper, and the current ASHDRA awardees presented their research (remotely online) in a slot at the event which, this year, I chaired.

Kory Stamper giving the Hornby Lecture

Was it useful?

In a very general sense, it was useful to give me a feel for the wider field. I'd been kind of aware of the differences I mentioned above, but meeting lexicographers and researchers from other countries and languages has crystallized just how different their worlds are from mine.

With my ASHDRA hat on, it's given me a better sense of what dictionary research looks like, and the norms and expectations of the field. As Dirk predicted, I had quite a few aha moments where I realized that some concept or theory or framework that I'd never heard of and sounded incredibly fancy was actually something I already knew about and use pretty much daily but without knowing the relevant label! And of course, it was a chance to spread the ASHDRA message and publicize the awards.

On a more personal level, I met lots of interesting people and it was been fun getting into some incredibly nerdy, niche conversations about the intricacies of dictionary compilation, corpus tools, and, inevitably, the impact of AI and LLMs on the field. Most of the new contacts I've made are unlikely to lead to future work, just because they work in such very different arenas, but one or two could potentially result in the odd offshoot which could be interesting.

Good to meet the team behind Sketch Engine, the corpus software I use.

As most of my trip was self-funded, it was an incredibly costly week. I decided to go at the start of the year when work was more stable and my finances were less precarious. More recently, it's felt like an outlay I could ill afford (and one which came out of my personal savings), but seeing as it was all booked, there was no point in feeling resentful and I tried to draw the positives out of it. Not least of those was the opportunity to visit beautiful Croatia. As is often the case with conferences, I spent much of my time inside windowless conference rooms, but I did have a free day at the start of the week to visit Dubrovnik and I grabbed a free afternoon to swim in the warm, crystal-clear waters of the Adriatic, so I really mustn't grumble!

Dubrovnik Old Town

The sparkling, crystal-clear Adriatic


Cavtat

The Adam Kilgarriff Memorial hike

Labels: ASHDRA, conferences, Euralex 2024, lexicography

Monday, September 25, 2023

Language data and permissions: AI vs corpora

Like everyone else lately, I haven’t been able to avoid hearing about generative AI, including a session at last week’s Freelancers’ AwayDay. One point in the debates around it though has jumped out at me, in particular. The language data collected to train the large language models that power the likes of ChatGPT is scraped from the internet with no attempt to get permission from the original creators of those texts. This has raised alarm bells with writers such as the US Authors Guild who are taking action on the issue.

As a lexicographer, I work with very large collections of language data every day: corpora. Working with publishers’ corpora on materials for publication I’m very aware that permission and copyright are issues we absolutely don’t ignore. The multi-billion-word corpora collected and held by dictionary publishers contain material that has been added with the permission of the copyright holders. This is generally the publishers of the texts rather than the individual writers allowing for the collection of large quantities of data, such as all the newspapers published by a particular media group or all the academic journals from a certain academic publisher.

That permission also comes with restrictions on how the data can be used. Specific agreements vary between corpora, but they usually include limitations about the length of excerpts that can be used, not generally a big problem when you need short dictionary examples. We are also generally required to edited examples drawn from a corpus so that they’re not obviously identifiable. Very ‘vanilla’ examples that could have come from anywhere – She left and closed the door behind her. – can safely be copied as they are, but any references to real people, places, events, or organizations will usually be removed, and often replaced with the minister, a company spokesperson, in the region, etc. Incidentally, this also has the advantage that the examples are less likely to date and will be accessible to a wider audience because they rely less on culturally-specific references.

As lexicographers, we do occasionally turn to other sources to research language, especially when we’re looking at new or niche uses for which we may have scant corpus evidence. In these cases, our editors are even more insistent about the need for caution. Ideally, we’ll refer to online sources to confirm how a word is being used, then try to make use of the few corpus examples we do have, informed by what we’ve seen elsewhere, to come up with appropriate example sentences.

I should note that I’m specifically talking about publishers’ corpora here. There are, of course, plenty of corpora out there, including numerous web corpora, where the issues around permission and copyright are very different. Many of these were originally collected for academic purposes, i.e. principally to research language usage rather than to publish commercial materials. They should also come along with notes about how they can be used – although I wonder how many users actually read the small print.

As writers ourselves, in whatever form, I think we should be especially aware of how the content we create is being used, and potentially abused, with and without our permission, and crucially, in turn, have respect for how we use the intellectual property of others.

Labels: AI, corpora, lexicography

Wednesday, July 19, 2023

Mundane language change

When I talk about language change and the need to keep dictionaries up-to-date with current language usage, people tend to immediately start talking about new words, and especially trending new coinages that they may have come across in the media, the likes of wokefishing or quiet quitting. But actually a huge amount of language change is much subtler and much more mundane.

Yesterday, I was looking into the noun form, in the sense of an application form or an entry form for a competition. Looking at different learner's dictionary definitions, I found a split between those which still describe it as a piece of paper to be written on, what we might now call a hard copy (a retronym) and those that have shifted to the more neutral description of a document which implies that it could be a piece of paper or in a digital format, maybe online. And of course, the word document itself has shifted and stretched in the same way.

You probably hadn't even noticed that the concept of a form, that always used to be a piece of paper, has slowly morphed to encompass digital and online formats too without us feeling the need for a new word - in the same way, for example, that we distinguish between a letter and an email.

A lot of language change is similarly undramatic. Words slowly shift from one usage to something slightly different or stretch seemlessly to encompass new concepts. As lexicographers, we have to be alert to these shifts, to gently tweak definitions to keep them current, and edit examples to reflect contemporary usage - in this case, likely showing examples that refer to both paper forms and digital ones.

Labels: dictionaries, language change, lexicography

Thursday, July 06, 2023

Lexicography FAQs: messy entries

Last week, I was speaking at the BAAL Vocab SIG conference about the process of compiling an entry for a learner's dictionary. I talked about some of the questions that you end up asking as you carry out your corpus research, and the variety of challenges and choices you're faced with: from how many variant forms of a word to show, to what constitutes a separate part of speech, to how finely to split out different senses of a word, and what uses and patterns to exemplify.

I mentioned how entries can range in length from very simple, single-sense words to the mammoth entry for run, the longest entry in most contemporary learner's dictionaries, running to 120 numbered senses in the Oxford Advanced Learner's Dictionary (see what I did there?! ).

This week, I've been thinking about how some entries are really simple and straighforward to compile, while others turn out to be messy and entangled. A couple of medical-related entries I've dealt with recently exemplify that nicely. The entry for cynaosis, despite being a fairly specialized medical term, turned out to be a really simple one to compile. It only has a single, clearly-defined meaning and it's one that can be explained easily within a defining vocabulary.

CCU, on the other hand, turned out to be a complicated mess. Abbreviations can be tricky for a number of reasons. Firstly, they're hard to search for in the corpus because the same abbreviation often gets used to refer to lots of different things, some of them things you wouldn't put in the dictionary, like names of companies or products or local sports clubs, etc., but also sometimes more than one generally-used concept that's relatively high frequency and that learner's might reasonably look up. Then there's the question of whether to have full entries for both the abbreviation and full form or maybe just a cross-reference at the abbreviation pointing to the full form. In the days of print dictionaries when space was at a premium, x-refs would be widely used, but online, it seems unnecessary to send a user round in circles when you could just give a full definition at both. Different publishers and projects will have detailed policies for these kinds of things set out in the styleguide, but sometimes decisions are still left, in part, to the discretion of the lexicographer, considering things such as overall frequency of the term and the relative frequencies of the abbreviation and full form. CCU, as you can see below, led me down a whole rabbit hole of different questions and choices both about the abbreviation itself and other possible variants and inclusions!

So, it seems that CCU can be an abbreviation for coronary care unit or cardiac care unit, which are both the same thing. However, such units are also sometimes called just coronary units or cardiac units - in which case, the abbreviation wouldn't be CCU. CCU can also refer to a critical care unit, which is something different, but mostly synonymous with intensive care unit, for which the abbreviation is ICU ... are you still following?!

And as I mentioned in my session last week, all those decisions about what to show, where and how have to be filtered through the lens of what will be most helpful for the user. You're always balancing wanting a learner to find the meaning or form of the word (or abbreviation) they've come across, which leans towards "include everthing", but at the same time, you know that they also want simple, concise answers rather than a confusing mess of too much information. Because, TL;DR!

Labels: abbreviations, corpus research, lexicography

Lexicoblog

Sunday, October 20, 2024

Euralex 2024, Croatia

Monday, September 25, 2023

Language data and permissions: AI vs corpora

Wednesday, July 19, 2023

Mundane language change

Thursday, July 06, 2023

Lexicography FAQs: messy entries

Lexicoblog

About Me

Previous Posts

Archives