Lexicoblog: Language data and permissions: AI vs corpora

Like everyone else lately, I haven’t been able to avoid hearing about generative AI, including a session at last week’s Freelancers’ AwayDay. One point in the debates around it though has jumped out at me, in particular. The language data collected to train the large language models that power the likes of ChatGPT is scraped from the internet with no attempt to get permission from the original creators of those texts. This has raised alarm bells with writers such as the US Authors Guild who are taking action on the issue.

As a lexicographer, I work with very large collections of language data every day: corpora. Working with publishers’ corpora on materials for publication I’m very aware that permission and copyright are issues we absolutely don’t ignore. The multi-billion-word corpora collected and held by dictionary publishers contain material that has been added with the permission of the copyright holders. This is generally the publishers of the texts rather than the individual writers allowing for the collection of large quantities of data, such as all the newspapers published by a particular media group or all the academic journals from a certain academic publisher.

That permission also comes with restrictions on how the data can be used. Specific agreements vary between corpora, but they usually include limitations about the length of excerpts that can be used, not generally a big problem when you need short dictionary examples. We are also generally required to edited examples drawn from a corpus so that they’re not obviously identifiable. Very ‘vanilla’ examples that could have come from anywhere – She left and closed the door behind her. – can safely be copied as they are, but any references to real people, places, events, or organizations will usually be removed, and often replaced with the minister, a company spokesperson, in the region, etc. Incidentally, this also has the advantage that the examples are less likely to date and will be accessible to a wider audience because they rely less on culturally-specific references.

As lexicographers, we do occasionally turn to other sources to research language, especially when we’re looking at new or niche uses for which we may have scant corpus evidence. In these cases, our editors are even more insistent about the need for caution. Ideally, we’ll refer to online sources to confirm how a word is being used, then try to make use of the few corpus examples we do have, informed by what we’ve seen elsewhere, to come up with appropriate example sentences.

I should note that I’m specifically talking about publishers’ corpora here. There are, of course, plenty of corpora out there, including numerous web corpora, where the issues around permission and copyright are very different. Many of these were originally collected for academic purposes, i.e. principally to research language usage rather than to publish commercial materials. They should also come along with notes about how they can be used – although I wonder how many users actually read the small print.

As writers ourselves, in whatever form, I think we should be especially aware of how the content we create is being used, and potentially abused, with and without our permission, and crucially, in turn, have respect for how we use the intellectual property of others.

Labels: AI, corpora, lexicography

Lexicoblog

Monday, September 25, 2023

Language data and permissions: AI vs corpora

0 Comments:

Lexicoblog

About Me

Previous Posts