Professional Documents
Culture Documents
Copyright Liability - AI Inputs and Outputs
Copyright Liability - AI Inputs and Outputs
https://doi.org/10.1093/grurint/ikad140
Advance Access Publication Date: 13 January 2024
Article
ANDRES GUADAMUZ*
limitations in the jurisdictions studied. However, there sources such as Wikipedia.11 While OpenAI does not
will inevitably be case law that probes the boundaries of specify it, some of these sources were collected by
such limitations. Regarding outputs, the resolution will OpenAI itself, but in other instances they used data-
probably be more dependent on individual cases, but sets created by others, such as the case of 16% of the
some of these works might also fall under certain excep- training data coming from independent sources, par-
tions and limitations. ticularly the book data, named Books1 and Books2.12
There has been some online speculation13 as to which
are the sources for these two datasets, one comprising
II. Inputs and outputs 12 billion tokens and the other 55 billion tokens respec-
1. Gathering inputs tively. It is almost certain that Books1 is made up of
public domain books made available through Project
The explosion in the sophistication of AI tools that we
Guttenberg,14 while the content of Books2 remains a
have experienced in recent years has come about because
mystery.15
of two important developments, firstly the improvement
Another popular dataset used in training language
and variety of machine-learning models, but most impor-
models is The Pile,16 which contains 22 other smaller
tantly the availability of large training datasets. While the
likelihood that the image may be ‘unsafe’, as well as as machine learning, where algorithms enable computers
providing a similarity score.21 to learn from data and even improve themselves without
However, some companies working on AI are not as being explicitly programmed.27 The system improves as it
transparent with their datasets. Midjourney, one of the is presented with more data.
early successful companies in the space of image genera- The biggest conflict at present is in what is referred
tion, has not disclosed which datasets it uses.22 Similarly, to as generative AI.28 While the concept has become a
OpenAI does not always disclose the datasets used in its bit of a buzzword, generative AI is generally understood
products. as content that is generated from scratch using some
One thing that is evident from the above is that the form of machine learning algorithm.29 A generative AI
nature of collected data and the method of collection will therefore can take a prompt and produce an entirely
be as varied as the number of datasets. Some datasets con- new work.30 How is this possible? There is a common
tain full works, such as images, video, text, and music. misconception that a generative operation is akin to
Some contain metadata, while some contain links to data. putting together a collage of pre-existing images,31 but
That means the nature of collection will be varied. Some this couldn’t be further from the truth. Generative algo-
will require a copy of a work to be made and kept in the rithms are a class of algorithms used in machine learning
models.37 These algorithms can be used for a wide range So, AI tools use a combination of a diffusion model
of tasks, such as image generation, data augmentation, trained on reconstructing images and natural language
and anomaly detection.38 models that understand words used to describe an
The most successful recent examples of generative AI in image.44 If I type the word ‘cat’ into Stable Diffusion,
images, such as Imagen, DALL·E 2, Stable Diffusion, and it understands because it was trained on thousands
Midjourney, use the diffusion model, which reportedly of reference images of cats and their accompanying
produces superior results.39 Diffusion works by taking an descriptions, and it can build a cat from scratch as it
input, for example a photograph, and then corrupting it ‘remembers’ what a cat looks like from having dif-
by adding noise to it; the training takes place by teaching fused it. This combination of language and image mod-
a neural network to put it back together by reversing the els allow me to type ‘a llama by Gustav Klimt’ with
corruption process.40 DALL-E, and it will be able to comply and produce an
output.45
There is another important element involved in gener-
ative models, and this is called latent space.46 In order to
train a model with millions – and sometimes billions – of
In the broadest sense, data are comprised of individual the exclusive rights of the author, such as reproduction,
facts, numbers, statistics, or items of information that distribution, adaptation, communication to the public,
can be collected.57 Some of these items can be works etc.63 So, copying a video without authorisation, such
that are capable of copyright protection, such as text, as WebVid, would be an infringement. If this is the case,
photographs, sound recordings, music, artworks, etc. developers and researchers will have to rely on an excep-
But facts, numbers, and information may not always be tion to copyright, something we will deal with later.
protected.58 But it must be stressed here that the collection must
For copyright to subsist, the work must be protected be performing one of the exclusive rights of the author.
subject matter and needs to be original, that is, the The copying is an easy one, as it infringes the right of
author’s own intellectual creation.59 It is evident that reproduction. But what if the work is not being copied?
this covers individual works such as books, music and This is what happens with many of the datasets described
art that could be included in a database, but this may in previous sections, particularly metadata ones where a
not be the case with just raw information. Large data work is not actually copied but rather some information
gathering cannot be said to meet this requirement as it about a work is extracted – like, for example, the musical
is often done automatically. This means some data will attributes of songs by Billie Eilish,64 or the song data from
As explained in a previous section, a copy of the data which is then used to produce outputs. In the UK, Sec.
must be available in some form for preparation and data 28A of the CDPA73 states that copyright is not infringed
extraction. So, there could be infringement if this repro- in the making of a temporary copy which is transient or
duction is unauthorised. incidental, if this copy is an integral part of a techno-
After the model has been trained, the copied data are logical process, and the sole purpose is to either enable
not needed anymore, unless one is training other mod- a transmission of the work, or for a lawful use, and the
els on the same data. A trained machine learning model temporary copy does not have independent economic
contains the set of parameters that have been learned significance.
from training data. These parameters are used to make It could be argued that the training of an AI model
predictions on new data, and the quality of the predic- does not fulfil all these requirements, but it is likely to be
tions is determined by how well the model has learned the something that will be argued in court by future defen-
relationship between the input data and the correspond- dants in copyright litigation. Making a copy only for
ing output labels.70 In other words, the resulting trained training is transient, but it could be said that it is not inci-
model does not contain copies of the dataset, but highly dental, it is required to extract information, and without
compressed information in a latent space. these copies there is no trained model. The copy is also
with a claim of copies being transient, it is possible that learning training, i.e. there is copying of large amounts of
the courts will have to look in detail at the technical fea- works to produce something different. Commentators are
tures of training a model, and it will be interesting to see divided on whether data mining falls under fair use in the
if future litigation argues this point. US, and while some argue that it is fair use unequivocally,86
The other main exception applicable to the training others are less sure,87 and see both scenarios as problem-
of a machine learning model is the existence of a lim- atic.88 We may have to wait for future litigation regarding
itation that deals specifically with the collection of the machine learning specifically to get a final answer on this.
data itself. The earliest source of an exception for train- At the time of writing, there are several ongoing cases that
ing an AI can be found in the United States in the shape could eventually lead to a more complete answer.89
of Google Books.78 The case originated in 2004, when In other jurisdictions there are applicable exceptions
Google announced a new service called Google Print for text and data mining (TDM), and a survey published
(later renamed Google Book Search). Google entered into in the journal Science lists several countries that contain
an agreement with several libraries in the US and the UK an exception of some form.90 One of the first countries to
which allowed them to digitise out-of-print books and include a TDM exception in its law was the UK, as a result
make them available to anyone searching for that title. of a specific recommendation made by the Hargreaves
dataset. There are dozens of research projects that col- nothing; improve the licensing environment to include
lect music, data or images, and all those collections TDM; expand the TDM exception to include commercial
would fall under this exception.97 If the exception uses; expand the exception for all purposes, with an opt-
applies to the gathering phase, it also makes sense that out for authors; or expand the exception for all purposes,
it covers the training phase too, as these technical stages not allowing for an opt-out.
are not contemplated in the legislation. What matters is The responses were varied, but in the end the UK
that this operation could involve the copying of data to Intellectual Property Office (UKIPO) decided to opt for
generate a model. the last option, which would make it the most open TDM
There are a couple of outstanding questions with exception in the world.104 One of the objectives for the
regards to the use of this exception. Firstly, there is no adoption of an expanded exception was to put the UK in
standard definition of what research is: many tech com- a relative competitive advantage when it comes to arti-
panies have large research bodies, but it would be fair to ficial intelligence research. The UK government saw the
assume that these would be excluded, as Sec. 29A spe- adoption of an AI-friendly policy as helping to generate
cifically refers to ‘non-commercial research’. Interestingly, income and attract investment, making ‘the most of the
there has never been a judicial definition of what consti- greater flexibilities following Brexit’.105
also because as things stand there is a great uncertainty in may be already fair or fall under existing exceptions such
an area that is starting to generate litigation. It is disap- as temporary copies,118 there is still a need for an excep-
pointing that the push towards reform has been stopped, tion because some uses could be infringing, and therefore
at least for now. the EU’s ‘competitive position as a research area will suffer,
One of the reasons for the UK’s regulatory action in unless steps are taken to address the legal uncertainty con-
this area has been precisely to respond to developments cerning text and data mining.’119 The commercial exception
in Europe.112 The EU adopted its own TDM exceptions is being implemented because several research institutions
in 2019 as part of the Digital Single Market (DSM) have commercial arms, as well as public-private partner-
Directive.113 The DSM Directive deals with a wide range ships and collaborations with the private sector, which
of digital copyright issues, and it implements two different would benefit from a wider exception that allows for such
exceptions with regards to data mining that are directly uses.120 The opt-out option is given to rightsholders to pro-
relevant to AI inputs. vide balance between their interests and those of the user,
In Art. 3, the DSM Directive sets out a new exception as it is clear that the Directive is attempting to pass the
for copyright for ‘reproductions and extractions made exceptions in compliance with the three-step test.121
by research organisations and cultural heritage institu- There are two interesting and relevant aspects covered
to chart any changes to these AI-friendly policies, because a conversion of a work into something that is different
there could be a backlash to allowing the indiscriminate to the original work, something we will deal with while
training of AI using copyright works. The existing excep- covering the third requirement.
tions were drafted with very specific types of data mining When works are different, we may be faced with
in place, with the fight against disease and the develop- an adaptation (in other jurisdictions this is called
ment of new medicines being cited repeatedly to justify a derivative work).135 In the UK, an adaptation has
these exceptions. Now, though, policymakers will have a specific meaning, namely the exclusive right of the
to contend with angry rightsholders that see their works author to allow for the translation of a literary work,
used in machine learning without equitable remunera- or to make a non-dramatic work into a dramatic one
tion, and it will be interesting to see how this possible (and vice versa), or to turn the literary work into a
conflict plays out. work with images,136 amongst other actions.137 For the
sake of brevity, we will treat both reproductions and
IV. Copyright infringement in outputs adaptations interchangeably here, although a repro-
duction could actually be an adaptation under some
While the input picture is complicated, the output ques-
circumstances.
tion may be slightly easier to analyse. It is tempting to stop
a study144 that looked at whether a model could recite 350,000 of the most replicated samples in the dataset.154
the next line of a book and found that popular books For each image, they generated 500 samples, producing a
that were probably prevalent in the input data could dis- total of 175 million results. The replication rate was low,
play such memorisation. Another recent study found that however, producing a total of 50 memorised images.155
language models could be tricked into spewing random So, it would be fair to say that there is certainly some
memorised text.145 form of memorisation across various models, but the
There have also been examples of memorisation of rate tends to be low. It is also important to point out that
inputs in a couple of studies conducted with image AI some of these studies are setting out to find replication
tools. Specifically with regards to machine learning, any as an attack vector to generate measures to reduce the
possible reproduction of images found in the training data replication. While this does not invalidate the results, it is
can happen due to what is known as overfitting.146 This is relevant to point out that memorisation, while possible, is
understood as the memorisation of individual items in the relatively rare and can potentially be ameliorated.
training data, particularly those which are repeated often. It is also relevant to point out that memorisation as
With regards to images, a study by Somepalli et al147 described in these papers may not be the same as a repro-
found some memorisation in image datasets, and was duction; after all, the term ‘memorisation’ is not legal,
at least a reverse search did not produce any hits. So Williams,172 where the claimant had created a textile
a model ‘learns’ that some images have watermarks flower design and brought an infringement suit against
and will place it sometimes, if prompted to do so, but the defendant alleging that copying had taken place. The
the output itself does not come from Getty Images: it original flower design had impressionistic red lines and
is a generative creation built by the model taking data coloured flowers scattered across, while the alleged copy
points from latent space, which it then adds to the logo also had lines and flowers, though in a different arrange-
upon request. ment. A causal connection had been established, so the
Another potential problem in establishing causal con- question rested on how substantial the copying was. The
nection is the existence of large amounts of material in trial judge found infringement, the appeal made it to the
the training data that resembles a work but does not orig- House of Lords, where it was held that there was indeed
inate from the owner of that work. This is the case of substantial infringement. What is relevant to the present
so-called fan fiction, or fan art.168 One of the challenges analysis is that when considering whether copying had
for living creators, but also for others whose work may been substantial, the analysis should be qualitative and
still be under copyright such as Picasso or Jean-Michel not quantitative, so if a small part of a work had been
Basquiat, is that an output may be the result of training copied, but that part was very important to the whole,
that ‘The Unknowable’ is actually a work of pastiche, World Wide Web, the days of filesharing and peer-to-
becoming one of the first applications of this exception peer, or the birth of user-generated content206 all came
in European courts. The court usefully defines pastiche with a few copyright infringement cases that sometimes
as requiring ‘an evaluative reference to an original […]. defined the technology. Sometimes, those changes were
In contrast to illegal plagiarism, the older work must be unforeseen: Napster lost its case, but filesharing would
used in such a way that it appears in a modified form. To continue unabated until legal downloads and music
do this, it is sufficient to add other elements to the work streaming became the norm. The early web infringe-
or to integrate the work into a new design,’204 ment cases led to changes in the law so that intermedi-
The court usefully gave examples of possible pastiche aries could operate without getting sued all the time for
uses, by stating that ‘the pastiche in particular allows cer- the actions of their users.
tain user-generated content (UGC) to be legally permitted We are certainly living through a technological revolu-
[…]. Quoting, imitating and learning cultural techniques tion in the shape of generative AI, and we do not know
are defining elements of intertextuality and contemporary what shape it will take, but the lawsuits have already
cultural creation and communication in the ‘Social Web’. started flying in. We can expect a few years of strife, and
In this context, practices such as remixes, memes, GIFs, perhaps at some point the law will become settled, either
206 For example, United States v LaMacchia 871 F.Supp. 535; MGM
Studios v Grokster 545 U.S. 913; and A&M Records, Inc. v Napster, Inc.
239 F.3d 1004 (9th. Cir., 2001).
204 LG Berlin 2 November 2021 at 40. 207 <https://haveibeentrained.com/> accessed 25 November 2023.
205 ibid at 39. 208 Vyas (n 170).