An IP Attorney’s Reading of the LLaMA and ChatGPT Class Action Lawsuits

Matthew Butterick and the Joseph Saveri Law Firm are continuing to make the rounds amongst generative AI companies, following up on their lawsuits related to Copilot, Codex, Stable Diffusion, and Midjourney with two more class actions related to ChatGPT and LLaMA, respectively. Notably, the lawsuit related to LLaMA actually predates Meta’s release of LLaMA under a commercial license and the comedian Sarah Silverman is one of three named plaintiffs in both of these new cases, alleging various claims related to her book, The Bedwetter. These cases make two new interesting assertions about generative AI that do not appear in any of the prior cases.

New Theories of Liability

The first new assertion is that the models themselves are derivative works of the training data. The Copilot and Codex case doesn’t include any claims related to copyright infringement. The Stable Diffusion and Midjourney case does include copyright infringement claims, but the allegations under Count I (Direct Copyright Infringement) don’t entirely make it clear whether the illegally prepared “derivative works” refer to the model output or the model itself. Section VII (Class Allegations) seems to indicate that the direct copyright infringement is tied to downloading and storing the copyrighted works and then training the model with those works; the vicarious copyright infringement is tied to the output which was used by third parties to create fakes of original artists’ work. While the complaint mentions that the “AI Image Products contain copies of every image in the Training Images,” the impression from the complaint is that it’s the copying of the images for the purposes of the training that’s at issue, and perhaps that a copy still somehow resides “in” the model, but copying is a separate and distinct copyright monopoly right from creating derivative works and this is the first time that this team of attorneys is claiming that the model is a derivative work of the training data explicitly.

The second new assertion is that “every output of the […] language models is an infringing derivative work” with respect to each of the copyrighted works of the plaintiffs. In other words, the assertion is that no matter what the model outputs, it is necessarily a derivative work of Silverman’s book. In the complaint related to Stable Diffusion and Midjourney, the Factual Allegations section does state that “the resulting image is necessarily a derivative work” but it doesn’t say whose work it is derivative of – is it a derivate of one copyright holder’s work or all of the copyright holders’ works at once? Further, the actual section describing the nature of the copyright infringement claim (Count I Direct Copyright Infringement) doesn’t quite go so far as to say every output of the model necessarily infringes the copyrighted work of a single copyright holder. The complaint puts it more obliquely, arguing that the output “are derived exclusively from the Training Images…” It’s not clear that this usage of the word “derived” is meant to mean “derivative work” under copyright law and and again, it’s not clear if they mean that every output is derivative of some works or all works. The same section merely accuses the defendants of having “prepared Derivative Works based upon one or more of the Works.” From my perspective, this is a new and different theory of copyright infringement now being put forth by this team of attorneys.

The Implications of the New Theories of Liability

Both of these new assertions have interesting implications for the plaintiffs’ cases. There is a scenario, for example, where the courts decide that the copying necessary to train a model is either fair use or outside the ambit of copyright law. That sort of incidental copying is fairly universal in order to allow both individuals and various bots to “read” the Internet, after all, and merely reading or viewing a copyrighted work is not a right protected by copyright law. A court may further decide that it won’t hold a model creator/distributor responsible for its output because there is substantial non-infringing use for the model(s) and will instead hold individual users accountable. However, if the model itself is deemed to be a derivative work of the training data and for whatever reason fair use or other defenses did not apply, the defendants would still be liable for making the derivative work and distributing it (to the extent there is distribution) even if the courts move as described above with respect to model training and the model output.

This puts into play a third prong of attack for the plaintiffs that didn’t exist before. Strategically, I think this was a good move for the plaintiffs’ case.1 I happen to agree that the model probably does constitute a derivative work of at least some of the training data and the logic of “the model can summarize the plaintiffs’ works, therefore it must have copied and stored them” is simple and really appealing to a non-technical audience in a way that the refutations of this statement will not be. However, I also think there is a strong argument that either the fair use defense applies and/or that certain models can be big and complex enough to the point where even though they might contain some training data, its use is de minimis. But, there’s no predicting how courts will weigh such a defense and this allegation gives the plaintiffs a third roll at the dice.

The allegation that all output is necessarily a derivative work of any given piece of training data is more of a double-edged sword for the plaintiffs. Technically speaking, any time the model creates output, it is doing so on the basis of the model in its entirety (a model’s decision not to respond to input in a certain way is as much informed by the training data as it’s affirmative decision to respond to input in a certain way). All output is, in a sense, a reflection of everything the model has gleaned from the training data. But this is “derivation” in the colloquial sense of the word, not in the sense designated under copyright law. Under copyright law, a “derivative work” is one that still retains copyrightable elements of the original work.

Clearly, ChatGPT is more than capable of producing output that no reasonable person could in any way connect with Silverman’s book, so this seems like a strong overreach. Such an interpretation would certainly benefit the plaintiffs since the complaints don’t actually allege any instances of output that infringes the plaintiffs’ works other than output summarizing those works (and of course mere summarization does not constitute copyright infringement, contrary to the plaintiffs’ assertions in these complaints), but I think this is a bridge too far and I think an already confused court would not look kindly upon such an incendiary and misleading claim. This feels like a placeholder until (if?) the plaintiffs actually get the models to reproduce a real derivative work of one of their works – and so far, they don’t seem to have succeeded at that.

The inability to produce any damning output probably makes these the weakest of all the generative AI cases this group has filed so far, especially with respect to ChatGPT because in the absence of derivative output and in the absence of physical distribution of ChatGPT itself (the plaintiffs don’t allege any physical distribution of ChatGPT), it would seem like there’s very little to hang the DMCA claims on. The plaintiffs would basically have to argue under 17 U.S. Code § 1202(b)(1)

(b) Removal or Alteration of Copyright Management Information.—No person shall, without the authority of the copyright owner or the law—
(1) intentionally remove or alter any copyright management information,
(2) distribute or import for distribution copyright management information knowing that the copyright management information has been removed or altered without authority of the copyright owner or the law, or
(3) distribute, import for distribution, or publicly perform works, copies of works, or phonorecords, knowing that copyright management information has been removed or altered without authority of the copyright owner or the law,
knowing, or, with respect to civil remedies under section 1203, having reasonable grounds to know, that it will induce, enable, facilitate, or conceal an infringement of any right under this title.

that mere creation (even in the absence of distribution) of generative AIs is prohibited by the DMCA because every act of creating a model removes copyright management information (CMI) and that the information was stripped specifically to “induce, enable, facilitate, or conceal an infringement.” That seems like an unpersuasive argument on many fronts: the models don’t always strip CMI, the models aren’t necessarily storing enough of the training data for it to retain copyright protection anyway (and some training data is ignored by the model entirely or “forgotten” later), the primary purpose of model creation isn’t the concealment of an infringement, the stripping of the CMI is more a byproduct of model creation rather something done by the models “by design,” it’s not clear that any infringement is happening here at all, it’s definitely not clear that the defendants actually believe there is any infringement here as a matter of law, and more generally, it’s hard to argue that a law passed in 1998 specifically to streamline innovation and generally to “get with the times” was intended to wholesale ban an incredibly exciting technology that wouldn’t exist for another 20+ years.

Other Background

The Claims

The claims in these cases are basically the same as the claims related to Stable Diffusion and Midjourney minus the claims related to rights of publicity and breach of contract (which was specific to DeviantArt’s alleged violations of their own Terms of Service and Privacy Policy):

  • Direct copyright infringement related to copying, making derivative works, publicly displaying copies and distributing copies of copyrighted works and derivatives thereof.
  • Vicarious copyright infringement arising from the allegation that “every output of the […] language models is an infringing derivative work”
  • Removal of copyright management information under the DMCA
  • Unfair competition based on the DMCA violation
  • Unjust enrichment, vaguely for all of the above
  • Negligence, extremely vaguely for all of the above

There is also an additional claim against Meta for false assertion of copyright related to the fact that when the LLaMA model was leaked, Meta sent GitHub a takedown notice in which they asserted sole copyright ownership in the model.

Of note is that there are no claims here related to personally identifiable data, which did appear in the Copilot and Codex-related complaint. Given that I’ve personally read an entirely fictitious biography of myself generated by ChatGPT, there would probably be a lot for a court to chew on with respect to those sorts of claims, but this isn’t the right formulation of the class to bring such claims.

The Class

The class in both cases is basically everyone with a copyright in any of the training data. It’s a strangely broad choice given that the named plaintiffs are specifically book authors and in each case, the complaint alleges that the plaintiffs’ books were part of a dataset originating from “illegal shadow libraries” that made copyrighted books available in bulk via torrent systems. Why not limit the class to other book authors whose works were part of the same dataset? By naming the class so broadly, the plaintiffs make it harder to prove typicality, adequacy, or commonality and predominance because the LLMs in question were also trained on the broader Internet on many different types of works under many different licenses, works in the public domain, and works whose copyright was never registered. Does the author of a book actually have a lot in common with a Redditor or someone publishing data on local erosion patterns under a public domain dedication? These copyright holders look to have different interests, be differently situated, and to have different questions of law and facts.

Like the classes in all the other generative AI cases brought by this group of attorneys, the classes here don’t condition participation on injury. Just because a work was part of the training data doesn’t mean the work is 1) is actually part of the model, 2) is part of the model in sufficient detail to still be subject to copyright, 3) actually outputted by the model (or a derivative of it is), or 4) outputted by the model in sufficient detail to still be subject to copyright.

Conclusion

Although the ChatGPT and LLaMA-related cases make some new, rather startling allegations, these cases feel weaker than the other ones the group has filed so far. The inability to prompt either model into actually outputting a derivative work from the training data will require the plaintiffs to focus on the actual act of training the model and the details of what a model is and how it works when they go to trial, making the claims and allegations here somewhat academic and theoretical (if not entirely impenetrable) from the viewpoint of a potential jury. There’s no smoking gun. There’s no clear narrative about how Sarah Silverman is losing out on book sales because ChatGPT is stealing her jokes, etc. At bottom, the plaintiffs will have to convince the jury that even though these LLMs aren’t actually stealing anyone’s work, and to the contrary seem to be providing helpful fact-based information (the book summaries), that they should nevertheless handsomely reward the authors for losing out on an entirely new revenue stream that basically only exists thanks to a dense and obscure tangle of laws and technical details. To me, the LLMs seem like the AIs most vulnerable to legal attack, yet these cases as currently presented strike me as the least worrisome ones.