Blog

An IP Attorney’s Reading of the Stable Diffusion Class Action Lawsuit

The image above was created via Stable Diffusion with the prompt “lawyers in suits fighting robots with lasers in a futuristic, superhero style.”

Special thanks to Yisong Yue, professor machine learning at Caltech for providing me with valuable technical feedback on this post!

Looks like Matthew Butterick and the Joseph Saveri Law Firm are going to have a busy year! The same folks who filed the class action against GitHub and Microsoft related to Copilot and Codex a couple of months ago, have filed another one against Stability AI, DeviantArt, and Midjourney related to Stable Diffusion. The crux of the complaint is around Stability AI and their Stable Diffusion product, but Midjourney and DeviantArt enter the picture because they have generative AI products that incorporate Stable Diffusion. DeviantArt also has some claims lobbed directly at them via a subclass because they allowed the nonprofit, Large-Scale Artificial Intelligence Open Network’s (LAION), to incorporate the art work submitted to their service into a large public dataset of 400 million images and captions. According to the complaint, at the time this was the largest freely available dataset of its kind and Stable Diffusion was trained on it. Like the Copilot case, this one includes claims for:

  • Violation of the Digital Millennium Copyright Act’s (DMCA) sections 1201-1205 related to stripping images of copyright-related information 
  • Unfair competition, in this case stemming from copyright law and DMCA violations 
  • breach of contract, in this case as related to DeviantArt’s alleged violation of personal data-related provisions in their Terms of Service and Privacy Policy

Unlike the Copilot case, this one includes additional claims for:

  • Direct copyright infringement for training Stable Diffusion on the class’s images, including the images in the Stable Diffusion model, and reproducing and distributing derivative works of those images
  • Vicarious copyright infringement for allowing users to create and sell fake works of well-known artists (essentially impersonating the artists)
  • Violation of the statutory and common law rights of publicity related to Stable Diffusion’s ability to request art in the style of a specific artist

The Class

Like the Copilot case, this one has similar potential flaws in defining the class. The definition of the injunctive relief class and the damages class in the complaint doesn’t actually condition participation on injury. The class is defined as all persons or entities with a copyright interest in any work that was used to train Stable Diffusion. But, simply having work that is part of the training set doesn’t mean the work is 1) is actually part of the model, 2) actually outputted by the model (or a derivative of it is), or 3) outputted by the model in sufficient detail to still be subject to copyright. As in the Copilot case, the complaint explicitly states that it’s difficult or impossible for the plaintiffs to identify their work in Copilot’s output. 

Since this case involves copyright infringement claims, it’s also surprising that the class is not limited to people with registered copyrights, but only to people “with a copyright interest” since people who don’t have registered copyrights cannot enforce their copyrights in court. Additionally, litigator Noorjahan Rahman has pointed out that some courts extend the registration requirement to DMCA enforcement as well, further weakening the plaintiffs’ chances of succeeding in defining the class this way and/or bringing either copyright or DMCA claims as a class action. 

It’s also worth pointing out that only a fraction of the dataset allegedly used by Stable Diffusion originated from DeviantArt. With respect to images that came from DeviantArt, the question of copyright infringement must revolve around the license granted to DeviantArt in its Terms of Services. But, the rest of the dataset came from other sources. Some of them are likely to be just copyrighted and not subject to any other license, some of them are likely to be dedicated to the public domain, but the rest of them are under various licenses, including various Creative Commons licenses, commercial licenses, other Terms of Use, etc. Authors of such works of course still retain a copyright interest in them, but the question of whether or not use of the works constituted copyright infringement is going to depend on the underlying license. Since there is likely a cornucopia of underlying licenses in that dataset, it’s difficult to argue that all members of the class actually share the same questions of law and facts. 

Remember that in the Copilot case, the relevant class was tied to the 13 open source licenses available in GitHub’s drop-down menu for self-selection. That was a calculated choice since of course you can find any license under the sun on GitHub, not just the 13 GitHub suggests. It’s not difficult to argue that questions of law and fact are likely to vary across 13 different license agreements. However, those 13 licenses at least all had the common requirement that the copyright holder be attributed as a condition of using the license, which is likely why the plaintiffs settled on just those 13 (that 13 probably also covers 80% or more of what’s on GitHub, so that helps). That common thread might be enough in that case, particularly since it doesn’t involve a copyright claim. Here, however, it’s impossible to draw any common thread because the number of underlying licenses is likely in the hundreds or thousands, maybe more. Every single newspaper, magazine, archive, image sharing site, museum, etc., each have their own Terms of Use with fairly variable language on how images on their sites can be used, and of course many sites allow image contributors to provide their own license for the work, whether that’s something standard like a Creative Commons license, or something totally custom. 

The Copyright Claims

The complaint includes a section attempting to explain how Stable Diffusion works. It argues that the Stable Diffusion model is basically just a giant archive of compressed images (similar to MP3 compression, for example) and that when Stable Diffusion is given a text prompt, it “interpolates” or combines the images in its archives to provide its output. The complaint literally calls Stable Diffusion nothing more than a “collage tool” throughout the document. It suggests that the output is just a mash-up of the training data.

This is a fascinating take because certainly the exact way the interpolation is done and how Stable Diffusion responds to the text prompts seem to be parameters that can range widely. The text prompt interpretation could potentially be very responsive to natural human speech with all its nuances or it could be awful. And interpolation, especially of 3 or 12 or 100 or 1000 images can be done in an unlimited number of combinations, some better, some worse. But, Stable Diffusion doesn’t interpolate a subset of training images to create its output: it interpolates ALL the images in the training data to create its output. Carefully calibrating the interpolation parameters to lead to useful, realistic, aesthetic, non-disturbing images is itself an art form as the Internet abounds with both excellent and horrible generative AI output images. It’s relatively easy to see improvements in Stable Diffusion’s output across their releases and those improvements are the result of tweaking the model, not adding more images to the training data. Even the various companies using Stable Diffusion as a library and training it with the LAION dataset, appear to be producing results with markedly different qualities. So, the claim that the output is nothing more than the input is deeply specious.1

I think this is a little bit like arguing that there is no difference between, say, listening to a randomly created mashup of  the Beatles’ White Album with Jay-Z’s Black Album (creating abominable noise), and listening to Danger Mouse’s powerful and evocative Grey Album, which is a creative mashup of the two. Even if the output is a “mere” mashup, the exact manner of mashup still matters a lot and makes the difference between something humans recognize with joy as art, and something humans view as nothing more than noise. Danger Mouse may have made a mashup, but he also made a significant contribution of his own artistry to the album, creating an entirely different tone, sound, style, and message from the original artists, worth listening to on its own merits and not simply to catch pieces of the original works.

This question of how much of the output should be credited to the training data versus to the model’s processing of the training data should be at the heart of the debate over whether Stable Diffusion’s use of the various images as training data is truly transformative and thus eligible for copyright’s fair use defense (or perhaps even to the question of whether the output is eligible for copyright protection at all). I think it’s going to be easy for the defense to present alternative analogies and narratives to those presented by the plaintiffs here. The output represents the model’s understanding of what is useful, aesthetic, pleasing, etc. and that, together with data filtering and cleaning that general image generating AI companies do,2 is what the companies consider most valuable, not the training data.3 Except in corner cases where the output is very tightly constrained (like “show me dogs in the style of Picasso”) it may well be argued that the the “use” of any one image from the training data is de minimis and/or not substantial enough to call the output a derivative work of any one image. Of course, none of this even begins to touch on the user’s contribution to the output via the specificity of the text prompt.4 There is some sense in which it’s true that there is no Stable Diffusion without the training data, but there is equally some sense in which there is no Stable Diffusion without users pouring their own creative energy into its prompts.

Conclusion

Stability AI has already announced that it is removing users’ ability to request images in a particular artist’s style and further, that future releases of Stable Diffusion will comply with any artist’s requests to remove their images from the training dataset. With that removal, the most outrage-inducing and troublesome output examples disappear from this case, leaving a much more complex and muddled set of facts for the jury to wade through. The publicity claims and vicarious copyright infringement claims, at least as stated in this complaint, also fall away.5 It’s not clear if the lawsuit that remains is one the plaintiffs still want to litigate, particularly since the class is likely to be narrowed as well.

__________

  1. This doesn’t cover the fact that Stability AI didn’t use the dataset in raw form. They have said they removed illegal content and have otherwise filtered it. Depending on the extent of this manipulation, they might be eligible for a thin copyright on the compilation that resulted, which would also erode the argument that the output is 100% work copyrighted by the plaintiffs.
  2. Note that this is highly context-specific. Images for general-scope image generating AIs are widely available and a giant subset of them is in the public domain. In other contexts, the data can be extremely valuable if it’s difficult to collect, requires human annotation or interpretation, etc. I think it’s really worthwhile to distinguish generative AIs that essentially draw on all of humanity’s knowledge in a certain domain (which we could also call our culture) from generative AIs that draw on more narrow sources of data which cannot be said to belong to all of us in the same way.
  3. Despite their self-proclaimed penchant for “open,” Stability AI didn’t hesitate to insist on a take-down when one of their partners made the model public. 
  4. The Copyright Office currently does not allow works created with the assistance of generative AI to receive copyright registration. However, top IP attorney Van Lindberg is working on a case to reverse this position and a considerable number of IP experts believe that he may ultimately succeed. The idea that no part of a work can be copyrighted just because some of the work was created with the help of generative AI tools doesn’t seem like it will stand the test of time, especially as such tools make their way into the standard set of tools artists are already using, like Photoshop. Such a victory would cast further doubt on the plaintiffs’ broader claims that every Stable Diffusion output is merely the sum of its inputs, and therefore the images used here were merely “stolen” (i.e. copyright infringement) rather than “transformed” (i.e. fair use).
  5. There is of course nothing prohibiting the plaintiffs for suing for various claims that occurred in prior releases of Stable Diffusion. But, because this is a class action and the lawyers are likely getting paid a portion of the damages awarded to the plaintiffs, the damages merely for the prior claims may not justify the legal expense of proceeding with the case.

Is Open Source Attribution Dead?

Virtually every common open source license such as the BSD, MIT, Apache, and GPL requires, at a minimum, that the copyright owners of the code be attributed when their code is redistributed. Licenses that merely require attribution and passing on the license are termed “permissive” licenses and licenses that additionally require providing source code are termed “copyleft” licenses. A few licenses out there are very specific about the attribution and require it in prominent places like a splash screen or even in advertising, but none of the most common licenses do so and most open source foundations as well as corporate entities have policies against using those licenses. They have largely been abandoned. Today, there is industry-wide consensus that open source attribution is sufficient if provided in documentation accompanying a product. 

The nature of that documentation has certainly changed over time, though. With the advent of package managers and the explosion of libraries that get pulled into most projects, significant products now have open source attribution files stretching into tens of thousands of pages (Here’s just a part of VMware’s vSphere product. It’s 15,385 pages and I crashed my MacBook Air twice trying to output it as PDF so I could count the pages for you!). Because licenses don’t get too specific about the readability of these attributions and the attributions are often provided in .txt or other fairly universal formats so that they are accessible by all, the documents aren’t easy to navigate. And because the attribution requirements in common OSS licenses are quite light, the attribution files typically don’t include descriptions of particular OSS packages, how they’re used in the product, what part of the product they’re used for, or how integral they are to the product. Commonly, packages are listed in alphabetical order but a tiny javascript function will get just as much prominence as an entire framework. Attribution files include copyright information, but they don’t include things like project/company logos or contact information. 

Long gone are the days when a product only used a handful of open source packages and attribution could easily fit into a product’s About screen which could reasonably be expected to be seen by most product users. The lists of packages are getting longer because of transitive dependencies, and additionally, individual OSS packages are now more likely to include other OSS subcomponents, with their own license and copyright information. The bloat occurs on multiple axes.

Compliance is now more challenging because not only does it involve digging up copyright and license information for the overall project, but it also involves digging up this info for the project’s subcomponents. Some projects do a good job of providing this information in one document, but most don’t. Most projects have secondary licenses scattered throughout their source code that are not acknowledged in a centralized location like a LICENSE, NOTICE or COPYING file. Projects from the Apache Foundation, for example, may note that certain subcomponents are under a certain license, but they won’t reprint that copyright and license information in their NOTICE file – they satisfy the requirement of passing on that information by passing on the the subcomponent’s source code when they make their own source code available. Finding this information either requires a tedious manual process or a complex scanning tool that often still requires manual checks and corrections and neither process is 100% accurate.* For this reason, many companies choose to only provide a project’s main license and decide to just risk it when it comes to secondary licenses. 

Downstream OSS consumers could likewise pass on an entire project’s source code in order to satisfy the attribution requirements of OSS licenses instead of putting together tedious attribution documents.  Some do, particularly in the container context, where it is much simpler and cleaner to just pass on all of a product’s OSS source code in an accompanying container rather than to try to locate this info and re-print it in one document. But it’s harder to bundle a lot of source code with a product in contexts where the size of the payload still matters (like on mobile devices) and in contexts where the source code is difficult to navigate and use (like on tiny screens built into refrigerators). Companies frequently provide source code on request or on websites separate from the products that use it in order to fulfill copyleft requirements around the provision of source code upon distribution. But, separating attributions from the product would technically violate a lot of licenses and most companies still try to avoid this.

However, the fact that perfect compliance can only really be accomplished by passing on the source code would seem to defeat the purpose of having permissive licenses in the first place. The license is no longer really permissive if it can only practicably be complied with by passing on the source code. Many developers choose permissive licenses for their work because they want it to be used as widely as possible and they specifically do not want to obligate users to additional source-code related conditions. A lot of them don’t think too hard about whether to pick a permissive license or whether to dedicate their work to the public domain because permissive licenses are very common and have been used for a long time. In contrast, public domain dedications were a tricky proposition before the CC0 1.0 Universal was released by Creative Commons in 2009, giving developers the right toolkit to properly make a public domain dedication and to safeguard that intention even in countries that do not recognize the concept of the public domain. If more developers understood that permissive licenses now function much more like copyleft licenses, it’s likely that many would opt for putting their work out under public domain rather than under a permissive license because the attribution is buried deep in documentation no one reads anyway and essentially just adds to the downstream user’s overhead.

Historically, permissive licenses haven’t seen much in the way of legal enforcement. Legal enforcement has really been focused on the GPL family of licenses, although some of those enforcement claims have obviously been tied to the fact that redistributors have failed to properly attribute the GPL code owners. In large part, that’s because people who chose permissive licenses in the first place were more concerned with spreading their work far and wide than they were with ensuring that downstream users kept the code “open” (because in that case they would have chosen a copyleft license instead). There has long been speculation that this might change and we might see some “attribution trolling,” wherein copyright holders start enforcing permissive licenses as well. That hasn’t happened yet, except perhaps in the context of a claim related to Copilot’s failure to attribute OSS owners when providing output to its users. 

On the one hand, some could see the attribution requirement as good and useful leverage as well as overhead that corporations should have to face, especially if they want to put a stick in Copilot’s wheel. But, I think others are ready to concede that that isn’t really what they intended or want for the industry, especially because this overhead tax is also legally required of individuals and non-profit OSS maintainers (no one is exempt from following third party OSS licenses). With a bigger push on the federal level for products to maintain a proper bill of materials for security purposes, we are already seeing more companies turn to upstream project maintainers and asking for better and more easily digestible information about code provenance. They, too, are likely to struggle with attributions.

The desire to get something in return for putting out quality code into the ether is understandable. Many developers care a lot about their own reputation and they contribute to open source at least in part to signal to other developers and potential employers that they have marketable skills. Of course larger projects run under the auspices of a non-profit foundation or even a corporation also want to burnish their brands. But, it’s hard to say anymore that traditional OSS attribution requirements are doing any of that for developers. Signal about great OSS projects is now coming from stars on GitHub, number of forks, tech blogs, Hacker News, etc. I’ve never heard of a developer checking out a new OSS project on the basis of merely finding its name among a sea of other OSS projects in someone’s attribution file.

There are much better ways for developers to build their reputations and brands. Merely including their contact info in their projects (perhaps in their license.txt or in header comments) may well be sufficient since few people have any interest or incentive in proactively stripping such information from the code they’re using. It’s important to note that downstream users want to be able to easily find new useful OSS; while putting together attribution files is expensive and time-consuming, they still want to know the origin of their software – they want to see what else the same developers have written, they want updates to the code, they want to know who to call if there are security issues. And they want to know who to hire, or in the case of other companies, who to buy. Certain companies have also emerged to track OSS usage globally and report on what’s commonly used and by whom, helping to give developers credit and pointing downstream users to useful projects. Package managers could also track this data automatically and make it public. In any case, it’s long past time for this problem to be solved via non-legal means.

* the process isn’t accurate for two reasons. The first is that locating all of a project’s license information isn’t easy. Some licensing information is in the header of a file, but occasionally info is hidden away in the middle of a file that’s the length of a book. The second is that people don’t all agree on exactly what needs to be reproduced. Do we have to reproduce the exact same copyright info and licensing info if the only difference between two such blocks is the copyright year? What if there’s an additional author? What if the text is substantially similar but uses slightly different wording? Plus, some projects put copyright info in every single file while others put it all in one place. Scanners and humans have a much harder time locating unique copyright notices and license information if each and every file in a project is marked rather than just the files with licensing information that differs from the main license of the project. 

Is Open Source Attribution Dead?

Virtually every common open source license such as the BSD, MIT, Apache, and GPL requires, at a minimum, that the copyright owners of the code be attributed when their code is redistributed. Licenses that merely require attribution and passing on the license are termed “permissive” licenses and licenses that additionally require providing source code are termed “copyleft” licenses. A few licenses out there are very specific about the attribution and require it in prominent places like a splash screen or even in advertising, but none of the most common licenses do so and most open source foundations as well as corporate entities have policies against using those licenses. They have largely been abandoned. Today, there is industry-wide consensus that open source attribution is sufficient if provided in documentation accompanying a product. 

The nature of that documentation has certainly changed over time, though. With the advent of package managers and the explosion of libraries that get pulled into most projects, significant products now have open source attribution files stretching into tens of thousands of pages (Here’s just a part of VMware’s vSphere product. It’s 15,385 pages and I crashed my MacBook Air twice trying to output it as PDF so I could count the pages for you!). Because licenses don’t get too specific about the readability of these attributions and the attributions are often provided in .txt or other fairly universal formats so that they are accessible by all, the documents aren’t easy to navigate. And because the attribution requirements in common OSS licenses are quite light, the attribution files typically don’t include descriptions of particular OSS packages, how they’re used in the product, what part of the product they’re used for, or how integral they are to the product. Commonly, packages are listed in alphabetical order but a tiny javascript function will get just as much prominence as an entire framework. Attribution files include copyright information, but they don’t include things like project/company logos or contact information. 

Long gone are the days when a product only used a handful of open source packages and attribution could easily fit into a product’s About screen which could reasonably be expected to be seen by most product users. The lists of packages are getting longer because of transitive dependencies, and additionally, individual OSS packages are now more likely to include other OSS subcomponents, with their own license and copyright information. The bloat occurs on multiple axes.

Compliance is now more challenging because not only does it involve digging up copyright and license information for the overall project, but it also involves digging up this info for the project’s subcomponents. Some projects do a good job of providing this information in one document, but most don’t. Most projects have secondary licenses scattered throughout their source code that are not acknowledged in a centralized location like a LICENSE, NOTICE or COPYING file. Projects from the Apache Foundation, for example, may note that certain subcomponents are under a certain license, but they won’t reprint that copyright and license information in their NOTICE file – they satisfy the requirement of passing on that information by passing on the the subcomponent’s source code when they make their own source code available. Finding this information either requires a tedious manual process or a complex scanning tool that often still requires manual checks and corrections and neither process is 100% accurate.* For this reason, many companies choose to only provide a project’s main license and decide to just risk it when it comes to secondary licenses. 

Downstream OSS consumers could likewise pass on an entire project’s source code in order to satisfy the attribution requirements of OSS licenses instead of putting together tedious attribution documents.  Some do, particularly in the container context, where it is much simpler and cleaner to just pass on all of a product’s OSS source code in an accompanying container rather than to try to locate this info and re-print it in one document. But it’s harder to bundle a lot of source code with a product in contexts where the size of the payload still matters (like on mobile devices) and in contexts where the source code is difficult to navigate and use (like on tiny screens built into refrigerators). Companies frequently provide source code on request or on websites separate from the products that use it in order to fulfill copyleft requirements around the provision of source code upon distribution. But, separating attributions from the product would technically violate a lot of licenses and most companies still try to avoid this.

However, the fact that perfect compliance can only really be accomplished by passing on the source code would seem to defeat the purpose of having permissive licenses in the first place. The license is no longer really permissive if it can only practicably be complied with by passing on the source code. Many developers choose permissive licenses for their work because they want it to be used as widely as possible and they specifically do not want to obligate users to additional source-code related conditions. A lot of them don’t think too hard about whether to pick a permissive license or whether to dedicate their work to the public domain because permissive licenses are very common and have been used for a long time. In contrast, public domain dedications were a tricky proposition before the CC0 1.0 Universal was released by Creative Commons in 2009, giving developers the right toolkit to properly make a public domain dedication and to safeguard that intention even in countries that do not recognize the concept of the public domain. If more developers understood that permissive licenses now function much more like copyleft licenses, it’s likely that many would opt for putting their work out under public domain rather than under a permissive license because the attribution is buried deep in documentation no one reads anyway and essentially just adds to the downstream user’s overhead.

Historically, permissive licenses haven’t seen much in the way of legal enforcement. Legal enforcement has really been focused on the GPL family of licenses, although some of those enforcement claims have obviously been tied to the fact that redistributors have failed to properly attribute the GPL code owners. In large part, that’s because people who chose permissive licenses in the first place were more concerned with spreading their work far and wide than they were with ensuring that downstream users kept the code “open” (because in that case they would have chosen a copyleft license instead). There has long been speculation that this might change and we might see some “attribution trolling,” wherein copyright holders start enforcing permissive licenses as well. That hasn’t happened yet, except perhaps in the context of a claim related to Copilot’s failure to attribute OSS owners when providing output to its users. 

On the one hand, some could see the attribution requirement as good and useful leverage as well as overhead that corporations should have to face, especially if they want to put a stick in Copilot’s wheel. But, I think others are ready to concede that that isn’t really what they intended or want for the industry, especially because this overhead tax is also legally required of individuals and non-profit OSS maintainers (no one is exempt from following third party OSS licenses). With a bigger push on the federal level for products to maintain a proper bill of materials for security purposes, we are already seeing more companies turn to upstream project maintainers and asking for better and more easily digestible information about code provenance. They, too, are likely to struggle with attributions.

The desire to get something in return for putting out quality code into the ether is understandable. Many developers care a lot about their own reputation and they contribute to open source at least in part to signal to other developers and potential employers that they have marketable skills. Of course larger projects run under the auspices of a non-profit foundation or even a corporation also want to burnish their brands. But, it’s hard to say anymore that traditional OSS attribution requirements are doing any of that for developers. Signal about great OSS projects is now coming from stars on GitHub, number of forks, tech blogs, Hacker News, etc. I’ve never heard of a developer checking out a new OSS project on the basis of merely finding its name among a sea of other OSS projects in someone’s attribution file.

There are much better ways for developers to build their reputations and brands. Merely including their contact info in their projects (perhaps in their license.txt or in header comments) may well be sufficient since few people have any interest or incentive in proactively stripping such information from the code they’re using. It’s important to note that downstream users want to be able to easily find new useful OSS; while putting together attribution files is expensive and time-consuming, they still want to know the origin of their software – they want to see what else the same developers have written, they want updates to the code, they want to know who to call if there are security issues. And they want to know who to hire, or in the case of other companies, who to buy. Certain companies have also emerged to track OSS usage globally and report on what’s commonly used and by whom, helping to give developers credit and pointing downstream users to useful projects. Package managers could also track this data automatically and make it public. In any case, it’s long past time for this problem to be solved via non-legal means.

* The process isn’t accurate for two reasons. The first is that locating all of a project’s license information isn’t easy. Some licensing information is in the header of a file, but occasionally info is hidden away in the middle of a file that’s the length of a book. The second is that people don’t all agree on exactly what needs to be reproduced. Do we have to reproduce the exact same copyright info and licensing info if the only difference between two such blocks is the copyright year? What if there’s an additional author? What if the text is substantially similar but uses slightly different wording? Plus, some projects put copyright info in every single file while others put it all in one place. Scanners and humans have a much harder time locating unique copyright notices and license information if each and every file in a project is marked rather than just the files with licensing information that differs from the main license of the project. 

AN Open Source Lawyer’s View on the Copilot Class Action Lawsuit

Seems like the other shoe finally dropped and a formal complaint has been filed with the Northern District of California regarding Copilot and Codex. The complaint is fascinating because the one thing it doesn’t allege is copyright infringement. The complaint explicitly anticipates a fair use defense on that front and attempts to sidestep that entire question principally by bringing claims under the Digital Millennium Copyright Act instead, centered on Section 1202, which forbids stripping copyrighted works of various copyright-related information. The complaint also includes other claims related to:

  • breach of contract as related to the open source licenses in individual GitHub repos (again, not a copyright claim)
  • tortious interference in a contractual relationship (by failing to give Copilot users proper licensing info that they could comply with)
  • fraud (relating to GitHub’s alleged lies in their Terms of Service and Privacy Policy about how code on GitHub would not be used outside of GitHub)
  • reverse passing off under the Lanham Act (for allegedly leading Copilot users to believe that output generated by Copilot belonged to Copilot)
  • unjust enrichment (vaguely for all of the above)
  • unfair competition (vaguely for all of the above)
  • breach of contract as related to GitHub’s alleged violation of personal data-related provisions in their Terms of Service and Privacy Policy
  • Violation of the California Consumer Privacy Act (CCPA) as related to GitHub’s alleged violation of personal data-related provisions in their Terms of Service and Privacy Policy
  • Negligence – negligent handling of personal data
  • Civil conspiracy (vaguely for all of the above)

Assessment of the Claims

The lack of a copyright claim here is very interesting. The first thought that comes to mind is that most people who have code on GitHub don’t bother to formally register their copyrights with the Copyright Office, which means that under the Copyright Act, though they have a copyright, they don’t have a right to enforce their copyright in court. Because this is a class action lawsuit, at least with respect to a copyright infringement claim, the plaintiffs’ attorneys would have had a difficult time identifying plaintiffs with registered copyrights and the pool of plaintiffs in the class would be significantly diminished – likely by about 99%. There are other reasons to not want to litigate a fair use defense, though. Such litigation is extremely fact-intensive, for one. It’s worth noting that while a firm driven by the monetary incentives that come with class actions may not want to pursue a copyright infringement claim, that certainly doesn’t foreclose people with other motivations from bringing such a claim. Without the copyright claim, any holding in this lawsuit will certainly not be the final touchstone lawyers look to when assessing the legal risks around machine learning (ML).

The other piece that looks odd here is that the complaint seems to misread the GitHub Terms of Service (ToS). The ToS, like all well-drafted terms of services out there, specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates. While lay people might be surprised to know that posting code on GitHub actually allows a giant web of companies to use their code for known and unknown purposes, legally, the ToS is clear on this point. A more persuasive claim of fraud would have centered on GitHub’s marketing materials (if any) around GitHub’s use of code only “for GitHub.” 

Nearly every claim in this complaint hinges on the idea that the only license users of GitHub granted to GitHub is the open source license under which they posted their code and there’s no mention anywhere of the license that GitHub users grant to GitHub in the ToS. Since a not insignificant number of GitHub repos don’t contain any licensing information at all, is the plaintiffs’ position that in the absence of an OSS license, there is neither a license in the ToS nor any implicit license for GitHub to host the code? That would be a strange position to take, particularly since GitHub has only recently started prompting users to add licensing information to their repos – it’s certainly never been a required field – and basically every commercial website out there takes a license to user content via their terms of service in more or less the same language that GitHub does.  It would be particularly strange to argue that a user could put any license provisions at all in their repos and GitHub would have to draw the entirety of their right to host and otherwise use the code from a license sight unseen. That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts. 

Nearly every claim also hinges on the idea that if the underlying code on GitHub is under, say, the MIT license, then Copilot’s output of any part of that code without attribution is also a violation of the MIT license. But, copyright doesn’t work that way. Sufficiently short lines of code don’t constitute any protectable expression and longer blocks of code may not be copyrightable if they are purely functional – i.e. there’s no other way to do this particular thing in this particular language. This is an issue for the plaintiffs on two fronts. First, the alleged DMCA violation here only applies to copyrighted (and copyrightable) works. So, if the code either in the Copilot model or in the Copilot output is not copyrightable, there’s no DMCA violation. Second, this affects the way they define (or should define) their class. 

The definition of the class in the complaint doesn’t actually condition participation on injury. The class is limited to GitHub users with a copyright in any work posted on GitHub after January 1, 2015 under one of GitHub’s suggested licenses (from their drop down menu). But there’s nothing in here narrowing that class to people whose work is 1) actually part of the model, 2) actually outputted by the model (or a derivative of it is), or 3) outputted by the model in sufficient detail to still be subject to copyright. In fact, in several parts of the complaint the plaintiffs explicitly state that it’s difficult or impossible for the plaintiffs to identify their work in Copilot’s output. GitHub is likely going to have a lot to say about a class of plaintiffs that can only circumstantially and perhaps speculatively allege any harm. 

The claims related to personal data don’t identify what personal data they’re referring to. But, it seems like there’s a non-zero chance that the plaintiffs are both arguing that Copilot doesn’t display enough copyright management information per the DMCA and that what little it does show, actually violates the CCPA. However, the CCPA does not prohibit data-related activities necessitated by federal law, which may protect the defendants if the scope of the personal data is copyright-related. If the scope of the personal data is broader, plaintiffs may have a good point in arguing that Copilot doesn’t seem to have a way for anyone to query whether or not Copilot holds any of their personally identifiable data nor a way to delete it if requested. Given how the ML model works, a full fix for this problem may be incredibly difficult if not impossible. 

Impact

Overall, it’s not clear what the plaintiffs (the actual class, not the lawyers) would actually gain from forcing Copilot to display licensing information for all of its copyrightable suggestions. Imagining a world where this is possible and easy, does any copyright owner feel better knowing that some commercial product has their name attached to it in a one-million page long attribution file? Attributions that are thousands of pages long are already common even without the use of Copilot on nearly every file. Of course, this sort of information isn’t actually easy to provide. In practice, for any given suggestion, there is a high likelihood that it comes from a number of different sources. The plaintiffs themselves describe Copilot as basing its suggestions on the most commonly seen approaches. Who gets the credit if thousands of people have written this particular function this particular way (even if we assume it is sufficiently detailed to be copyrightable)? Is crediting all of them useful or practical? Who decides whether a suggestion is actually novel or derivative of other code and what metrics are to be used to decide that on a scale of millions of suggestions a day? The law doesn’t provide clear answers to these questions; experts in the Copyright Office often mull these questions over for months for just one copyrighted work and even that decision often gets overturned in court. In practice, even if GitHub wanted to provide all the relevant licensing information for any given suggestion, doing so is likely impossible in most cases.

If GitHub is to be believed, Copilot only regurgitates exact code snippets from the training data 1% of the time. Some portion of that 1% is certainly non-copyrightable snippets. So the plaintiffs are essentially asking for attribution for less than 1% of Copilot suggestions. The complaint repeatedly anticipates any winning claims to arise from that 1%. So sure, the affected copyright owners have rights but then this isn’t exactly “high impact litigation.” It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions. 

Most of the complaint is related to improper OSS attribution, but curiously only one line is devoted to the idea that the Copilot model itself is actually subject to perhaps some of the licenses in the underlying code and should actually be open sourced. Now that’s an actual substantive complaint and far from troll-y. If the goal of the complaint was to make a significant impact on the future of AI and ML, then this would really be the crux of the complaint because it would be an argument that ML models are copyrightable (this is a highly contentious matter), that ML models are derivative works of the training data (this is really fact-specific based on how the model actually works and perhaps also a big philosophical quagmire), and that ML model output is copyrightable (also highly contentious because the Copyright Office won’t register copyrights to non-humans today, per their interpretation of the Copyright Act). The practical effect would likely be that at least in the software space, the world would see at least one copyleft-licensed ML model (which might not benefit anyone in any way if the model itself is always hosted and never distributed and therefore the model owners are under no obligation to share its source code). 

But outside the software space, where open source licenses don’t proliferate and training data may not be subject to copyright at all (like training data that is purely data or which is in the public domain), this may create a precedent that ML/AI models should receive copyright protection, and the owners of one model could potentially block the development of a similar one, effectively walling off the knowledge such a model might yield on an entire domain from everyone but the very first people to create a model for that domain. Or worse, the recognition of copyright in the model leads to recognition of copyright in the output and now humans can be sued for violating copyrights related to AI-generated content, which can be generated on a massive scale in very little time with no human effort at all. Under the banner of “open,” this lawsuit and others like it actually help to pave the way for more recognition of proprietary rights in a broader category of works, not less.

Conclusion

Class action lawsuits often experience significant roadblocks based on how a class is defined before any of the substantive claims are even reached. Onlookers can expect a lot of back and forth on this subject before the courts get around to providing the industry with any useful guidance with respect to Copilot or ML models. In the meantime, more lawsuits are possible. The Software Freedom Conservancy has been mulling its own suit for many months now and the lack of a copyright claim in this one would be an enticing reason to bring a separate lawsuit with respect to Copilot. Attorneys bringing a class action probably won’t spend the time and resources to push the issue of whether or not the ML model needs to be open sourced if it used copyleft training data – certainly their complaint only mentions this in passing. But, it’s likely only a matter of time until an individual or small group of people presses that issue for ideological rather than monetary reasons. 

Co-chairing pli’s annaual “open source software – from compliance to cooperation” program

This September I co-chaired the Practising Law Institute’s annual “Open Source Software – From Compliance to Cooperation” program with Heather Meeker. The program is a day-long continuing legal education event with a variety of open source licensing and compliance experts covering both introductory and advanced topics as well as recent developments in OSS licensing. We had an all-star lineup this year, so it’s a terrific way for attorneys to pick up 6 CLE credits, including 1 elimination of bias credit.

As part of the program, Aaron Williamson of Williamson Legal (former counsel for the Software Freedom Law Center and general counsel to the Fintech Open Source Foundation) and I did a session titled “OSS in Transactions, Licensing and M&A” where we did a deep dive on contractual provisions related to open source software and provided some advice on where and how they should be implemented. Our slides are available for download here. Our presentation was loosely based on a white paper we co-authored titled “IoT and the Special Challenges of Open Source Software Licensing,” which will be published shortly in the ABA’s Landslide magazine. I’ll update this blog with a link once it’s online.

OSS in procurement

I gave a presentation today during the Open Source Initiative’s “Practical Open Source Information” event on OSS in procurement. This presentation specifically focuses on what OSS-related provisions you may need to add when you are purchasing licenses to third party software, either from consultants/contractors or off-the-shelf. You can download my slides, including a number of sample provisions here.