Seems like the other shoe finally dropped and a formal complaint has been filed with the Northern District of California regarding Copilot and Codex. The complaint is fascinating because the one thing it doesn’t allege is copyright infringement. The complaint explicitly anticipates a fair use defense on that front and attempts to sidestep that entire question principally by bringing claims under the Digital Millennium Copyright Act instead, centered on Section 1202, which forbids stripping copyrighted works of various copyright-related information. The complaint also includes other claims related to:
- breach of contract as related to the open source licenses in individual GitHub repos (again, not a copyright claim)
- tortious interference in a contractual relationship (by failing to give Copilot users proper licensing info that they could comply with)
- reverse passing off under the Lanham Act (for allegedly leading Copilot users to believe that output generated by Copilot belonged to Copilot)
- unjust enrichment (vaguely for all of the above)
- unfair competition (vaguely for all of the above)
- Negligence – negligent handling of personal data
- Civil conspiracy (vaguely for all of the above)
Assessment of the Claims
The lack of a copyright claim here is very interesting. The first thought that comes to mind is that most people who have code on GitHub don’t bother to formally register their copyrights with the Copyright Office, which means that under the Copyright Act, though they have a copyright, they don’t have a right to enforce their copyright in court. Because this is a class action lawsuit, at least with respect to a copyright infringement claim, the plaintiffs’ attorneys would have had a difficult time identifying plaintiffs with registered copyrights and the pool of plaintiffs in the class would be significantly diminished – likely by about 99%. There are other reasons to not want to litigate a fair use defense, though. Such litigation is extremely fact-intensive, for one. It’s worth noting that while a firm driven by the monetary incentives that come with class actions may not want to pursue a copyright infringement claim, that certainly doesn’t foreclose people with other motivations from bringing such a claim. Without the copyright claim, any holding in this lawsuit will certainly not be the final touchstone lawyers look to when assessing the legal risks around machine learning (ML).
The other piece that looks odd here is that the complaint seems to misread the GitHub Terms of Service (ToS). The ToS, like all well-drafted terms of services out there, specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates. While lay people might be surprised to know that posting code on GitHub actually allows a giant web of companies to use their code for known and unknown purposes, legally, the ToS is clear on this point. A more persuasive claim of fraud would have centered on GitHub’s marketing materials (if any) around GitHub’s use of code only “for GitHub.”
Nearly every claim in this complaint hinges on the idea that the only license users of GitHub granted to GitHub is the open source license under which they posted their code and there’s no mention anywhere of the license that GitHub users grant to GitHub in the ToS. Since a not insignificant number of GitHub repos don’t contain any licensing information at all, is the plaintiffs’ position that in the absence of an OSS license, there is neither a license in the ToS nor any implicit license for GitHub to host the code? That would be a strange position to take, particularly since GitHub has only recently started prompting users to add licensing information to their repos – it’s certainly never been a required field – and basically every commercial website out there takes a license to user content via their terms of service in more or less the same language that GitHub does. It would be particularly strange to argue that a user could put any license provisions at all in their repos and GitHub would have to draw the entirety of their right to host and otherwise use the code from a license sight unseen. That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.
Nearly every claim also hinges on the idea that if the underlying code on GitHub is under, say, the MIT license, then Copilot’s output of any part of that code without attribution is also a violation of the MIT license. But, copyright doesn’t work that way. Sufficiently short lines of code don’t constitute any protectable expression and longer blocks of code may not be copyrightable if they are purely functional – i.e. there’s no other way to do this particular thing in this particular language. This is an issue for the plaintiffs on two fronts. First, the alleged DMCA violation here only applies to copyrighted (and copyrightable) works. So, if the code either in the Copilot model or in the Copilot output is not copyrightable, there’s no DMCA violation. Second, this affects the way they define (or should define) their class.
The definition of the class in the complaint doesn’t actually condition participation on injury. The class is limited to GitHub users with a copyright in any work posted on GitHub after January 1, 2015 under one of GitHub’s suggested licenses (from their drop down menu). But there’s nothing in here narrowing that class to people whose work is 1) actually part of the model, 2) actually outputted by the model (or a derivative of it is), or 3) outputted by the model in sufficient detail to still be subject to copyright. In fact, in several parts of the complaint the plaintiffs explicitly state that it’s difficult or impossible for the plaintiffs to identify their work in Copilot’s output. GitHub is likely going to have a lot to say about a class of plaintiffs that can only circumstantially and perhaps speculatively allege any harm.
The claims related to personal data don’t identify what personal data they’re referring to. But, it seems like there’s a non-zero chance that the plaintiffs are both arguing that Copilot doesn’t display enough copyright management information per the DMCA and that what little it does show, actually violates the CCPA. However, the CCPA does not prohibit data-related activities necessitated by federal law, which may protect the defendants if the scope of the personal data is copyright-related. If the scope of the personal data is broader, plaintiffs may have a good point in arguing that Copilot doesn’t seem to have a way for anyone to query whether or not Copilot holds any of their personally identifiable data nor a way to delete it if requested. Given how the ML model works, a full fix for this problem may be incredibly difficult if not impossible.
Overall, it’s not clear what the plaintiffs (the actual class, not the lawyers) would actually gain from forcing Copilot to display licensing information for all of its copyrightable suggestions. Imagining a world where this is possible and easy, does any copyright owner feel better knowing that some commercial product has their name attached to it in a one-million page long attribution file? Attributions that are thousands of pages long are already common even without the use of Copilot on nearly every file. Of course, this sort of information isn’t actually easy to provide. In practice, for any given suggestion, there is a high likelihood that it comes from a number of different sources. The plaintiffs themselves describe Copilot as basing its suggestions on the most commonly seen approaches. Who gets the credit if thousands of people have written this particular function this particular way (even if we assume it is sufficiently detailed to be copyrightable)? Is crediting all of them useful or practical? Who decides whether a suggestion is actually novel or derivative of other code and what metrics are to be used to decide that on a scale of millions of suggestions a day? The law doesn’t provide clear answers to these questions; experts in the Copyright Office often mull these questions over for months for just one copyrighted work and even that decision often gets overturned in court. In practice, even if GitHub wanted to provide all the relevant licensing information for any given suggestion, doing so is likely impossible in most cases.
If GitHub is to be believed, Copilot only regurgitates exact code snippets from the training data 1% of the time. Some portion of that 1% is certainly non-copyrightable snippets. So the plaintiffs are essentially asking for attribution for less than 1% of Copilot suggestions. The complaint repeatedly anticipates any winning claims to arise from that 1%. So sure, the affected copyright owners have rights but then this isn’t exactly “high impact litigation.” It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions.
Most of the complaint is related to improper OSS attribution, but curiously only one line is devoted to the idea that the Copilot model itself is actually subject to perhaps some of the licenses in the underlying code and should actually be open sourced. Now that’s an actual substantive complaint and far from troll-y. If the goal of the complaint was to make a significant impact on the future of AI and ML, then this would really be the crux of the complaint because it would be an argument that ML models are copyrightable (this is a highly contentious matter), that ML models are derivative works of the training data (this is really fact-specific based on how the model actually works and perhaps also a big philosophical quagmire), and that ML model output is copyrightable (also highly contentious because the Copyright Office won’t register copyrights to non-humans today, per their interpretation of the Copyright Act). The practical effect would likely be that at least in the software space, the world would see at least one copyleft-licensed ML model (which might not benefit anyone in any way if the model itself is always hosted and never distributed and therefore the model owners are under no obligation to share its source code).
But outside the software space, where open source licenses don’t proliferate and training data may not be subject to copyright at all (like training data that is purely data or which is in the public domain), this may create a precedent that ML/AI models should receive copyright protection, and the owners of one model could potentially block the development of a similar one, effectively walling off the knowledge such a model might yield on an entire domain from everyone but the very first people to create a model for that domain. Or worse, the recognition of copyright in the model leads to recognition of copyright in the output and now humans can be sued for violating copyrights related to AI-generated content, which can be generated on a massive scale in very little time with no human effort at all. Under the banner of “open,” this lawsuit and others like it actually help to pave the way for more recognition of proprietary rights in a broader category of works, not less.
Class action lawsuits often experience significant roadblocks based on how a class is defined before any of the substantive claims are even reached. Onlookers can expect a lot of back and forth on this subject before the courts get around to providing the industry with any useful guidance with respect to Copilot or ML models. In the meantime, more lawsuits are possible. The Software Freedom Conservancy has been mulling its own suit for many months now and the lack of a copyright claim in this one would be an enticing reason to bring a separate lawsuit with respect to Copilot. Attorneys bringing a class action probably won’t spend the time and resources to push the issue of whether or not the ML model needs to be open sourced if it used copyleft training data – certainly their complaint only mentions this in passing. But, it’s likely only a matter of time until an individual or small group of people presses that issue for ideological rather than monetary reasons.