Blog

AN Open Source Lawyer’s View on the Copilot Class Action Lawsuit

Seems like the other shoe finally dropped and a formal complaint has been filed with the Northern District of California regarding Copilot and Codex. The complaint is fascinating because the one thing it doesn’t allege is copyright infringement. The complaint explicitly anticipates a fair use defense on that front and attempts to sidestep that entire question principally by bringing claims under the Digital Millennium Copyright Act instead, centered on Section 1202, which forbids stripping copyrighted works of various copyright-related information. The complaint also includes other claims related to:

  • breach of contract as related to the open source licenses in individual GitHub repos (again, not a copyright claim)
  • tortious interference in a contractual relationship (by failing to give Copilot users proper licensing info that they could comply with)
  • fraud (relating to GitHub’s alleged lies in their Terms of Service and Privacy Policy about how code on GitHub would not be used outside of GitHub)
  • reverse passing off under the Lanham Act (for allegedly leading Copilot users to believe that output generated by Copilot belonged to Copilot)
  • unjust enrichment (vaguely for all of the above)
  • unfair competition (vaguely for all of the above)
  • breach of contract as related to GitHub’s alleged violation of personal data-related provisions in their Terms of Service and Privacy Policy
  • Violation of the California Consumer Privacy Act (CCPA) as related to GitHub’s alleged violation of personal data-related provisions in their Terms of Service and Privacy Policy
  • Negligence – negligent handling of personal data
  • Civil conspiracy (vaguely for all of the above)

Assessment of the Claims

The lack of a copyright claim here is very interesting. The first thought that comes to mind is that most people who have code on GitHub don’t bother to formally register their copyrights with the Copyright Office, which means that under the Copyright Act, though they have a copyright, they don’t have a right to enforce their copyright in court. Because this is a class action lawsuit, at least with respect to a copyright infringement claim, the plaintiffs’ attorneys would have had a difficult time identifying plaintiffs with registered copyrights and the pool of plaintiffs in the class would be significantly diminished – likely by about 99%. There are other reasons to not want to litigate a fair use defense, though. Such litigation is extremely fact-intensive, for one. It’s worth noting that while a firm driven by the monetary incentives that come with class actions may not want to pursue a copyright infringement claim, that certainly doesn’t foreclose people with other motivations from bringing such a claim. Without the copyright claim, any holding in this lawsuit will certainly not be the final touchstone lawyers look to when assessing the legal risks around machine learning (ML).

The other piece that looks odd here is that the complaint seems to misread the GitHub Terms of Service (ToS). The ToS, like all well-drafted terms of services out there, specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates. While lay people might be surprised to know that posting code on GitHub actually allows a giant web of companies to use their code for known and unknown purposes, legally, the ToS is clear on this point. A more persuasive claim of fraud would have centered on GitHub’s marketing materials (if any) around GitHub’s use of code only “for GitHub.” 

Nearly every claim in this complaint hinges on the idea that the only license users of GitHub granted to GitHub is the open source license under which they posted their code and there’s no mention anywhere of the license that GitHub users grant to GitHub in the ToS. Since a not insignificant number of GitHub repos don’t contain any licensing information at all, is the plaintiffs’ position that in the absence of an OSS license, there is neither a license in the ToS nor any implicit license for GitHub to host the code? That would be a strange position to take, particularly since GitHub has only recently started prompting users to add licensing information to their repos – it’s certainly never been a required field – and basically every commercial website out there takes a license to user content via their terms of service in more or less the same language that GitHub does.  It would be particularly strange to argue that a user could put any license provisions at all in their repos and GitHub would have to draw the entirety of their right to host and otherwise use the code from a license sight unseen. That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts. 

Nearly every claim also hinges on the idea that if the underlying code on GitHub is under, say, the MIT license, then Copilot’s output of any part of that code without attribution is also a violation of the MIT license. But, copyright doesn’t work that way. Sufficiently short lines of code don’t constitute any protectable expression and longer blocks of code may not be copyrightable if they are purely functional – i.e. there’s no other way to do this particular thing in this particular language. This is an issue for the plaintiffs on two fronts. First, the alleged DMCA violation here only applies to copyrighted (and copyrightable) works. So, if the code either in the Copilot model or in the Copilot output is not copyrightable, there’s no DMCA violation. Second, this affects the way they define (or should define) their class. 

The definition of the class in the complaint doesn’t actually condition participation on injury. The class is limited to GitHub users with a copyright in any work posted on GitHub after January 1, 2015 under one of GitHub’s suggested licenses (from their drop down menu). But there’s nothing in here narrowing that class to people whose work is 1) actually part of the model, 2) actually outputted by the model (or a derivative of it is), or 3) outputted by the model in sufficient detail to still be subject to copyright. In fact, in several parts of the complaint the plaintiffs explicitly state that it’s difficult or impossible for the plaintiffs to identify their work in Copilot’s output. GitHub is likely going to have a lot to say about a class of plaintiffs that can only circumstantially and perhaps speculatively allege any harm. 

The claims related to personal data don’t identify what personal data they’re referring to. But, it seems like there’s a non-zero chance that the plaintiffs are both arguing that Copilot doesn’t display enough copyright management information per the DMCA and that what little it does show, actually violates the CCPA. However, the CCPA does not prohibit data-related activities necessitated by federal law, which may protect the defendants if the scope of the personal data is copyright-related. If the scope of the personal data is broader, plaintiffs may have a good point in arguing that Copilot doesn’t seem to have a way for anyone to query whether or not Copilot holds any of their personally identifiable data nor a way to delete it if requested. Given how the ML model works, a full fix for this problem may be incredibly difficult if not impossible. 

Impact

Overall, it’s not clear what the plaintiffs (the actual class, not the lawyers) would actually gain from forcing Copilot to display licensing information for all of its copyrightable suggestions. Imagining a world where this is possible and easy, does any copyright owner feel better knowing that some commercial product has their name attached to it in a one-million page long attribution file? Attributions that are thousands of pages long are already common even without the use of Copilot on nearly every file. Of course, this sort of information isn’t actually easy to provide. In practice, for any given suggestion, there is a high likelihood that it comes from a number of different sources. The plaintiffs themselves describe Copilot as basing its suggestions on the most commonly seen approaches. Who gets the credit if thousands of people have written this particular function this particular way (even if we assume it is sufficiently detailed to be copyrightable)? Is crediting all of them useful or practical? Who decides whether a suggestion is actually novel or derivative of other code and what metrics are to be used to decide that on a scale of millions of suggestions a day? The law doesn’t provide clear answers to these questions; experts in the Copyright Office often mull these questions over for months for just one copyrighted work and even that decision often gets overturned in court. In practice, even if GitHub wanted to provide all the relevant licensing information for any given suggestion, doing so is likely impossible in most cases.

If GitHub is to be believed, Copilot only regurgitates exact code snippets from the training data 1% of the time. Some portion of that 1% is certainly non-copyrightable snippets. So the plaintiffs are essentially asking for attribution for less than 1% of Copilot suggestions. The complaint repeatedly anticipates any winning claims to arise from that 1%. So sure, the affected copyright owners have rights but then this isn’t exactly “high impact litigation.” It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions. 

Most of the complaint is related to improper OSS attribution, but curiously only one line is devoted to the idea that the Copilot model itself is actually subject to perhaps some of the licenses in the underlying code and should actually be open sourced. Now that’s an actual substantive complaint and far from troll-y. If the goal of the complaint was to make a significant impact on the future of AI and ML, then this would really be the crux of the complaint because it would be an argument that ML models are copyrightable (this is a highly contentious matter), that ML models are derivative works of the training data (this is really fact-specific based on how the model actually works and perhaps also a big philosophical quagmire), and that ML model output is copyrightable (also highly contentious because the Copyright Office won’t register copyrights to non-humans today, per their interpretation of the Copyright Act). The practical effect would likely be that at least in the software space, the world would see at least one copyleft-licensed ML model (which might not benefit anyone in any way if the model itself is always hosted and never distributed and therefore the model owners are under no obligation to share its source code). 

But outside the software space, where open source licenses don’t proliferate and training data may not be subject to copyright at all (like training data that is purely data or which is in the public domain), this may create a precedent that ML/AI models should receive copyright protection, and the owners of one model could potentially block the development of a similar one, effectively walling off the knowledge such a model might yield on an entire domain from everyone but the very first people to create a model for that domain. Or worse, the recognition of copyright in the model leads to recognition of copyright in the output and now humans can be sued for violating copyrights related to AI-generated content, which can be generated on a massive scale in very little time with no human effort at all. Under the banner of “open,” this lawsuit and others like it actually help to pave the way for more recognition of proprietary rights in a broader category of works, not less.

Conclusion

Class action lawsuits often experience significant roadblocks based on how a class is defined before any of the substantive claims are even reached. Onlookers can expect a lot of back and forth on this subject before the courts get around to providing the industry with any useful guidance with respect to Copilot or ML models. In the meantime, more lawsuits are possible. The Software Freedom Conservancy has been mulling its own suit for many months now and the lack of a copyright claim in this one would be an enticing reason to bring a separate lawsuit with respect to Copilot. Attorneys bringing a class action probably won’t spend the time and resources to push the issue of whether or not the ML model needs to be open sourced if it used copyleft training data – certainly their complaint only mentions this in passing. But, it’s likely only a matter of time until an individual or small group of people presses that issue for ideological rather than monetary reasons. 

Co-chairing pli’s annaual “open source software – from compliance to cooperation” program

This September I co-chaired the Practising Law Institute’s annual “Open Source Software – From Compliance to Cooperation” program with Heather Meeker. The program is a day-long continuing legal education event with a variety of open source licensing and compliance experts covering both introductory and advanced topics as well as recent developments in OSS licensing. We had an all-star lineup this year, so it’s a terrific way for attorneys to pick up 6 CLE credits, including 1 elimination of bias credit.

As part of the program, Aaron Williamson of Williamson Legal (former counsel for the Software Freedom Law Center and general counsel to the Fintech Open Source Foundation) and I did a session titled “OSS in Transactions, Licensing and M&A” where we did a deep dive on contractual provisions related to open source software and provided some advice on where and how they should be implemented. Our slides are available for download here. Our presentation was loosely based on a white paper we co-authored titled “IoT and the Special Challenges of Open Source Software Licensing,” which will be published shortly in the ABA’s Landslide magazine. I’ll update this blog with a link once it’s online.

OSS in procurement

I gave a presentation today during the Open Source Initiative’s “Practical Open Source Information” event on OSS in procurement. This presentation specifically focuses on what OSS-related provisions you may need to add when you are purchasing licenses to third party software, either from consultants/contractors or off-the-shelf. You can download my slides, including a number of sample provisions here.

Considerations When Open Sourcing a Project

One of the most frequent questions I receive not only from engineers, but from other lawyers is “we are thinking of making a project available under an open source license – what do we need to know?” Since I get this question a lot, I’ll share with you the list of information I usually collect from clients to help them assess whether an open source license is right for them, which one they should use, how they should communicate licensing info to others, and whether they should consider creating codes of conduct, contributor guidelines, or contributor agreements (or DCOs) for their projects:

Outbound Open source checklist

Background Info

  • What is the name of your project and the first version that will be publicly released?
  • Please describe what your project does and the format in which it is made available. (Will you be providing just source or binaries, too? Will it be packaged as a virtual machine, container, etc.?)
  • Where do you plan to make the open source project available? (ex.: Github, NPM, rubygems.org, etc.)
  • Does the project have any use outside the context of using it in conjunction with other Company products or services?
  • Do you have any reason to believe that there are Company patents on the technology you want to open source?

Technical Info

  • Does Security see any issues in open sourcing the proposed project? Have you ensured there are no keys or credentials in the source code?
  • Does your project contain any code licensed from a vendor (whether for free or for fee)?
  • Does your project contain any integrations with any third-party products or services?
  • Does the project have any “phone home” features? If so, please identify with specificity the data collected and why.
  • Please submit an open source code scan of the project and its dependencies (including transitive dependencies) for review.
    • The scan needs to be done before a main license can be chosen for the project because the main license of the project should be compatible with the licenses of the third-party code the project uses or pulls in. For example, if you want to license something out as GPL 2, you can’t include Apache 2.0 dependencies and vice versa.
    • You will need to remove dependencies/components under commercial licenses or under licenses that are incompatible with the chosen license for the project as a whole.

Goals

  • Explain why you want to open source this project. What are you trying to accomplish? Why is this preferable over a proprietary license or a proprietary license that gives access to source code?

License

  • Have you thought about the sort of license you’d like to release this under? If so, what is it and why?
  • If the project is used by a competitor or a provided as a service by a big cloud services provider, does that pose issues for the business?
  • Can you foresee any reasons why you might later regret open sourcing this project?
  • Do you expect proprietary Company products to incorporate this code in the future?
  • Do you expect that this project might grow into something you might want to commercially license in the future?

Developer/User Expectations

  • Do you intend to grow a community around this project or are you just looking for an easy way to get code into the hands of existing customers and partners? If you want to grow a community:
    • Who will be responsible for managing the community?
    • Are you committing to responding to pull requests and issues on a prompt basis?
  • Are you going to provide any formal support for users of the project?

Once this information is collected, I can provide clients with more specific guidance with respect to marking their source code, providing licensing info in their chosen distribution channels, how to set up a contributor agreement intake process (if necessary), how to set user/developer expectations appropriately, and how to attribute third party open source packages used in the project. However, sometimes we all discover that the necessary ingredients for creating an open source project aren’t there and we can walk through alternatives or the steps necessary to be ready in the future. I hope you find this useful!

“Ask Me Anything” Open Source Webinar with Tidelift

I recently sat down with Luis Villa, the co-founder of Tidelift, to record an “Ask Me Anything” (AMA) Webinar on open source licenses. TideLift is a startup that provides companies with a well-curated catalog of proactively maintained open source projects, and in turn provides maintainers with a plethora of resources and a revenue stream. As an open source expert himself (notably, he is the author the Mozilla Public License 2.0), Luis has run a series of open source AMAs and I encourage developers and young open source projects and companies to check them out. We touched on containers, API licenses (including Google v. Oracle), and compliance strategies. A recap of the AMA is available here, and the full audio recording can be downloaded here.