Blog

Did GitHub Really Just Offer to Indemnify You for Copilot’s Suggestions?

Short answer: Strictly theoretically, yes, but only if you execute the Corporate Terms of Service (not the one for individuals), potentially litigate a lot of unclear language in the contract in order to enforce GitHub’s obligation, and GitHub doesn’t spend all its money on other lawsuits first.

Long answer:

If, like me, you’ve been poking around GitHub’s website to get the scoop on all of Copilot’s new features, you might have stumbled upon an FAQ with the following information:

You can reach this FAQ via https://github.com/features/copilot/#faq-privacy-copilot-for-business and https://github.com/features/copilot/#faq as of May 10, 2023. Much ink has been spilled by lawyers and non-lawyers alike arguing about whether or not Copilot’s suggestions result in copyright infringement, so a promise by GitHub to defend its users in court against claims of copyright infringement related to Copilot output would be notable, and would likely persuade many potential customers to use the service.

If you click the link and go to the GitHub Copilot Product Specific Terms, though, you’ll find this:

To be clear, GitHub ISN’T offering an indemnity here. They’re saying that if your agreement with them happens to have one, then the following things are EXCLUDED from it. However, if you go to the GitHub default Terms of Service, you might be surprised to find that there’s no indemnity there for customers/users at all. The wording of the FAQ, and the fact that the information about the indemnity appears under the “Privacy – General” section and not under the “Privacy – Copilot for Business” section would lead a casual reader to expect this statement to be generally true of Copilot, not just for a subset of customers.

The Corporate Terms of Service do offer to indemnify you “against any claim brought by an unaffiliated third party to the extent it alleges Customer’s authorized use of the Service infringes a copyright, patent, or trademark or misappropriates a trade secret of an unaffiliated third party.” Let’s break that down, though. Does the “use of the Service” also include use of suggestions in your hosted service? That’s unclear. Does “use of the Service” also include your redistribution of suggestions in your own downloadable products? That’s even less clear since you’re not just including suggestions in your offering, you’re taking a separate step to ship it to third parties who will run it completely separately and apart from the Copilot services that initially built it. Certainly, the suggestions themselves could have been expressly referenced in this provision, but they are not and the definition of “Service” just says “GitHub’s hosted service and any applicable Documentation.”

Now let’s look at those exclusions. That language excludes claims based on code that differs from a suggestion provided by Copilot. Does a very slight difference void the indemnity? Potentially. It doesn’t say “materially” or “significantly” differs. And, perhaps more saliently, how is anyone supposed to tell 1. what code was written by Copilot and 2. if it was modified? I’m not aware of Copilot maintaining this kind of editing history as of today’s date (May 10, 2023). That means that at best one might be able to show that a particular suggestion was created by Copilot only if one managed to prompt Copilot to output the suggestion again (and that’s not necessarily evidence that the code at issue was output by Copilot back when the claim arose, only that it might have been output that way back then). The ability to do that is dubious since the underlying model and fine tuning are subject to change over time and since the suggestions are context-sensitive – they depend on the other info in the file that’s being worked on and even related files. So, those other artifacts may need to be re-presented to Copilot in that original state in order to elicit the exact same suggestion.

Moreover, it’s important to remember that the current case against GitHub does not actually allege copyright infringement. I discussed this before here, but it’s worth noting that both the DMCA Section 1202 claims and the claims related to violations of the California Consumer Privacy Act are claims that could potentially be brought against Copilot customers, not just Copilot. DMCA Section 1202 prohibits distribution of copyright management information that one knows has been altered or removed (and maybe customers “know” this because Copilot has already been sued for this? Certainly they’d “know” after the plaintiffs won on those claims?). Parts of the CCPA apply to a data holder even if they didn’t collect the data themselves and claims under other data protection laws in other states or other countries remain possible.

It’s also instructive to remember that the claims against GitHub and the claims against Stability AI related to Stable Diffusion (which does allege copyright infringement) are brought under a very “circumstantial” argument: the models are trained on copyrighted data, therefore all output from the models is a derivative work of the training data that infringers the authors copyrighted works, and all the other claims more or less flow from there. The plaintiffs specifically state in both complaints that it’s impossible to pinpoint particular instances of the output that might be infringing, so they are choosing to bring their theory of the claims without resting them on any specific pieces of output. Whether or not such a strategy will be approved by the courts is pure speculation, but it suggests the possibility of generative AI lawsuits where no specific code is at issue – the accusation could more or less begin and end with whether you’ve used Copilot to build your products. In that case, it’s not clear if or how GitHub’s indemnity exclusion might apply since, to put it literally, the claim is based on “code” (presumably the customer’s product as a whole) that “differs from the suggestions,” as it must because 1. many suggestions are likely modified, and 2. the product contains customer-written code that doesn’t contain and therefore differs from, the suggestions. (Tech transactions lawyers have likely noticed the “interesting” way this exclusion is drafted, which sidesteps the clearer and more traditional indemnity exclusion for “code/software/materials modified by a party other than the vendor” preferably paired with a “where such claim would not have arisen but for such modification”)

Lastly, in the wake of Johnson & Johnson’s baby powder class actions, it’s worth pointing out that the more serious the legal actions against Copilot’s customers or Copilot itself get, the less likely GitHub is to actually honor its indemnity obligation. If, for example, a court in any major market deems the entire enterprise of creating and/or using these models to be copyright infringement (or contributory copyright infringement, etc.), it’s possible that the first few people (or classes) to win litigation get a big payout from GitHub and then there’s no money left over for indemnifying subsequent customers. And even if there is money left over, GitHub could file for bankruptcy , which would likely mean customers are stuck defending themselves upfront and then trying to recoup just a portion of those expenses in bankruptcy court.

In conclusion, the indemnity on offer is not very exciting. Given the vagueness of the provision and its exclusions and the potential need to litigate in order to sort them out, I’d advise customers to ask for clarifying language, or at least a “prevailing party provision” that states that the loser in a lawsuit, or other dispute resolution settlement, must pay all or part of the winner’s legal costs – and make sure those fees are carved out of any limitation of liability provision. It would be unfortunate to lose the protection of the indemnity if the upfront costs of enforcing it against GitHub are just too high for a business to swallow.

OpenAI’s Massive Data Grab

I spent some time this week going through the whole suite of OpenAI agreements and policies and was quite surprised at what I found in their Terms of Use (TOU). It turns out that OpenAI’s confidentiality provision is unilateral: it includes confidentiality protection solely for OpenAI’s information. That means that neither the inputs provided to OpenAI nor the output it produces are treated as confidential by OpenAI. This is somewhat unusual in the context of SaaS vendors, who usually at least acknowledge that the data provided to them is confidential, even if they try to limit their liability with respect to keeping such data confidential over the course of the contract. Many companies are likely to be caught off guard by this provision. 

Companies the world over are looking to integrate OpenAI technologies, particularly ChatGPT, into their products and services. Microsoft has famously integrated a version of ChatGPT into its Bing search engine, with plans to integrate it with products and services throughout the rest of the Microsoft ecosystem. However, many companies will have to think twice about those integrations if OpenAI doesn’t change its tune. That’s because nearly every SaaS company has at least a subset of customers whose data they promise to treat as confidential, requiring them to pass along those confidentiality requirements to any third party who gets access to such customer data. If a vendor like OpenAI won’t treat any of a company’s inputs as confidential, then they are also not treating any input provided by a company’s customer or about a company’s customer as confidential, putting companies in violation of their own terms of use and/or non-disclosure agreements with their customers. Likewise, it means that companies cannot use OpenAI technologies to solicit or analyze certain categories of data about their staff, which the companies are bound to keep confidential either by law or by their agreements with their staff. That severely curtails the available use cases for OpenAI’s technologies. 

The unilateral nature of the confidentiality provision is easy to miss for non-lawyers. The TOU has a provision that says no input sent to or received from its APIs will be used to improve its services or train its models and further says that companies can opt out of the same for input or output sent in other ways. Many people reading the TOU are concerned about precisely this scenario because they don’t want third parties to get their hands on their data, so it’s easy to tune out the rest of the TOU after reading that, but what the TOU is also saying via the confidentiality provision is that every other use of input data and output data is still on the table for OpenAI, whether that means publishing the inputs or outputs, privately sharing them with third parties, aggregating them, analyzing them, etc. Lawyers can also draw distinctions here (some more dubious than others), like claiming that analyzing the input in order to better market the services, understand customers, or prioritize partnerships is not the same as analyzing the data to “improve the Services.” Such distinctions aren’t unusual in privacy policies and that kind of specificity is often lauded by regulators. Keep in mind that like the GitHub TOU discussed with respect to Copilot, the OpenAI TOU includes all of OpenAI’s affiliates, current and future, so there is no telling what the input data could potentially be used for in the future and as we’ve seen with Copilot, people can be unpleasantly surprised by unexpected uses down the line. 

The other element making the confidentiality provision easy to overlook is the fact that the TOU mentions the applicability of a Data Processing Addendum (DPA), principally used to help companies comply with the GDPR (Europe’s main set of data privacy regulations). Many reading the TOU will assume that the DPA protects their input and output data. However, the DPA only protects data that is also personally identifiable information. Again, provided that OpenAI strips the input of any personally identifiable information (which they explicitly state is a practice they perform), the input can be used by OpenAI without any confidentiality obligations. That should be particularly disconcerting for companies who make it their business to collect, analyze, and sell certain types of data. 

So, if you’re worried about what OpenAI might learn about your company, your customers or your employees, and who they might share that information with, think twice about accepting the OpenAI TOU as-is. Call them up and negotiate. 

US Copyright Office Rejects Notion that Midjourney Output Can Be Copyrighted

Back in 2022, Kristina Kashtanova applied for copyright registration with the US Copyright Office (USCO) for her comic book, Zarya of the Dawn. The USCO initially granted the copyright registration but when it realized that the book’s images were created using Midjourney, the USCO stripped the author of her copyright. IP attorney Van Linberg then stepped in to argue the matter with the USCO and on February 21, 2023 the US Copyright Office responded, allowing for copyright on the text of the comic book as well as the selection, coordination and arrangement of the comic book’s textual and visual elements, but expressly disallowing any copyright protection with respect to any of the comic book’s images. 

The crux of the USCO’s refusal to recognize any copyright interest in the images rests on the idea that Midjourney’s output is unpredictable and that the prompts users provide to it are mere suggestions, with too much “distance between what a user may direct Midjourney to create and the visual material Midjourney actually produces” such that “users lack sufficient control over generated images to be treated as the “mastermind” behind them.” Repeatedly, the USCO seems to argue that the final result has to reflect the artist’s “own original conception,” even going so far as to argue that the “process is not controlled by the user because it is not possible to predict what Mijourney will create ahead of time.”

This reasoning seems flawed, though, because there are a lot of examples of copyrightable work where it’s fair to say that the author could not predict the outcome at the beginning of the process: Bob Ross was famous for paintings with “happy accidents,” Jackson Pollock and Edwin Parker Twombly, Jr.’s splatter paintings, nature documentaries based on hundreds of hours of hidden camera footage, and the many photographs in which the events ultimately caught on film were complete surprises to the people behind the camera. And this is to say nothing of an entire genre of music based precisely on not knowing what will result: jazz. Historically, the USCO has recognized copyright in a number of works where the author did not have 100% control of the final output. 

The reasoning seems overly complicated. It would probably have been simpler for the USCO to just write that the “expression” is all produced by a computer and therefore isn’t authored by Kashtanova. That would have drawn a bright line around generative AI output. The USCO makes a point that if Kashtanova had commissioned an artist to draw these images, then the artist would have been the author since the artist chose how to express the ideas. However, that line of reasoning would seem to fall apart when the “prompt” becomes sufficiently long and detailed enough to warrant its own copyright. Surely, at some point, sufficient levels of input and management would move Kashtanova into the co-author slot? Even when commissioning a work, it’s not uncommon for the commissioned artist/creator to be given certain copyrightable pieces to work with and/or incorporate into the final output.

It seems to me that at least under current legal precedents, there should be some theoretical level of prompt sophistication that should garner the user co-authorship over the output. The USCO suggests that sufficiently creative alterations to the images would have yielded Kashtanova some copyright interest in the images. However, it’s not clear to me why using a paintbrush in Photoshop to change the color of a shirt is all that different from simply writing a command to the computer to change the color of the shirt via a prompt. In both cases, the user is simply using a tool to express their vision, one is just more complicated than the other. In some ways, the Midjourney prompts can almost be seen as a higher level programming language – following user commands and abstracting away ever more technical details in the process. While iterations in Midjourney prompts may produce more variable results than other image manipulation tools, the fact that using Midjourney to refine a picture is oftentimes slower than doing the same refinement in Photoshop, doesn’t really change the fact that the desired refinement is still possible through infinite iteration. While the output for any one prompt is “unpredictable” as the USCO letter says, many iterations can, in fact, lead to a specific desired result.

On the other hand, it’s not obvious that Kashtanova needs any more copyright protection than what the USCO granted, for purposes of incentivizing further creative output. The copyright protection she has would prevent someone else from publishing her book in its entirety. It would also prevent someone else from taking the text in her book and adding their own images or re-using her characters in other books. The only thing anyone could freely re-use would be the images. The images alone aren’t a substitute for the book. Lack of copyright protection on the images might eat into her ability to merchandise some of the images if others are selling objects with those images on them, but perhaps monetarily that’s a fair trade for not having painted them herself in the first place?

This seems like a fine individual outcome, but the reasoning behind it matters a great deal for cases going forward. The Copyright Office is incorrect in stating that users lack control, and should choose a more accurate and legible line.

An IP Attorney’s Reading of the Stable Diffusion Class Action Lawsuit

The image above was created via Stable Diffusion with the prompt “lawyers in suits fighting robots with lasers in a futuristic, superhero style.”

Special thanks to Yisong Yue, professor of machine learning at Caltech for providing me with valuable technical feedback on this post!

Looks like Matthew Butterick and the Joseph Saveri Law Firm are going to have a busy year! The same folks who filed the class action against GitHub and Microsoft related to Copilot and Codex a couple of months ago, have filed another one against Stability AI, DeviantArt, and Midjourney related to Stable Diffusion. The crux of the complaint is around Stability AI and their Stable Diffusion product, but Midjourney and DeviantArt enter the picture because they have generative AI products that incorporate Stable Diffusion. DeviantArt also has some claims lobbed directly at them via a subclass because they allowed the nonprofit, Large-Scale Artificial Intelligence Open Network’s (LAION), to incorporate the art work submitted to their service into a large public dataset of 400 million images and captions. According to the complaint, at the time this was the largest freely available dataset of its kind and Stable Diffusion was trained on it. Like the Copilot case, this one includes claims for:

  • Violation of the Digital Millennium Copyright Act’s (DMCA) sections 1201-1205 related to stripping images of copyright-related information 
  • Unfair competition, in this case stemming from copyright law and DMCA violations 
  • breach of contract, in this case as related to DeviantArt’s alleged violation of personal data-related provisions in their Terms of Service and Privacy Policy

Unlike the Copilot case, this one includes additional claims for:

  • Direct copyright infringement for training Stable Diffusion on the class’s images, including the images in the Stable Diffusion model, and reproducing and distributing derivative works of those images
  • Vicarious copyright infringement for allowing users to create and sell fake works of well-known artists (essentially impersonating the artists)
  • Violation of the statutory and common law rights of publicity related to Stable Diffusion’s ability to request art in the style of a specific artist

The Class

Like the Copilot case, this one has similar potential flaws in defining the class. The definition of the injunctive relief class and the damages class in the complaint doesn’t actually condition participation on injury. The class is defined as all persons or entities with a copyright interest in any work that was used to train Stable Diffusion. But, simply having work that is part of the training set doesn’t mean the work is 1) is actually part of the model, 2) actually outputted by the model (or a derivative of it is), or 3) outputted by the model in sufficient detail to still be subject to copyright. As in the Copilot case, the complaint explicitly states that it’s difficult or impossible for the plaintiffs to identify their work in Stable Diffusion’s output. 

Since this case involves copyright infringement claims, it’s also surprising that the class is not limited to people with registered copyrights, but only to people “with a copyright interest” since people who don’t have registered copyrights cannot enforce their copyrights in court. Additionally, litigator Noorjahan Rahman has pointed out that some courts extend the registration requirement to DMCA enforcement as well, further weakening the plaintiffs’ chances of succeeding in defining the class this way and/or bringing either copyright or DMCA claims as a class action. 

It’s also worth pointing out that only a fraction of the dataset allegedly used by Stable Diffusion originated from DeviantArt. With respect to images that came from DeviantArt, the question of copyright infringement must revolve around the license granted to DeviantArt in its Terms of Services. But, the rest of the dataset came from other sources. Some of them are likely to be just copyrighted and not subject to any other license, some of them are likely to be dedicated to the public domain, but the rest of them are under various licenses, including various Creative Commons licenses, commercial licenses, other Terms of Use, etc. Authors of such works of course still retain a copyright interest in them, but the question of whether or not use of the works constituted copyright infringement is going to depend on the underlying license. Since there is likely a cornucopia of underlying licenses in that dataset, it’s difficult to argue that all members of the class actually share the same questions of law and facts. 

Remember that in the Copilot case, the relevant class was tied to the 13 open source licenses available in GitHub’s drop-down menu for self-selection. That was a calculated choice since of course you can find any license under the sun on GitHub, not just the 13 GitHub suggests. It’s not difficult to argue that questions of law and fact are likely to vary across 13 different license agreements. However, those 13 licenses at least all had the common requirement that the copyright holder be attributed as a condition of using the license, which is likely why the plaintiffs settled on just those 13 (that 13 probably also covers 80% or more of what’s on GitHub, so that helps). That common thread might be enough in that case, particularly since it doesn’t involve a copyright claim. Here, however, it’s impossible to draw any common thread because the number of underlying licenses is likely in the hundreds or thousands, maybe more. Every single newspaper, magazine, archive, image sharing site, museum, etc., each have their own Terms of Use with fairly variable language on how images on their sites can be used, and of course many sites allow image contributors to provide their own license for the work, whether that’s something standard like a Creative Commons license, or something totally custom. 

The Copyright Claims

The complaint includes a section attempting to explain how Stable Diffusion works. It argues that the Stable Diffusion model is basically just a giant archive of compressed images (similar to MP3 compression, for example) and that when Stable Diffusion is given a text prompt, it “interpolates” or combines the images in its archives to provide its output. The complaint literally calls Stable Diffusion nothing more than a “collage tool” throughout the document. It suggests that the output is just a mash-up of the training data.

This is a fascinating take because certainly the exact way the interpolation is done and how Stable Diffusion responds to the text prompts seem to be parameters that can range widely. The text prompt interpretation could potentially be very responsive to natural human speech with all its nuances or it could be awful. And interpolation, especially of 3 or 12 or 100 or 1000 images can be done in an unlimited number of combinations, some better, some worse. But, Stable Diffusion doesn’t interpolate a subset of training images to create its output: it interpolates ALL the images in the training data to create its output. Carefully calibrating the interpolation parameters to lead to useful, realistic, aesthetic, non-disturbing images is itself an art form as the Internet abounds with both excellent and horrible generative AI output images. It’s relatively easy to see improvements in Stable Diffusion’s output across their releases and those improvements are the result of tweaking the model, not adding more images to the training data. Even the various companies using Stable Diffusion as a library and training it with the LAION dataset, appear to be producing results with markedly different qualities. So, the claim that the output is nothing more than the input is deeply specious.1

I think this is a little bit like arguing that there is no difference between, say, listening to a randomly created mashup of  the Beatles’ White Album with Jay-Z’s Black Album (creating abominable noise), and listening to Danger Mouse’s powerful and evocative Grey Album, which is a creative mashup of the two. Even if the output is a “mere” mashup, the exact manner of mashup still matters a lot and makes the difference between something humans recognize with joy as art, and something humans view as nothing more than noise. Danger Mouse may have made a mashup, but he also made a significant contribution of his own artistry to the album, creating an entirely different tone, sound, style, and message from the original artists, worth listening to on its own merits and not simply to catch pieces of the original works.

This question of how much of the output should be credited to the training data versus to the model’s processing of the training data should be at the heart of the debate over whether Stable Diffusion’s use of the various images as training data is truly transformative and thus eligible for copyright’s fair use defense (or perhaps even to the question of whether the output is eligible for copyright protection at all). I think it’s going to be easy for the defense to present alternative analogies and narratives to those presented by the plaintiffs here. The output represents the model’s understanding of what is useful, aesthetic, pleasing, etc. and that, together with data filtering and cleaning that general image generating AI companies do,2 is what the companies consider most valuable, not the training data.3 Except in corner cases where the output is very tightly constrained (like “show me dogs in the style of Picasso”) it may well be argued that the the “use” of any one image from the training data is de minimis and/or not substantial enough to call the output a derivative work of any one image. Of course, none of this even begins to touch on the user’s contribution to the output via the specificity of the text prompt.4 There is some sense in which it’s true that there is no Stable Diffusion without the training data, but there is equally some sense in which there is no Stable Diffusion without users pouring their own creative energy into its prompts.

Conclusion

Stability AI has already announced that it is removing users’ ability to request images in a particular artist’s style and further, that future releases of Stable Diffusion will comply with any artist’s requests to remove their images from the training dataset. With that removal, the most outrage-inducing and troublesome output examples disappear from this case, leaving a much more complex and muddled set of facts for the jury to wade through. The publicity claims and vicarious copyright infringement claims, at least as stated in this complaint, also fall away.5 It’s not clear if the lawsuit that remains is one the plaintiffs still want to litigate, particularly since the class is likely to be narrowed as well.

__________

  1. This doesn’t cover the fact that Stability AI didn’t use the dataset in raw form. They have said they removed illegal content and have otherwise filtered it. Depending on the extent of this manipulation, they might be eligible for a thin copyright on the compilation that resulted, which would also erode the argument that the output is 100% work copyrighted by the plaintiffs.
  2. Note that this is highly context-specific. Images for general-scope image generating AIs are widely available and a giant subset of them is in the public domain. In other contexts, the data can be extremely valuable if it’s difficult to collect, requires human annotation or interpretation, etc. I think it’s really worthwhile to distinguish generative AIs that essentially draw on all of humanity’s knowledge in a certain domain (which we could also call our culture) from generative AIs that draw on more narrow sources of data which cannot be said to belong to all of us in the same way.
  3. Despite their self-proclaimed penchant for “open,” Stability AI didn’t hesitate to insist on a take-down when one of their partners made the model public. 
  4. The Copyright Office currently does not allow works created with the assistance of generative AI to receive copyright registration. However, top IP attorney Van Lindberg is working on a case to reverse this position and a considerable number of IP experts believe that he may ultimately succeed. The idea that no part of a work can be copyrighted just because some of the work was created with the help of generative AI tools doesn’t seem like it will stand the test of time, especially as such tools make their way into the standard set of tools artists are already using, like Photoshop. Such a victory would cast further doubt on the plaintiffs’ broader claims that every Stable Diffusion output is merely the sum of its inputs, and therefore the images used here were merely “stolen” (i.e. copyright infringement) rather than “transformed” (i.e. fair use).
  5. There is of course nothing prohibiting the plaintiffs for suing for various claims that occurred in prior releases of Stable Diffusion. But, because this is a class action and the lawyers are likely getting paid a portion of the damages awarded to the plaintiffs, the damages merely for the prior claims may not justify the legal expense of proceeding with the case.

Is Open Source Attribution Dead?

Virtually every common open source license such as the BSD, MIT, Apache, and GPL requires, at a minimum, that the copyright owners of the code be attributed when their code is redistributed. Licenses that merely require attribution and passing on the license are termed “permissive” licenses and licenses that additionally require providing source code are termed “copyleft” licenses. A few licenses out there are very specific about the attribution and require it in prominent places like a splash screen or even in advertising, but none of the most common licenses do so and most open source foundations as well as corporate entities have policies against using those licenses. They have largely been abandoned. Today, there is industry-wide consensus that open source attribution is sufficient if provided in documentation accompanying a product. 

The nature of that documentation has certainly changed over time, though. With the advent of package managers and the explosion of libraries that get pulled into most projects, significant products now have open source attribution files stretching into tens of thousands of pages (Here’s just a part of VMware’s vSphere product. It’s 15,385 pages and I crashed my MacBook Air twice trying to output it as PDF so I could count the pages for you!). Because licenses don’t get too specific about the readability of these attributions and the attributions are often provided in .txt or other fairly universal formats so that they are accessible by all, the documents aren’t easy to navigate. And because the attribution requirements in common OSS licenses are quite light, the attribution files typically don’t include descriptions of particular OSS packages, how they’re used in the product, what part of the product they’re used for, or how integral they are to the product. Commonly, packages are listed in alphabetical order but a tiny javascript function will get just as much prominence as an entire framework. Attribution files include copyright information, but they don’t include things like project/company logos or contact information. 

Long gone are the days when a product only used a handful of open source packages and attribution could easily fit into a product’s About screen which could reasonably be expected to be seen by most product users. The lists of packages are getting longer because of transitive dependencies, and additionally, individual OSS packages are now more likely to include other OSS subcomponents, with their own license and copyright information. The bloat occurs on multiple axes.

Compliance is now more challenging because not only does it involve digging up copyright and license information for the overall project, but it also involves digging up this info for the project’s subcomponents. Some projects do a good job of providing this information in one document, but most don’t. Most projects have secondary licenses scattered throughout their source code that are not acknowledged in a centralized location like a LICENSE, NOTICE or COPYING file. Projects from the Apache Foundation, for example, may note that certain subcomponents are under a certain license, but they won’t reprint that copyright and license information in their NOTICE file – they satisfy the requirement of passing on that information by passing on the the subcomponent’s source code when they make their own source code available. Finding this information either requires a tedious manual process or a complex scanning tool that often still requires manual checks and corrections and neither process is 100% accurate.* For this reason, many companies choose to only provide a project’s main license and decide to just risk it when it comes to secondary licenses. 

Downstream OSS consumers could likewise pass on an entire project’s source code in order to satisfy the attribution requirements of OSS licenses instead of putting together tedious attribution documents.  Some do, particularly in the container context, where it is much simpler and cleaner to just pass on all of a product’s OSS source code in an accompanying container rather than to try to locate this info and re-print it in one document. But it’s harder to bundle a lot of source code with a product in contexts where the size of the payload still matters (like on mobile devices) and in contexts where the source code is difficult to navigate and use (like on tiny screens built into refrigerators). Companies frequently provide source code on request or on websites separate from the products that use it in order to fulfill copyleft requirements around the provision of source code upon distribution. But, separating attributions from the product would technically violate a lot of licenses and most companies still try to avoid this.

However, the fact that perfect compliance can only really be accomplished by passing on the source code would seem to defeat the purpose of having permissive licenses in the first place. The license is no longer really permissive if it can only practicably be complied with by passing on the source code. Many developers choose permissive licenses for their work because they want it to be used as widely as possible and they specifically do not want to obligate users to additional source-code related conditions. A lot of them don’t think too hard about whether to pick a permissive license or whether to dedicate their work to the public domain because permissive licenses are very common and have been used for a long time. In contrast, public domain dedications were a tricky proposition before the CC0 1.0 Universal was released by Creative Commons in 2009, giving developers the right toolkit to properly make a public domain dedication and to safeguard that intention even in countries that do not recognize the concept of the public domain. If more developers understood that permissive licenses now function much more like copyleft licenses, it’s likely that many would opt for putting their work out under public domain rather than under a permissive license because the attribution is buried deep in documentation no one reads anyway and essentially just adds to the downstream user’s overhead.

Historically, permissive licenses haven’t seen much in the way of legal enforcement. Legal enforcement has really been focused on the GPL family of licenses, although some of those enforcement claims have obviously been tied to the fact that redistributors have failed to properly attribute the GPL code owners. In large part, that’s because people who chose permissive licenses in the first place were more concerned with spreading their work far and wide than they were with ensuring that downstream users kept the code “open” (because in that case they would have chosen a copyleft license instead). There has long been speculation that this might change and we might see some “attribution trolling,” wherein copyright holders start enforcing permissive licenses as well. That hasn’t happened yet, except perhaps in the context of a claim related to Copilot’s failure to attribute OSS owners when providing output to its users. 

On the one hand, some could see the attribution requirement as good and useful leverage as well as overhead that corporations should have to face, especially if they want to put a stick in Copilot’s wheel. But, I think others are ready to concede that that isn’t really what they intended or want for the industry, especially because this overhead tax is also legally required of individuals and non-profit OSS maintainers (no one is exempt from following third party OSS licenses). With a bigger push on the federal level for products to maintain a proper bill of materials for security purposes, we are already seeing more companies turn to upstream project maintainers and asking for better and more easily digestible information about code provenance. They, too, are likely to struggle with attributions.

The desire to get something in return for putting out quality code into the ether is understandable. Many developers care a lot about their own reputation and they contribute to open source at least in part to signal to other developers and potential employers that they have marketable skills. Of course larger projects run under the auspices of a non-profit foundation or even a corporation also want to burnish their brands. But, it’s hard to say anymore that traditional OSS attribution requirements are doing any of that for developers. Signal about great OSS projects is now coming from stars on GitHub, number of forks, tech blogs, Hacker News, etc. I’ve never heard of a developer checking out a new OSS project on the basis of merely finding its name among a sea of other OSS projects in someone’s attribution file.

There are much better ways for developers to build their reputations and brands. Merely including their contact info in their projects (perhaps in their license.txt or in header comments) may well be sufficient since few people have any interest or incentive in proactively stripping such information from the code they’re using. It’s important to note that downstream users want to be able to easily find new useful OSS; while putting together attribution files is expensive and time-consuming, they still want to know the origin of their software – they want to see what else the same developers have written, they want updates to the code, they want to know who to call if there are security issues. And they want to know who to hire, or in the case of other companies, who to buy. Certain companies have also emerged to track OSS usage globally and report on what’s commonly used and by whom, helping to give developers credit and pointing downstream users to useful projects. Package managers could also track this data automatically and make it public. In any case, it’s long past time for this problem to be solved via non-legal means.

* the process isn’t accurate for two reasons. The first is that locating all of a project’s license information isn’t easy. Some licensing information is in the header of a file, but occasionally info is hidden away in the middle of a file that’s the length of a book. The second is that people don’t all agree on exactly what needs to be reproduced. Do we have to reproduce the exact same copyright info and licensing info if the only difference between two such blocks is the copyright year? What if there’s an additional author? What if the text is substantially similar but uses slightly different wording? Plus, some projects put copyright info in every single file while others put it all in one place. Scanners and humans have a much harder time locating unique copyright notices and license information if each and every file in a project is marked rather than just the files with licensing information that differs from the main license of the project. 

Is Open Source Attribution Dead?

Virtually every common open source license such as the BSD, MIT, Apache, and GPL requires, at a minimum, that the copyright owners of the code be attributed when their code is redistributed. Licenses that merely require attribution and passing on the license are termed “permissive” licenses and licenses that additionally require providing source code are termed “copyleft” licenses. A few licenses out there are very specific about the attribution and require it in prominent places like a splash screen or even in advertising, but none of the most common licenses do so and most open source foundations as well as corporate entities have policies against using those licenses. They have largely been abandoned. Today, there is industry-wide consensus that open source attribution is sufficient if provided in documentation accompanying a product. 

The nature of that documentation has certainly changed over time, though. With the advent of package managers and the explosion of libraries that get pulled into most projects, significant products now have open source attribution files stretching into tens of thousands of pages (Here’s just a part of VMware’s vSphere product. It’s 15,385 pages and I crashed my MacBook Air twice trying to output it as PDF so I could count the pages for you!). Because licenses don’t get too specific about the readability of these attributions and the attributions are often provided in .txt or other fairly universal formats so that they are accessible by all, the documents aren’t easy to navigate. And because the attribution requirements in common OSS licenses are quite light, the attribution files typically don’t include descriptions of particular OSS packages, how they’re used in the product, what part of the product they’re used for, or how integral they are to the product. Commonly, packages are listed in alphabetical order but a tiny javascript function will get just as much prominence as an entire framework. Attribution files include copyright information, but they don’t include things like project/company logos or contact information. 

Long gone are the days when a product only used a handful of open source packages and attribution could easily fit into a product’s About screen which could reasonably be expected to be seen by most product users. The lists of packages are getting longer because of transitive dependencies, and additionally, individual OSS packages are now more likely to include other OSS subcomponents, with their own license and copyright information. The bloat occurs on multiple axes.

Compliance is now more challenging because not only does it involve digging up copyright and license information for the overall project, but it also involves digging up this info for the project’s subcomponents. Some projects do a good job of providing this information in one document, but most don’t. Most projects have secondary licenses scattered throughout their source code that are not acknowledged in a centralized location like a LICENSE, NOTICE or COPYING file. Projects from the Apache Foundation, for example, may note that certain subcomponents are under a certain license, but they won’t reprint that copyright and license information in their NOTICE file – they satisfy the requirement of passing on that information by passing on the the subcomponent’s source code when they make their own source code available. Finding this information either requires a tedious manual process or a complex scanning tool that often still requires manual checks and corrections and neither process is 100% accurate.* For this reason, many companies choose to only provide a project’s main license and decide to just risk it when it comes to secondary licenses. 

Downstream OSS consumers could likewise pass on an entire project’s source code in order to satisfy the attribution requirements of OSS licenses instead of putting together tedious attribution documents.  Some do, particularly in the container context, where it is much simpler and cleaner to just pass on all of a product’s OSS source code in an accompanying container rather than to try to locate this info and re-print it in one document. But it’s harder to bundle a lot of source code with a product in contexts where the size of the payload still matters (like on mobile devices) and in contexts where the source code is difficult to navigate and use (like on tiny screens built into refrigerators). Companies frequently provide source code on request or on websites separate from the products that use it in order to fulfill copyleft requirements around the provision of source code upon distribution. But, separating attributions from the product would technically violate a lot of licenses and most companies still try to avoid this.

However, the fact that perfect compliance can only really be accomplished by passing on the source code would seem to defeat the purpose of having permissive licenses in the first place. The license is no longer really permissive if it can only practicably be complied with by passing on the source code. Many developers choose permissive licenses for their work because they want it to be used as widely as possible and they specifically do not want to obligate users to additional source-code related conditions. A lot of them don’t think too hard about whether to pick a permissive license or whether to dedicate their work to the public domain because permissive licenses are very common and have been used for a long time. In contrast, public domain dedications were a tricky proposition before the CC0 1.0 Universal was released by Creative Commons in 2009, giving developers the right toolkit to properly make a public domain dedication and to safeguard that intention even in countries that do not recognize the concept of the public domain. If more developers understood that permissive licenses now function much more like copyleft licenses, it’s likely that many would opt for putting their work out under public domain rather than under a permissive license because the attribution is buried deep in documentation no one reads anyway and essentially just adds to the downstream user’s overhead.

Historically, permissive licenses haven’t seen much in the way of legal enforcement. Legal enforcement has really been focused on the GPL family of licenses, although some of those enforcement claims have obviously been tied to the fact that redistributors have failed to properly attribute the GPL code owners. In large part, that’s because people who chose permissive licenses in the first place were more concerned with spreading their work far and wide than they were with ensuring that downstream users kept the code “open” (because in that case they would have chosen a copyleft license instead). There has long been speculation that this might change and we might see some “attribution trolling,” wherein copyright holders start enforcing permissive licenses as well. That hasn’t happened yet, except perhaps in the context of a claim related to Copilot’s failure to attribute OSS owners when providing output to its users. 

On the one hand, some could see the attribution requirement as good and useful leverage as well as overhead that corporations should have to face, especially if they want to put a stick in Copilot’s wheel. But, I think others are ready to concede that that isn’t really what they intended or want for the industry, especially because this overhead tax is also legally required of individuals and non-profit OSS maintainers (no one is exempt from following third party OSS licenses). With a bigger push on the federal level for products to maintain a proper bill of materials for security purposes, we are already seeing more companies turn to upstream project maintainers and asking for better and more easily digestible information about code provenance. They, too, are likely to struggle with attributions.

The desire to get something in return for putting out quality code into the ether is understandable. Many developers care a lot about their own reputation and they contribute to open source at least in part to signal to other developers and potential employers that they have marketable skills. Of course larger projects run under the auspices of a non-profit foundation or even a corporation also want to burnish their brands. But, it’s hard to say anymore that traditional OSS attribution requirements are doing any of that for developers. Signal about great OSS projects is now coming from stars on GitHub, number of forks, tech blogs, Hacker News, etc. I’ve never heard of a developer checking out a new OSS project on the basis of merely finding its name among a sea of other OSS projects in someone’s attribution file.

There are much better ways for developers to build their reputations and brands. Merely including their contact info in their projects (perhaps in their license.txt or in header comments) may well be sufficient since few people have any interest or incentive in proactively stripping such information from the code they’re using. It’s important to note that downstream users want to be able to easily find new useful OSS; while putting together attribution files is expensive and time-consuming, they still want to know the origin of their software – they want to see what else the same developers have written, they want updates to the code, they want to know who to call if there are security issues. And they want to know who to hire, or in the case of other companies, who to buy. Certain companies have also emerged to track OSS usage globally and report on what’s commonly used and by whom, helping to give developers credit and pointing downstream users to useful projects. Package managers could also track this data automatically and make it public. In any case, it’s long past time for this problem to be solved via non-legal means.

* The process isn’t accurate for two reasons. The first is that locating all of a project’s license information isn’t easy. Some licensing information is in the header of a file, but occasionally info is hidden away in the middle of a file that’s the length of a book. The second is that people don’t all agree on exactly what needs to be reproduced. Do we have to reproduce the exact same copyright info and licensing info if the only difference between two such blocks is the copyright year? What if there’s an additional author? What if the text is substantially similar but uses slightly different wording? Plus, some projects put copyright info in every single file while others put it all in one place. Scanners and humans have a much harder time locating unique copyright notices and license information if each and every file in a project is marked rather than just the files with licensing information that differs from the main license of the project.