Yes, GitHub Finally Offered to Indemnify for Copilot Suggestions, But…

Back in September, Microsoft made a big splash announcing that it would be offering a copyright indemnity to all paying customers of its various Copilot services, including GitHub Copilot. But, Microsoft didn’t update any of its contracts to reflect this new copyright indemnity. Lawyers everywhere were mystified (ok, maybe just my friends…). It was very strange for such a public official announcement, coming straight from the Chief Legal Officer himself, to not also be accompanied by new contracts which would instantiate the commitment being offered. Claiming to offer an indemnity is one thing, but indemnities can be written broadly or narrowly and can include exceptions big and small. Given that GitHub already had a history of publishing misleading information about its legal protections, which I detailed here, I was curious to see what Microsoft came up with.

Close to a month later, GitHub finally published some new indemnity-related language. Section 4 of the GitHub Copilot Product Specific Terms was updated from this:

To this:

The General Terms remain the same:

As before, the IP indemnity only applies to paying customers, but now it explicitly covers not just use of GitHub Copilot, but also any IP claims related to its Suggestions. But, this update of Section 4 seems a bit hasty. When GitHub says the Suggestions are “included,” does that also mean that the Suggestions are subject to the “unmodified as provided by GitHub and not combined with anything else” carveout? As discussed before, in the context of how developers use GitHub Copilot, those exceptions are so large they threaten to swallow the indemnity whole. It’s entirely up for debate whether or not those exceptions are meant to apply to Suggestions as well or not. Further, with respect to GitHub’s mitigation measures, is GitHub also offering to replace the Suggestions with a functional equivalent? Again, I think it’s entirely unclear. 

But, let’s say for the sake of argument that GitHub is being generous and we should read these ambiguities as being resolved in favor of the customer. The big elephant in the room is that a lawsuit against a customer may or may not specify exact Suggestions that infringe the copyright holder’s IP rights. The plaintiffs in many of the existing class action lawsuits against various generative AI companies don’t allege any specific infringing output; they allege infringement generally, solely on the basis of their works being used to train the models. One alleges that the model itself ends up being a derivative work of the training data and goes so far as to say that all output infringes the copyright of the author of each piece of training data. 

Lawsuits with that posture are much harder to bring against Copilot customers since the customers didn’t train the model and didn’t handle any of the training data, but there is some uncertainty around whether the courts will accept that a model is a derivative work of the training data (and therefore, to the extent models are offered for physical distribution, making copies of the model also infringes the copyrights of the training data authors) or that all model output is a de facto infringement of the training data’s copyrights. So, if the plaintiff doesn’t specify infringing output and the customer has no way to track what was and what wasn’t a Suggestion, what would it mean for the customer to stop infringing if they lose the case or GitHub settles it on their behalf? Would the customer just have to stop shipping its product entirely and rewrite it so that there’s certainty it doesn’t include any Copilot Suggestions? The likelihood of such a claim against a customer or its ultimate success is probably low, but it’s not zero, and the cost of such an outcome is extremely high. It’s likely higher than the related copyright statutory damages. 

The issue here is that an injunction or agreement to stop infringing is an equitable remedy; it’s not monetary damages. That means that any revenue losses resulting from a customer needing to discontinue a product aren’t damages covered by the indemnity and are losses they would have to deal with alone. Such damages would be consequential damages for which Microsoft fully disclaims any liability. That potentially puts GitHub in a position where they make a settlement on a customer’s behalf that effectively ends their business, with no or limited financial repercussions for GitHub. Worse, GitHub would still be able to tell reporters that they “fulfilled their obligations to defend their customers against IP claims related to Copilot” and unless the customer is well-known, the customer’s ultimate immiseration may never become publicly known, especially if the settlement’s terms include confidentiality. 

It’s also worth noting that the new indemnity provision is strictly for IP claims and does not cover other types of claims, like those that might arise from security vulnerabilities introduced by the Suggestions. It also doesn’t cover some of the claims already brought against GitHub such as those under the DMCA’s Section 1202 related to deleting copyright management information or related to violations of the California Consumer Privacy Act. Either of those could potentially be brought against GitHub customers as well. 

Conclusion

Microsoft and GitHub’s new indemnity offer is an improvement over their previous offer, but the drafting leaves a lot of open questions about how it would apply in practice. Is that ambiguity intentional or just the result of drafting quickly under pressure? Overall, the specter of lawsuits against customers is likely overblown, but the worst case scenario I described here is quite bad. One remedy is obviously to ask GitHub for final approval over any settlement or any settlement that includes any conditions other than monetary damages. However, if that fails, customers might also consider something unusual: ban GitHub from making settlements subject to confidentiality obligations without the customer’s written consent so that GitHub will have its reputation to consider if it chooses to throw a customer under the bus in the name of a quick and cheap settlement.

An IP Attorney’s Reading of the LLaMA and ChatGPT Class Action Lawsuits

Matthew Butterick and the Joseph Saveri Law Firm are continuing to make the rounds amongst generative AI companies, following up on their lawsuits related to Copilot, Codex, Stable Diffusion, and Midjourney with two more class actions related to ChatGPT and LLaMA, respectively. Notably, the lawsuit related to LLaMA actually predates Meta’s release of LLaMA under a commercial license and the comedian Sarah Silverman is one of three named plaintiffs in both of these new cases, alleging various claims related to her book, The Bedwetter. These cases make two new interesting assertions about generative AI that do not appear in any of the prior cases.

New Theories of Liability

The first new assertion is that the models themselves are derivative works of the training data. The Copilot and Codex case doesn’t include any claims related to copyright infringement. The Stable Diffusion and Midjourney case does include copyright infringement claims, but the allegations under Count I (Direct Copyright Infringement) don’t entirely make it clear whether the illegally prepared “derivative works” refer to the model output or the model itself. Section VII (Class Allegations) seems to indicate that the direct copyright infringement is tied to downloading and storing the copyrighted works and then training the model with those works; the vicarious copyright infringement is tied to the output which was used by third parties to create fakes of original artists’ work. While the complaint mentions that the “AI Image Products contain copies of every image in the Training Images,” the impression from the complaint is that it’s the copying of the images for the purposes of the training that’s at issue, and perhaps that a copy still somehow resides “in” the model, but copying is a separate and distinct copyright monopoly right from creating derivative works and this is the first time that this team of attorneys is claiming that the model is a derivative work of the training data explicitly.

The second new assertion is that “every output of the […] language models is an infringing derivative work” with respect to each of the copyrighted works of the plaintiffs. In other words, the assertion is that no matter what the model outputs, it is necessarily a derivative work of Silverman’s book. In the complaint related to Stable Diffusion and Midjourney, the Factual Allegations section does state that “the resulting image is necessarily a derivative work” but it doesn’t say whose work it is derivative of – is it a derivate of one copyright holder’s work or all of the copyright holders’ works at once? Further, the actual section describing the nature of the copyright infringement claim (Count I Direct Copyright Infringement) doesn’t quite go so far as to say every output of the model necessarily infringes the copyrighted work of a single copyright holder. The complaint puts it more obliquely, arguing that the output “are derived exclusively from the Training Images…” It’s not clear that this usage of the word “derived” is meant to mean “derivative work” under copyright law and and again, it’s not clear if they mean that every output is derivative of some works or all works. The same section merely accuses the defendants of having “prepared Derivative Works based upon one or more of the Works.” From my perspective, this is a new and different theory of copyright infringement now being put forth by this team of attorneys.

The Implications of the New Theories of Liability

Both of these new assertions have interesting implications for the plaintiffs’ cases. There is a scenario, for example, where the courts decide that the copying necessary to train a model is either fair use or outside the ambit of copyright law. That sort of incidental copying is fairly universal in order to allow both individuals and various bots to “read” the Internet, after all, and merely reading or viewing a copyrighted work is not a right protected by copyright law. A court may further decide that it won’t hold a model creator/distributor responsible for its output because there is substantial non-infringing use for the model(s) and will instead hold individual users accountable. However, if the model itself is deemed to be a derivative work of the training data and for whatever reason fair use or other defenses did not apply, the defendants would still be liable for making the derivative work and distributing it (to the extent there is distribution) even if the courts move as described above with respect to model training and the model output.

This puts into play a third prong of attack for the plaintiffs that didn’t exist before. Strategically, I think this was a good move for the plaintiffs’ case.1 I happen to agree that the model probably does constitute a derivative work of at least some of the training data and the logic of “the model can summarize the plaintiffs’ works, therefore it must have copied and stored them” is simple and really appealing to a non-technical audience in a way that the refutations of this statement will not be. However, I also think there is a strong argument that either the fair use defense applies and/or that certain models can be big and complex enough to the point where even though they might contain some training data, its use is de minimis. But, there’s no predicting how courts will weigh such a defense and this allegation gives the plaintiffs a third roll at the dice.

The allegation that all output is necessarily a derivative work of any given piece of training data is more of a double-edged sword for the plaintiffs. Technically speaking, any time the model creates output, it is doing so on the basis of the model in its entirety (a model’s decision not to respond to input in a certain way is as much informed by the training data as it’s affirmative decision to respond to input in a certain way). All output is, in a sense, a reflection of everything the model has gleaned from the training data. But this is “derivation” in the colloquial sense of the word, not in the sense designated under copyright law. Under copyright law, a “derivative work” is one that still retains copyrightable elements of the original work.

Clearly, ChatGPT is more than capable of producing output that no reasonable person could in any way connect with Silverman’s book, so this seems like a strong overreach. Such an interpretation would certainly benefit the plaintiffs since the complaints don’t actually allege any instances of output that infringes the plaintiffs’ works other than output summarizing those works (and of course mere summarization does not constitute copyright infringement, contrary to the plaintiffs’ assertions in these complaints), but I think this is a bridge too far and I think an already confused court would not look kindly upon such an incendiary and misleading claim. This feels like a placeholder until (if?) the plaintiffs actually get the models to reproduce a real derivative work of one of their works – and so far, they don’t seem to have succeeded at that.

The inability to produce any damning output probably makes these the weakest of all the generative AI cases this group has filed so far, especially with respect to ChatGPT because in the absence of derivative output and in the absence of physical distribution of ChatGPT itself (the plaintiffs don’t allege any physical distribution of ChatGPT), it would seem like there’s very little to hang the DMCA claims on. The plaintiffs would basically have to argue under 17 U.S. Code § 1202(b)(1)

(b) Removal or Alteration of Copyright Management Information.—No person shall, without the authority of the copyright owner or the law—
(1) intentionally remove or alter any copyright management information,
(2) distribute or import for distribution copyright management information knowing that the copyright management information has been removed or altered without authority of the copyright owner or the law, or
(3) distribute, import for distribution, or publicly perform works, copies of works, or phonorecords, knowing that copyright management information has been removed or altered without authority of the copyright owner or the law,
knowing, or, with respect to civil remedies under section 1203, having reasonable grounds to know, that it will induce, enable, facilitate, or conceal an infringement of any right under this title.

that mere creation (even in the absence of distribution) of generative AIs is prohibited by the DMCA because every act of creating a model removes copyright management information (CMI) and that the information was stripped specifically to “induce, enable, facilitate, or conceal an infringement.” That seems like an unpersuasive argument on many fronts: the models don’t always strip CMI, the models aren’t necessarily storing enough of the training data for it to retain copyright protection anyway (and some training data is ignored by the model entirely or “forgotten” later), the primary purpose of model creation isn’t the concealment of an infringement, the stripping of the CMI is more a byproduct of model creation rather something done by the models “by design,” it’s not clear that any infringement is happening here at all, it’s definitely not clear that the defendants actually believe there is any infringement here as a matter of law, and more generally, it’s hard to argue that a law passed in 1998 specifically to streamline innovation and generally to “get with the times” was intended to wholesale ban an incredibly exciting technology that wouldn’t exist for another 20+ years.

Other Background

The Claims

The claims in these cases are basically the same as the claims related to Stable Diffusion and Midjourney minus the claims related to rights of publicity and breach of contract (which was specific to DeviantArt’s alleged violations of their own Terms of Service and Privacy Policy):

  • Direct copyright infringement related to copying, making derivative works, publicly displaying copies and distributing copies of copyrighted works and derivatives thereof.
  • Vicarious copyright infringement arising from the allegation that “every output of the […] language models is an infringing derivative work”
  • Removal of copyright management information under the DMCA
  • Unfair competition based on the DMCA violation
  • Unjust enrichment, vaguely for all of the above
  • Negligence, extremely vaguely for all of the above

There is also an additional claim against Meta for false assertion of copyright related to the fact that when the LLaMA model was leaked, Meta sent GitHub a takedown notice in which they asserted sole copyright ownership in the model.

Of note is that there are no claims here related to personally identifiable data, which did appear in the Copilot and Codex-related complaint. Given that I’ve personally read an entirely fictitious biography of myself generated by ChatGPT, there would probably be a lot for a court to chew on with respect to those sorts of claims, but this isn’t the right formulation of the class to bring such claims.

The Class

The class in both cases is basically everyone with a copyright in any of the training data. It’s a strangely broad choice given that the named plaintiffs are specifically book authors and in each case, the complaint alleges that the plaintiffs’ books were part of a dataset originating from “illegal shadow libraries” that made copyrighted books available in bulk via torrent systems. Why not limit the class to other book authors whose works were part of the same dataset? By naming the class so broadly, the plaintiffs make it harder to prove typicality, adequacy, or commonality and predominance because the LLMs in question were also trained on the broader Internet on many different types of works under many different licenses, works in the public domain, and works whose copyright was never registered. Does the author of a book actually have a lot in common with a Redditor or someone publishing data on local erosion patterns under a public domain dedication? These copyright holders look to have different interests, be differently situated, and to have different questions of law and facts.

Like the classes in all the other generative AI cases brought by this group of attorneys, the classes here don’t condition participation on injury. Just because a work was part of the training data doesn’t mean the work is 1) is actually part of the model, 2) is part of the model in sufficient detail to still be subject to copyright, 3) actually outputted by the model (or a derivative of it is), or 4) outputted by the model in sufficient detail to still be subject to copyright.

Conclusion

Although the ChatGPT and LLaMA-related cases make some new, rather startling allegations, these cases feel weaker than the other ones the group has filed so far. The inability to prompt either model into actually outputting a derivative work from the training data will require the plaintiffs to focus on the actual act of training the model and the details of what a model is and how it works when they go to trial, making the claims and allegations here somewhat academic and theoretical (if not entirely impenetrable) from the viewpoint of a potential jury. There’s no smoking gun. There’s no clear narrative about how Sarah Silverman is losing out on book sales because ChatGPT is stealing her jokes, etc. At bottom, the plaintiffs will have to convince the jury that even though these LLMs aren’t actually stealing anyone’s work, and to the contrary seem to be providing helpful fact-based information (the book summaries), that they should nevertheless handsomely reward the authors for losing out on an entirely new revenue stream that basically only exists thanks to a dense and obscure tangle of laws and technical details. To me, the LLMs seem like the AIs most vulnerable to legal attack, yet these cases as currently presented strike me as the least worrisome ones.

Evaluating Generative AI Licensing Terms

Last week I presented to the IP Section of the California Lawyers Association on the topic of “Evaluating Generative AI Licensing Terms.” This was aimed at lawyers and procurement professionals familiar with tech transactions looking to license AI/ML technologies on behalf of their clients. It contained a brief introduction to AI/ML technologies, an exploration of special risks related to these technologies, commonly negotiated license provisions, and risk mitigation measures. Many of you asked for a copy of the slides afterwards – feel free to get them here.

AI Licensing Can’t Balance “Open” with “Responsible” 

AI researchers looking to publicly share their AI/ML models while limiting the possibility that the models may be used inappropriately drove the creation of “Responsible AI licenses” by the RAIL group (“RAIL licenses”).1 Today these licenses are used by a number of popular models, including Stable Diffusion, the BLOOM Large Language Model, and StarCoder. RAIL licenses were inspired by open source licensing and they are very similar to the Apache 2.0 license, with the addition of an attachment outlining use restrictions for the model. The “inappropriateness” RAIL licenses are targeting is wide-ranging.

Background on RAIL Licenses

Technically, these licenses are not open source licenses per the definitions of open source promulgated by the Open Source Initiative or the Free Software Foundation. The use restrictions in these licenses’ Attachment A, such as a ban on using the models to provide medical advice or medical results or using them for law enforcement, discriminate against particular fields of endeavor in contravention of fundamental open source principles. Because the licenses allow licensees to pass on the models under their own licenses of choice, provided they flow down the use restrictions, the licenses only promise a very limited amount of “openness” or “freedom” in the traditional OSS-specific sense. Unlike open source licenses, downstream users of RAIL-licensed models are not required to receive the same rights to use, modify, or distribute the models as the original licensee. 

The use restrictions are worth reading in full if you haven’t already. Here are the ones in BigCode’s StarCoder license (they vary a bit from one RAIL license to another):

You agree not to Use the Model or Modifications of the Model:

(a) In any way that violates any applicable national, federal, state, local or international law or regulation;

(b) For the purpose of exploiting, Harming or attempting to exploit or harm minors in any way;

(c) To generate and/or disseminate malware (including – but not limited to – ransomware) or any other content to be used for the purpose of Harming electronic systems;

(d) To generate or disseminate verifiably false information and/or content with the purpose of Harming others;

(e) To generate or disseminate personal identifiable information with the purpose of Harming others;

(f) To generate or disseminate information (including – but not limited to – images, code, posts, articles), and place the information in any public context (including – but not limited to – bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated;

(g) To intentionally defame, disparage or otherwise harass others;

(h) To impersonate or attempt to impersonate human beings for purposes of deception;

(i) For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation without expressly and intelligibly disclaiming that the creation or modification of the obligation is machine generated;

(j) For any Use intended to discriminate against or Harm individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;

(k) To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;

(l) For any Use intended to discriminate against individuals or groups based on legally protected characteristics or categories;

(m) To provide medical advice or medical results interpretation that is intended to be a substitute for professional medical advice, diagnosis, or treatment;

(n) For fully automated decision making in administration of justice, law enforcement, immigration or asylum processes.

The scope of the concerns here is remarkable. Some are rather practical (let users know they’re talking to an AI) and others are common in licensing generally (don’t use the technology to break the law). But, many of these read as attempts to prevent the technology from enabling totalitarianism and police states, generally. Licenses for general-purpose technologies have long contained restrictions related to using certain technologies in extremely dangerous, mission-critical contexts (like to manage nuclear power plants or conduct remote surgeries); those sorts of disclaimers or limitations arose because the software was simply not hardened to meet the security and performance standards those uses would require. It’s only in the last 15 years or so that such licenses have attempted to limit usage based on ethical considerations.

A spate of so-called “ethical licenses” emerged in the last 15 years, with aims that included preventing companies from using the software if they worked with ICE, if they were Chinese companies that required long, grueling working hours, and if they were harming underprivileged groups. These licenses have not enjoyed widespread use or support; very few projects have adopted them and they have been banned2 at virtually every tech company sophisticated enough to have an open source licensing policy. But the RAIL licenses are probably the most far-reaching licenses for publicly available technology that have ever gained any notoriety, pursuing not just one particular type of harm, but contractually attempting to require adherence to what amounts to an expression of a wide variety of Western, liberal principles.

As you can see, many of these restrictions are nebulous even at face value. You’ll notice that none of the uses of the word “discriminate” are tied to any legal understanding of what discrimination might be, leaving open the interpretation that simply sorting people by a protected or unprotected characteristic is prohibited “discrimination” under this license.3 “Harm” is defined broadly: it “includes but is not limited to physical, mental, psychological, financial and reputational damage, pain, or loss.” Note that “Harm” isn’t just harm in violation of the law or undue or unjust harm. It would include harm like sending a rightly convicted person to jail or hurting someone’s reputation by publishing true things about them.

On the face of it, (g) would prevent a reporter from using an AI to assist in writing an article on a company dumping chemicals in the water, since that would be disparaging, (e) would prohibit someone from asking an AI about the names and addresses of people on the public sex offender registry, since that would be a use of personally identifiable information likely to hurt someone’s feelings, and both (j) and (l) would prevent someone from using an AI to assist in targeting men for boxer brief advertisements, since that would be discrimination based on a predicted personal characteristic and based on a protected category. An entire essay could be written about the possible meaning of nearly any one of these provisions. The vagueness here would likely make some of these restrictions unenforceable in a court of law and leave the others open to unpredictable case-by-case determinations. 

Like open source licenses, the RAIL licenses are styled as copyright licenses. The ability to enforce them rests on the model owner’s ability to prove that they have copyright ownership in the model and that a violation of the RAIL license is therefore copyright infringement. Given that AI models are mostly just numbers representing parameter weights, there is deep skepticism in the IP world that the models are even copyrightable. If not, the license is fully unenforceable. It would need to be significantly rewritten solely as a contract in order to be enforceable, which is difficult to do because every country (and even the states therein) have very different laws that apply to contract interpretation, whereas copyright law was harmonized almost worldwide by the Berne Convention.

There is also a strong possibility that the use restrictions in RAIL licenses could constitute copyright misuse in the United States (and a few other countries). This doctrine prevents copyright enforcement where the copyright owner has conditioned a copyright license on restrictions or limitations well outside the scope of the monopoly rights granted by the Copyright Act. Examples of copyright misuse include licenses that require non-competition for 100 years and licenses that prevent the licensee from sharing any data stored in the software with others. Given the vagueness of the restrictions, I don’t think it’s hard to imagine a use case that might be read to violate those restrictions, but in which a court would be hard-pressed to enforce them (such as my example of a journalist writing about clean water, above). 

For purposes of analyzing and assessing the various enforcement mechanisms possible in RAIL licenses, I’m going to assume that RAIL licenses can surmount the potential legal challenges mentioned above and that it is possible to make better and clearer restrictions that are worth trying to enact.

RAIL Licensed Models Incorporated in Other Products and Services

RAIL licenses were drafted in the context of AI researchers sharing naked models (without the software to run them or the training data), often foundational models, in public spaces like the Hugging Face platform. Thus, it’s not entirely clear how the licenses are supposed to apply to entities or individuals incorporating a model into their own products and services. On the one hand, the RAIL licenses allow the initial licensee to re-license the model under their own license provided that they flow down the use restrictions in the RAIL license (this is “sublicensing” and the downstream users are “sublicensees”). But, on the other hand, the RAIL licenses still require the initial licensee to provide downstream licensees with a copy of the RAIL license. It’s not clear why downstream licensees need to see a license that they may not actually be subject to. 

Although the RAIL licenses require flowing down the use restrictions to downstream licensees, the license does not further require that the initial licensee enforce those restrictions4 against downstream licensees (including by terminating access rights for downstream licensees) nor does the license terminate for the initial licensee if downstream licensees violate those restrictions. As such, the owners of the model can really only enforce the RAIL license against the initial licensee and only if the licensee itself has violated the license, not if one of its downstream licensees has violated the RAIL flowdowns. Since there is no privity of contract between the model owners and the downstream licensees (i.e. there is no contract between them) when the initial licensee sublicenses to the downstream licensee, model owners have no standing to enforce any of the use restrictions against them (or their subsequent downstream licensees).5 The use restrictions are not very effective since everyone who receives the model from someone other than the licensor (the model owner), is beyond the reach of legal enforcement from the model owners. 

Even if the initial licensee wanted to sue the downstream licensee, they’d be limited to contract claims as only exclusive licensees have standing to sue for copyright infringement, and that significantly limits the appeal of enforcement since the initial licensee would need to prove that they were personally, financially damaged by the violations of the downstream licensee. That’s probably not true in most instances of violations of the RAIL use restrictions.

Alternative Approaches to License Enforcement

The RAIL licenses could potentially be modified in two different ways to make sure that no users of a RAIL licensed model are beyond reach for those wishing to enforce the license. The first approach is to make the RAIL licenses look more like commercial licenses, with more restrictions and a clear framework for how upstream licensees take responsibility for downstream users. The second approach is to hew more closely to the open source licensing model, by prohibiting sublicensing, and forming a direct contractual relationship between the model owner and each user. 

The Commercial Approach

The companies named here are provided solely as an example and this diagram does not necessarily reflect any real-world relationships.

Commercial contracts that require that the initial licensee flow certain terms down to downstream licensees (but not the entire agreement) also reserve the right to terminate the license of the initial licensee if downstream licensees fail to comply with those flowdowns. They often require that the initial licensee report any downstream violations. They can further ask the initial licensee to indemnify the licensor against any claims or losses related to the actions of the downstream licensees. 

Companies that care a lot about downstream uses will even go so far as to require the initial licensee to make the company (the licensor) a third party beneficiary to any agreements between the initial licensee and the downstream licensees so that they can enforce those flowdowns even when the initial licensee has no interest in doing so. This can be particularly appealing to the licensor if the initial licensee is likely to be smaller than the customers (downstream licensees) it serves since that power dynamic is unlikely to yield effective enforcement of the licensor’s rights and the initial licensee doesn’t have deep enough pockets to make suing them for their customers’ actions profitable. Designating a third party beneficiary in a tech transaction is fairly unusual, but it’s worth discussing at length here because the initial licensees in this particular context are unlikely to engage in much license enforcement on their own.6

Generally, downstream licensees will chafe at the addition of a third party beneficiary because:

  • They don’t like dealing with a third party they don’t know or have any relationship with – trust and relationships are important. Some third parties are more litigious than others
  • It involves allowing a third party to have access to their contracts or potentially their confidential information, but the agreement is solely between the initial licensee and the downstream licensee and the downstream licensee therefore cannot bind the third party beneficiary to any confidentiality or data privacy obligations
  • When a third party beneficiary includes “…and their affiliates,” this could potentially mean hundreds of plaintiffs the downstream licensee has never even heard of before and which can change as companies are bought and sold. That can significantly increase litigation risk and the risk of exposing confidential information to unknown entities without restriction
  • It leads to more complex and expensive litigation

Commercial agreements also commonly include provisions that subject initial licensees to audits to make sure they’re complying with the agreement. They may also include provisions requiring the initial licensee to create and maintain certain records so that they may be available for an audit or legal discovery, and such records might need to contain information about downstream licensees.  Where the parties anticipate difficulty in bringing copyright claims or proving damages resulting from contract breach, the parties may agree to a liquidated damages provision, specifying up-front what a breach will cost the licensee.

All the additional terms discussed above could be flowed down to downstream licensees of downstream licenses, etc.

The Open Source Approach

The companies named here are provided solely as an example and this diagram does not necessarily reflect any real-world relationships.

On the other hand, open source licenses do not rely on flowdowns to downstream users. Open source licenses follow a direct licensing model, wherein every user is in a direct license with the copyright holders of the software (i.e. the project contributors). Another way of putting this is that there are no downstream licensees or sublicensees because everyone receives their license directly from the copyright holder, even if they might receive the actual software and a copy of the license language from someone else (you might see references to downstream users, though). If a company incorporates open source licensed software into their products, the open source software remains under the open source license, regardless of how the company chooses to license other elements of its products and their customer receives a bundle of licenses – one for their own software, and one for each OSS component in the product:

 Any violations of the open source license by such customers (or their own customers) can be enforced by the copyright holders. Downstream users never receive fewer rights than upstream users and the software itself thus remains “open” and “free.” 

Evaluating the Commercial Approach for the AI Space

The commercial approach is useful when the subject of the license is a proprietary piece of technology. In other words, there is no interest in the technology being widely and freely available to others and there is no interest in seeing research and development related to the technology from anyone but the licensor (the initial creator/owner of the technology) or the licensor’s chosen partners or contractors. That’s because the initial licensee of such technology must comply with the licensor’s restrictions in order to maintain the license and avoid potential legal claims. If they don’t, the licensor could get an injunction to completely prevent the initial licensee from selling their products until the licensor’s technology has been removed from those products. Missing out on several months of revenue would be detrimental to most businesses and public lawsuits would deter new customers from signing on. And, of course, the licensee would be on the hook for damages related to copyright infringement and contract breach. The licensee might even owe money to its other customers if they granted them an IP warranty or breached their contracts with them by ceasing to make their product available.

If the initial licensee is responsible for downstream use, then it’s in their best interest to limit uses of the technology to a set of pre-approved use cases and prevent users from using the technology in an unforeseen way. This has traditionally translated into initial licensees forbidding downstream licensees from modifying the technology, distributing the technology, incorporating the technology into their own products, or reverse engineering the technology (sometimes at the behest of the licensor directly, but sometimes just as a risk mitigation measure). This often goes hand in hand with not providing downstream licensees with source code to the technology and with implementing control systems like DRM, licensee keys, etc. Even if downstream licensees are allowed to redistribute the technology in certain circumstances, many of the above controls and measures will remain in play. In sum, these controls and measures mean that the technology at issue is not “open” or “free.” Additional research and development related to the technology will typically begin and end with the licensor.

In the AI space, preventing downstream licensees from violating the use restrictions in the RAIL licenses will likewise mean lockdown measures. Sensible companies will limit the ways in which their customers can interact with the AI model as well as their ability to fine-tune it or do prompt-tuning. The fear of a customer circumventing some of the controls a company may put in place to comply with the RAIL license means that in the absence of regulations requiring otherwise, the exact nature of those controls will be kept secret and so will the code implementing those controls, limiting the transparency that AI researchers and the media could otherwise provide on everything from the performance abilities of the model to the model’s safety. Since the training data itself can offer information useful for circumvention, disclosing its contents could also become risky and undesirable. 

In the B2B software context, knowledge of how downstream licensees are using a particular technology is generally rooted in either monitoring license key usage and/or logins or via physical audits of documents or computers in search of information related to payments, inventory, number of users, number of downstream licensees, and product descriptions or documentation, etc. That’s because the salient questions up until now have revolved around whether a product has been distributed when it shouldn’t have or whether too many users are using it, etc. In some cases certain uses can be identified just by seeing what a downstream licensee is doing and saying in public. Technology companies that sell to other companies (not to individuals) mostly rely on metadata and performance/utilization data to monitor customer accounts (initial licensees) for potential license violations, not the contents of customer data (which is often not their data, but data of their customers, i.e. downstream licensees – ex. AWS hosts a data analysis platform and that platform in turn ingests data from various customers who are retailers and the data provided by the retailers might actually be data of individual consumers or employees). 

Best practices in the software world include only giving a small subset of employees direct access to customer accounts, and that subset generally only has access to resolve specific security issues and support requests; their access is carefully logged and monitored, and abuse of such access quickly leads to termination. Most B2B software companies enter into data protection addendums with their customers, promising compliance with these practices. While companies often reserve the right to access customer accounts for things like enforcement of their agreement with the customer, per the above, in practice most such enforcement doesn’t require access to customer data directly. Many companies that might use AI models as part of their services also do not receive any customer data in the clear – processing that needs to happen with unencrypted data is done in customer-controlled environments before being sent to the company; they only receive encrypted data that only their downstream licensees can unencrypt. And of course, good old-fashioned on-prem software providers receive no customer data to monitor at all. 

Making sure that downstream licensees aren’t using AI models to provide medical advice, for example, would involve a level of surveillance that most B2B tech companies do not normally engage in, even if they do process customer data in the clear as part of a SaaS offering. Assessing a violation like that requires constant monitoring of the contents of that data, who a customer’s customers are, and more people viewing customer data and personally identifiable information for purposes other than providing people with the services they’re paying for. Suddenly, compliance, legal, and engineering staff are exchanging, citing, and discussing a heap of sensitive health-related information to distinguish what’s advice and what’s just a factual statement. Given the grand scope and ambiguity of the RAIL use restrictions, deciding whether or not any customer complies with all of them would require each company to form something like FaceBook’s Oversight Board, except with more expertise in a wider array of areas and without many of the legal protections that allow companies to eschew proactive surveillance and enforcement in favor of waiting for complaints of violations from third parties.7 The desire to make sure downstream licensees are using AI responsibly is in direct tension with their privacy and data protection rights, which is ironic because the RAIL licenses themselves prohibit AI models from abusing personally identifiable data. 

Most technology companies also experience instances where customers are in violation of their agreements, but they choose not to terminate those customers. That can happen when an alternative deal is struck with the customer that everyone is happy with (which may or may not get documented in a contract somewhere), when the violation is fairly minor, when the customer is also an important partner and the partnership is much more lucrative than the customer arrangement, or the customer serves as such an important signal of the company’s value (it’s a “big logo”), that it’s worthwhile to keep them on even in the face of major violations. In the B2B space, companies make money by licensing products and services and they lose money when they terminate customers or file lawsuits against them. The reality is that individuals and corporations act on the things that most deeply affect them personally and they don’t, can’t, and shouldn’t expend limitless resources to surveil and monitor their customers or cripple the privacy and security of their products in order to act as a substitute for traditional government authority. Accepting a license that comes with such a responsibility would probably be a violation of the company’s fiduciary duties to its shareholders. 

In the worst case scenario, if companies do not think they can put technical controls in place to comply with the RAIL use restrictions, they believe that enforcing these restrictions increases their data privacy-related risks (particularly if they interpret these provisions as preventing them from encrypting customer data), or they think it is too expensive and time-consuming to chase down every potential or actual violation, companies will simply ask customers to download AI models themselves and offer API integrations between their products and these models.8 That arrangement would fully absolve the companies of putting into place any controls, monitoring, or limitations that may be necessary to comply with RAIL use restrictions, putting those obligations squarely on the shoulders of their customers.

The problem with this is that the customers of Microsoft, for example, include veterinary offices and hair salons – in other words, lots of customers who do not have the capacity or interest in making AI models safer or policing their own employees’ and customers’ use. While various regulatory agencies may have the capacity to police several thousand tech companies and force them to develop a set of best practices, there is very little they can do when everyone is running their own instance of a foundational AI model to which no one’s added any compliance or safety mechanisms and very importantly, no one is incentivized to add such mechanisms because the people capable of doing the work no longer bear any potential liability and the people who bear liability will, in practice, face a low chance of enforcement. At that point, regulators can either give up on AI safety or they can ban publicly available AI models. Such a move would again run counter to the desire for transparency around how models are trained and how they work, pushing AI research and development back into the hands of a handful of corporations, and limiting AI research and development by hampering people’s ability to iterate on each other’s work. 

Changing the RAIL licenses to work more like commercial agreements may further ensure “responsible” AI use in one particular sense, but it would significantly cut against openness and the benefits related to transparency and innovation that openness brings. That “responsibility” would be of a limited nature – it would be easy to see wild license violations and corporations would likely act on such violations if committed by their downstream licensees, but it would no longer be possible for outsiders to closely study modified or fine-tuned models for bias or emergent behaviors, for example. It would be impossible for the public to know who all the users of a particular model are or what they’re using it for. Corporations would play a greater role in deciding what does and doesn’t violate the license, and other researchers, journalists, and regulators would have less opportunity to weigh in. If the cost of violating the license is high, as it is in commercial agreements, and as it would be if a company came to depend on a model under a RAIL license, then there is also a strong incentive to hide any violations that are found, rather than letting affected individuals or entities know about them. As in the data breach context, the only way to counter the incentives around hiding violations is to pass laws mandating their reporting and creating an even higher penalty for failure to disclose than for violating the license.9 But that doesn’t work in the AI context, where the entire point of the RAIL licenses is to substitute for regulations that don’t exist yet. 

Evaluating the Open Source Approach for the AI Space

The open source approach is, at first glance, unsatisfying for a model owner who wants to go after just one company and not all of its customers. In other words, for a model owner who wants to essentially delegate responsibility for license enforcement down the supply chain. But, the goals of the open source movement haven’t been hampered by the direct licensing model. On the contrary, that model is core to the movement’s success and open source software’s ubiquity.

The open source movement, as it turns out, has been wildly successful without very much need to sue anyone at all. Just a handful of enforcement actions has created legal precedent and legal awareness around the importance of open source license compliance. The software industry has dramatically improved its open source license compliance practices in the last 30 years. Partially this is because the potential for enforcement and the expense related to it has been clarified – not just to individual companies, but to their potential customers, investors and acquirers. Partially it is because the open source movement has many adherents and beneficiaries; companies that don’t comply have a harder time recruiting good engineers who care about open source and they have a harder time forming partnerships with other companies who do not want to harm their reputations. 

But perhaps the biggest reason for the movement’s success is simply that good software was shipped under open source licenses that could never be stripped from the software. That meant that not only could lots of people use the software freely, but lots of people could improve it, making many pieces of open source software the de facto gold standard in that domain. And once the software became a gold standard, no one could avoid using or avoid eventually becoming aware of open source licenses. 

If the AI movement wanted to replicate the OSS movement’s success, allowing AI models and the related software (under RAIL software-specific licenses) to be sublicensed under other licenses was a fatal flaw.10 It undercuts the ability of others to iterate on AI models and improve them, weakening the strength of openly developed AI models and in turn the ubiquity and importance of the RAIL licenses themselves. In seeking maximum “responsibility,” there’s a good chance the RAIL licenses become, for all intents and purposes, entirely irrelevant because the best models aren’t going to be released under RAIL licenses at all. 

There are also a lot of differences between the motivations of licensors in the open source space and the licensors in the AI model space. The first wave of people who licensed their work under an open source license did so because they fundamentally believed in a set of fairly succinct open source principles related to openness and freedom related to code.11 They cared not just about getting attention and accolades for their work, but specifically the ability of others to use and improve upon their work freely. This is where I believe there is schism between the open source world and the AI world: many people licensing their code under RAIL licenses don’t have a distinct and articulated (or articulable) desire for how they want their models to be used. And they don’t necessarily even agree with each other about what constitutes a good or bad use. They may want to be seen as generally “responsible,” but being responsible isn’t a movement; it’s not a set of guiding principles; it’s not a standard someone can clearly meet or fail. The general sort of responsibility imbued in RAIL licenses is the same sort of “don’t be evil” ethos that’s always been floating in the ether and not a new, better way of doing things. 

OSS licensors are also affected by license violations far more directly than AI licensors, and they have more resources available to them if they choose to enforce their license. In the open source world, someone abusing the open source license means that the copyright holder is not given back contributions to the project (by making them publicly available as licensees should), they’re not attributed, or they are robbed of revenues from alternative commercial arrangements from people who can’t or don’t want to comply with the open source license. Although lawsuits are fairly rare in the open source space, there is a culture and history of enforcement actions where copyright holders merely ask for and receive compliance in order to redress a violation. That culture is in part predicated on the fact that getting into compliance with an OSS license or getting a commercial license for the technology when it’s available is generally just one of the costs of doing business; it is rarely a matter that could make or break a company unless the company decides to literally go for broke and fight obvious violations in court. There are also foundations and open source compliance-focused organizations that help to educate the public about OSS licenses, provide resources for copyright holders trying to enforce their license, and sometimes act as counsel for a copyright holder who wants to enforce their license.

AI licensors are likely to be affected only very indirectly, if at all, by violations of RAIL licenses. Are individual PhD candidates at Berkeley really going to try to enforce against companies using their models for facial recognition technology in Indonesia? Is a Meta employee fine- tuning LLaMA in her personal time going to sue a pharmaceutical company in Norway for using her work to suggest various drugs to people? Today, there are no private organizations funded and staffed to enforce RAIL licenses. The resources or desire for people to enforce any of the RAIL restrictions are very limited. Whether it’s model developers to whom enforcement falls, as in the OSS direct licensing approach, or it’s anyone upstream of a downstream violation, as in the commercial sublicensing approach, no one individual or company has the desire and funds to enforce all of the license restrictions everywhere in the world at all times, no matter how egregious and harmful they may be. Even occasional enforcement is likely to be extremely rare in part because of resource constraints, in part because of a lack of interest, but also in part because the people who care the most about enforcement wouldn’t hand out their models to unknown entities on public platforms. 

Unlike in the OSS space, some of the potentially infringing uses of the RAIL licenses may also be impossible for certain companies to remedy. Defense contractors using AI models to help process PII in order to identify, locate, and arrest Russian soldiers in Ukraine, for example, can’t simultaneously stay in business and come into compliance with the license’s prohibition on using PII to “harm” others. That means a potential license violation could, in fact, make or break the business and the same culture of “please just come into compliance, no need for nasty lawsuits” is unlikely to take hold in the AI space. As with the commercial approach, this incentivizes people to hide the fact that they’re using RAIL-licensed models and their violations. And in particular, if you know the violations applicable to your business or your customers cannot be remedied, why look for them at all?

The success of the OSS movement is enviable and impossible to overstate. But, the many elements which have led to its success simply can’t be replicated in the AI space. The requirements of OSS licenses are much narrower and less open to interpretation than the requirements of the RAIL licenses. The people applying the licenses to their technology are much more directly affected by license violations than those in the AI space, they have far more resources available to them, and the types of violations they can allege are relatively clear cut. The scope and expense of proving that someone didn’t make a source code offer in court is orders of magnitude smaller than convincing a court in say, Myanmar, that a model has been used to discriminate against a religious minority (that the government has an active policy of discriminating against) and that the corporation running it should be prevented from selling its products until the model is removed. Not to mention the fact that enforcing many of the RAIL provisions constitutes not just an arcane commercial licensing dispute, but a provocative political statement that may lead to threats, harassment, and even torture or death for the people making them.  

What Should Be Done with RAIL Licenses?

RAIL licenses were drafted because there was (and remains) little AI-related regulation and little AI-related government or corporate expertise available. The licenses are an attempt to offer a bandaid or substitute for this lack of awareness and regulation. They are also social and political statements about the role of AI researchers in the world, chastened, perhaps, by the regrets of some of the scientists who worked on the Manhattan project or the general history of scientists driven by curiosity, excitement, and hubris who did not give a thought as to how the results of their work would actually be used in the world. 

The use restrictions in the RAIL licenses would seem to indicate that the people licensing their models this way believe that their technology is profoundly dangerous and can be used by corporations and governments alike to effect systemic bias, voter manipulation, and totalitarian control in a manner and to a degree not previously possible. I don’t know how true that is; I’m just a lawyer, but I’m willing to accept these assertions at face value from the people who know the technology best. If true, today’s RAIL licenses amount to little more than a warning sticker on canisters of purified uranium being jettisoned into the sky with t-shirt guns. They provide no real control over how any of the technologies licensed under them are used. And unfortunately, neither the commercial nor the open source approaches discussed above create a good balance between openness and safety. 

As previously stated, it’s not clear the models are copyrightable and subject to any license on that basis at all or that the licenses would survive misuse challenges. Even if the licenses are valid, like any legal instrument, they’re only useful if the entities subject to them reside in countries with operational legal systems open to foreign plaintiffs and whose own laws and politics are aligned with the goals and values espoused in the licenses. To put this another way, no license on earth is going to stop the Chinese Communist Party from using a model from Hugging Face to maintain biometric identification on its citizens, if that’s what it wants to do. If a terrorist organization wants to use the models to plan future attacks against Americans, maybe there’s no local court that cares to stop them from doing so. 

What exactly is the sort of model that’s ok for North Korea to use however it wants, but whose use will require extraordinary levels of surveillance in the Western world? Are the licensors essentially arming hostile governments and terrorist organizations and only retaining control over law-abiding people? Do the licensors think it doesn’t matter if existing totalitarian regimes entrench their power over their helpless citizens even further so long as the licenses allow them to keep such practices at bay in the Western world? Is their technology not actually dangerous enough to warrant all these use restrictions? 

This discussion is so fraught and complex because there actually is no model for how to regulate a dangerous technology that developed countries fail to recognize as dangerous. Certainly, there’s no model for further allowing people to trade this technology freely and publicly while also ensuring safety and transparency about who is using it, for what and how. Except perhaps in failed states, things that are truly dangerous, like military weapons, nuclear technologies, viruses created for scientific research, etc. simply can’t be left in the public square for anyone to come pick up and play with. No doubt all of these technologies would benefit from open and collaborative world-wide development of the sort made possible by the open source software movement, but openness and even progress don’t always trump safety and responsibility. 

Licensing of any variety can’t substitute for meaningful, international AI regulation. The ultimate answer as to how to balance openness with responsibility, probably like most of this blog post, is deeply unsatisfying and uncomfortable: if you really think your technology can be misused in such profoundly harmful ways, don’t make it publicly available on the Internet. Only provide it to people you trust who are going to be honest with you about how they’re using it, under legally enforceable agreements. All technology can be used for both good and evil, but it’s still our collective responsibility to prevent evil where we can.


  1. This post had additional information related to the history of these licenses and the types of concerns they attempt to address.
    ↩︎
  2. You can read more about the reasons for this here.
    ↩︎
  3. This is particularly likely since the restrictions already require compliance with all applicable law. Basic principles of contract interpretation would require additional restrictions on “discrimination” to mean something other than the sort banned by applicable law, since otherwise there would have been no need to write a separate provision addressing the same matter.
    ↩︎
  4. Requiring an initial licensee to “enforce” the flowdowns against downstream licensees is not a meaningful requirement. In the US, legal “enforcement” can mean issuing a cease and desist, filing a legal complaint, settling the lawsuit for a lot or for a little or for nothing at all, going to trial and getting a verdict from a judge or jury, appealing to an appeals court, or appealing to a court of final judgment. Without more specificity, it’s impossible to know exactly how far a plaintiff is supposed to take a matter to satisfy the licensor. Even if the licensor were to specify something like “must pursue all violations to final judgment from a final court of appeals,”  it would still be up to the plaintiff to decide exactly what claims to file and what sorts of remedies to ask for, and those would have to depend on the exact nature of the violation and the evidence available to the plaintiff. This would be an extremely difficult commitment for any company to make since taking a case all the way to the Supreme Court or a state supreme court would likely mean millions of dollars in legal fees and may open the company up to various counterclaims as well as the possibility that a judge issues penalties against the plaintiff or requires the plaintiff to pay the defendant’s legal fees if a plaintiff persists in bringing a case that a judge has deemed frivolous or counter to public policy, the requirements of the license notwithstanding. In my entire legal career, I have never seen “enforcement” against downstream licensees as a legal requirement in any license for these reasons. However, it is not uncommon for licensors to require that initial licensees report license violations to them.
    ↩︎
  5.  In the case where there is a license between the licensor and the user of the software (the licensee), the Ninth Circuit has held that restrictions that are not directly tied to the monopoly rights of copyright, like those commonly seen in Acceptable Use Policies, can only be enforced as breach of contract claims and not as copyright infringement claims. What may or may not be tied directly to a monopoly right is still an open question, but I think it’s extremely difficult to argue that any of the RAIL restrictions would meet this standard.

    When there is no license between the licensor and the user, the licensor’s claims of copyright infringement must be proven by demonstrating that the user has violated one of the exclusive rights guaranteed to copyright holders. See 17 U.S.C. § 106(3) and Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 433104 S.Ct. 77478 L.Ed.2d 574 (1984). The use restrictions in the RAIL license are rather far removed from those rights. 

    That leaves the licensor solely with breach of contract claims, which can only be brought against entities one is in privity with – i.e. the entities one actually has a valid contract with. 
    ↩︎
  6. Apple and Oracle are the only notable companies that come to mind who have done this for publicly available code, and they’ve moved away from this tactic in many of their licenses in recent years.
    ↩︎
  7. Even if the use restrictions allowed licensees to wait for complaints and did not involve active monitoring in the style of DMCA violations, it would be unclear what sort of response to such a complaint would absolve the company of responsibility for the violation. Under the DMCA, if a service provider receives a take-down request, they take down the offending materials, and send the request along to the party allegedly in violation. That party can then send the service provide a counter notice to get their materials back up. If after 14 days the complainant hasn’t sued the alleged violator, the material goes back up and the service provider has no liability for anything.

    In the case of a RAIL license violation, it’s less obvious what this procedure looks like since the conversation isn’t about just about one copyrighted work or even multiple copyrighted works, but about terminating an entire customer relationship. It’s an entirely different paradigm from the business to consumer (B2C) space where the DMCA is most commonly used. Unlike in the B2B space, in the B2C space, users often aren’t paying for the service and the service provider’s contractual obligations to the user are very limited. The service provider’s limitation of liability is often set to a nominal amount. The DMCA also specifically shields service providers from any liability for taking down materials, even if they ultimately go back up. But in the B2B space, terminating a customer in violation of the agreement with the customer may mean not just loss of revenue from the customer, but also substantial damages that no law can shield them from (and which no sane customer would waive their right to). 

    So, would the service provider have to terminate immediately upon receiving a complaint, without any investigation? If they do investigate and find no wrongdoing, is a simple “nothing here” response sufficient? Or is the service provider now expected to divulge detailed confidential customer information to anyone who feels like writing a complaint? How does the complainant know that an investigation took place or that it was conducted properly? A single license can’t recreate the entire DMCA in an entirely new domain that has little in common with that of the DMCA.
    ↩︎
  8. In the open source world, this approach is not available for technology under strong copyleft licenses. There is consensus that if a product requires, for example, GPL-licensed code in order to function, then merely requiring a customer to download the GPL component themselves is not a sufficient course of action to avoid the implications of the GPL. In that case, even though the company isn’t distributing the GPL’ed code directly, the product would still be viewed as a derivative work of the GPL’ed code, and if it’s not under the GPL itself, then that would constitute a violation of the GPL and in turn, copyright (and perhaps patent) infringement. In the AI model context, it’s particularly difficult to have a conversation about what might be a derivative work of a model if the model itself likely isn’t copyrightable. I further address the possibility of applying copyleft-style requirements to RAIL licenses in Footnote 10.
    ↩︎
  9. Even if the license made disclosure of violations a requirement, an additional condition on the license probably wouldn’t yield any additional damages, at least not in the US. Copyright holders in the US can either ask for actual damages related to their copyrighted works as well as the infringer’s profits from the infringement, or statutory damages. It’s not clear what actual damages might be to a copyright owner whose morals have been offended but who can’t show any economic losses. And calculating the infringer’s profits from the infringement will be the same, no matter how many ways the work was actually infringed. If seeking statutory damages instead, each statutorily prescribed damage award is meant to cover one copyrighted work, not every infringement of the one work. This means that the disclosure requirement is really only effective if it’s coming from a government law or regulation with its own prescribed financial or criminal penalties that exceed those available under copyright law. 
    ↩︎
  10. There is also the question of whether or not RAIL licenses should include copyleft elements, requiring any derivatives of the models or software to also be publicly released. In the OSS world, there are probably a lot of mixed feelings about how important copyleft mechanisms have actually been to the success of open source. After all, many permissively licensed OSS projects (without copyleft requirements) have seen great success and much collaboration, too. But in the AI space, a copyleft requirement is quite complicated. Assuming the models are even copyrightable, on the one hand, it might increase transparency, allowing more people to study and investigate fine-tuned and otherwise modified models. But on the other hand, it would be undesirable for such copyleft requirements to affect the ability of individuals or entities who have a valid reason to further train the models with proprietary or confidential information (including PII) to keep that data secure or hinder the ability of developers to limit or restrict the model for safety purposes by essentially providing people who wish to circumvent those limits with the blueprints to do so. 
    ↩︎
  11.  In practice, open source compliance comes down to attributing copyright holders and providing source code for the OSS used when required by the license. These are not difficult concepts to understand or measure except in really nuanced GPL-related analyses. But even there, the ultimate question is narrow: what needs to be open sourced?

Copilot and Snippet Scanning

By Kate Downing, with Input from Aaron Williamson

In the wake of Copilot’s release, I’ve seen an uptick in questions related to snippet scanning and whether or not that may be desirable for open source compliance purposes. I believe that the answer is still “no.”

First, Copilot has created filters that prevent Copilot from making suggestions that exactly match any public code on GitHub. I’m not aware of any open source scanning tool capable of being able to identify a non-exact snippet match, so I’m not sure what sort of snippet matches one might receive if these filters are activated – chances are the matches won’t be coming from Copilot suggestions. These filters are not hidden away or difficult to enable and they must be turned on in order for an organization to be eligible for Copilot’s indemnity offer. They can be turned on for the entire organization.

It’s good practice for engineers to look skeptically at any lengthy Copilot suggestions as the chances of copyrightability of a suggestion (and hence the possibility of copyright infringement related to using the suggestion) increase the lengthier the suggestion. If receiving a lengthy suggestion, it’s also worthwhile to consider whether or not it may be better to receive the same functionality by adding an open source dependency instead. That’s because a piece of distinct open source code for an actively managed project will be updated and patched by someone else, whereas an unidentifiable suggestion from Copilot will not be. Likewise, distinct OSS can trigger security alerts from various open source monitoring tools, but they are less likely to identify a vulnerability in a file that just looks like company code rather than third party code. GitHub has announced that it is also working on a feature that provides references to OSS projects for certain suggestions, making it even easier to add OSS dependencies when extensive functionality is desired. If engineers are looking closely at lengthy suggestions and filters are also turned on, the chances of code that’s actually copyrightable ending up in a company’s product are quite low.

Second, even when a snippet scanner turns up an exact match, it might mean very little.  The snippet may not be copyrightable, or may reflect a common code pattern used by many projects. Remember that Copilot is basically autocomplete for code and it biases toward producing code that appears in the training data most often. Open source scanners might identify the code as coming from a particular project, but they’re incapable of listing ALL the projects the same code appears in. That means that even if you attribute the project identified by the scanner, that project may not even be the originator of that code. Some other project could have written it first and the attribution made by the scanner may be incorrect, or the authors of multiple projects may have written the same code independently. I’ve personally seen code scanners attribute snippets to very large, very popular projects, when the snippet is actually found in a subcomponent owned by someone else entirely, written long before the popular project came into existence. And of course, the more often the code has appeared in various projects, the more likely it is that the code is purely functional (and not copyrightable) and it appears in multiple projects because that’s just how something is done in a particular language.

Third, if the concern is patents rather than copyrights, I’d argue that it’s extremely difficult to embody an entire patent in just a snippet of code.

Fourth, one has to look practically at the possibility of actual legal enforcement in this context. I’m not aware of any litigation based merely on snippets. Every open source-related litigation I’m aware of involved taking substantial portions of libraries, drivers, even operating systems without proper attribution or source code offers. Even if one were in the business of trolling, trolling merely on the basis of snippets and nothing more is just not profitable. There are so many companies out there not doing even basic compliance with entire Linux distributions, that there’s really no reason to spend time and money arguing about much more gray cases like snippets, which the plaintiff is less likely to win and which will be more costly because the plaintiff will need to bring in evidence and experts to defend the copyrightability of the snippet. There is far less dispute about the copyrightability of entire libraries and operating systems.

Infringing snippets are also hard to find, particularly if they’re embedded in a SaaS product or software that is distributed only in executable (as opposed to source code) form. Techniques for finding open source software in binary software distributions are limited. Often, enforcement efforts are based on the inclusion of complete open source components, where the components can be identified by their filenames, or by the output when they are run. Open source components may also be identified by strings (quoted text) that are unique to that component, because when source code is compiled into binary form, those strings can still be found in the binary. But a short snippet compiled into another piece of software is unlikely to be identified by either technique.

In order to have standing to enforce a copyright license, a copyright holder has to register their copyrights. Most open source developers do not do this. Even many corporations do not do this. Back in 2018, there was a study about how many people actually complied with Stack Overflow’s Creative Commons ShareAlike 3.0 license. Stack Overflow is probably the single most common source of snippets picked up by open source scanners. But, the answer is that basically nobody complies with these licensing terms. In no small part, it’s because the people posting on Stack Overflow don’t bother to register their copyrights in their snippets. They also generally have no particular interest in enforcing those licensing terms. Expensive enforcement litigation makes sense for non-profits dedicated to enforcement, large corporations, and serial trolls, not everyday contributors, much less coders answering questions on public forums.

Fifth, snippet scanning is almost always a distraction from higher-priority compliance issues. For example, most organizations still don’t properly do open source compliance for virtualized or containerized images, failing to provide attribution or offer source code for entire containers, applications, and operating systems. So, spending time chasing down snippets while still not having figured out containerization is bad risk-management. And in my experience, the tools focused on the far less risky subject of snippets are also much worse at dealing with containerization.

Sixth, snippet scanning is not industry-standard. There are many open source scanning tools out there, but only a handful do snippet scanning and only a subset of those customers are chasing these down. The entire tech industry has embraced Copilot – there are really only a few notable exceptions to my knowledge. Which means that in some ways we are back to where we started from – deeper pockets are at higher risk of enforcement and smaller companies continue to fly under the radar. The number of entities in a position to do OSS enforcement hasn’t changed and whatever is the total budget for that enforcement remains the same. I don’t think Copilot is going to induce more people to enter the trolling business for the reasons laid out above (lawsuits against GitHub itself notwithstanding). So given that the actual risk here is the same, it does not make sense to reallocate company compliance budgets to spend time and money on the less risky issue of snippets, in lieu of other, more substantive potential violations.

Conclusion

When selecting tools, it doesn’t make sense to prioritize great snippet identification over things like a better ability to identify secondary licenses buried in source code, automated customer-facing attribution files that actually reproduce copyright notices and licenses from the source code instead, identification of transitive dependencies, ability to work with more computer languages and build systems, or good container handling (especially separation of application layer from operating system layer). For me, it’s absolutely the least important feature of a software scanning tool.