Matthew Butterick and the Joseph Saveri Law Firm are continuing to make the rounds amongst generative AI companies, following up on their lawsuits related to Copilot, Codex, Stable Diffusion, and Midjourney with two more class actions related to ChatGPT and LLaMA, respectively. Notably, the lawsuit related to LLaMA actually predates Meta’s release of LLaMA under a commercial license and the comedian Sarah Silverman is one of three named plaintiffs in both of these new cases, alleging various claims related to her book, The Bedwetter. These cases make two new interesting assertions about generative AI that do not appear in any of the prior cases.
New Theories of Liability
The first new assertion is that the models themselves are derivative works of the training data. The Copilot and Codex case doesn’t include any claims related to copyright infringement. The Stable Diffusion and Midjourney case does include copyright infringement claims, but the allegations under Count I (Direct Copyright Infringement) don’t entirely make it clear whether the illegally prepared “derivative works” refer to the model output or the model itself. Section VII (Class Allegations) seems to indicate that the direct copyright infringement is tied to downloading and storing the copyrighted works and then training the model with those works; the vicarious copyright infringement is tied to the output which was used by third parties to create fakes of original artists’ work. While the complaint mentions that the “AI Image Products contain copies of every image in the Training Images,” the impression from the complaint is that it’s the copying of the images for the purposes of the training that’s at issue, and perhaps that a copy still somehow resides “in” the model, but copying is a separate and distinct copyright monopoly right from creating derivative works and this is the first time that this team of attorneys is claiming that the model is a derivative work of the training data explicitly.
The second new assertion is that “every output of the […] language models is an infringing derivative work” with respect to each of the copyrighted works of the plaintiffs. In other words, the assertion is that no matter what the model outputs, it is necessarily a derivative work of Silverman’s book. In the complaint related to Stable Diffusion and Midjourney, the Factual Allegations section does state that “the resulting image is necessarily a derivative work” but it doesn’t say whose work it is derivative of – is it a derivate of one copyright holder’s work or all of the copyright holders’ works at once? Further, the actual section describing the nature of the copyright infringement claim (Count I Direct Copyright Infringement) doesn’t quite go so far as to say every output of the model necessarily infringes the copyrighted work of a single copyright holder. The complaint puts it more obliquely, arguing that the output “are derived exclusively from the Training Images…” It’s not clear that this usage of the word “derived” is meant to mean “derivative work” under copyright law and and again, it’s not clear if they mean that every output is derivative of some works or all works. The same section merely accuses the defendants of having “prepared Derivative Works based upon one or more of the Works.” From my perspective, this is a new and different theory of copyright infringement now being put forth by this team of attorneys.
The Implications of the New Theories of Liability
Both of these new assertions have interesting implications for the plaintiffs’ cases. There is a scenario, for example, where the courts decide that the copying necessary to train a model is either fair use or outside the ambit of copyright law. That sort of incidental copying is fairly universal in order to allow both individuals and various bots to “read” the Internet, after all, and merely reading or viewing a copyrighted work is not a right protected by copyright law. A court may further decide that it won’t hold a model creator/distributor responsible for its output because there is substantial non-infringing use for the model(s) and will instead hold individual users accountable. However, if the model itself is deemed to be a derivative work of the training data and for whatever reason fair use or other defenses did not apply, the defendants would still be liable for making the derivative work and distributing it (to the extent there is distribution) even if the courts move as described above with respect to model training and the model output.
This puts into play a third prong of attack for the plaintiffs that didn’t exist before. Strategically, I think this was a good move for the plaintiffs’ case.1 I happen to agree that the model probably does constitute a derivative work of at least some of the training data and the logic of “the model can summarize the plaintiffs’ works, therefore it must have copied and stored them” is simple and really appealing to a non-technical audience in a way that the refutations of this statement will not be. However, I also think there is a strong argument that either the fair use defense applies and/or that certain models can be big and complex enough to the point where even though they might contain some training data, its use is de minimis. But, there’s no predicting how courts will weigh such a defense and this allegation gives the plaintiffs a third roll at the dice.
The allegation that all output is necessarily a derivative work of any given piece of training data is more of a double-edged sword for the plaintiffs. Technically speaking, any time the model creates output, it is doing so on the basis of the model in its entirety (a model’s decision not to respond to input in a certain way is as much informed by the training data as it’s affirmative decision to respond to input in a certain way). All output is, in a sense, a reflection of everything the model has gleaned from the training data. But this is “derivation” in the colloquial sense of the word, not in the sense designated under copyright law. Under copyright law, a “derivative work” is one that still retains copyrightable elements of the original work.
Clearly, ChatGPT is more than capable of producing output that no reasonable person could in any way connect with Silverman’s book, so this seems like a strong overreach. Such an interpretation would certainly benefit the plaintiffs since the complaints don’t actually allege any instances of output that infringes the plaintiffs’ works other than output summarizing those works (and of course mere summarization does not constitute copyright infringement, contrary to the plaintiffs’ assertions in these complaints), but I think this is a bridge too far and I think an already confused court would not look kindly upon such an incendiary and misleading claim. This feels like a placeholder until (if?) the plaintiffs actually get the models to reproduce a real derivative work of one of their works – and so far, they don’t seem to have succeeded at that.
The inability to produce any damning output probably makes these the weakest of all the generative AI cases this group has filed so far, especially with respect to ChatGPT because in the absence of derivative output and in the absence of physical distribution of ChatGPT itself (the plaintiffs don’t allege any physical distribution of ChatGPT), it would seem like there’s very little to hang the DMCA claims on. The plaintiffs would basically have to argue under 17 U.S. Code § 1202(b)(1)
(b) Removal or Alteration of Copyright Management Information.—No person shall, without the authority of the copyright owner or the law— (1) intentionally remove or alter any copyright management information, (2) distribute or import for distribution copyright management information knowing that the copyright management information has been removed or altered without authority of the copyright owner or the law, or (3) distribute, import for distribution, or publicly perform works, copies of works, or phonorecords, knowing that copyright management information has been removed or altered without authority of the copyright owner or the law, knowing, or, with respect to civil remedies under section 1203, having reasonable grounds to know, that it will induce, enable, facilitate, or conceal an infringement of any right under this title.
that mere creation (even in the absence of distribution) of generative AIs is prohibited by the DMCA because every act of creating a model removes copyright management information (CMI) and that the information was stripped specifically to “induce, enable, facilitate, or conceal an infringement.” That seems like an unpersuasive argument on many fronts: the models don’t always strip CMI, the models aren’t necessarily storing enough of the training data for it to retain copyright protection anyway (and some training data is ignored by the model entirely or “forgotten” later), the primary purpose of model creation isn’t the concealment of an infringement, the stripping of the CMI is more a byproduct of model creation rather something done by the models “by design,” it’s not clear that any infringement is happening here at all, it’s definitely not clear that the defendants actually believe there is any infringement here as a matter of law, and more generally, it’s hard to argue that a law passed in 1998 specifically to streamline innovation and generally to “get with the times” was intended to wholesale ban an incredibly exciting technology that wouldn’t exist for another 20+ years.
Other Background
The Claims
The claims in these cases are basically the same as the claims related to Stable Diffusion and Midjourney minus the claims related to rights of publicity and breach of contract (which was specific to DeviantArt’s alleged violations of their own Terms of Service and Privacy Policy):
Direct copyright infringement related to copying, making derivative works, publicly displaying copies and distributing copies of copyrighted works and derivatives thereof.
Vicarious copyright infringement arising from the allegation that “every output of the […] language models is an infringing derivative work”
Removal of copyright management information under the DMCA
Unfair competition based on the DMCA violation
Unjust enrichment, vaguely for all of the above
Negligence, extremely vaguely for all of the above
There is also an additional claim against Meta for false assertion of copyright related to the fact that when the LLaMA model was leaked, Meta sent GitHub a takedown notice in which they asserted sole copyright ownership in the model.
Of note is that there are no claims here related to personally identifiable data, which did appear in the Copilot and Codex-related complaint. Given that I’ve personally read an entirely fictitious biography of myself generated by ChatGPT, there would probably be a lot for a court to chew on with respect to those sorts of claims, but this isn’t the right formulation of the class to bring such claims.
The Class
The class in both cases is basically everyone with a copyright in any of the training data. It’s a strangely broad choice given that the named plaintiffs are specifically book authors and in each case, the complaint alleges that the plaintiffs’ books were part of a dataset originating from “illegal shadow libraries” that made copyrighted books available in bulk via torrent systems. Why not limit the class to other book authors whose works were part of the same dataset? By naming the class so broadly, the plaintiffs make it harder to prove typicality, adequacy, or commonality and predominance because the LLMs in question were also trained on the broader Internet on many different types of works under many different licenses, works in the public domain, and works whose copyright was never registered. Does the author of a book actually have a lot in common with a Redditor or someone publishing data on local erosion patterns under a public domain dedication? These copyright holders look to have different interests, be differently situated, and to have different questions of law and facts.
Like the classes in all the other generative AI cases brought by this group of attorneys, the classes here don’t condition participation on injury. Just because a work was part of the training data doesn’t mean the work is 1) is actually part of the model, 2) is part of the model in sufficient detail to still be subject to copyright, 3) actually outputted by the model (or a derivative of it is), or 4) outputted by the model in sufficient detail to still be subject to copyright.
Conclusion
Although the ChatGPT and LLaMA-related cases make some new, rather startling allegations, these cases feel weaker than the other ones the group has filed so far. The inability to prompt either model into actually outputting a derivative work from the training data will require the plaintiffs to focus on the actual act of training the model and the details of what a model is and how it works when they go to trial, making the claims and allegations here somewhat academic and theoretical (if not entirely impenetrable) from the viewpoint of a potential jury. There’s no smoking gun. There’s no clear narrative about how Sarah Silverman is losing out on book sales because ChatGPT is stealing her jokes, etc. At bottom, the plaintiffs will have to convince the jury that even though these LLMs aren’t actually stealing anyone’s work, and to the contrary seem to be providing helpful fact-based information (the book summaries), that they should nevertheless handsomely reward the authors for losing out on an entirely new revenue stream that basically only exists thanks to a dense and obscure tangle of laws and technical details. To me, the LLMs seem like the AIs most vulnerable to legal attack, yet these cases as currently presented strike me as the least worrisome ones.
Last week I presented to the IP Section of the California Lawyers Association on the topic of “Evaluating Generative AI Licensing Terms.” This was aimed at lawyers and procurement professionals familiar with tech transactions looking to license AI/ML technologies on behalf of their clients. It contained a brief introduction to AI/ML technologies, an exploration of special risks related to these technologies, commonly negotiated license provisions, and risk mitigation measures. Many of you asked for a copy of the slides afterwards – feel free to get them here.
AI researchers looking to publicly share their AI/ML models while limiting the possibility that the models may be used inappropriately drove the creation of “Responsible AI licenses” by the RAIL group (“RAIL licenses”).1 Today these licenses are used by a number of popular models, including Stable Diffusion, the BLOOM Large Language Model, and StarCoder. RAIL licenses were inspired by open source licensing and they are very similar to the Apache 2.0 license, with the addition of an attachment outlining use restrictions for the model. The “inappropriateness” RAIL licenses are targeting is wide-ranging.
Background on RAIL Licenses
Technically, these licenses are not open source licenses per the definitions of open source promulgated by the Open Source Initiative or the Free Software Foundation. The use restrictions in these licenses’ Attachment A, such as a ban on using the models to provide medical advice or medical results or using them for law enforcement, discriminate against particular fields of endeavor in contravention of fundamental open source principles. Because the licenses allow licensees to pass on the models under their own licenses of choice, provided they flow down the use restrictions, the licenses only promise a very limited amount of “openness” or “freedom” in the traditional OSS-specific sense. Unlike open source licenses, downstream users of RAIL-licensed models are not required to receive the same rights to use, modify, or distribute the models as the original licensee.
The use restrictions are worth reading in full if you haven’t already. Here are the ones in BigCode’s StarCoder license (they vary a bit from one RAIL license to another):
You agree not to Use the Model or Modifications of the Model:
(a) In any way that violates any applicable national, federal, state, local or international law or regulation;
(b) For the purpose of exploiting, Harming or attempting to exploit or harm minors in any way;
(c) To generate and/or disseminate malware (including – but not limited to – ransomware) or any other content to be used for the purpose of Harming electronic systems;
(d) To generate or disseminate verifiably false information and/or content with the purpose of Harming others;
(e) To generate or disseminate personal identifiable information with the purpose of Harming others;
(f) To generate or disseminate information (including – but not limited to – images, code, posts, articles), and place the information in any public context (including – but not limited to – bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated;
(g) To intentionally defame, disparage or otherwise harass others;
(h) To impersonate or attempt to impersonate human beings for purposes of deception;
(i) For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation without expressly and intelligibly disclaiming that the creation or modification of the obligation is machine generated;
(j) For any Use intended to discriminate against or Harm individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
(k) To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
(l) For any Use intended to discriminate against individuals or groups based on legally protected characteristics or categories;
(m) To provide medical advice or medical results interpretation that is intended to be a substitute for professional medical advice, diagnosis, or treatment;
(n) For fully automated decision making in administration of justice, law enforcement, immigration or asylum processes.
The scope of the concerns here is remarkable. Some are rather practical (let users know they’re talking to an AI) and others are common in licensing generally (don’t use the technology to break the law). But, many of these read as attempts to prevent the technology from enabling totalitarianism and police states, generally. Licenses for general-purpose technologies have long contained restrictions related to using certain technologies in extremely dangerous, mission-critical contexts (like to manage nuclear power plants or conduct remote surgeries); those sorts of disclaimers or limitations arose because the software was simply not hardened to meet the security and performance standards those uses would require. It’s only in the last 15 years or so that such licenses have attempted to limit usage based on ethical considerations.
A spate of so-called “ethical licenses” emerged in the last 15 years, with aims that included preventing companies from using the software if they worked with ICE, if they were Chinese companies that required long, grueling working hours, and if they were harming underprivileged groups. These licenses have not enjoyed widespread use or support; very few projects have adopted them and they have been banned2 at virtually every tech company sophisticated enough to have an open source licensing policy. But the RAIL licenses are probably the most far-reaching licenses for publicly available technology that have ever gained any notoriety, pursuing not just one particular type of harm, but contractually attempting to require adherence to what amounts to an expression of a wide variety of Western, liberal principles.
As you can see, many of these restrictions are nebulous even at face value. You’ll notice that none of the uses of the word “discriminate” are tied to any legal understanding of what discrimination might be, leaving open the interpretation that simply sorting people by a protected or unprotected characteristic is prohibited “discrimination” under this license.3 “Harm” is defined broadly: it “includes but is not limited to physical, mental, psychological, financial and reputational damage, pain, or loss.” Note that “Harm” isn’t just harm in violation of the law or undue or unjust harm. It would include harm like sending a rightly convicted person to jail or hurting someone’s reputation by publishing true things about them.
On the face of it, (g) would prevent a reporter from using an AI to assist in writing an article on a company dumping chemicals in the water, since that would be disparaging, (e) would prohibit someone from asking an AI about the names and addresses of people on the public sex offender registry, since that would be a use of personally identifiable information likely to hurt someone’s feelings, and both (j) and (l) would prevent someone from using an AI to assist in targeting men for boxer brief advertisements, since that would be discrimination based on a predicted personal characteristic and based on a protected category. An entire essay could be written about the possible meaning of nearly any one of these provisions. The vagueness here would likely make some of these restrictions unenforceable in a court of law and leave the others open to unpredictable case-by-case determinations.
Like open source licenses, the RAIL licenses are styled as copyright licenses. The ability to enforce them rests on the model owner’s ability to prove that they have copyright ownership in the model and that a violation of the RAIL license is therefore copyright infringement. Given that AI models are mostly just numbers representing parameter weights, there is deep skepticism in the IP world that the models are even copyrightable. If not, the license is fully unenforceable. It would need to be significantly rewritten solely as a contract in order to be enforceable, which is difficult to do because every country (and even the states therein) have very different laws that apply to contract interpretation, whereas copyright law was harmonized almost worldwide by the Berne Convention.
There is also a strong possibility that the use restrictions in RAIL licenses could constitute copyright misuse in the United States (and a few other countries). This doctrine prevents copyright enforcement where the copyright owner has conditioned a copyright license on restrictions or limitations well outside the scope of the monopoly rights granted by the Copyright Act. Examples of copyright misuse include licenses that require non-competition for 100 years and licenses that prevent the licensee from sharing any data stored in the software with others. Given the vagueness of the restrictions, I don’t think it’s hard to imagine a use case that might be read to violate those restrictions, but in which a court would be hard-pressed to enforce them (such as my example of a journalist writing about clean water, above).
For purposes of analyzing and assessing the various enforcement mechanisms possible in RAIL licenses, I’m going to assume that RAIL licenses can surmount the potential legal challenges mentioned above and that it is possible to make better and clearer restrictions that are worth trying to enact.
RAIL Licensed Models Incorporated in Other Products and Services
RAIL licenses were drafted in the context of AI researchers sharing naked models (without the software to run them or the training data), often foundational models, in public spaces like the Hugging Face platform. Thus, it’s not entirely clear how the licenses are supposed to apply to entities or individuals incorporating a model into their own products and services. On the one hand, the RAIL licenses allow the initial licensee to re-license the model under their own license provided that they flow down the use restrictions in the RAIL license (this is “sublicensing” and the downstream users are “sublicensees”). But, on the other hand, the RAIL licenses still require the initial licensee to provide downstream licensees with a copy of the RAIL license. It’s not clear why downstream licensees need to see a license that they may not actually be subject to.
Although the RAIL licenses require flowing down the use restrictions to downstream licensees, the license does not further require that the initial licensee enforce those restrictions4 against downstream licensees (including by terminating access rights for downstream licensees) nor does the license terminate for the initial licensee if downstream licensees violate those restrictions. As such, the owners of the model can really only enforce the RAIL license against the initial licensee and only if the licensee itself has violated the license, not if one of its downstream licensees has violated the RAIL flowdowns. Since there is no privity of contract between the model owners and the downstream licensees (i.e. there is no contract between them) when the initial licensee sublicenses to the downstream licensee, model owners have no standing to enforce any of the use restrictions against them (or their subsequent downstream licensees).5 The use restrictions are not very effective since everyone who receives the model from someone other than the licensor (the model owner), is beyond the reach of legal enforcement from the model owners.
Even if the initial licensee wanted to sue the downstream licensee, they’d be limited to contract claims as only exclusive licensees have standing to sue for copyright infringement, and that significantly limits the appeal of enforcement since the initial licensee would need to prove that they were personally, financially damaged by the violations of the downstream licensee. That’s probably not true in most instances of violations of the RAIL use restrictions.
Alternative Approaches to License Enforcement
The RAIL licenses could potentially be modified in two different ways to make sure that no users of a RAIL licensed model are beyond reach for those wishing to enforce the license. The first approach is to make the RAIL licenses look more like commercial licenses, with more restrictions and a clear framework for how upstream licensees take responsibility for downstream users. The second approach is to hew more closely to the open source licensing model, by prohibiting sublicensing, and forming a direct contractual relationship between the model owner and each user.
The Commercial Approach
The companies named here are provided solely as an example and this diagram does not necessarily reflect any real-world relationships.
Commercial contracts that require that the initial licensee flow certain terms down to downstream licensees (but not the entire agreement) also reserve the right to terminate the license of the initial licensee if downstream licensees fail to comply with those flowdowns. They often require that the initial licensee report any downstream violations. They can further ask the initial licensee to indemnify the licensor against any claims or losses related to the actions of the downstream licensees.
Companies that care a lot about downstream uses will even go so far as to require the initial licensee to make the company (the licensor) a third party beneficiary to any agreements between the initial licensee and the downstream licensees so that they can enforce those flowdowns even when the initial licensee has no interest in doing so. This can be particularly appealing to the licensor if the initial licensee is likely to be smaller than the customers (downstream licensees) it serves since that power dynamic is unlikely to yield effective enforcement of the licensor’s rights and the initial licensee doesn’t have deep enough pockets to make suing them for their customers’ actions profitable. Designating a third party beneficiary in a tech transaction is fairly unusual, but it’s worth discussing at length here because the initial licensees in this particular context are unlikely to engage in much license enforcement on their own.6
Generally, downstream licensees will chafe at the addition of a third party beneficiary because:
They don’t like dealing with a third party they don’t know or have any relationship with – trust and relationships are important. Some third parties are more litigious than others
It involves allowing a third party to have access to their contracts or potentially their confidential information, but the agreement is solely between the initial licensee and the downstream licensee and the downstream licensee therefore cannot bind the third party beneficiary to any confidentiality or data privacy obligations
When a third party beneficiary includes “…and their affiliates,” this could potentially mean hundreds of plaintiffs the downstream licensee has never even heard of before and which can change as companies are bought and sold. That can significantly increase litigation risk and the risk of exposing confidential information to unknown entities without restriction
It leads to more complex and expensive litigation
Commercial agreements also commonly include provisions that subject initial licensees to audits to make sure they’re complying with the agreement. They may also include provisions requiring the initial licensee to create and maintain certain records so that they may be available for an audit or legal discovery, and such records might need to contain information about downstream licensees. Where the parties anticipate difficulty in bringing copyright claims or proving damages resulting from contract breach, the parties may agree to a liquidated damages provision, specifying up-front what a breach will cost the licensee.
All the additional terms discussed above could be flowed down to downstream licensees of downstream licenses, etc.
The Open Source Approach
The companies named here are provided solely as an example and this diagram does not necessarily reflect any real-world relationships.
On the other hand, open source licenses do not rely on flowdowns to downstream users. Open source licenses follow a direct licensing model, wherein every user is in a direct license with the copyright holders of the software (i.e. the project contributors). Another way of putting this is that there are no downstream licensees or sublicensees because everyone receives their license directly from the copyright holder, even if they might receive the actual software and a copy of the license language from someone else (you might see references to downstream users, though). If a company incorporates open source licensed software into their products, the open source software remains under the open source license, regardless of how the company chooses to license other elements of its products and their customer receives a bundle of licenses – one for their own software, and one for each OSS component in the product:
Any violations of the open source license by such customers (or their own customers) can be enforced by the copyright holders. Downstream users never receive fewer rights than upstream users and the software itself thus remains “open” and “free.”
Evaluating the Commercial Approach for the AI Space
The commercial approach is useful when the subject of the license is a proprietary piece of technology. In other words, there is no interest in the technology being widely and freely available to others and there is no interest in seeing research and development related to the technology from anyone but the licensor (the initial creator/owner of the technology) or the licensor’s chosen partners or contractors. That’s because the initial licensee of such technology must comply with the licensor’s restrictions in order to maintain the license and avoid potential legal claims. If they don’t, the licensor could get an injunction to completely prevent the initial licensee from selling their products until the licensor’s technology has been removed from those products. Missing out on several months of revenue would be detrimental to most businesses and public lawsuits would deter new customers from signing on. And, of course, the licensee would be on the hook for damages related to copyright infringement and contract breach. The licensee might even owe money to its other customers if they granted them an IP warranty or breached their contracts with them by ceasing to make their product available.
If the initial licensee is responsible for downstream use, then it’s in their best interest to limit uses of the technology to a set of pre-approved use cases and prevent users from using the technology in an unforeseen way. This has traditionally translated into initial licensees forbidding downstream licensees from modifying the technology, distributing the technology, incorporating the technology into their own products, or reverse engineering the technology (sometimes at the behest of the licensor directly, but sometimes just as a risk mitigation measure). This often goes hand in hand with not providing downstream licensees with source code to the technology and with implementing control systems like DRM, licensee keys, etc. Even if downstream licensees are allowed to redistribute the technology in certain circumstances, many of the above controls and measures will remain in play. In sum, these controls and measures mean that the technology at issue is not “open” or “free.” Additional research and development related to the technology will typically begin and end with the licensor.
In the AI space, preventing downstream licensees from violating the use restrictions in the RAIL licenses will likewise mean lockdown measures. Sensible companies will limit the ways in which their customers can interact with the AI model as well as their ability to fine-tune it or do prompt-tuning. The fear of a customer circumventing some of the controls a company may put in place to comply with the RAIL license means that in the absence of regulations requiring otherwise, the exact nature of those controls will be kept secret and so will the code implementing those controls, limiting the transparency that AI researchers and the media could otherwise provide on everything from the performance abilities of the model to the model’s safety. Since the training data itself can offer information useful for circumvention, disclosing its contents could also become risky and undesirable.
In the B2B software context, knowledge of how downstream licensees are using a particular technology is generally rooted in either monitoring license key usage and/or logins or via physical audits of documents or computers in search of information related to payments, inventory, number of users, number of downstream licensees, and product descriptions or documentation, etc. That’s because the salient questions up until now have revolved around whether a product has been distributed when it shouldn’t have or whether too many users are using it, etc. In some cases certain uses can be identified just by seeing what a downstream licensee is doing and saying in public. Technology companies that sell to other companies (not to individuals) mostly rely on metadata and performance/utilization data to monitor customer accounts (initial licensees) for potential license violations, not the contents of customer data (which is often not their data, but data of their customers, i.e. downstream licensees – ex. AWS hosts a data analysis platform and that platform in turn ingests data from various customers who are retailers and the data provided by the retailers might actually be data of individual consumers or employees).
Best practices in the software world include only giving a small subset of employees direct access to customer accounts, and that subset generally only has access to resolve specific security issues and support requests; their access is carefully logged and monitored, and abuse of such access quickly leads to termination. Most B2B software companies enter into data protection addendums with their customers, promising compliance with these practices. While companies often reserve the right to access customer accounts for things like enforcement of their agreement with the customer, per the above, in practice most such enforcement doesn’t require access to customer data directly. Many companies that might use AI models as part of their services also do not receive any customer data in the clear – processing that needs to happen with unencrypted data is done in customer-controlled environments before being sent to the company; they only receive encrypted data that only their downstream licensees can unencrypt. And of course, good old-fashioned on-prem software providers receive no customer data to monitor at all.
Making sure that downstream licensees aren’t using AI models to provide medical advice, for example, would involve a level of surveillance that most B2B tech companies do not normally engage in, even if they do process customer data in the clear as part of a SaaS offering. Assessing a violation like that requires constant monitoring of the contents of that data, who a customer’s customers are, and more people viewing customer data and personally identifiable information for purposes other than providing people with the services they’re paying for. Suddenly, compliance, legal, and engineering staff are exchanging, citing, and discussing a heap of sensitive health-related information to distinguish what’s advice and what’s just a factual statement. Given the grand scope and ambiguity of the RAIL use restrictions, deciding whether or not any customer complies with all of them would require each company to form something like FaceBook’s Oversight Board, except with more expertise in a wider array of areas and without many of the legal protections that allow companies to eschew proactive surveillance and enforcement in favor of waiting for complaints of violations from third parties.7 The desire to make sure downstream licensees are using AI responsibly is in direct tension with their privacy and data protection rights, which is ironic because the RAIL licenses themselves prohibit AI models from abusing personally identifiable data.
Most technology companies also experience instances where customers are in violation of their agreements, but they choose not to terminate those customers. That can happen when an alternative deal is struck with the customer that everyone is happy with (which may or may not get documented in a contract somewhere), when the violation is fairly minor, when the customer is also an important partner and the partnership is much more lucrative than the customer arrangement, or the customer serves as such an important signal of the company’s value (it’s a “big logo”), that it’s worthwhile to keep them on even in the face of major violations. In the B2B space, companies make money by licensing products and services and they lose money when they terminate customers or file lawsuits against them. The reality is that individuals and corporations act on the things that most deeply affect them personally and they don’t, can’t, and shouldn’t expend limitless resources to surveil and monitor their customers or cripple the privacy and security of their products in order to act as a substitute for traditional government authority. Accepting a license that comes with such a responsibility would probably be a violation of the company’s fiduciary duties to its shareholders.
In the worst case scenario, if companies do not think they can put technical controls in place to comply with the RAIL use restrictions, they believe that enforcing these restrictions increases their data privacy-related risks (particularly if they interpret these provisions as preventing them from encrypting customer data), or they think it is too expensive and time-consuming to chase down every potential or actual violation, companies will simply ask customers to download AI models themselves and offer API integrations between their products and these models.8 That arrangement would fully absolve the companies of putting into place any controls, monitoring, or limitations that may be necessary to comply with RAIL use restrictions, putting those obligations squarely on the shoulders of their customers.
The problem with this is that the customers of Microsoft, for example, include veterinary offices and hair salons – in other words, lots of customers who do not have the capacity or interest in making AI models safer or policing their own employees’ and customers’ use. While various regulatory agencies may have the capacity to police several thousand tech companies and force them to develop a set of best practices, there is very little they can do when everyone is running their own instance of a foundational AI model to which no one’s added any compliance or safety mechanisms and very importantly, no one is incentivized to add such mechanisms because the people capable of doing the work no longer bear any potential liability and the people who bear liability will, in practice, face a low chance of enforcement. At that point, regulators can either give up on AI safety or they can ban publicly available AI models. Such a move would again run counter to the desire for transparency around how models are trained and how they work, pushing AI research and development back into the hands of a handful of corporations, and limiting AI research and development by hampering people’s ability to iterate on each other’s work.
Changing the RAIL licenses to work more like commercial agreements may further ensure “responsible” AI use in one particular sense, but it would significantly cut against openness and the benefits related to transparency and innovation that openness brings. That “responsibility” would be of a limited nature – it would be easy to see wild license violations and corporations would likely act on such violations if committed by their downstream licensees, but it would no longer be possible for outsiders to closely study modified or fine-tuned models for bias or emergent behaviors, for example. It would be impossible for the public to know who all the users of a particular model are or what they’re using it for. Corporations would play a greater role in deciding what does and doesn’t violate the license, and other researchers, journalists, and regulators would have less opportunity to weigh in. If the cost of violating the license is high, as it is in commercial agreements, and as it would be if a company came to depend on a model under a RAIL license, then there is also a strong incentive to hide any violations that are found, rather than letting affected individuals or entities know about them. As in the data breach context, the only way to counter the incentives around hiding violations is to pass laws mandating their reporting and creating an even higher penalty for failure to disclose than for violating the license.9 But that doesn’t work in the AI context, where the entire point of the RAIL licenses is to substitute for regulations that don’t exist yet.
Evaluating the Open Source Approach for the AI Space
The open source approach is, at first glance, unsatisfying for a model owner who wants to go after just one company and not all of its customers. In other words, for a model owner who wants to essentially delegate responsibility for license enforcement down the supply chain. But, the goals of the open source movement haven’t been hampered by the direct licensing model. On the contrary, that model is core to the movement’s success and open source software’s ubiquity.
The open source movement, as it turns out, has been wildly successful without very much need to sue anyone at all. Just a handful of enforcement actions has created legal precedent and legal awareness around the importance of open source license compliance. The software industry has dramatically improved its open source license compliance practices in the last 30 years. Partially this is because the potential for enforcement and the expense related to it has been clarified – not just to individual companies, but to their potential customers, investors and acquirers. Partially it is because the open source movement has many adherents and beneficiaries; companies that don’t comply have a harder time recruiting good engineers who care about open source and they have a harder time forming partnerships with other companies who do not want to harm their reputations.
But perhaps the biggest reason for the movement’s success is simply that good software was shipped under open source licenses that could never be stripped from the software. That meant that not only could lots of people use the software freely, but lots of people could improve it, making many pieces of open source software the de facto gold standard in that domain. And once the software became a gold standard, no one could avoid using or avoid eventually becoming aware of open source licenses.
If the AI movement wanted to replicate the OSS movement’s success, allowing AI models and the related software (under RAIL software-specific licenses) to be sublicensed under other licenses was a fatal flaw.10 It undercuts the ability of others to iterate on AI models and improve them, weakening the strength of openly developed AI models and in turn the ubiquity and importance of the RAIL licenses themselves. In seeking maximum “responsibility,” there’s a good chance the RAIL licenses become, for all intents and purposes, entirely irrelevant because the best models aren’t going to be released under RAIL licenses at all.
There are also a lot of differences between the motivations of licensors in the open source space and the licensors in the AI model space. The first wave of people who licensed their work under an open source license did so because they fundamentally believed in a set of fairly succinct open source principles related to openness and freedom related to code.11 They cared not just about getting attention and accolades for their work, but specifically the ability of others to use and improve upon their work freely. This is where I believe there is schism between the open source world and the AI world: many people licensing their code under RAIL licenses don’t have a distinct and articulated (or articulable) desire for how they want their models to be used. And they don’t necessarily even agree with each other about what constitutes a good or bad use. They may want to be seen as generally “responsible,” but being responsible isn’t a movement; it’s not a set of guiding principles; it’s not a standard someone can clearly meet or fail. The general sort of responsibility imbued in RAIL licenses is the same sort of “don’t be evil” ethos that’s always been floating in the ether and not a new, better way of doing things.
OSS licensors are also affected by license violations far more directly than AI licensors, and they have more resources available to them if they choose to enforce their license. In the open source world, someone abusing the open source license means that the copyright holder is not given back contributions to the project (by making them publicly available as licensees should), they’re not attributed, or they are robbed of revenues from alternative commercial arrangements from people who can’t or don’t want to comply with the open source license. Although lawsuits are fairly rare in the open source space, there is a culture and history of enforcement actions where copyright holders merely ask for and receive compliance in order to redress a violation. That culture is in part predicated on the fact that getting into compliance with an OSS license or getting a commercial license for the technology when it’s available is generally just one of the costs of doing business; it is rarely a matter that could make or break a company unless the company decides to literally go for broke and fight obvious violations in court. There are also foundations and open source compliance-focused organizations that help to educate the public about OSS licenses, provide resources for copyright holders trying to enforce their license, and sometimes act as counsel for a copyright holder who wants to enforce their license.
AI licensors are likely to be affected only very indirectly, if at all, by violations of RAIL licenses. Are individual PhD candidates at Berkeley really going to try to enforce against companies using their models for facial recognition technology in Indonesia? Is a Meta employee fine- tuning LLaMA in her personal time going to sue a pharmaceutical company in Norway for using her work to suggest various drugs to people? Today, there are no private organizations funded and staffed to enforce RAIL licenses. The resources or desire for people to enforce any of the RAIL restrictions are very limited. Whether it’s model developers to whom enforcement falls, as in the OSS direct licensing approach, or it’s anyone upstream of a downstream violation, as in the commercial sublicensing approach, no one individual or company has the desire and funds to enforce all of the license restrictions everywhere in the world at all times, no matter how egregious and harmful they may be. Even occasional enforcement is likely to be extremely rare in part because of resource constraints, in part because of a lack of interest, but also in part because the people who care the most about enforcement wouldn’t hand out their models to unknown entities on public platforms.
Unlike in the OSS space, some of the potentially infringing uses of the RAIL licenses may also be impossible for certain companies to remedy. Defense contractors using AI models to help process PII in order to identify, locate, and arrest Russian soldiers in Ukraine, for example, can’t simultaneously stay in business and come into compliance with the license’s prohibition on using PII to “harm” others. That means a potential license violation could, in fact, make or break the business and the same culture of “please just come into compliance, no need for nasty lawsuits” is unlikely to take hold in the AI space. As with the commercial approach, this incentivizes people to hide the fact that they’re using RAIL-licensed models and their violations. And in particular, if you know the violations applicable to your business or your customers cannot be remedied, why look for them at all?
The success of the OSS movement is enviable and impossible to overstate. But, the many elements which have led to its success simply can’t be replicated in the AI space. The requirements of OSS licenses are much narrower and less open to interpretation than the requirements of the RAIL licenses. The people applying the licenses to their technology are much more directly affected by license violations than those in the AI space, they have far more resources available to them, and the types of violations they can allege are relatively clear cut. The scope and expense of proving that someone didn’t make a source code offer in court is orders of magnitude smaller than convincing a court in say, Myanmar, that a model has been used to discriminate against a religious minority (that the government has an active policy of discriminating against) and that the corporation running it should be prevented from selling its products until the model is removed. Not to mention the fact that enforcing many of the RAIL provisions constitutes not just an arcane commercial licensing dispute, but a provocative political statement that may lead to threats, harassment, and even torture or death for the people making them.
What Should Be Done with RAIL Licenses?
RAIL licenses were drafted because there was (and remains) little AI-related regulation and little AI-related government or corporate expertise available. The licenses are an attempt to offer a bandaid or substitute for this lack of awareness and regulation. They are also social and political statements about the role of AI researchers in the world, chastened, perhaps, by the regrets of some of the scientists who worked on the Manhattan project or the general history of scientists driven by curiosity, excitement, and hubris who did not give a thought as to how the results of their work would actually be used in the world.
The use restrictions in the RAIL licenses would seem to indicate that the people licensing their models this way believe that their technology is profoundly dangerous and can be used by corporations and governments alike to effect systemic bias, voter manipulation, and totalitarian control in a manner and to a degree not previously possible. I don’t know how true that is; I’m just a lawyer, but I’m willing to accept these assertions at face value from the people who know the technology best. If true, today’s RAIL licenses amount to little more than a warning sticker on canisters of purified uranium being jettisoned into the sky with t-shirt guns. They provide no real control over how any of the technologies licensed under them are used. And unfortunately, neither the commercial nor the open source approaches discussed above create a good balance between openness and safety.
As previously stated, it’s not clear the models are copyrightable and subject to any license on that basis at all or that the licenses would survive misuse challenges. Even if the licenses are valid, like any legal instrument, they’re only useful if the entities subject to them reside in countries with operational legal systems open to foreign plaintiffs and whose own laws and politics are aligned with the goals and values espoused in the licenses. To put this another way, no license on earth is going to stop the Chinese Communist Party from using a model from Hugging Face to maintain biometric identification on its citizens, if that’s what it wants to do. If a terrorist organization wants to use the models to plan future attacks against Americans, maybe there’s no local court that cares to stop them from doing so.
What exactly is the sort of model that’s ok for North Korea to use however it wants, but whose use will require extraordinary levels of surveillance in the Western world? Are the licensors essentially arming hostile governments and terrorist organizations and only retaining control over law-abiding people? Do the licensors think it doesn’t matter if existing totalitarian regimes entrench their power over their helpless citizens even further so long as the licenses allow them to keep such practices at bay in the Western world? Is their technology not actually dangerous enough to warrant all these use restrictions?
This discussion is so fraught and complex because there actually is no model for how to regulate a dangerous technology that developed countries fail to recognize as dangerous. Certainly, there’s no model for further allowing people to trade this technology freely and publicly while also ensuring safety and transparency about who is using it, for what and how. Except perhaps in failed states, things that are truly dangerous, like military weapons, nuclear technologies, viruses created for scientific research, etc. simply can’t be left in the public square for anyone to come pick up and play with. No doubt all of these technologies would benefit from open and collaborative world-wide development of the sort made possible by the open source software movement, but openness and even progress don’t always trump safety and responsibility.
Licensing of any variety can’t substitute for meaningful, international AI regulation. The ultimate answer as to how to balance openness with responsibility, probably like most of this blog post, is deeply unsatisfying and uncomfortable: if you really think your technology can be misused in such profoundly harmful ways, don’t make it publicly available on the Internet. Only provide it to people you trust who are going to be honest with you about how they’re using it, under legally enforceable agreements. All technology can be used for both good and evil, but it’s still our collective responsibility to prevent evil where we can.
This post had additional information related to the history of these licenses and the types of concerns they attempt to address. ↩︎
You can read more about the reasons for this here. ↩︎
This is particularly likely since the restrictions already require compliance with all applicable law. Basic principles of contract interpretation would require additional restrictions on “discrimination” to mean something other than the sort banned by applicable law, since otherwise there would have been no need to write a separate provision addressing the same matter. ↩︎
Requiring an initial licensee to “enforce” the flowdowns against downstream licensees is not a meaningful requirement. In the US, legal “enforcement” can mean issuing a cease and desist, filing a legal complaint, settling the lawsuit for a lot or for a little or for nothing at all, going to trial and getting a verdict from a judge or jury, appealing to an appeals court, or appealing to a court of final judgment. Without more specificity, it’s impossible to know exactly how far a plaintiff is supposed to take a matter to satisfy the licensor. Even if the licensor were to specify something like “must pursue all violations to final judgment from a final court of appeals,” it would still be up to the plaintiff to decide exactly what claims to file and what sorts of remedies to ask for, and those would have to depend on the exact nature of the violation and the evidence available to the plaintiff. This would be an extremely difficult commitment for any company to make since taking a case all the way to the Supreme Court or a state supreme court would likely mean millions of dollars in legal fees and may open the company up to various counterclaims as well as the possibility that a judge issues penalties against the plaintiff or requires the plaintiff to pay the defendant’s legal fees if a plaintiff persists in bringing a case that a judge has deemed frivolous or counter to public policy, the requirements of the license notwithstanding. In my entire legal career, I have never seen “enforcement” against downstream licensees as a legal requirement in any license for these reasons. However, it is not uncommon for licensors to require that initial licensees report license violations to them. ↩︎
In the case where there is a license between the licensor and the user of the software (the licensee), the Ninth Circuit has held that restrictions that are not directly tied to the monopoly rights of copyright, like those commonly seen in Acceptable Use Policies, can only be enforced as breach of contract claims and not as copyright infringement claims. What may or may not be tied directly to a monopoly right is still an open question, but I think it’s extremely difficult to argue that any of the RAIL restrictions would meet this standard.
When there is no license between the licensor and the user, the licensor’s claims of copyright infringement must be proven by demonstrating that the user has violated one of the exclusive rights guaranteed to copyright holders. See 17 U.S.C. § 106(3) and Sony Corp. of Am. v. Universal City Studios, Inc.,464 U.S. 417, 433, 104 S.Ct. 774, 78 L.Ed.2d 574 (1984). The use restrictions in the RAIL license are rather far removed from those rights.
That leaves the licensor solely with breach of contract claims, which can only be brought against entities one is in privity with – i.e. the entities one actually has a valid contract with. ↩︎
Apple and Oracle are the only notable companies that come to mind who have done this for publicly available code, and they’ve moved away from this tactic in many of their licenses in recent years. ↩︎
Even if the use restrictions allowed licensees to wait for complaints and did not involve active monitoring in the style of DMCA violations, it would be unclear what sort of response to such a complaint would absolve the company of responsibility for the violation. Under the DMCA, if a service provider receives a take-down request, they take down the offending materials, and send the request along to the party allegedly in violation. That party can then send the service provide a counter notice to get their materials back up. If after 14 days the complainant hasn’t sued the alleged violator, the material goes back up and the service provider has no liability for anything.
In the case of a RAIL license violation, it’s less obvious what this procedure looks like since the conversation isn’t about just about one copyrighted work or even multiple copyrighted works, but about terminating an entire customer relationship. It’s an entirely different paradigm from the business to consumer (B2C) space where the DMCA is most commonly used. Unlike in the B2B space, in the B2C space, users often aren’t paying for the service and the service provider’s contractual obligations to the user are very limited. The service provider’s limitation of liability is often set to a nominal amount. The DMCA also specifically shields service providers from any liability for taking down materials, even if they ultimately go back up. But in the B2B space, terminating a customer in violation of the agreement with the customer may mean not just loss of revenue from the customer, but also substantial damages that no law can shield them from (and which no sane customer would waive their right to).
So, would the service provider have to terminate immediately upon receiving a complaint, without any investigation? If they do investigate and find no wrongdoing, is a simple “nothing here” response sufficient? Or is the service provider now expected to divulge detailed confidential customer information to anyone who feels like writing a complaint? How does the complainant know that an investigation took place or that it was conducted properly? A single license can’t recreate the entire DMCA in an entirely new domain that has little in common with that of the DMCA. ↩︎
In the open source world, this approach is not available for technology under strong copyleft licenses. There is consensus that if a product requires, for example, GPL-licensed code in order to function, then merely requiring a customer to download the GPL component themselves is not a sufficient course of action to avoid the implications of the GPL. In that case, even though the company isn’t distributing the GPL’ed code directly, the product would still be viewed as a derivative work of the GPL’ed code, and if it’s not under the GPL itself, then that would constitute a violation of the GPL and in turn, copyright (and perhaps patent) infringement. In the AI model context, it’s particularly difficult to have a conversation about what might be a derivative work of a model if the model itself likely isn’t copyrightable. I further address the possibility of applying copyleft-style requirements to RAIL licenses in Footnote 10. ↩︎
Even if the license made disclosure of violations a requirement, an additional condition on the license probably wouldn’t yield any additional damages, at least not in the US. Copyright holders in the US can either ask for actual damages related to their copyrighted works as well as the infringer’s profits from the infringement, or statutory damages. It’s not clear what actual damages might be to a copyright owner whose morals have been offended but who can’t show any economic losses. And calculating the infringer’s profits from the infringement will be the same, no matter how many ways the work was actually infringed. If seeking statutory damages instead, each statutorily prescribed damage award is meant to cover one copyrighted work, not every infringement of the one work. This means that the disclosure requirement is really only effective if it’s coming from a government law or regulation with its own prescribed financial or criminal penalties that exceed those available under copyright law. ↩︎
There is also the question of whether or not RAIL licenses should include copyleft elements, requiring any derivatives of the models or software to also be publicly released. In the OSS world, there are probably a lot of mixed feelings about how important copyleft mechanisms have actually been to the success of open source. After all, many permissively licensed OSS projects (without copyleft requirements) have seen great success and much collaboration, too. But in the AI space, a copyleft requirement is quite complicated. Assuming the models are even copyrightable, on the one hand, it might increase transparency, allowing more people to study and investigate fine-tuned and otherwise modified models. But on the other hand, it would be undesirable for such copyleft requirements to affect the ability of individuals or entities who have a valid reason to further train the models with proprietary or confidential information (including PII) to keep that data secure or hinder the ability of developers to limit or restrict the model for safety purposes by essentially providing people who wish to circumvent those limits with the blueprints to do so. ↩︎
In practice, open source compliance comes down to attributing copyright holders and providing source code for the OSS used when required by the license. These are not difficult concepts to understand or measure except in really nuanced GPL-related analyses. But even there, the ultimate question is narrow: what needs to be open sourced?
In the wake of Copilot’s release, I’ve seen an uptick in questions related to snippet scanning and whether or not that may be desirable for open source compliance purposes. I believe that the answer is still “no.”
First, Copilot has created filters that prevent Copilot from making suggestions that exactly match any public code on GitHub. I’m not aware of any open source scanning tool capable of being able to identify a non-exact snippet match, so I’m not sure what sort of snippet matches one might receive if these filters are activated – chances are the matches won’t be coming from Copilot suggestions. These filters are not hidden away or difficult to enable and they must be turned on in order for an organization to be eligible for Copilot’s indemnity offer. They can be turned on for the entire organization.
It’s good practice for engineers to look skeptically at any lengthy Copilot suggestions as the chances of copyrightability of a suggestion (and hence the possibility of copyright infringement related to using the suggestion) increase the lengthier the suggestion. If receiving a lengthy suggestion, it’s also worthwhile to consider whether or not it may be better to receive the same functionality by adding an open source dependency instead. That’s because a piece of distinct open source code for an actively managed project will be updated and patched by someone else, whereas an unidentifiable suggestion from Copilot will not be. Likewise, distinct OSS can trigger security alerts from various open source monitoring tools, but they are less likely to identify a vulnerability in a file that just looks like company code rather than third party code. GitHub has announced that it is also working on a feature that provides references to OSS projects for certain suggestions, making it even easier to add OSS dependencies when extensive functionality is desired. If engineers are looking closely at lengthy suggestions and filters are also turned on, the chances of code that’s actually copyrightable ending up in a company’s product are quite low.
Second, even when a snippet scanner turns up an exact match, it might mean very little. The snippet may not be copyrightable, or may reflect a common code pattern used by many projects. Remember that Copilot is basically autocomplete for code and it biases toward producing code that appears in the training data most often. Open source scanners might identify the code as coming from a particular project, but they’re incapable of listing ALL the projects the same code appears in. That means that even if you attribute the project identified by the scanner, that project may not even be the originator of that code. Some other project could have written it first and the attribution made by the scanner may be incorrect, or the authors of multiple projects may have written the same code independently. I’ve personally seen code scanners attribute snippets to very large, very popular projects, when the snippet is actually found in a subcomponent owned by someone else entirely, written long before the popular project came into existence. And of course, the more often the code has appeared in various projects, the more likely it is that the code is purely functional (and not copyrightable) and it appears in multiple projects because that’s just how something is done in a particular language.
Third, if the concern is patents rather than copyrights, I’d argue that it’s extremely difficult to embody an entire patent in just a snippet of code.
Fourth, one has to look practically at the possibility of actual legal enforcement in this context. I’m not aware of any litigation based merely on snippets. Every open source-related litigation I’m aware of involved taking substantial portions of libraries, drivers, even operating systems without proper attribution or source code offers. Even if one were in the business of trolling, trolling merely on the basis of snippets and nothing more is just not profitable. There are so many companies out there not doing even basic compliance with entire Linux distributions, that there’s really no reason to spend time and money arguing about much more gray cases like snippets, which the plaintiff is less likely to win and which will be more costly because the plaintiff will need to bring in evidence and experts to defend the copyrightability of the snippet. There is far less dispute about the copyrightability of entire libraries and operating systems.
Infringing snippets are also hard to find, particularly if they’re embedded in a SaaS product or software that is distributed only in executable (as opposed to source code) form. Techniques for finding open source software in binary software distributions are limited. Often, enforcement efforts are based on the inclusion of complete open source components, where the components can be identified by their filenames, or by the output when they are run. Open source components may also be identified by strings (quoted text) that are unique to that component, because when source code is compiled into binary form, those strings can still be found in the binary. But a short snippet compiled into another piece of software is unlikely to be identified by either technique.
In order to have standing to enforce a copyright license, a copyright holder has to register their copyrights. Most open source developers do not do this. Even many corporations do not do this. Back in 2018, there was a study about how many people actually complied with Stack Overflow’s Creative Commons ShareAlike 3.0 license. Stack Overflow is probably the single most common source of snippets picked up by open source scanners. But, the answer is that basically nobody complies with these licensing terms. In no small part, it’s because the people posting on Stack Overflow don’t bother to register their copyrights in their snippets. They also generally have no particular interest in enforcing those licensing terms. Expensive enforcement litigation makes sense for non-profits dedicated to enforcement, large corporations, and serial trolls, not everyday contributors, much less coders answering questions on public forums.
Fifth, snippet scanning is almost always a distraction from higher-priority compliance issues. For example, most organizations still don’t properly do open source compliance for virtualized or containerized images, failing to provide attribution or offer source code for entire containers, applications, and operating systems. So, spending time chasing down snippets while still not having figured out containerization is bad risk-management. And in my experience, the tools focused on the far less risky subject of snippets are also much worse at dealing with containerization.
Sixth, snippet scanning is not industry-standard. There are many open source scanning tools out there, but only a handful do snippet scanning and only a subset of those customers are chasing these down. The entire tech industry has embraced Copilot – there are really only a few notable exceptions to my knowledge. Which means that in some ways we are back to where we started from – deeper pockets are at higher risk of enforcement and smaller companies continue to fly under the radar. The number of entities in a position to do OSS enforcement hasn’t changed and whatever is the total budget for that enforcement remains the same. I don’t think Copilot is going to induce more people to enter the trolling business for the reasons laid out above (lawsuits against GitHub itself notwithstanding). So given that the actual risk here is the same, it does not make sense to reallocate company compliance budgets to spend time and money on the less risky issue of snippets, in lieu of other, more substantive potential violations.
Conclusion
When selecting tools, it doesn’t make sense to prioritize great snippet identification over things like a better ability to identify secondary licenses buried in source code, automated customer-facing attribution files that actually reproduce copyright notices and licenses from the source code instead, identification of transitive dependencies, ability to work with more computer languages and build systems, or good container handling (especially separation of application layer from operating system layer). For me, it’s absolutely the least important feature of a software scanning tool.
Because IP attorneys are often arguing cases in front of non-technical judges and juries, lawyers often rely on analogies to help them explain how a particular technology works. Choosing the right analogy can make the difference between winning or losing a case. Lawyers have used several different analogies to explain how ML/AI models (I’ll often just call them “models” here) are trained and how they function in service of arguments that the training process does or doesn’t infringe US copyright law. This post will only examine whether training a model is copyright infringement. Whether the output is infringing is out of scope for this discussion.
The Suggested Analogies
In the class action accusing Stable Diffusion and Midjourney of copyright infringement, the plaintiffs argue that the generative AI models in question are essentially sophisticated collage tools, with the output representing nothing more than a mash-up of the training data, which is itself stored in the models as compressed copies.
Among those taking issue with this analogy is my colleague Van Lindberg, a distinguished IP attorney who recently published a law review article in Rutgers Business Law Review titled “Building and Using Generative Models Under US Copyright Law,” which provides a great introduction to generative AI to a non-technical audience. Lindberg argues that current generative AI models do not store compressed copies of the training data, but that “the model training process records facts about the work.” He continues: “Think of the analogy of the art inspector taking every measurement possible – brushstrokes per square inch, correlations between colors six inches apart, and the number of syllables in the artist’s name.” On the strength of this analogy, Lindberg argues that model training either isn’t governed by copyright law or constitutes fair use.
Lindberg is right that the model is not literally storing copies of copyrighted works like an archive or a database. It’s not possible just to click around and open up various works that were part of the training data. That implication by the plaintiffs is at best misleading and in any case literally and technically untrue. The model contains a set of numbers and equations that have a relationship to the training data and enable the model to produce responsive output appropriate to the input.
But by focusing on the technical inaccuracy of the plaintiffs’ characterization of the training process, Lindberg misses an opportunity. If we engage with it as an analogy, the plaintiffs’ characterization ultimately undermines their arguments for infringement and against fair use.
Evaluating the Suggested Analogies
Generative AI models are often described as “black boxes,” because even AI experts don’t really know why a model produces a specific output for a given input; they do not have insight into a model’s decision-making process or all the factors it takes into consideration when producing a particular output. We can say how the model is designed to analyze training data, and how the results of that analysis are encoded (in this case, as parameters and biases in numerical format), but not what information the model ultimately records about the training data.
This process is so opaque that researchers are using one AI model to guess what data another model uses in its decision-making and how it weighs one piece of data versus anothers in producing its output. One researcher spent months reverse-engineering a tiny model just to figure out how it does addition, and this was headline-grabbing news in the AI world.
So, it’s important to keep in mind that neither side of this argument actually has a good understanding of what’s actually in any particular model; when Lindberg analogizes what Stable Diffusion does to an art inspector tracking the number of syllables in an artist’s name, we don’t know that Stable Diffusion is actually doing anything like that. To some extent, we’re all guessing, and will be until AI researchers better understand what they’ve built. I will graciously eat my hat if some of my arguments become outdated or debunked in this quickly moving field.
Lindberg asserts that the model records facts about a work but not the work itself.1 But a certain level of fact gathering is nearly indistinguishable from reproducing the work, or creating a derivative work of it.2 If a summary of a book is as long as the book itself, there’s a strong argument there that the summary is really a derivative work of the book. Sheet music is a very detailed factual account of a recorded song, but we recognize both as eligible for copyright. Likewise, scripts with stage directions are a very detailed factual account of a movie, but we recognize both as eligible for copyright. Images don’t have a non-digital analog to describe the copyrighted work the way that movies or recorded songs do, but they do have digital analogs like PNG and JPEG.
If we create new ways of conveying substantially similar creative expression in different formats, why should that creative expression lose protection simply because it is conveyed via a new format? Particularly where, as with ML/AI models, not only can the work be conveyed via the new format, but it can be effortlessly and instantaneously converted back into the original format (or a very similar derivative work) during the output phase – no need to send anyone to the sound stage again! The courts have long recognized that merely changing formats is a creation of a derivative work of the original.
Lindberg argues that “there is no way in which an ML model could be mistaken for any of its training inputs. The mass of statistical probabilities that make up a generative ML model are so different from the training material that there is no question that it is “different in purpose, character, expression, meaning, and message” from any (or all) of the works that were used as input.” But this oversimplifies. After all, images and song recordings don’t lose their copyright protection just because we express them via 1’s and 0’s in JPEGs and MP3s. Simply digitizing a copyrighted work, or in this case encoding it in one digital manner versus another (like switching from WAV to MP3), doesn’t strip the work of copyright protection.
At present, we cannot say for certain whether an ML/AI model does so much “fact gathering” that it’s a derivative work of the training data. We can only deduce certain things about the model by looking at the output. But when a model’s output is identical (or very similar) to the training data, that’s a strong indication that the model has captured enough information about that training data to be considered a derivative work.
Lindberg focuses on one study where researchers were able to extract only a few identical copies of training data from image-generating diffusion models. From these results, he argues that this is a very rare occurrence that mostly happens when the training data isn’t properly deduplicated or the model is overtrained – in other words, when someone messes up.
But Lindberg overlooks the many AI outputs which, while not identical to the training data, may be similar enough to be derivative works. Granted, we can’t accurately quantify how much of a model’s potential output might be derivative of its training data, because the question is subjective and case-specific, but that doesn’t mean we can just avoid the question. Courts make those judgements even if researchers can’t, and our legal system is designed to protect copyright holders from unauthorized derivatives.
Developers can set up certain guard rails to prevent output that is too derivative (and therefore perhaps avoid liability for the output), but the fact that even a competently trained model is capable of producing obviously derivative output strongly suggests that the model is fairly closely encoding the training data. And while Lindberg argues that his reasoning extends to all types of generative models, there is evidence that the performance of large language models is improved when the models memorize more (regardless of overfitting). In other words, memorization is a feature, not a bug.
Some may argue that because a trained model encodes millions or billions of pieces of training data together and not just one, that it can’t really be said to be an encoding of any one particular work; it’s impossible to actually pinpoint data about any one piece of training data within the model because the model generalizes from the training data and “forgets” some of the training data in the process. But while this may be true under a particular set of circumstances, it is not universally true of all AI/ML models because:
When models are trained on a relatively small set of training data, they are more likely to produce output that is identical or very similar to the training data. They simply don’t generalize.
Models whose training data contain many duplicates are less likely to generalize and more likely to create verbatim or very similar output to the training data.
Models that are overtrained are more likely to create verbatim or very similar output to the training data.
The fact that the encoding for any particular piece of training data is difficult to extract doesn’t mean it’s not there. If the training data can be output by the model, then it is in some way encoded in the model during training. In a mile-wide photo collage, any one picture may be difficult to locate, but it’s still there, and the collage is still a compilation requiring permission from the photo’s copyright holder.
While current AI/ML model architecture makes it difficult to pinpoint where a model has encoded information about a piece of training data, that is likely to change in the future. Other architectures may prove more useful and researchers are actively working on making AI models more transparent, making it easier for people to understand what AI models learn from training data and how they use that information to make decisions. Specifically, it is desirable for a model to be able to indicate what specific training data it relied on to make a decision, including by showing that training data when asked.
In certain contexts, it’s extremely desirable for models to have perfect memory of certain portions of their training data or data received after training. The arguments discussed above all relate to generative AI, which is designed to produce new work based on fairly limited prompts. However, other functions related to problem-solving, analysis, planning, and education will require the model to maintain perfect memory of a portion of its training data. For example, no one wants an AI scientific research assistant with imperfect memory of the periodic table of elements or which combination of medicines can kill people. One of the biggest benefits of AI assistance is that it is, in fact, capable of perfect memory in a way humans are not. Many models can also broaden their functionality by receiving and processing data after training has been completed as part of a specific user query (i.e. “Alexa, read this book you’ve never seen before and tell me if it’s suitable for my 5-year-old.”) and it may not be useful for the model to generalize this data.
To a certain extent, the level of generalization and forgetting a model does is driven not by the need to optimize the model’s results, but by computational limitations. Those limitations are diminishing every day.
Alternative Analogies
Arguments and analogies that assume models don’t memorize aren’t just technically inaccurate, they are not future-proof: they link the question of fair use solely to the specific state of the art available today. If the fair use defense depends on this assumption, it will fail whenever any amount of memorization can be shown. Fortunately, there’s no need to build on such a fragile foundation: there is plenty of precedent that a use can be transformative even if it depends upon copying and retaining entire copyrighted works.
Argument 1: ML/AI Models Can Be Transformative in the Same Way as the Google Books Project
To create a searchable digital book archive, Google copied and compiled thousands of books verbatim. Even so, the Second Circuit approved of the archive as fair use because its purpose (to allow users to discover facts about specific books and literature more broadly) was transformative (and specifically, it was furthering the goals of the Copyright Act to spur the growth and spread of knowledge), and the way in which the archive was ultimately utilized (in this case, the technical features of the search function which limited the scope of the results it would provide) did not create a substitute for the original work in the market.
Applying that same analysis in the model training context, it’s clear that the analysis would look different depending on the model. “Foundational models” like Stable Diffusion and ChatGPT can do a broad range of things within one or more modality (image, text, audio, etc.). For example, ChatGPT and other large language models (LLMs) are essentially designed to generate text that’s responsive to a prompt, and they can be given prompts on most any topic. These models can be further tuned to become even more useful within a particular domain of knowledge or a particular task and might receive additional data beyond the training data made available to the foundational model. Other models are not foundational models at all; they are trained on a much narrower variety of training data and they only do some limited tasks for specific use cases.
With this lens, foundational models, particularly those that haven’t been heavily fine-tuned, look more transformative because they have broader applications than non-foundational or heavily fine-tuned models with specific use cases (like “we want to display copyrighted artwork that best matches people’s moods”).
Even when a model has an ostensibly transformative purpose (and/or still retains it after fine-tuning), the way a model ultimately operates may still vary. For example, one image-generating model may take great pains to avoid outputting verbatim copies of copyrighted works (Stable Diffusion can no longer be prompted by specific artist names, for example), but another might freely allow it or even take steps to make such output easier to prompt. Even though the purpose of both models may be to help people create original art, courts may view them differently based on their implementation.
For this reason, it’s not realistic to classify the training of all ML/AI models as either fair use or infringing. Each instance must be evaluated in the context of the model’s intended use case and whether it is, in practice, offering substitutes in the market for the kinds of works that comprise its training data.
Today, at least, it is unlikely that courts could provide a comprehensive rubric to distinguish fair from infringing models. That would require a degree of technical knowledge unlikely to be relevant to a single trial. And in general, legal rubrics like that tend to emerge only after enough cases have been decided that a court can create a cohesive analysis of all of them.
Argument 2: The Model Is More Like a Means of Reproduction Than a Copy Itself and the Means of Reproduction Are Legal So Long as There Is Some Legitimate Downstream Use
A blanket decision that training models is legal would be desirable and advantageous for the tech sector given the way many models are currently created and shared. Today, it’s not uncommon for entities to create foundational models, to make those models publicly available for download for free, and for other entities to download those models and further fine-tune them for their specific use cases or otherwise experiment with them. Most notably, Meta’s release (first accidental, and then intentional) of LLaMA , a foundational language model, spurred research, experimentation, and improvements to it, and language models in general, from a wide variety of individuals and entities, including universities and start-ups who could not afford to train their own foundational models, but who obviously had a lot to contribute if they had something to work with. It’s likely that some downstream uses of such models don’t raise any eyebrows, but other uses draw a legal firing squad. So, it would be really convenient, and arguably good for the sake of technological progress and invention, for the creators of the models to get legal absolution for creating the model without regard for how downstream users actually use the model.
If the goal is specifically to absolve the creators of foundational models, there is an argument that even if the model is a way of encoding the training data, because the only practical way to access such an “archive” is by having the model create output, it should be treated more like a means to create a copy rather than as a copy in and of itself. I think that AI/ML models can be likened to something like a Star Trek replicator. Since anyone can ask the replicator to create any physical object, be it a bayonet from the 1700s or a James Bond-style martini, the replicator has obviously ingested more or less all of humanity’s collective knowledge. But, this knowledge is entirely useless and inaccessible except to the extent that the replicator actually produces something. Therefore the replicator is treated as nothing more than a means of reproduction (literally, a “replicator” even though it can create things that have never existed before like a hot pink bayonet or a martini with tapioca pearls), rather than an archive or library. It only has value to the extent it can actually produce output. Like a photocopier, tape recorder, or VCR, the replicator’s existence, without actual use by anyone, doesn’t rob any IP holder of an opportunity to monetize their IP; that only comes into play with respect to certain types of output and certain uses of that output, which are controlled solely by the user requesting the output.
Treating the AI/ML model as merely a means to create a copy for all practical purposes would align the fact pattern here with the one in the Supreme Court ruling related to VCRs. Since the user of the VCR was the one producing the tape recordings and not the VCR manufacturer (Sony), direct copyright infringement was off the table, leaving Sony only with claims for contributory and vicarious copyright infringement. In that case, the court closely examined how downstream users used VCRs in making its determination with respect to whether Sony’s sales of them constituted either contributory or vicarious liability. The court did not find contributory or vicarious liability because they could reasonably identify at least one downstream use case that met the fair use standard: at home time-shifting of broadcast tv. The court repeatedly emphasized that “the sale of copying equipment, like the sale of other articles of commerce, does not constitute contributory infringement if the product is widely used for legitimate, unobjectionable purposes. Indeed, it need merely be capable of substantial noninfringing use [italics added].” I’m certain that there are many use cases out there of ML/AI output that have legitimate purposes (like quoting certain sources for educational purposes, creating or editing original art, analyzing data from medical trials, etc.). And unlike the photocopier or the VCR, the vast majority of the output is not, in fact, a copy of the original, meaning that it’s even easier to find noninfringing uses with respect to ML/AI models than with any other piece of technology designed to create near exact copies.
Argument 3: The Model Includes the Training Data, But There’s No Substantial Similarity Between the Two
Another way to get to the same place is to argue that at a certain point the model has so much training data, and it so complexly bundled up and interwoven together in the model, that there is no copyright violation because the model and any one piece of training data no longer have substantial similarity. After all, it is literally impossible to pull any one piece of training data from the model itself without creating output. If you really want a specific copyrighted work in its entirety, it is at least an order of magnitude easier to just go get it from the website it was scraped from or to go photocopy a library book than to try to get a properly trained foundational model to faithfully reproduce it. The portion of data in the model about any specific piece of training data is infinitesimally small compared to the totality of the data in the model.
This argument could be aided by the fact that the model doesn’t just contain information about the training data (or the training data itself), the model also contains information about what sort of output is pleasing to humans – in the image context, it contains data not just about the training data, but about how to interpolate those images to create something responsive to the prompt that’s also appealing and not visually confusing or disturbing. A model that only contained information about the training data but not how to evaluate it and transform it for the purposes of creating output would not be very useful.
The use of any particular piece of training data could be deemed de minimis3 in such a case, but there would still be potential claims of contributory and vicarious copyright infringement as well as claims related to the output, since the output could still have substantial similarity to the training data.
This may be subtle, but I think there’s a big difference between trying to argue that the model doesn’t contain the training data at all (which I find unconvincing, not least because you are still asking a court to declare that a certain small percentage of copying can be overlooked for reasons that aren’t entirely clear as I’m not aware of any model guaranteeing a 0% copying rate), and arguing that the particular type of encoding it has undergone transforms the model into something very different from merely the sum of all of its training data.
The Limits of Analogies
While slam-dunk analogies are incredibly powerful in a courtroom setting, the courts are not the only way to get clarity on the issue of model training; there is also a statutory pathway wherein Congress creates a specific copyright exception for model training based on certain policy considerations. The relatively open practice of model development and refinement is good for technological progress – there is no doubt that making these models publicly available has sparked a Cambrian explosion of technological progress in the field. And models that can be transparent about what training data is influencing their decision-making are socially desirable even if that transparency comes at the cost of allowing models to memorize the training data.
There’s certainly precedent for Congress creating copyright exemptions based on policy goals tied to specific new technologies (like the DMCA’s safe harbor for those who host third party content) and other countries have created copyright exceptions specifically for model training for these and many other reasons. If US legislators are to be encouraged to draft a clear exception for model training, they must decide that the appropriate outcome is not likely to be reached by the courts any time soon, i.e. that the laws currently on the books do not adequately address this topic and new ones must be written. They won’t feel that way, though, if the question of fair use with respect to AI/ML training is popularly treated as a foregone conclusion.
To that end, while I believe that some of the alternative arguments I’ve offered are closer to the technical truth at issue, I don’t have any reason to believe that they are easier for a jury or judge to understand than the ones proffered by others. In general, this is a deeply technical discussion that even AI experts have a broad range of opinion about, which makes this a subject better suited for regulatory experts or legislators in consultation with various advisory committees and federal agencies, than 12 people chosen at random or a single judge.
Conclusion
I agree with Lindberg’s general notion that the legal system should find a way to remove copyright liability from at least some ML model developers for a number of policy reasons. But, tying the legality of AI model training to the fact that its memory is imperfect or to the fact that it’s currently challenging to get a model to reproduce training data as output may be a short-sighted approach. To the extent that it’s desirable to clear a legal path forward for ML/AI model training, it doesn’t make sense to rely on arguments that lead courts to create rulings that can only protect somewhat frivolous AIs like image generators from liability, but whose logic cannot be extended to a future set of AIs to be used in scientific and mission-critical contexts. And in fact, whose logic creates precedent for finding those other types of models to be illegal.
As I have previously written, I don’t think the model just stores information about the training data. I think the model also stores information about what sort of output humans find appealing. That data is what sets models apart even when they’re trained on the same training data corpus. This seems logical to me since models can be improved without training them on new training data.
Note that while it’s arguable whether or not 17 U.S.C. 117(a)’s incidental copying exception would apply to model training, that section’s exception doesn’t extend to creating derivative works so I’m going to leave a 117(a) discussion for a separate post.
While the Ninth Circuit has famously said that the de minimis doctrine is about what percentage of a copyrighted work is copied in Bell v. Wilmott Storage Services, LLC, other courts have looked to the newer work to assess what portion of the newer work is made up of the older work and they have taken into account how observable the copy really is in the new work, even if it was copied in full. Aaron Moss writes: “For example, in a case involving the 1995 Brad Pitt crime thriller “Seven”, the Second Circuit found that that [sic] the use of copyrighted photos that appeared fleetingly and out of focus for 35 seconds of the film was de minimis. Another court reached the same result when a pinball machine appeared in the background of a scene from “What Women Want.” And last year, the court in Solid Oak Sketches v. 2K Games held that the use of NBA players’ tattoos in the popular “NBA 2K” video games was de minimis, pointing out that the tattoos appeared only fleetingly, and comprised only 0.000286% to 0.000431% of the total game data.”