Micropayments Would Benefit Everyone

This is the second part of a four-part series called “Protecting Digital Content in the AI Age: A Lawyer’s Guide.” Part 1, “AI and the Digital Content Provider’s Dilemma,” examined the effects of AI on digital content providers, including both a drop in web traffic and significantly higher demands on web infrastructure, and called for new business models like micropayments to maintain the open web. Part 3 will examine AI crawlers, robots.txt, the technology behind the web, and new tools for digital content owners. Part 4 will analyze US and EU laws related to text and data mining opt-outs and technologies that may help digital content owners sustain their businesses.

As discussed in Part 1, the open web is already shrinking in response to AI, with more digital content providers moving their content behind logins, removing it from the internet entirely, or attempting to block access to it via both legal and technical mechanisms. AI companies are losing access to data, while the rest of us are losing an open web. Bespoke deals between AI companies and certain content providers are just a bandaid. AI companies can only negotiate so many deals. If nothing changes, both the AIs and the internet as a whole will continue to lose “long tail” niche content, putting content providers out of business, and AI models will exhibit more bias because they will lack data from smaller cultural communities and non-English speaking countries. Micropayments – an automated, frictionless way for AI companies to pay for the content they use – are the right solution for content providers, AI companies, and society as a whole. 

A. Micropayments Can Quickly Force AI Companies to Pay for Content to Remain on Even Footing with Their Competitors

Automated means by which content providers could offer free or cheaper licenses to non-profits, small businesses, companies that respect content providers’ preferences, or companies that release open source AI models (or some combination thereof, under whatever definitions content providers prefer), while signaling that they expect the megacorp AI players to pay is the number one most effective way to pressure AI companies into paying for the content they use. Paying or not paying under such a regime completely changes the AI companies’ calculus of paying content providers. Today, the AI companies weigh the costs and benefits of certain licenses or the possibility of certain lawsuits, and they can incrementally choose how much risk they are willing to accept so long as all of their competitors are doing the same math. But a competitor who gets access to content for much less or for free poses a starker choice to the megacorp: pay for the same data or lose to entities whose AIs can provide higher-quality information. The question would become existential in a way that it is not today. 

The unwillingness of content providers to privilege certain entities over others, whether out of principle or in search of maximum profits, simply means disarming themselves of the single most powerful tool they have to protect their businesses. In particular, entering into deals with AI companies but leaving non-profits and small businesses with no rights at all (including by blocking all entities from training on their material in robots.txt) only helps to secure an oligopsony of AI companies and make the content providers more and more vulnerable over time. 

Many of the major AI companies view their mission as reaching artificial general intelligence or artificial super intelligence (human levels of intelligence or higher), and many in the industry, including Dario Amodei (CEO of Anthropic), Demis Hassabis (Google DeepMind CEO), and Elon Musk have explicitly stated or otherwise implied that they view such an achievement as a winner-take-all proposition. That is their fundamental argument for why the US should do everything in its power to beat China in the AI race. Content providers need to understand this dynamic in order to fully appreciate why price discrimination is such a powerful tool in this circumstance and why AI companies fear one another more than they fear billion dollar lawsuits. 

B. Micropayments Keep Power in the Hands of Creators Instead of Intermediaries

Intermediaries can provide indispensable services (especially when content must be provided in physical form) and in some cases intermediaries can use their leverage to get individuals better deals, especially when they take the form of regulated non-profits, as in the EU. But, if AI companies continue to make bespoke deals with large content providers and intermediaries while ignoring everyone else, the natural result will be intermediary growth and then consolidation, giving them more leverage over time over the AI companies, content providers, and the rest of society. 

The companies that ignore or neglect the individual content creator in favor of feeding the intermediary, invariably discover, again and again, that they’ve been throwing meat to a baby dragon, not a household pet. This is the story of how Amazon fed publishers (instead of authors) who then created their own e-book and audiobook platforms, Netflix fed movie studios (instead of writers and directors) who then pulled their catalogs to their own streaming services, YouTube fed multi-channel networks (instead of individuals) that were able to move the best content creators to their own video services, and how Apple’s iTunes Store fed the music labels (instead of individual artists and even independent labels) who then backed Spotify and other streaming services that decimated iTunes. Each and every time, these companies eventually offered individuals more, but only after allowing the dragons to fly away with a limb or two. 

The flip side, of course, is that these intermediaries also amass more power over creators, especially the ones they already serve, who may or may not be able to renegotiate for a cut of the money. Such a scenario is already playing out in the Anthropic settlement, where to Judge Alsup’s surprise and dismay, a number of publishers are pulling chairs up to the table instead of authors because many authors have already signed away some or all of the rights to their infringed books. Consolidation means that just a few entities wield massive control not just over AI companies, but over what the rest of us have access to (especially in authoritarian countries), the price for such access, who can make a living making content, and on what terms. Publishers like HarperCollins already don’t blush when it comes to telling their authors they’ll be taking 50% of AI training revenues, take it or leave it. 

Intermediary consolidation is no less of a threat to content providers than AI. Elsevier is the wildest dream of every corporate intermediary, exploiting consolidation in scientific publishing to its most brutal, logical conclusion. In large part, their business model is to take people’s research for free, have others do peer-review for free, and then sell the research back to the very same institutions that funded it for exorbitant rates. They require copyright assignment from authors. They bundle journal subscriptions, such that institutions pay even for things they don’t want. Historically, they’ve lobbied against the open access movement and more recently they’ve co-opted it, charging authors $2k-$10k to put their papers up on an open-access website. Their profit margin rivals those of oil companies. 

Their high fees have forced smaller scientific publishers out of the market because libraries don’t have the budget for much beyond the Elsevier subscription, leaving authors and institutions who’d actually like to be compensated for their work with few, if any, other options. Authors now largely choose between supporting Elsevier and having their works widely read in well-respected journals, or publishing them freely online in total obscurity. Either way, compensation is off the table and scientists have to make money via institutional paychecks; they cannot work as individuals and make money from their articles instead. Consolidation doesn’t just change prices, it often forecloses individual autonomy and reinforces other exploitative systems.

C. Micropayment Will Help Modernize the Open Web and Make AIs Better

Companies receiving compensation from AI companies will have an incentive to make their websites AI-friendly, making changes that allow AI agents to quickly and accurately find information, make purchases, and select multimedia with appropriate use rights. That would potentially allow AI companies to segregate human-created works from AI-generated works to hone in on high quality data, lower the costs of inference involving retrieval-augmented generation (RAG), and cut the cost of cleansing training data. 

In particular, knowing that a certain portion of data is coming from trustworthy sources means AI companies working to avoid data poisoning can focus their efforts and enhance security. There has also been work on protocols like IndexNow that allow a website or content delivery network (CDN) to signal to a crawler whether the website has changed since the last time a crawler visited, potentially reducing unnecessary crawls by 53%, saving energy, and helping AIs prioritize fresh information. It’s not difficult to imagine a regime where AI companies pay AI-friendly sites more. At the same time, people using open source AIs, including researchers and non-profits, will also benefit from higher-quality information. 

D. Micropayments Will Ensure Healthy Competition and Technological Progress

Without micropayments, the competitive landscape in AI would stagnate. Startups trying to find their footing in this space simply can’t ink thousands of content deals before they sign a single customer. Payments to the content provider might be back-ended (“we’ll pay you at the end of the year”), but payments to lawyers and salespeople are not. 

In the absence of competition, the incumbent AI companies will do what oligopolies always do: stall disruptive innovation, conduct regulatory capture, keep prices high, have an outsized influence on companies and industries upstream and downstream of them, and proceed with the enshittification of their services, as Cory Doctorow puts it: slowly shifting value from the user to the AI company because the user has nowhere else to go now (everybody wave at Comcast!). 

It’s nice to be an oligopolist I imagine, but this position wouldn’t necessarily translate to more profits for the AI companies given the fact that squelching innovation – both for their competitors and their users – means that at some point their share of the pie might grow, but the pie itself won’t. Satya Nadella’s goal of increasing worldwide GDP to 10% per year through AI is explicitly contingent on AI itself becoming a commodity, which isn’t possible with billions of dollars worth of barriers to entry. He’s manifestly correct on this front. The success of the Silicon Valley titans over the decades has hinged on technologies where the aggregate value of the technology to its users is many orders of magnitude higher than the amount that can be charged for it, and the right play isn’t charging more for it, it’s spreading it far and wide and finding a way to harvest some of that value downstream. 1

A simple example of this is Apple apps. Apple could have decided that the only apps on its phone would be Apple apps and they’d make 100% of the money off of all software on their phones. Great, they’d have a monopoly there. But it’s clear that Apple has made more money off the Apple Store from third-party apps than it ever would have from its own apps. Third-party apps had huge knock-on effects. Here’s a non-exhaustive list: the phone now had appeal both to consumers and businesses, other companies became more valuable because their products could be accessed on the go (so they hired more people, who needed more phones, who wrote more apps, who invested in new companies…), an entire industry of mobile-first apps like Uber and Shazam sprung up, and everyone online got slightly richer for the added security of two-factor authentication, enabled by third-party apps. All of this made the phone that much more valuable, as well as all the other products and services in the Apple ecosystem. By allowing their share of the app pie to shrink, the entire universe of potential Apple customers expanded.

Stay Tuned!

Stay tuned for Part 3 of this series, which will do a deep dive into the technologies which can enable micropayments, and will provide a refresher of many fundamental web protocols that can be used and extended for AI.

  1.  This is the rationale for much of commercial open source and for free products like Gmail or Bing. Sometimes the company practices vertical integration and becomes a user itself in a particular domain to capture some of those downstream benefits. GitHub Copilot is an example.
    ↩︎

AI and The Digital Content Provider’s Dilemma

This is the first part of a four-part series called “Protecting Digital Content in the AI Age: A Lawyer’s Guide.” Part 2 will discuss the advantages of automated payments for AI companies and the public. Part 3 will examine AI crawlers, robots.txt, the technology behind the web, and new tools for digital content owners. Part 4 will analyze US and EU laws related to text and data mining opt-outs and technologies that may help digital content owners sustain their businesses.

A. The Web That Could Have Been and the Web We Got

So often I find a writer I want to follow, but I don’t want to add another monthly subscription to my budget for a newspaper or magazine that I’m otherwise not interested in. Sometimes I read an incredibly influential piece that got millions of views, but the writer got no compensation at all because they don’t put out enough content to warrant a subscription. Every now and then I’ll stumble upon a hobbyist site clearly run by someone who drinks way too much coffee that fulfills a one-time need for very useful information, like one that dutifully cataloged specs for an untold number of computer monitors. What if it were as easy to compensate them as throwing coins into a street musician’s hat? Visit a site, automatically toss them some money.

The idea of web micropayments is an old one.[1] For techno-optimists, it sits in the same wistful part of the brain as federated social media platforms and ubiquitous mesh networks (and bringing Google Reader back from the dead, if we’re going to be honest). Micropayments never took off because credit card companies weren’t interested, and unlike the record companies and the movie studios, there wasn’t a forcing function to drag them kicking and screaming into the next millennium, like Napster or Netflix. There was simply nowhere else to go if someone wanted digital payments. For context, the first website was launched in 1991 and PayPal (for example) didn’t launch its services until 2000.

In the meantime, digital content providers increasingly relied on ad revenue, whether their content was copyrightable or not. Google Search launched in 1998, initially putting ads in the search results screen as part of Google AdWords in 2000. The web caused a near extinction event for a number of businesses, most notably local journalism.[2] The remaining digital content owners and Google settled into a happy symbiosis: content owners acquiesced[3] to their content being copied and indexed for listing on Google Search in exchange for Google sending viewers to their sites. In 2003 that symbiosis was enriched by Google AdSense and what would eventually become the Google Display Network, which allowed content owners to sign up for Google to serve ads on their site for a share of the ad revenue. Google eventually drove so much business on the web that they pried content providers out from behind portals like Yahoo! and largely decimated the closed web. In 2007 the New York Times actually scrapped its subscription service because they realized they could get more money just from ads and didn’t return to the concept for another 4 years.

Google Search and the ad-supported open web did not just move analog content online. Nearly every organization with a web presence became a content creator, and many began to offer deep libraries of free, expert content to lure visitors to their sites and build trust in their brands. For example, law firms like Quinn Emmanuel publish several blogs a day about recent cases and new laws; hobby retailers like Goulet Pens have giant trove of articles and videos about their goods; software security firms publish vital security-related info, like Dark Visitors’ widely cited list of all known bots on the internet. It’s not just that the world saw an explosion of content, it’s that the type of content being produced was new, too, previously either not available at all, limited to customers of the firm, or only published in a book at least a year after it would have been useful.

But, the ad-based web has eaten itself. Websites are now optimized for engagement and views, not quality, so outrage and conspiracies have become the coin of the realm, and the engagement (addiction) strategies of social media algorithms make the casino playbooks look absolutely stone age. Hucksters have learned to jump their way up the Google Search results so effectively that nearly every search, no matter how specific, yields links of the “10 very basic facts about the domain you were asking about” variety – quotation marks in your search be damned! Things have gotten so bad that even the Wall Street Journal noticed that people were adding the word “reddit” to their searches back in 2023 so they could find genuine information because Google now struggles to surface things like personal blogs and other non-commercial material.

B. The Rise of AI and the Decline of Human Traffic and Ad Revenue on the Web

A major factor in AI’s popularity is the decline of Google Search. Casual AI users appreciate getting a quick, straight answer without having to wade through the muck of Google Search results and people doing a deeper dive are rejoicing because some of the models are capable of finding relevant sources that Google Search simply can’t anymore. Unfortunately, most queries are of the casual variety and the AI answers often eliminate the need to go visit the sources, even if they’re listed. That translates into less ad revenue for site owners who host content and fewer transactions for e-commerce sites.

Data from SimilarWeb shows that 44 out of the top 50 news websites in the US saw declining traffic in the last year. Search referrals to top U.S. travel and tourism sites tumbled 20% year over year last month, 9% for e-commerce, and 17% for news and media sites. Cloudflare’s data indicates that with OpenAI, it’s 750 times more difficult to get traffic (referrals) than it was with the Google of old (before its answer box and its AI overviews) and with Anthropic, it’s 30,000 times more difficult.[4]

Perplexity AI has become a recurring character in the AI v. digital content provider saga. While the practices of many AI companies have been the subject of a healthy debate, some of Perplexity’s behavior constitutes incontrovertible AI-powered plagiarism. Perplexity is valued at about $20 billion and it primarily runs a search engine that summarizes its recommendations. During the summer of 2024, they were excoriated for publishing an article using AI that essentially recycled investigative journalism from Forbes, lifting sections of their article and even reproducing some of their images, without even mentioning Forbes or the authors of the original piece. They even created an AI-generated podcast related to the piece that irritated Forbes to no end because it outranked all other Fobes content in a Google search for this topic.

That same summer, WIRED used all the firepower it could muster in calling them out for output making claims that WIRED reported something it didn’t, output closely paraphrasing WIRED articles without attributing them, and reproducing their photographs, showing attribution only if the image is clicked – all while ignoring their robots.txt. They published an article entitled “Perplexity is a Bullshit Machine,” among others. That article was shortly followed by another, entitled “Perplexity Plagiarized Our Story About How Perplexity is a Bullshit Machine.”

In a September 2025 interview with Stratechery, Cloudflare’s CEO (another recurring character discussed further in Part 3) described similar problematic behavior:

“[I]f they’re blocked from getting the content of an article, they’ll actually, they’ll query against services like Trade Desk, which is an ad serving service and Trade Desk will provide them the headline of the article and they’ll provide them a rough description of what the article is about. They will take those two things and they will then make up the content of the article and publish it as if it was fact…”

Cloudflare also exposed that Perplexity crawlers were bypassing website permissions by disguising their identity. Perplexity really seems to have gotten Cloudflare’s goat, and driven some of Cloudflare’s innovation, or at the very least, their rhetoric. Cloudflare and Perplexity have since engaged in several public brawls that have focused discourse on AI crawling. All while avoiding punching the actual AI heavy weights in the room.

It may be tempting to argue that declining viewership is caused by just a few bad apples rather than the fundamental nature of the technology. After all, some of those referral numbers are significantly better than others and historically, there have been many cycles of technological disruption, copyright holder backlash, and then gradual realization by copyright holders that the technology can actually improve their businesses. However, even when AI companies appear to be following all the best practices and creating output that properly references sources, websites are still losing traffic. Google’s AI Overviews (powered by Gemini), which follows a lot of good practices and has the strongest incentive to send people to actual webpages because they contain Google ads, appears to be choking off web traffic by about 40%:

C. AI Crawlers Are Overwhelming Web Infrastructure and Driving Up the Costs of Maintaining a Website

Apart from the question of data governance, AI crawlers are taxing site infrastructure, significantly degrading site access for other users, and regularly taking down sites altogether. An estimated 30% to 39% of global web traffic now comes from bots. Unlike older search crawlers, AI crawlers ignore website permissions (robots.txt), crawl delay instructions, and bandwidth-saving guidelines, causing traffic spikes that can be 10-20x the normal level[5] within just a few minutes. Many sysadmins report that crawlers are running random user-agents from tens of thousands of residential IP addresses, each one making just one HTTP request and therefore masquerading as a normal user. This activity amounts to a distributed-denial-of-service (DDoS) attack, which can’t be thwarted by mechanisms like IP blocking, device fingerprinting, or even CAPTCHAs.[6] The crawler activity affects not just the websites being crawled, but other websites as well if they’re on a shared server.[7] Even the largest sites experience performance issues when crawled by AI-related crawlers.

Sites hosting open source software seem to be particularly juicy targets. Anthropic was in the headlines last summer for visiting certain sites more than a million times a day.  A GNOME sysadmin estimated that in a 2.5 hour sample, 97% of attempted visitors were crawlers. Both Fedora and LWN, a Linux/FOSS news site, have reported that only a small portion of their traffic now consists of humans and that they’re struggling just to keep their sites up – Fedora has been down for weeks at a time. It does not appear to be the case that these are examples of crawler bugs – some report a regular pattern of being scraped every six hours.[8]

Other kinds of websites are also being attacked. An April 2025 survey by the Confederation Open Access Repositories, representing libraries, universities, research institutions, etc. around the world, indicated that 80% of surveyed members had encountered service disruptions as a result of aggressive bots; a staggering ⅕ reported having a service outage that lasted several days. Wikimedia has seen a 50% increase in multimedia downloads on Wikipedia. The explosive growth in the number of crawlers out there and the scale of their activities drove Wikimedia to the point that it started creating structured Wikipedia datasets for AI companies to download just to keep them off the live site. Even small, niche sites have been under distress. A website hosting pictures of board games reported that it was crawled 50,000 times by OpenAI’s crawler in a single month, drawing 30 terabytes of bandwidth.

The kicker? Websites with slow loading pages are ranked lower in Google Search results!

The amount of pressure even the more scrupulous AI companies are putting on site infrastructure is vastly disproportionate to the amount of human traffic they send to those same sites:

One group of academics, policy-makers and advocates has suggested that the digital commons is currently subsidizing AI development by bearing these additional infrastructure costs and involuntarily contributing to the environmental footprint associated with AI. Indeed, although the EU Act requires companies to disclose energy used in training and inference, they are not required to disclose an estimate of the energy used by third parties in responding to their crawlers or the energy used to block them.[9]

D. Conclusion

The question of paying content providers is fundamentally about preserving the open web, not necessarily punishing AI companies for doing something wrong. I might even be persuaded that certain activities or services are fully within the bounds of the law. But even perfectly legal, well-intentioned activities can create negative externalities that should nevertheless be addressed. The current state of affairs is not sustainable . People won’t keep posting freely available content, at increasing expense, just to sate the AIs if users don’t even associate them with their work, nevermind compensation. They’ll move their content behind paywalls, join walled gardens, or simply stop creating content. Individuals will be further isolated in their information bubbles. Per a recent study, “Consent in Crisis: The Rapid Decline of the AI Data Commons,” 20-33% of all tokens from the most high quality and frequently used websites for training data became restricted in 2024, up from 3% the previous year. Another compensation model is needed and it looks like the technology to power it might be right around the corner.


[1] The 1997 HTTP spec optimistically stubbed out a micropayments request code: 402.

[2] https://en.wikipedia.org/wiki/Decline_of_newspapers

[3] Not without throwing a legal tantrum first, of course. Google faced numerous lawsuits related to indexing websites as well as creating thumb nail images for image search. See Perfect 10, Inc. v. Amazon.com, Inc. & Google Inc. (2007) and Field v. Google, Inc. (2006).

[4] Cloudflare has made a lot of data available about the activity of AI crawlers. An explanation of their metrics and links to live dashboards are here.

[5] https://www.theregister.com/2025/08/29/ai_web_crawlers_are_destroying/

[6] Headless browsers, discussed in Part 3, allow AIs to interact with websites like a human would.

[7] https://www.inmotionhosting.com/blog/ai-crawlers-slowing-down-your-website/

[8] https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

[9] See the Model Documentation Form published alongside the Transparency chapter of the EU AI Act Code of Practice.

The Enforceability of AI Training Opt-Outs

Creative Commons and the NYU Stern Fubon Center for Technology, Business and Innovation recently hosted a workshop in NYC, inviting participants with expertise in IP and various “open” movements and communities to give feedback on their AI-related proposals. This article was prompted by my participation in that workshop.

Creative Commons has been working on creating a set of “preference signals” for copyright holders to indicate how they would like their works to be treated by AI developers considering using their works for AI training. Currently, these preference signals are meant to be applied at the data set level, not to each individual work.1 Creative Commons has said that it is not treating these preference signals as legally enforceable at the moment, presumably because it believes that using copyrighted works to train AIs is likely to be considered “fair use” under US copyright law. Where use of a copyrighted work is deemed a “fair use,” a license attempting to prevent or limit such use is unenforceable.2 Wikimedia, the largest and most famous licensor to employ Creative Commons licenses, agrees that the fair use defense is likely to prevail.3

 I think this approach is premature. 

EU Copyright Law Rules the World

EU AI Act Brings EU Copyright Law to the World

Many jurisdictions do not have a concept of “fair use,” but instead have statutory exemptions from copyright liability. In the EU, the Directive on Copyright and Related Rights in the Digital Single Market (the “CDSM Directive”), allows commercial model developers4 to copy and extract content from copyrighted works for purposes of text and data mining (TDM) provided that they are lawfully accessible and that the model developer abides by copyright holder opt-outs. The EU AI Act’s Article 53(1)(c) takes the unusual step of importing EU copyright law and the obligations in the CDSM into the EU AI Act and applying them to all general-purpose AI model providers subject to the EU AI Act, even if they would not otherwise be subject to the CDSM or European copyright law. That means that model developers still have to abide by EU AI Act training opt-outs, even if AI training is protected by fair use in the US or elsewhere. 

1. Providers of general-purpose AI models shall:

(c) put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790;

The EU AI Act’s scope is surprisingly broad in three ways. First, Recital 106 states that all AI developers subject to the EU AI Act must follow EU copyright law and respect opt-outs, even if they conduct model training outside the EU.5 This is unusual because generally the copyright laws applicable to any copyright-related acts are the laws of the jurisdiction where the acts are committed. Here, the EU specifically did not want to see the sale of models in the EU that would have been illegal to train in the EU. Second, it’s actually not clear if the intention is for model providers to respect opt-outs just for works governed by EU copyright law, or for all works from all over the world. 6The language is pretty ambiguous on this front. Even if this language only applies to works subject to EU copyright law, though, it would be impossible to identify such works on a mass scale with any degree of certainty.7 Therefore, in practice, companies will abide by opt-outs broadly to the extent standards for expressing them emerge.8 

Third, the scope of entities to whom the EU AI Act applies is broader than even the scope of Europe’s main privacy law, the GDPR. The EU AI Act’s scope is not limited to companies operating in, or selling products or services in the EU from third countries. The scope actually extends to any model provider whose output “is used in the Union.”9 Potentially, that means that a book with AI-generated images created in the US and sold in the EU is within the scope of the EU AI Act, and the model developers must then comply with EU copyright law. In other words, it’s almost impossible to escape EU copyright law with any certainty since model providers have limited control over their users and users might have limited control over where their outputs end up.10  

Creative Commons’ Protocol Proposal

The upshot for Creative Commons is that the TDM opt-out structure can be a vehicle for making signal preferences legally enforceable against the vast majority of commercial AI model providers worldwide. The latest draft of the EU AI Act’s “General-Purpose AI Code of Practice, Copyright Section” specifies in Measure I.2.3 that model providers should follow the robots.txt protocol specified by the Internet Engineering Task Force and “make best efforts to identify and comply with other appropriate machine-readable protocols…” While some of the protocol proposals are binary (“ok to train” v. “don’t train”), a number of organizations have put forward proposals that include additional licensing terms or permissions. The question of which protocols will be legally accepted seems to depend on which ones get popular adoption and public recognition. In practice, if a few major organizations like Common Crawl and EleutherAI get on board, that’s likely to be sufficient. Creative Commons’ stature certainly positions it well for meeting this criteria. 

Enforcement of TDM Opt-Outs

Should CC preference signals become legally recognized by the EU, the applicable enforcement mechanisms will look different from those applicable to CC licenses. EU authors could bring copyright infringement claims against non-compliant companies that conduct training in the EU, but probably not if they conduct training outside the EU. Such plaintiffs would need to look towards the EU AI Act instead. It cannot be enforced by private action, but a complaint could be filed with the relevant national regulatory agency for investigation. Corporate AI customers could potentially terminate their agreements and sue for breach of contract in the event an AI provider doesn’t respect CC preference signals since most contracts require the vendor to comply with all applicable laws. In the meantime, such customers can specifically require compliance with CC preference signals in their contracts and they can also make that a formal procurement requirement when selecting vendors in the first place. Since the EU AI Act carries hefty penalties like the GDPR,11 the lack of private action will not deter companies from complying with the Act.  

Fair Use in the US is Not a Foregone Conclusion

There are a lot of excellent papers out there by IP experts making well-reasoned arguments for the finding of fair use with respect to AI training. But, it’s important to remember that these papers are meant to persuade individual judges about how they should rule; they are not nationwide forecasts of judicial rulings. Courts are not always perfectly logical and many struggle with understanding the technology they are asked to rule on. Think about the felony convictions doled out under the Computer Fraud and Abuse Act for security research and mere terms of service violations. Bad facts can lead to bad law where the defendant is so reprehensible in the eyes of the court, that the court is inclined to find a way to rule against them in the interest of fairness, without regard for defendants acting in good faith that might follow later. Our system of law lurches forward slowly and unevenly, revealing only certain legal insights over certain types of technology in certain jurisdictions over many years. Keep in mind that it took over a decade to determine the relatively straightforward question of whether copying APIs is copyright infringement in just a single dispute (Google v. Oracle). Even iron-clad logic is not a guarantee of any specific legal outcomes, not even on a very long timeline. 

Unpredictable Application

In the US, fair use is not an exception to copyright law; it’s a defense against copyright infringement that involves arguing that a complicated set of very fact-specific factors are favorable to the defendant. So even though a court may hold that the defense is valid in one case, there is no guarantee that it will be valid in similar cases. Courts regularly make surprising or novel distinctions between similar cases, particularly where the underlying facts paint the defendant in a negative light. 

One need look no further than the Supreme Court’s acceptance of the fair use defense with respect to the VCR (Sony Corp. of America v. Universal City Studios, Inc.)12 and its subsequent rejection of it with respect to peer-to-peer (P2P) networks (MGM Studios, Inc. v. Grokster) 13to see such distinctions.  In both cases, the underlying technology can facilitate non-infringing copying and distribution of copyrighted work:  VCRs can enable time-shifting, allowing people to view shows at a time more convenient to them, and P2P networks were commonly used by universities for internal exchange of research and for distributed computation projects like Folding@home, which used P2P networks to simulate protein folding. In both cases, the technology could also be used in an infringing manner and the purveyors of the technology publicly advertised uses of the technology that would clearly constitute copyright infringement: creating a personal home library of shows and movies on VHS tapes in the case of Sony, and downloading copyrighted music in the case of the Grokster. On the face of it, the cases presented similar facts and many IP experts predicted a win for Grokster. But, undoubtedly, Sony’s well-known and highly-respected brand, combined with the justices’ own usage of VCRs swayed them in one direction, while Grokster’s motley crew of anarchists and its association with the “dark web” swayed them in a different direction.

Inability to Make Blanket Fair Use Rules

With respect to AI in particular, distinctions may be drawn between different types of AI models (ex. generative v. predictive models), different modalities (ex. images v. text), the various domains where the models are used, and the purpose of the use. The Copyright Alliance gets this point right: “unauthorized use of copyrighted material to train AI systems cannot be handwaved by a broad fair use exception… Neither the Copyright Act nor case law… would support such a broad fair use exception for AI.” Each copyright infringement claim must be evaluated in the context of the model’s intended use case and whether it is, in practice, offering substitutes in the market for the kinds of works that comprise its training data.14

Likelihood of Inconsistency Between Circuits

The current crop of AI cases can easily be distinguished from one another should a court wish to because of the diversity of the plaintiffs (some dramatically more sympathetic than others) and the modalities involved (as well as many other factors). The cases are spread amongst various circuits. The specific issues the parties might choose to appeal to the U.S. Court of Appeals or to the Supreme Court, and the postures of the cases when they arrive, are unpredictable. The US is likely to have a patchwork of AI-related precedent throughout the various circuits that does not gel into a cohesive, consistent whole for many years to come (if ever) in the same way that the fair use doctrine itself took many years to come together. AI companies may end up with guidance on code-generating AIs in only one circuit, national guidance on training predictive models specifically on training data behind a paywall, and a single district court opinion in another circuit on image generation that expresses a lot of outrage over output but fails to specifically address just the training step. 

Arbitrary Rulings

It’s also possible for a court to throw a complete curveball, such as in Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. In that case, Ross was accused of copyright infringement for using Thomson Reuters’ case summaries to train its AI-powered case search engine, which suggested the names of cases when queried with specific legal questions. The  judge inexplicably rejected Ross’s fair use defense because all the fair use cases raised by Ross related to the copying of code rather than text. This argument of course ignores major fair use cases that don’t relate to the copying of code (which were referred to in the cases that Ross cited), including those related to Google’s mass scanning of books to enable search within books, as well as Amazon’s and Google’s copying of images from all over the web to enable image search. Here, the judge distinguished the case before him from precedent by simply ignoring much of the precedent. 

Conclusion

Given the uncertainty in where the law might go on the issue of fair use and AI training, a presumption of unenforceability of CC preference signals is premature. Declaring that the signals are unenforceable right out of the gate robs legal counsel  of any gravitas they might bring to a compliance request. Companies don’t spend money complying with voluntary frameworks unless and until they get (or avoid) something tangible in return, and in this case, those benefits can’t manifest themselves until there is sufficient adoption of the signals. Even in the world of open source software, where the benefits of the software are very tangible and the licenses are clearly enforceable, a huge portion of companies still don’t put in the time and effort necessary to do compliance on that scale. It would be much more effective to begin with the notion that the signals are enforceable, particularly in the EU, to drive adoption and compliance. Even if they don’t turn out to be enforceable in any jurisdiction decades from now, by then they may continue to function on the basis of norms. 

  1.  A separate set of “signals” might be developed later for individual works. ↩︎
  2. Creative Commons is particularly attuned to this concept. Most of their licenses specifically include language like this: “For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.”
    ↩︎
  3. In fact, it has proactively put together a dataset of Wikipedia content for AI developers to use, in no small part to ease the burden of crawlers on its infrastructure. ↩︎
  4.  Article 3 also provides an exception for research organizations and cultural heritage institutions which carry out TDM for the purposes of scientific research. ↩︎
  5. Recital 106: “Any provider placing a general-purpose AI model on the Union market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those general-purpose AI models take place.”
    ↩︎
  6. Lawyers at Freshfields also raise this possibility. ↩︎
  7. The EU and the US do not require copyright registration for a copyright to be validly held, so there is no complete source, even at the national level, tracking who has what copyrights in what jurisdictions. Many copyrighted materials do not come with sufficient information to determine the author(s). Even if there is a name, without some sort of identification number, that name might match multiple people in multiple countries. Certain materials may have multiple authors from several different countries and it may not be clear exactly which copyright laws might apply to any given portion. Materials emanating from organizations are even more complicated because they might be coming from affiliates worldwide, and copyright ownership is governed by private corporate family agreements. In some cases, copyright ownership may actually sit with a contractor, customer, or partner despite public statements to the contrary due to private copyright assignments (or lack thereof). Suffice it to say, this is a complex determination that is resource intensive and prone to error because much of the relevant information is not publicly available. 
    ↩︎
  8. Companies are also likely to respect these opt-outs even in the US because at the very least, they signal risk of litigation and they might be relevant for a fair use analysis in court. ↩︎
  9. Article 2(1)(c) of the EU AI Act. ↩︎
  10. That is certainly the case for any current model provider large enough to actually threaten the data commons in a meaningful way and therefore be a subject of interest in Creative Commons’ application, and possible enforcement, of preference signals. It’s worth noting that there might be challenges to the vast scope of this law precisely because of its attempt at extraterritorial application of copyright law in this manner. But, that’s mere speculation on a matter that may not be decided for many years to come. In the meantime, AI companies are likely to attempt compliance. The third draft of the “General Purpose AI Code of Practice,” further specifying requirements in the EU AI Act, does not give any additional insight into the matter. ↩︎
  11. The GDPR has a right of private action but damages are limited to actual damages suffered by the plaintiff(s). Because litigation is so expensive, it is exceedingly rare for such litigation to be worthwhile for individuals. In practice, these cases are brought as class actions. Nevertheless, the vast majority of enforcement action is via data protection authorities and not litigation. 
    ↩︎
  12. Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 (1984). ↩︎
  13. MGM Studios, Inc. v. Grokster, Ltd., 545 U.S. 913 (2005). ↩︎
  14. See my article, “Battle of the AI Analogies” for a lengthy discussion of the various facts that might make a fair use argument more or less likely to succeed. ↩︎

Choose Your Own Adventure: The EU AI Act and Openish AI

NOTE: This post was published based on a draft of the EU AI Act. The official version of the EU AI Act uses slightly different section numbers than the leaked draft discussed below, and it has been edited to remove errors and omission. Make sure that you cite to the correct section numbers available here, rather than those used in this post. Additionally, the general-purpose AI Code of Practice and several guidelines and templates have since been published which dramatically reshape how some of the provisions in the Act should be interpreted, altering some of the conclusions below.

A copy of the EU AI Act leaked on January 22, 2024.1 The Act has since been unanimously approved by the ambassadors of each EU member country and is likely to officially go into effect in April. The Act exempts certain freely available AI/ML models2 from some of its obligations if they are under “free and open source licenses.” The Act only governs models put into real-world use and does not apply to AI models used or shared purely for scientific research or development. It therefore does not affect anyone’s ability to merely post models on public repositories. This post will examine the Act’s potential effects on providers of AI models, with a focus on “openish” AI models. Skip below to Part 2 if you don’t provide openish AI models, but want to better understand model providers’ obligations under the Act.

Part 1: Is My Model Under a “Free and Open Source License” Under the Act?

The TL;DR here is that the Act doesn’t actually bother to define exactly what it means for models to be under “free and open source licenses” and the Act is using this term in an idiosyncratic way. As you read the Act, you can silently replace every instance of “Free and Open Source License” with “Mystery License” in your head and you will have lost nothing by doing so. As best as I can tell just from the language in the Act, a model is under the Mystery License if the provider:

  • Doesn’t monetize it, including by charging for hosting or for support. Putting it up on HuggingFace or similar is fine, though
  • Releases it under a license* that allows for access, usage, study**, modification, and distribution** 
  • Makes its weights available
  • Provides information on the model architecture and model usage

* It’s unclear whether or not such licenses can have “field of use” restrictions that prohibit using the model for specific uses (like development of nuclear weapons or biometric identification).

** It’s possible that more AI-related artifacts (like training methodologies) may be explicitly required in the future. More on this below.

Read on if you’d like some colorful commentary around this term in the Act. Otherwise, just skip ahead to Part 2: What Does the EU AI Act Require of My Model?.

Review of the Plain Meaning and Course of Usage of the Term “Free and Open Source License”

Is There a “License”?

It’s not obvious that models, by themselves, are copyrightable. There is no major precedent or legislation in the EU (or in any other major market as far as I know), that says one way or another. My personal take is that the models are, more or less, just numbers, and contain no copyrightable human expression. The training protocols might qualify as patentable processes and the software used for training might be eligible for both patent and copyright protection, but the models are mere computer output. If the models aren’t copyrightable, then the legal documents attached to them aren’t licenses at all – they’re contracts.3 All of which is to say that it’s not clear there are ANY AI model “licenses” out there today, and also unclear whether or not there might be any in the future. 

Yet, it’s strange to imagine that if models aren’t copyrightable, or if they are dedicated to the public domain, that that should have any bearing on what transparency and safety steps providers of openish models need to take. My conclusion is that this is unlikely to be a dispositive term one way or another.

Is There “Source”?

Nope. Models aren’t code and therefore don’t contain any source code, which is generally what people are referring to when they refer to “source” in the context of either “open source” or software. Taken more broadly, the term “open source” might refer to the concept of publicly making available the underlying technology or artifacts that are used to build a final product (like design schematics for hardware), but there is no industry-wide consensus on what that might include with respect to AI models. In fact, the Open Source Initiative is still working on defining what “open” might mean in the AI domain and what would constitute the equivalent of source code for AI/ML models.

Is the License “Free and Open”?

Most people in the open source community would guess that this phrase refers to the definitions of free and open source software promulgated by the FSF and the OSI, respectively, or to a license approved by one or both of the organizations. However, none of the licenses currently approved by the FSF or OSI were intended for use with AI/ML models and they aren’t suited for the purpose. They’re open source software (OSS) licenses.

A number of openish AI-specific licenses have emerged, but none of the notable ones would meet the definitions of free or open source software (notwithstanding that models aren’t software) because they contain field of use restrictions which prohibit the models from being used for certain purposes (such as for the creation of biological weapons) or prohibit certain types of users (such as the military). Other licenses, like that for Llama 2, are really just free commercial licenses (“shareware,” if you’ll take a stroll down memory lane with me) and not “open” or “free” as traditionally understood by the open source community for a multitude of reasons. 

To the extent that OSI and FSF continue to categorically reject field of use restrictions, plenty of people are going to choose non-approved licenses anyway because the field of use restrictions are important to their ethics and/or because limiting downstream use can also limit their own liability with respect to the models,4 and protect their reputations. In the software realm, OSS licenses generally contain a disclaimer of warranties and a limitation of liability provision that applies to anyone exercising any of the rights granted in the license. When the software fails, the potential harm is generally borne by the licensee using the software,5 so those disclaimers and limitations are generally enough to immunize an OSS developer from liability. However, in the AI realm, AIs can cause serious harm to people who are not users or licensees of the AI provider – people who are denied loans based on AI-enhanced assessments by banks, for example. Since those individuals are not licensees, no disclaimers of warranty or limitations of liability apply to them. The only way an AI provider can attempt to limit liability with respect to those individuals is by prohibiting licensees from applying their AI models to risky uses in the first place.

AI developers also have a desire to release models under something like a “beta” or “eval” license so that others can test them before they are forced to decide if the model can be made available under a broader license or if they need to go back to the drawing board; that desire is more acute with AI than with software because the potential harms are so much greater and less predictable (no accounting software, for example, has accidentally tried to convince a user to divorce his wife). So, it’s not clear to me that even if the OSI and FSF managed to define what “free and open” might mean in the AI domain in the near future, that they will be seen as the vanguard for this definition. Few AI providers will be inclined to take on global liability for human deaths (and any number of lesser harms) just to suit the principles of these organizations. 

Is It Desirable to Use a License Already Approved by the OSI or FSF?

If a model provider has reason to put a model out under a license instead of dedicating it to the public domain, it’s a gamble whether or not any of the licenses currently approved by the FSF or OSI6 are likely to help them achieve their goals. They would be better off using an AI-specific license to give users clear restrictions and obligations, particularly if they wanted to add AI-specific transparency obligations. Further, from a policy perspective, I strongly suspect that the EU would prefer to see models licensed under something like the BigScience RAIL License than under Apache 2.0.

OSS licenses make reference to terminology that is not applicable to models (like “source code,” “binaries,” “build instructions,” “linking,” “macros,” etc.). Perhaps most importantly, copyleft open source licenses require that “modifications”  (as in the Mozilla Public License 2.0) or “derivative works” (as in the GNU General Public License 2.0) of the OSS code also be licensed under the same or similar license, but it’s anyone’s guess how a court might interpret these terms in the AI model context. Recognizing this is extremely important for any model developer who is drawn to copyleft licensing because fostering collaboration is important to them, or who really wants to ensure that anyone using their models only does so in conjunction with products and services that are provided under similar terms. 

“Modifications” often mean something like “additions or deletions to code.” That’s not a definition that works for AI models. “Derivative work” is a term that has a specific meaning in copyright law,7 is fundamentally inapplicable to a work that is not copyrightable, and in the software domain, depends on exactly how one piece of code interacts with another piece of code. That analysis takes for granted that the copylefted work and the larger or other work at issue are both pieces of software. The generally recognized consensus in the open source community is that if a software product uses or incorporates the output of a copyleft OSS package (output that is not code), but not the OSS package itself, the copyleft license of the OSS package will not extend to the product and the product is not considered a derivative work of the OSS package. If it were otherwise, it would be difficult to sustain copyleft text editors, for example. 

In other words, there is something of a blood-brain barrier between output and software when discussing the reach of a strong copyleft license. Applied to the AI model context, it would mean that copyleft training software doesn’t necessarily yield a copyleft model (though perhaps copyleft training data might) and that software receiving output from an AI model that is under a traditional copyleft license would not necessarily be affected by the license of the model either. These outcomes probably run counter to the goals that AI developers may have when placing their models under copyleft licenses.

If/when it is determined that models are copyrightable, the above-mentioned consensus in the open source community may or may not be relevant to any particular judge or jury and that particular consensus only answers some of the possible questions that may arise when deciding what is and isn’t a derivative work of a model under a traditional open source software license:

  • Are fine-tuned models derivative of the models that they tuned? (Starting with a softball!) 
  • What if you just publish a set of weights to swap out of the original model (a diff) but you don’t publish any of the weights in the original model? 
  • Does a product or service that uses a particular model constitute a derivative work of the model? Does it matter how important the model is for the product (if so, what is that bar and who sets it?)? 
  • What if the product has, say, three features, and uses a different model for each feature and each model is under a different license? 
  • What if multiple models under different licenses are all used for just one feature? 
  • What about models trained on the output of another model? 
  • A model trained by another model? 
  • A model trained using the same training methodology as another model? 
  • A model trained using the same training methodology and the same training data as another model? 

Takeaway

The phrase “free and open source license” and the constituent words in this phrase have no relation to AI. If this was supposed to reference licenses specifically approved by the FSF and OSI, it’s a strange reference since they haven’t approved any AI-specific licenses or published any definitions related to “open” AI, and it doesn’t make sense to push people to use the existing approved licenses for models. Since many, if not most, of the most popular freely available models out there aren’t licensed under true free and open source software licenses,8 but there’s every indication that this language was intended to refer to most of them, one can only conclude that the EU’s understanding of what constitutes a “free and open source license” is unique to the EU legislators drafting this Act.

Review of the Term “Free and Open Source Licenses” Solely Within the Context of the Act

Article 2, which critically addresses the scope of the Act, simply refers to models under “free and open source licenses.” Article 52c of Title VIII, addressing the need for authorized representatives in the EU, refers to models with:

 “…a free and open source licence that allows for the access, usage, modification, and distribution of the model, and whose parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available.” 

Recital 60(f)9 similarly refers to AI models that:

“…are released under a free and open source license, and whose parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available…” 

Recital 60i adds that:

 “The licence should be considered free and open- source also [emphasis mine] when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected.”10

So far, this term isn’t too convoluted. It’s more or less asking for licenses that grant broad use rights (like all OSI-approved licenses) and for providers to make model parameters, weights, model architectures, and model usage info publicly available. 

There is some ambiguity here, though, about what it really means to be able to modify or study a model if you are only provided with the items specified in the Act (and don’t have information on the training methodologies, for example). A coalition of entities with interest in openish AI wrote

“…to understand how a given AI system works in practice, it is necessary to understand how it was trained. This stands in contrast to traditional software systems, where access to the source code is typically sufficient to explain and reproduce behaviors. The training dataset, training algorithm, code used to train the model, and evaluation datasets used to validate the development choices and quantify the model performance all impact how the system operates and are not immediately apparent from inspecting the final AI system.” 

They identify critical artifacts necessary to study and modify a model that are not specified in the Act. Does the right to modify and study imply that model providers actually need to provide more than that which is specifically listed in the Act to be eligible for the “free and open source” exceptions? Some legal experts both inside and outside of OSI have voiced this position. To my mind, a standard of openness that requires providing everything necessary to rebuild a model mirrors the GPL’s requirement that those who receive GPL code also receive everything necessary to modify the code and put it back into use. But, I also think “openness” in the AI context should have gradations (just like open source has a variety of licenses), and that this broad approach is just one conceivable and valid interpretation of “open.” 

There is also ambiguity as to whether or not the license must be free of field of use restrictions in order to be a “free and open source license” under the Act. On the one hand, Recital 60i does open the door for licenses with certain conditions and nothing in the Act explicitly forbids field of use restrictions. It would be rather strange for an Act focused on mitigating harm from AIs to disincentivize people from licensing their AIs in ways that prevent them from being used for risky or dangerous purposes. Many providers are keen to limit their liability with respect to openish models (see more discussion of this above) and would not want to make their models publicly available if the only way to do so would be to accept unlimited liability with respect to harms suffered by individuals impacted by an AI’s activities or results. In the long run, such an interpretation probably puts a nail in the coffin for the possibility of an “open” AI in the image of open source software.11 But on the other hand, at least some lawmakers and regulators may want exactly that: it may be their intent to significantly shrink the “open” AI ecosystem.

 Recital (60i+1)12 is where it gets really confusing:

“Free and open-source AI components covers the software and data, including models and general purpose AI models, tools, services or processes of an AI system. Free and open-source AI components can be provided through different channels, including their development on open repositories. For the purpose of this Regulation, AI components that are provided against a price or otherwise monetised, including through the provision of technical support or other services, including through a software platform, related to the AI component, or the use of personal data for reasons other than exclusively for improving the security, compatibility or interoperability of the software, with the exception of transactions between micro enterprises, should not benefit from the exceptions provided to free and open source AI components. The fact of making AI components available through open repositories should not, in itself, constitute a monetisation.”

This set of requirements isn’t supported by any definitions or practices in the open source software domain. Of course, open source software must be provided freely (though developers can charge for the media it’s on and shipping), but open source software doesn’t cease being open source simply because someone offers support services or hosting services for it. The point of open source isn’t to foreclose private profits; it’s to ensure an end user’s rights to use and modify the code and to exchange that code and those modifications with others. Somebody’s offer of support or hosting services doesn’t hinder user rights at all. In the AI domain, offering such services also would not take away from the transparency, safety, and innovation benefits that are otherwise foreseen by the Act from open AI models. 

The bit about personal data looks inscrutable to me – looks like a drafting error. Do they mean to not exempt otherwise “open” models if they’re trained on data that has personal data? If they accept input that might have personal data? Either would ban every well known openish AI model out there from taking advantage of the free and open source exception. I have no idea what this is supposed to mean.

Part 2: What Does the EU AI Act Require of My Model?

All models need to comply with the transparency requirements in Title IV, Article 52, to the extent that they are applicable. Here’s a summary:

  • To the extent that human users interact with the AI system, they need to know they’re interacting with an AI
  • Outputs (if any) generated or modified by AI must be marked as such (there are some nuances here)
  • Deployers of emotional recognition systems or biometric categorisation systems that aren’t prohibited by the Act have to notify people that they are using such systems

Keep reading to see what additional obligations may apply. All the obligations discussed below are additive.

Is Your Model Designed for a Prohibited Use or Are You (Personally) Using a General Purpose Model for a Prohibited Use?

The EU AI Act prohibits uses of artificial intelligence that fall under Title II, Article 5, regardless of the type of model in question (“general purpose AI” or not, “free and open source” or not). Such uses may only occur for the sole purpose of scientific research and development. Here is a brief summary of the unacceptable uses:

  • Deploying subliminal techniques or purposefully manipulative or deceptive techniques
  • Exploiting vulnerabilities of people due to age, disability or specific social or economic situation in a way that causes harm
  • Biometric categorisation to deduce race, political opinion, and other sensitive characteristics
  • Social scoring
  • Real-time biometric identification in public spaces (with a bunch of caveats)
  • Predicting future crimes of individuals
  • Creating or expanding a facial recognition database by scraping images from the internet or CCTV footage
  • Inferring emotions in a workplace or educational institution, except for safety reasons

Beyond this inquiry, the Act bifurcates between “general purpose AI models” and models designed for specific use cases. The bifurcation happened because the Act was first drafted in 2022, before general purpose AIs made a big splash in the tech world, and at a time when the dominant idea of an AI was one that was specifically trained at one or a handful of narrow tasks. In 2023, numerous politicians (particularly from Germany and France) demanded that general purpose AI providers (aka foundational models) be exempted from the Act entirely because they didn’t want to stifle the growth of domestic AI companies. The bifurcation happened as a compromise, lowering the number and scope of obligations applicable to such providers. In particular, general purpose AI models do not need to be approved by regulators before they are put on the market, even very powerful ones, unless the provider itself puts it towards a high-risk use on behalf of itself or a customer. In that case, the provider would be subject to all the regulations attendant to both general purpose AI models as well as “high-risk” uses. This is true for both models that are and aren’t “free and open source licensed.”

Is Your Model Designed for a High-Risk Use or Are You Personally Using a General Purpose Model for a High-Risk Use?

With certain exceptions and nuances as expressed in Article 6, Annex III of the Act lists a number of AI uses categorized as “high-risk.” Generally speaking, these are uses of AI where the decisions they make or help others make have a significant impact on the course of an individual’s life and include decision-making regarding things like access to employment, education, asylum, essential private and public services, law enforcement, etc. High-risk use cases also include ones that pose physical risks, like the use of AI for critical infrastructure or as part of a safety system for a physical product or system. “Free and open source licensed” models designed for a “high-risk” use have to comply with all the same requirements as other models if providers want to put them on the market (commercially) or use them for their own benefit or that of a customer (including by using a general purpose AI model for a high-risk use13); in such case, this category of AI models carries the most onerous obligations under the Act: 

  • The model must go through a “conformity assessment” and receive approval before it can be put on the market
  • Affix a marking to approved systems to indicate they have passed the conformity assessment
  • Maintain a risk management system, including testing
  • Implement a data governance policy
  • Maintain technical documentation – small companies can use a simplified form. Must include level of accuracy, robustness and security; possibility of misuse and possible risks; explainability info; human oversight measures, etc.
  • Provide info to deployers necessary for them to use the system and comply with the Act
  • Provide human oversight
  • Maintain a post-market monitoring system
    • High-risk AI systems that continue to learn after being placed on the market or put into service must be developed in such a way as to eliminate or reduce, as far as possible, the risk of possibly biased outputs influencing input for future operations (‘feedback loops’) 
  • The model must be registered in an EU database
  • Meet accessibility requirements
  • Maintain a “quality management system” – a policy for complying with all obligations in the Act
  • Keep all documentation for 10 years
  • System must have automatic logs
  • Provider must report incidents of noncompliance and corrective action taken to regulators
  • Must appoint a representative to perform tasks under the Act if company is outside the Union
  • Create written agreements with vendors to ensure the provider can meet its obligations under the Act

Additional obligations will apply if your model is both used for a “high-risk” activity and is a general purpose AI model. However, I believe that the drafters of the Act imagined that it would be rare for a general purpose AI model provider to also be using their own model for a high-risk use. That might be true today, but it might not always be true. In particular, a general purpose AI doesn’t necessarily cease being a general purpose AI just because it is fine-tuned to perform better in a certain domain, so it seems possible that an AI provider may offer different flavors of their models, with some flavors specifically designed to perform high-risk activities. In the interest of completeness and because I enjoy being technically correct, I will frame this as a possibility.

Is Your Model a “General Purpose AI Model”?

All general purpose AI models need to comply with a subset of the transparency requirements in Article 52c, summarized below:

  • Put in place a copyright compliance policy
  • Provide a detailed summary of the content used for training using a template to be provided by the AI Office (whether copyrighted or not)

Is Your Model Under a “Free and Open-Source License”?

See Part 1.

Does Your General Purpose AI Model Pose “Systemic Risk”?

General purpose AI models are treated differently under the Act depending on whether or not they pose “systemic risk” due to their high impact capabilities. Models where the cumulative amount of compute used for training measured in floating point operations (FLOPs) is greater than 10^25 are deemed to be models with “systemic risk” per the Act by default, unless the provider can demonstrate otherwise. The Act also allows for regulators to add alternative criteria for determining whether a general purpose AI might cause “systemic risk.” Models in the “systemic risk” category are subject to a number of additional requirements:

  • Article 52d:
    • Perform model evaluations
    • Assess and mitigate systemic risks
    • Report serious incidents to the AI Office
    • Ensure cybersecurity
  • Appoint an authorized representative in the EU to coordinate/correspond with AI Office, etc. if your organization isn’t established in the Union

Further, the rest of the obligations under Article 52c related to transparency would also apply:

  • Create and keep up-to-date very detailed technical documentation, including training and testing processes and results of evaluation.
    • Notably this needs to include the model’s energy consumption
  • Create and keep up-to-date info and documentation for deployers of such AI systems to use

The obligations above related to general purpose AI models with “systemic risk” will be further spelled out in “codes of practice” to be developed within 9 months of the Act going into effect. These will be developed via collaboration by the AI Office, the Advisory Board, and the providers of such models. That’s where a lot of the real action will take place.

You’re Still Here?

Congratulations, you have made it through your chosen adventure. As you can see, while the concessions granted to general purpose AI models relative to other models designed for “high-risk” uses are fairly wide-ranging, the exceptions for openish models are actually relatively slim in comparison, especially because of the requirement that openish models not be monetized in any way. The lighter regulatory load for “free and open source licensed” models is likely to only be enjoyed by researchers at universities and non-profits, who truly don’t monetize the models in any way, and to a lesser extent by individuals. Companies that want to utilize openish models as part of their business strategy are unlikely to benefit from any regulatory leeway by doing so. 


  1. ↩︎
  2. ↩︎
  3. ↩︎
  4. ↩︎
  5. ↩︎
  6. ↩︎
  7. ↩︎
  8. ↩︎
  9. ↩︎
  10. ↩︎
  11. ↩︎
  12. ↩︎
  13. ↩︎

The New York Times Launches a Very Strong Case Against Microsoft and OpenAI

It seems that The New York Times Company (“The Times”) got fed up with the pace of its negotiations with Microsoft and OpenAI over their use of The Times’ content for training and running their LLMs. So much so that The Times filed a post-Christmas complaint against the two, likely knowing full well they’d lay waste to the winter vacations of hundreds of people working for OpenAI and Microsoft. It might be the most well-known AI-related case to date because the case isn’t a class action and the plaintiff is globally recognized.

The complaint alleges:

  • Copyright infringement against all defendants (related to handling of the datasets containing content from The Times, handling of models allegedly derivative of the datasets, and the ultimate output)
  • Vicarious copyright infringement (the idea that Microsoft and various OpenAI affiliates directed, controlled and profited from infringement committed by OpenAI OpCo LLC and OpenAI, LLC)
  • Contributory copyright infringement by all defendants (the idea that the defendants contribute to any infringement perpetrated by end users of the models)
  • DMCA Section 1202 violations by all defendants regarding removal of copyright management information from items in the datasets
  • Common law unfair competition by misappropriation by all defendants (related to training AI models on The Times’ content and offering AI services that reproduce The Times’ content in identical or substantially similar form (and without citing The Times or linking to the underlying content))
  • Trademark dilution by all defendants (arguing the the AIs dilute the quality associated with The Times’ trademarks by falsely claiming certain content originates from The Times)

Unlike other complaints, this one doesn’t spend too much time explaining how AI models work or teeing up the analogies they plan to use in court. Instead, the complaint includes multiple extremely clear-cut examples of the LLMs spitting out The Times’ content nearly verbatim or stating bald-faced lies about The Times’ content. Many of the other complaints admitted they weren’t able to find clear-cut examples of infringing output, nebulously resting their claims on the idea that all output is, by definition, infringing. Here, Microsoft and OpenAI haven’t just used The Times’ content to teach the AI how to communicate, they’ve launched news-specific services and features that ingest both archived content and brand new articles from The Times. The other plaintiffs also weren’t able to argue that their specific content, out of the trillions of pieces of training data in the datasets, was particularly important for creating quality AIs. Here, The Times convincingly argues that its content was extremely valuable for training the AIs, both because of the quantity involved as well as the fact that the training process involved instructing the AI to prioritize The Times’ content.

This is probably the strongest AI-related complaint out there. I think it’s quite possible that a jury or judge angry at Microsoft and OpenAI for offering services that compete with and undercut The Times is more likely to also find that the training activities constituted copyright infringement and that the model itself is a derivative work of the training data, without thinking too hard about a scenario where the ultimate model doesn’t supplant the business or livelihood of the copyright holders in the training data. It’s definitely a case where “bad facts invite bad law.”

This case is also notable for the fact that it explicitly goes after the defendants for their AIs’ hallucinations. An AI summarizing a news event based on one or more news articles opens a Pandora’s box worth of debate about the line between uncopyrightable facts and copyrightable expression, as well as how/if those same standards should be applied to a computer “reading” the news. But the hallucinations aren’t facts; they’re lies. And even if the defendants prevail in arguing that the AIs are mostly just providing people with unprotectable facts, there’s very little to shield them from liability for the lies, both with respect to trademark dilution claims, but also with respect to potential libel or privacy-related claims that might be brought by other individuals. Copyright law can forgive a certain amount of infringement under certain circumstances but these other areas of law are far less flexible.

The other really interesting thing about this complaint is the extent to which it describes the business of The Times – how much work the journalists put in to create the articles, the physical risks they take during reporting, the value of good journalism in general, and The Times’ struggle with adjusting to an online world. The complaint paints a picture of an honorable industry repeatedly pants-ed by the tech industry, which historically has only come to heel under enormous public pressure and the Herculean efforts of The Times to continue to survive. It’s interesting because US copyright law decisively rejects the idea that copyright protection is due for what is commonly referred to as “sweat of the brow.” In other words, the fact that it takes great effort or resources to compile certain information (like a phonebook), doesn’t entitle that work to any copyright protection – others may use it freely. And where there is copyrightable expression, the difficulty in creating it is irrelevant. So, is all this background aimed solely at supporting the unfair competition claim? Is it a quiet way of asking the court to ignore the “sweat of the brow” precedent, to the extent that it’s ultimately argued by the defendants, in favor of protecting the more sympathetic party? Maybe they’re truly concerned that the courts no longer recognize the value of journalism and need a history lesson? No other AI-related complaint has worked so hard to justify the very existence, needs, and frustrations of its plaintiffs.

Unless Microsoft and OpenAI hustle to strike a deal with the New York Times, this is definitely going to be the case to watch in the next year or two. Not only does it embody some of the strongest legal arguments related to copyright, it is likely to become a lightning rod for many interests who will use it to wage a proxy war on their behalf. The case, and especially the media coverage of the case, will likely embitter the public and politicians even further against big tech, treating its success as a zero sum game vis a vis journalists and creators more broadly. It’s the kind of case that ultimately results in federal legislation, either codifying a judgment or statutorily reversing it.