Blog

Micropayments Would Benefit Everyone

This is the second part of a four-part series called “Protecting Digital Content in the AI Age: A Lawyer’s Guide.” Part 1, “AI and the Digital Content Provider’s Dilemma,” examined the effects of AI on digital content providers, including both a drop in web traffic and significantly higher demands on web infrastructure, and called for new business models like micropayments to maintain the open web. Part 3 will examine AI crawlers, robots.txt, the technology behind the web, and new tools for digital content owners. Part 4 will analyze US and EU laws related to text and data mining opt-outs and technologies that may help digital content owners sustain their businesses.

As discussed in Part 1, the open web is already shrinking in response to AI, with more digital content providers moving their content behind logins, removing it from the internet entirely, or attempting to block access to it via both legal and technical mechanisms. AI companies are losing access to data, while the rest of us are losing an open web. Bespoke deals between AI companies and certain content providers are just a bandaid. AI companies can only negotiate so many deals. If nothing changes, both the AIs and the internet as a whole will continue to lose “long tail” niche content, putting content providers out of business, and AI models will exhibit more bias because they will lack data from smaller cultural communities and non-English speaking countries. Micropayments – an automated, frictionless way for AI companies to pay for the content they use – are the right solution for content providers, AI companies, and society as a whole. 

A. Micropayments Can Quickly Force AI Companies to Pay for Content to Remain on Even Footing with Their Competitors

Automated means by which content providers could offer free or cheaper licenses to non-profits, small businesses, companies that respect content providers’ preferences, or companies that release open source AI models (or some combination thereof, under whatever definitions content providers prefer), while signaling that they expect the megacorp AI players to pay is the number one most effective way to pressure AI companies into paying for the content they use. Paying or not paying under such a regime completely changes the AI companies’ calculus of paying content providers. Today, the AI companies weigh the costs and benefits of certain licenses or the possibility of certain lawsuits, and they can incrementally choose how much risk they are willing to accept so long as all of their competitors are doing the same math. But a competitor who gets access to content for much less or for free poses a starker choice to the megacorp: pay for the same data or lose to entities whose AIs can provide higher-quality information. The question would become existential in a way that it is not today. 

The unwillingness of content providers to privilege certain entities over others, whether out of principle or in search of maximum profits, simply means disarming themselves of the single most powerful tool they have to protect their businesses. In particular, entering into deals with AI companies but leaving non-profits and small businesses with no rights at all (including by blocking all entities from training on their material in robots.txt) only helps to secure an oligopsony of AI companies and make the content providers more and more vulnerable over time. 

Many of the major AI companies view their mission as reaching artificial general intelligence or artificial super intelligence (human levels of intelligence or higher), and many in the industry, including Dario Amodei (CEO of Anthropic), Demis Hassabis (Google DeepMind CEO), and Elon Musk have explicitly stated or otherwise implied that they view such an achievement as a winner-take-all proposition. That is their fundamental argument for why the US should do everything in its power to beat China in the AI race. Content providers need to understand this dynamic in order to fully appreciate why price discrimination is such a powerful tool in this circumstance and why AI companies fear one another more than they fear billion dollar lawsuits. 

B. Micropayments Keep Power in the Hands of Creators Instead of Intermediaries

Intermediaries can provide indispensable services (especially when content must be provided in physical form) and in some cases intermediaries can use their leverage to get individuals better deals, especially when they take the form of regulated non-profits, as in the EU. But, if AI companies continue to make bespoke deals with large content providers and intermediaries while ignoring everyone else, the natural result will be intermediary growth and then consolidation, giving them more leverage over time over the AI companies, content providers, and the rest of society. 

The companies that ignore or neglect the individual content creator in favor of feeding the intermediary, invariably discover, again and again, that they’ve been throwing meat to a baby dragon, not a household pet. This is the story of how Amazon fed publishers (instead of authors) who then created their own e-book and audiobook platforms, Netflix fed movie studios (instead of writers and directors) who then pulled their catalogs to their own streaming services, YouTube fed multi-channel networks (instead of individuals) that were able to move the best content creators to their own video services, and how Apple’s iTunes Store fed the music labels (instead of individual artists and even independent labels) who then backed Spotify and other streaming services that decimated iTunes. Each and every time, these companies eventually offered individuals more, but only after allowing the dragons to fly away with a limb or two. 

The flip side, of course, is that these intermediaries also amass more power over creators, especially the ones they already serve, who may or may not be able to renegotiate for a cut of the money. Such a scenario is already playing out in the Anthropic settlement, where to Judge Alsup’s surprise and dismay, a number of publishers are pulling chairs up to the table instead of authors because many authors have already signed away some or all of the rights to their infringed books. Consolidation means that just a few entities wield massive control not just over AI companies, but over what the rest of us have access to (especially in authoritarian countries), the price for such access, who can make a living making content, and on what terms. Publishers like HarperCollins already don’t blush when it comes to telling their authors they’ll be taking 50% of AI training revenues, take it or leave it. 

Intermediary consolidation is no less of a threat to content providers than AI. Elsevier is the wildest dream of every corporate intermediary, exploiting consolidation in scientific publishing to its most brutal, logical conclusion. In large part, their business model is to take people’s research for free, have others do peer-review for free, and then sell the research back to the very same institutions that funded it for exorbitant rates. They require copyright assignment from authors. They bundle journal subscriptions, such that institutions pay even for things they don’t want. Historically, they’ve lobbied against the open access movement and more recently they’ve co-opted it, charging authors $2k-$10k to put their papers up on an open-access website. Their profit margin rivals those of oil companies. 

Their high fees have forced smaller scientific publishers out of the market because libraries don’t have the budget for much beyond the Elsevier subscription, leaving authors and institutions who’d actually like to be compensated for their work with few, if any, other options. Authors now largely choose between supporting Elsevier and having their works widely read in well-respected journals, or publishing them freely online in total obscurity. Either way, compensation is off the table and scientists have to make money via institutional paychecks; they cannot work as individuals and make money from their articles instead. Consolidation doesn’t just change prices, it often forecloses individual autonomy and reinforces other exploitative systems.

C. Micropayment Will Help Modernize the Open Web and Make AIs Better

Companies receiving compensation from AI companies will have an incentive to make their websites AI-friendly, making changes that allow AI agents to quickly and accurately find information, make purchases, and select multimedia with appropriate use rights. That would potentially allow AI companies to segregate human-created works from AI-generated works to hone in on high quality data, lower the costs of inference involving retrieval-augmented generation (RAG), and cut the cost of cleansing training data. 

In particular, knowing that a certain portion of data is coming from trustworthy sources means AI companies working to avoid data poisoning can focus their efforts and enhance security. There has also been work on protocols like IndexNow that allow a website or content delivery network (CDN) to signal to a crawler whether the website has changed since the last time a crawler visited, potentially reducing unnecessary crawls by 53%, saving energy, and helping AIs prioritize fresh information. It’s not difficult to imagine a regime where AI companies pay AI-friendly sites more. At the same time, people using open source AIs, including researchers and non-profits, will also benefit from higher-quality information. 

D. Micropayments Will Ensure Healthy Competition and Technological Progress

Without micropayments, the competitive landscape in AI would stagnate. Startups trying to find their footing in this space simply can’t ink thousands of content deals before they sign a single customer. Payments to the content provider might be back-ended (“we’ll pay you at the end of the year”), but payments to lawyers and salespeople are not. 

In the absence of competition, the incumbent AI companies will do what oligopolies always do: stall disruptive innovation, conduct regulatory capture, keep prices high, have an outsized influence on companies and industries upstream and downstream of them, and proceed with the enshittification of their services, as Cory Doctorow puts it: slowly shifting value from the user to the AI company because the user has nowhere else to go now (everybody wave at Comcast!). 

It’s nice to be an oligopolist I imagine, but this position wouldn’t necessarily translate to more profits for the AI companies given the fact that squelching innovation – both for their competitors and their users – means that at some point their share of the pie might grow, but the pie itself won’t. Satya Nadella’s goal of increasing worldwide GDP to 10% per year through AI is explicitly contingent on AI itself becoming a commodity, which isn’t possible with billions of dollars worth of barriers to entry. He’s manifestly correct on this front. The success of the Silicon Valley titans over the decades has hinged on technologies where the aggregate value of the technology to its users is many orders of magnitude higher than the amount that can be charged for it, and the right play isn’t charging more for it, it’s spreading it far and wide and finding a way to harvest some of that value downstream. 1

A simple example of this is Apple apps. Apple could have decided that the only apps on its phone would be Apple apps and they’d make 100% of the money off of all software on their phones. Great, they’d have a monopoly there. But it’s clear that Apple has made more money off the Apple Store from third-party apps than it ever would have from its own apps. Third-party apps had huge knock-on effects. Here’s a non-exhaustive list: the phone now had appeal both to consumers and businesses, other companies became more valuable because their products could be accessed on the go (so they hired more people, who needed more phones, who wrote more apps, who invested in new companies…), an entire industry of mobile-first apps like Uber and Shazam sprung up, and everyone online got slightly richer for the added security of two-factor authentication, enabled by third-party apps. All of this made the phone that much more valuable, as well as all the other products and services in the Apple ecosystem. By allowing their share of the app pie to shrink, the entire universe of potential Apple customers expanded.

Stay Tuned!

Stay tuned for Part 3 of this series, which will do a deep dive into the technologies which can enable micropayments, and will provide a refresher of many fundamental web protocols that can be used and extended for AI.

  1.  This is the rationale for much of commercial open source and for free products like Gmail or Bing. Sometimes the company practices vertical integration and becomes a user itself in a particular domain to capture some of those downstream benefits. GitHub Copilot is an example.
    ↩︎

AI and The Digital Content Provider’s Dilemma

This is the first part of a four-part series called “Protecting Digital Content in the AI Age: A Lawyer’s Guide.” Part 2 will discuss the advantages of automated payments for AI companies and the public. Part 3 will examine AI crawlers, robots.txt, the technology behind the web, and new tools for digital content owners. Part 4 will analyze US and EU laws related to text and data mining opt-outs and technologies that may help digital content owners sustain their businesses.

A. The Web That Could Have Been and the Web We Got

So often I find a writer I want to follow, but I don’t want to add another monthly subscription to my budget for a newspaper or magazine that I’m otherwise not interested in. Sometimes I read an incredibly influential piece that got millions of views, but the writer got no compensation at all because they don’t put out enough content to warrant a subscription. Every now and then I’ll stumble upon a hobbyist site clearly run by someone who drinks way too much coffee that fulfills a one-time need for very useful information, like one that dutifully cataloged specs for an untold number of computer monitors. What if it were as easy to compensate them as throwing coins into a street musician’s hat? Visit a site, automatically toss them some money.

The idea of web micropayments is an old one.[1] For techno-optimists, it sits in the same wistful part of the brain as federated social media platforms and ubiquitous mesh networks (and bringing Google Reader back from the dead, if we’re going to be honest). Micropayments never took off because credit card companies weren’t interested, and unlike the record companies and the movie studios, there wasn’t a forcing function to drag them kicking and screaming into the next millennium, like Napster or Netflix. There was simply nowhere else to go if someone wanted digital payments. For context, the first website was launched in 1991 and PayPal (for example) didn’t launch its services until 2000.

In the meantime, digital content providers increasingly relied on ad revenue, whether their content was copyrightable or not. Google Search launched in 1998, initially putting ads in the search results screen as part of Google AdWords in 2000. The web caused a near extinction event for a number of businesses, most notably local journalism.[2] The remaining digital content owners and Google settled into a happy symbiosis: content owners acquiesced[3] to their content being copied and indexed for listing on Google Search in exchange for Google sending viewers to their sites. In 2003 that symbiosis was enriched by Google AdSense and what would eventually become the Google Display Network, which allowed content owners to sign up for Google to serve ads on their site for a share of the ad revenue. Google eventually drove so much business on the web that they pried content providers out from behind portals like Yahoo! and largely decimated the closed web. In 2007 the New York Times actually scrapped its subscription service because they realized they could get more money just from ads and didn’t return to the concept for another 4 years.

Google Search and the ad-supported open web did not just move analog content online. Nearly every organization with a web presence became a content creator, and many began to offer deep libraries of free, expert content to lure visitors to their sites and build trust in their brands. For example, law firms like Quinn Emmanuel publish several blogs a day about recent cases and new laws; hobby retailers like Goulet Pens have giant trove of articles and videos about their goods; software security firms publish vital security-related info, like Dark Visitors’ widely cited list of all known bots on the internet. It’s not just that the world saw an explosion of content, it’s that the type of content being produced was new, too, previously either not available at all, limited to customers of the firm, or only published in a book at least a year after it would have been useful.

But, the ad-based web has eaten itself. Websites are now optimized for engagement and views, not quality, so outrage and conspiracies have become the coin of the realm, and the engagement (addiction) strategies of social media algorithms make the casino playbooks look absolutely stone age. Hucksters have learned to jump their way up the Google Search results so effectively that nearly every search, no matter how specific, yields links of the “10 very basic facts about the domain you were asking about” variety – quotation marks in your search be damned! Things have gotten so bad that even the Wall Street Journal noticed that people were adding the word “reddit” to their searches back in 2023 so they could find genuine information because Google now struggles to surface things like personal blogs and other non-commercial material.

B. The Rise of AI and the Decline of Human Traffic and Ad Revenue on the Web

A major factor in AI’s popularity is the decline of Google Search. Casual AI users appreciate getting a quick, straight answer without having to wade through the muck of Google Search results and people doing a deeper dive are rejoicing because some of the models are capable of finding relevant sources that Google Search simply can’t anymore. Unfortunately, most queries are of the casual variety and the AI answers often eliminate the need to go visit the sources, even if they’re listed. That translates into less ad revenue for site owners who host content and fewer transactions for e-commerce sites.

Data from SimilarWeb shows that 44 out of the top 50 news websites in the US saw declining traffic in the last year. Search referrals to top U.S. travel and tourism sites tumbled 20% year over year last month, 9% for e-commerce, and 17% for news and media sites. Cloudflare’s data indicates that with OpenAI, it’s 750 times more difficult to get traffic (referrals) than it was with the Google of old (before its answer box and its AI overviews) and with Anthropic, it’s 30,000 times more difficult.[4]

Perplexity AI has become a recurring character in the AI v. digital content provider saga. While the practices of many AI companies have been the subject of a healthy debate, some of Perplexity’s behavior constitutes incontrovertible AI-powered plagiarism. Perplexity is valued at about $20 billion and it primarily runs a search engine that summarizes its recommendations. During the summer of 2024, they were excoriated for publishing an article using AI that essentially recycled investigative journalism from Forbes, lifting sections of their article and even reproducing some of their images, without even mentioning Forbes or the authors of the original piece. They even created an AI-generated podcast related to the piece that irritated Forbes to no end because it outranked all other Fobes content in a Google search for this topic.

That same summer, WIRED used all the firepower it could muster in calling them out for output making claims that WIRED reported something it didn’t, output closely paraphrasing WIRED articles without attributing them, and reproducing their photographs, showing attribution only if the image is clicked – all while ignoring their robots.txt. They published an article entitled “Perplexity is a Bullshit Machine,” among others. That article was shortly followed by another, entitled “Perplexity Plagiarized Our Story About How Perplexity is a Bullshit Machine.”

In a September 2025 interview with Stratechery, Cloudflare’s CEO (another recurring character discussed further in Part 3) described similar problematic behavior:

“[I]f they’re blocked from getting the content of an article, they’ll actually, they’ll query against services like Trade Desk, which is an ad serving service and Trade Desk will provide them the headline of the article and they’ll provide them a rough description of what the article is about. They will take those two things and they will then make up the content of the article and publish it as if it was fact…”

Cloudflare also exposed that Perplexity crawlers were bypassing website permissions by disguising their identity. Perplexity really seems to have gotten Cloudflare’s goat, and driven some of Cloudflare’s innovation, or at the very least, their rhetoric. Cloudflare and Perplexity have since engaged in several public brawls that have focused discourse on AI crawling. All while avoiding punching the actual AI heavy weights in the room.

It may be tempting to argue that declining viewership is caused by just a few bad apples rather than the fundamental nature of the technology. After all, some of those referral numbers are significantly better than others and historically, there have been many cycles of technological disruption, copyright holder backlash, and then gradual realization by copyright holders that the technology can actually improve their businesses. However, even when AI companies appear to be following all the best practices and creating output that properly references sources, websites are still losing traffic. Google’s AI Overviews (powered by Gemini), which follows a lot of good practices and has the strongest incentive to send people to actual webpages because they contain Google ads, appears to be choking off web traffic by about 40%:

C. AI Crawlers Are Overwhelming Web Infrastructure and Driving Up the Costs of Maintaining a Website

Apart from the question of data governance, AI crawlers are taxing site infrastructure, significantly degrading site access for other users, and regularly taking down sites altogether. An estimated 30% to 39% of global web traffic now comes from bots. Unlike older search crawlers, AI crawlers ignore website permissions (robots.txt), crawl delay instructions, and bandwidth-saving guidelines, causing traffic spikes that can be 10-20x the normal level[5] within just a few minutes. Many sysadmins report that crawlers are running random user-agents from tens of thousands of residential IP addresses, each one making just one HTTP request and therefore masquerading as a normal user. This activity amounts to a distributed-denial-of-service (DDoS) attack, which can’t be thwarted by mechanisms like IP blocking, device fingerprinting, or even CAPTCHAs.[6] The crawler activity affects not just the websites being crawled, but other websites as well if they’re on a shared server.[7] Even the largest sites experience performance issues when crawled by AI-related crawlers.

Sites hosting open source software seem to be particularly juicy targets. Anthropic was in the headlines last summer for visiting certain sites more than a million times a day.  A GNOME sysadmin estimated that in a 2.5 hour sample, 97% of attempted visitors were crawlers. Both Fedora and LWN, a Linux/FOSS news site, have reported that only a small portion of their traffic now consists of humans and that they’re struggling just to keep their sites up – Fedora has been down for weeks at a time. It does not appear to be the case that these are examples of crawler bugs – some report a regular pattern of being scraped every six hours.[8]

Other kinds of websites are also being attacked. An April 2025 survey by the Confederation Open Access Repositories, representing libraries, universities, research institutions, etc. around the world, indicated that 80% of surveyed members had encountered service disruptions as a result of aggressive bots; a staggering ⅕ reported having a service outage that lasted several days. Wikimedia has seen a 50% increase in multimedia downloads on Wikipedia. The explosive growth in the number of crawlers out there and the scale of their activities drove Wikimedia to the point that it started creating structured Wikipedia datasets for AI companies to download just to keep them off the live site. Even small, niche sites have been under distress. A website hosting pictures of board games reported that it was crawled 50,000 times by OpenAI’s crawler in a single month, drawing 30 terabytes of bandwidth.

The kicker? Websites with slow loading pages are ranked lower in Google Search results!

The amount of pressure even the more scrupulous AI companies are putting on site infrastructure is vastly disproportionate to the amount of human traffic they send to those same sites:

One group of academics, policy-makers and advocates has suggested that the digital commons is currently subsidizing AI development by bearing these additional infrastructure costs and involuntarily contributing to the environmental footprint associated with AI. Indeed, although the EU Act requires companies to disclose energy used in training and inference, they are not required to disclose an estimate of the energy used by third parties in responding to their crawlers or the energy used to block them.[9]

D. Conclusion

The question of paying content providers is fundamentally about preserving the open web, not necessarily punishing AI companies for doing something wrong. I might even be persuaded that certain activities or services are fully within the bounds of the law. But even perfectly legal, well-intentioned activities can create negative externalities that should nevertheless be addressed. The current state of affairs is not sustainable . People won’t keep posting freely available content, at increasing expense, just to sate the AIs if users don’t even associate them with their work, nevermind compensation. They’ll move their content behind paywalls, join walled gardens, or simply stop creating content. Individuals will be further isolated in their information bubbles. Per a recent study, “Consent in Crisis: The Rapid Decline of the AI Data Commons,” 20-33% of all tokens from the most high quality and frequently used websites for training data became restricted in 2024, up from 3% the previous year. Another compensation model is needed and it looks like the technology to power it might be right around the corner.


[1] The 1997 HTTP spec optimistically stubbed out a micropayments request code: 402.

[2] https://en.wikipedia.org/wiki/Decline_of_newspapers

[3] Not without throwing a legal tantrum first, of course. Google faced numerous lawsuits related to indexing websites as well as creating thumb nail images for image search. See Perfect 10, Inc. v. Amazon.com, Inc. & Google Inc. (2007) and Field v. Google, Inc. (2006).

[4] Cloudflare has made a lot of data available about the activity of AI crawlers. An explanation of their metrics and links to live dashboards are here.

[5] https://www.theregister.com/2025/08/29/ai_web_crawlers_are_destroying/

[6] Headless browsers, discussed in Part 3, allow AIs to interact with websites like a human would.

[7] https://www.inmotionhosting.com/blog/ai-crawlers-slowing-down-your-website/

[8] https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

[9] See the Model Documentation Form published alongside the Transparency chapter of the EU AI Act Code of Practice.

The Enforceability of AI Training Opt-Outs

Creative Commons and the NYU Stern Fubon Center for Technology, Business and Innovation recently hosted a workshop in NYC, inviting participants with expertise in IP and various “open” movements and communities to give feedback on their AI-related proposals. This article was prompted by my participation in that workshop.

Creative Commons has been working on creating a set of “preference signals” for copyright holders to indicate how they would like their works to be treated by AI developers considering using their works for AI training. Currently, these preference signals are meant to be applied at the data set level, not to each individual work.1 Creative Commons has said that it is not treating these preference signals as legally enforceable at the moment, presumably because it believes that using copyrighted works to train AIs is likely to be considered “fair use” under US copyright law. Where use of a copyrighted work is deemed a “fair use,” a license attempting to prevent or limit such use is unenforceable.2 Wikimedia, the largest and most famous licensor to employ Creative Commons licenses, agrees that the fair use defense is likely to prevail.3

 I think this approach is premature. 

EU Copyright Law Rules the World

EU AI Act Brings EU Copyright Law to the World

Many jurisdictions do not have a concept of “fair use,” but instead have statutory exemptions from copyright liability. In the EU, the Directive on Copyright and Related Rights in the Digital Single Market (the “CDSM Directive”), allows commercial model developers4 to copy and extract content from copyrighted works for purposes of text and data mining (TDM) provided that they are lawfully accessible and that the model developer abides by copyright holder opt-outs. The EU AI Act’s Article 53(1)(c) takes the unusual step of importing EU copyright law and the obligations in the CDSM into the EU AI Act and applying them to all general-purpose AI model providers subject to the EU AI Act, even if they would not otherwise be subject to the CDSM or European copyright law. That means that model developers still have to abide by EU AI Act training opt-outs, even if AI training is protected by fair use in the US or elsewhere. 

1. Providers of general-purpose AI models shall:

(c) put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790;

The EU AI Act’s scope is surprisingly broad in three ways. First, Recital 106 states that all AI developers subject to the EU AI Act must follow EU copyright law and respect opt-outs, even if they conduct model training outside the EU.5 This is unusual because generally the copyright laws applicable to any copyright-related acts are the laws of the jurisdiction where the acts are committed. Here, the EU specifically did not want to see the sale of models in the EU that would have been illegal to train in the EU. Second, it’s actually not clear if the intention is for model providers to respect opt-outs just for works governed by EU copyright law, or for all works from all over the world. 6The language is pretty ambiguous on this front. Even if this language only applies to works subject to EU copyright law, though, it would be impossible to identify such works on a mass scale with any degree of certainty.7 Therefore, in practice, companies will abide by opt-outs broadly to the extent standards for expressing them emerge.8 

Third, the scope of entities to whom the EU AI Act applies is broader than even the scope of Europe’s main privacy law, the GDPR. The EU AI Act’s scope is not limited to companies operating in, or selling products or services in the EU from third countries. The scope actually extends to any model provider whose output “is used in the Union.”9 Potentially, that means that a book with AI-generated images created in the US and sold in the EU is within the scope of the EU AI Act, and the model developers must then comply with EU copyright law. In other words, it’s almost impossible to escape EU copyright law with any certainty since model providers have limited control over their users and users might have limited control over where their outputs end up.10  

Creative Commons’ Protocol Proposal

The upshot for Creative Commons is that the TDM opt-out structure can be a vehicle for making signal preferences legally enforceable against the vast majority of commercial AI model providers worldwide. The latest draft of the EU AI Act’s “General-Purpose AI Code of Practice, Copyright Section” specifies in Measure I.2.3 that model providers should follow the robots.txt protocol specified by the Internet Engineering Task Force and “make best efforts to identify and comply with other appropriate machine-readable protocols…” While some of the protocol proposals are binary (“ok to train” v. “don’t train”), a number of organizations have put forward proposals that include additional licensing terms or permissions. The question of which protocols will be legally accepted seems to depend on which ones get popular adoption and public recognition. In practice, if a few major organizations like Common Crawl and EleutherAI get on board, that’s likely to be sufficient. Creative Commons’ stature certainly positions it well for meeting this criteria. 

Enforcement of TDM Opt-Outs

Should CC preference signals become legally recognized by the EU, the applicable enforcement mechanisms will look different from those applicable to CC licenses. EU authors could bring copyright infringement claims against non-compliant companies that conduct training in the EU, but probably not if they conduct training outside the EU. Such plaintiffs would need to look towards the EU AI Act instead. It cannot be enforced by private action, but a complaint could be filed with the relevant national regulatory agency for investigation. Corporate AI customers could potentially terminate their agreements and sue for breach of contract in the event an AI provider doesn’t respect CC preference signals since most contracts require the vendor to comply with all applicable laws. In the meantime, such customers can specifically require compliance with CC preference signals in their contracts and they can also make that a formal procurement requirement when selecting vendors in the first place. Since the EU AI Act carries hefty penalties like the GDPR,11 the lack of private action will not deter companies from complying with the Act.  

Fair Use in the US is Not a Foregone Conclusion

There are a lot of excellent papers out there by IP experts making well-reasoned arguments for the finding of fair use with respect to AI training. But, it’s important to remember that these papers are meant to persuade individual judges about how they should rule; they are not nationwide forecasts of judicial rulings. Courts are not always perfectly logical and many struggle with understanding the technology they are asked to rule on. Think about the felony convictions doled out under the Computer Fraud and Abuse Act for security research and mere terms of service violations. Bad facts can lead to bad law where the defendant is so reprehensible in the eyes of the court, that the court is inclined to find a way to rule against them in the interest of fairness, without regard for defendants acting in good faith that might follow later. Our system of law lurches forward slowly and unevenly, revealing only certain legal insights over certain types of technology in certain jurisdictions over many years. Keep in mind that it took over a decade to determine the relatively straightforward question of whether copying APIs is copyright infringement in just a single dispute (Google v. Oracle). Even iron-clad logic is not a guarantee of any specific legal outcomes, not even on a very long timeline. 

Unpredictable Application

In the US, fair use is not an exception to copyright law; it’s a defense against copyright infringement that involves arguing that a complicated set of very fact-specific factors are favorable to the defendant. So even though a court may hold that the defense is valid in one case, there is no guarantee that it will be valid in similar cases. Courts regularly make surprising or novel distinctions between similar cases, particularly where the underlying facts paint the defendant in a negative light. 

One need look no further than the Supreme Court’s acceptance of the fair use defense with respect to the VCR (Sony Corp. of America v. Universal City Studios, Inc.)12 and its subsequent rejection of it with respect to peer-to-peer (P2P) networks (MGM Studios, Inc. v. Grokster) 13to see such distinctions.  In both cases, the underlying technology can facilitate non-infringing copying and distribution of copyrighted work:  VCRs can enable time-shifting, allowing people to view shows at a time more convenient to them, and P2P networks were commonly used by universities for internal exchange of research and for distributed computation projects like Folding@home, which used P2P networks to simulate protein folding. In both cases, the technology could also be used in an infringing manner and the purveyors of the technology publicly advertised uses of the technology that would clearly constitute copyright infringement: creating a personal home library of shows and movies on VHS tapes in the case of Sony, and downloading copyrighted music in the case of the Grokster. On the face of it, the cases presented similar facts and many IP experts predicted a win for Grokster. But, undoubtedly, Sony’s well-known and highly-respected brand, combined with the justices’ own usage of VCRs swayed them in one direction, while Grokster’s motley crew of anarchists and its association with the “dark web” swayed them in a different direction.

Inability to Make Blanket Fair Use Rules

With respect to AI in particular, distinctions may be drawn between different types of AI models (ex. generative v. predictive models), different modalities (ex. images v. text), the various domains where the models are used, and the purpose of the use. The Copyright Alliance gets this point right: “unauthorized use of copyrighted material to train AI systems cannot be handwaved by a broad fair use exception… Neither the Copyright Act nor case law… would support such a broad fair use exception for AI.” Each copyright infringement claim must be evaluated in the context of the model’s intended use case and whether it is, in practice, offering substitutes in the market for the kinds of works that comprise its training data.14

Likelihood of Inconsistency Between Circuits

The current crop of AI cases can easily be distinguished from one another should a court wish to because of the diversity of the plaintiffs (some dramatically more sympathetic than others) and the modalities involved (as well as many other factors). The cases are spread amongst various circuits. The specific issues the parties might choose to appeal to the U.S. Court of Appeals or to the Supreme Court, and the postures of the cases when they arrive, are unpredictable. The US is likely to have a patchwork of AI-related precedent throughout the various circuits that does not gel into a cohesive, consistent whole for many years to come (if ever) in the same way that the fair use doctrine itself took many years to come together. AI companies may end up with guidance on code-generating AIs in only one circuit, national guidance on training predictive models specifically on training data behind a paywall, and a single district court opinion in another circuit on image generation that expresses a lot of outrage over output but fails to specifically address just the training step. 

Arbitrary Rulings

It’s also possible for a court to throw a complete curveball, such as in Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. In that case, Ross was accused of copyright infringement for using Thomson Reuters’ case summaries to train its AI-powered case search engine, which suggested the names of cases when queried with specific legal questions. The  judge inexplicably rejected Ross’s fair use defense because all the fair use cases raised by Ross related to the copying of code rather than text. This argument of course ignores major fair use cases that don’t relate to the copying of code (which were referred to in the cases that Ross cited), including those related to Google’s mass scanning of books to enable search within books, as well as Amazon’s and Google’s copying of images from all over the web to enable image search. Here, the judge distinguished the case before him from precedent by simply ignoring much of the precedent. 

Conclusion

Given the uncertainty in where the law might go on the issue of fair use and AI training, a presumption of unenforceability of CC preference signals is premature. Declaring that the signals are unenforceable right out of the gate robs legal counsel  of any gravitas they might bring to a compliance request. Companies don’t spend money complying with voluntary frameworks unless and until they get (or avoid) something tangible in return, and in this case, those benefits can’t manifest themselves until there is sufficient adoption of the signals. Even in the world of open source software, where the benefits of the software are very tangible and the licenses are clearly enforceable, a huge portion of companies still don’t put in the time and effort necessary to do compliance on that scale. It would be much more effective to begin with the notion that the signals are enforceable, particularly in the EU, to drive adoption and compliance. Even if they don’t turn out to be enforceable in any jurisdiction decades from now, by then they may continue to function on the basis of norms. 

  1.  A separate set of “signals” might be developed later for individual works. ↩︎
  2. Creative Commons is particularly attuned to this concept. Most of their licenses specifically include language like this: “For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.”
    ↩︎
  3. In fact, it has proactively put together a dataset of Wikipedia content for AI developers to use, in no small part to ease the burden of crawlers on its infrastructure. ↩︎
  4.  Article 3 also provides an exception for research organizations and cultural heritage institutions which carry out TDM for the purposes of scientific research. ↩︎
  5. Recital 106: “Any provider placing a general-purpose AI model on the Union market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those general-purpose AI models take place.”
    ↩︎
  6. Lawyers at Freshfields also raise this possibility. ↩︎
  7. The EU and the US do not require copyright registration for a copyright to be validly held, so there is no complete source, even at the national level, tracking who has what copyrights in what jurisdictions. Many copyrighted materials do not come with sufficient information to determine the author(s). Even if there is a name, without some sort of identification number, that name might match multiple people in multiple countries. Certain materials may have multiple authors from several different countries and it may not be clear exactly which copyright laws might apply to any given portion. Materials emanating from organizations are even more complicated because they might be coming from affiliates worldwide, and copyright ownership is governed by private corporate family agreements. In some cases, copyright ownership may actually sit with a contractor, customer, or partner despite public statements to the contrary due to private copyright assignments (or lack thereof). Suffice it to say, this is a complex determination that is resource intensive and prone to error because much of the relevant information is not publicly available. 
    ↩︎
  8. Companies are also likely to respect these opt-outs even in the US because at the very least, they signal risk of litigation and they might be relevant for a fair use analysis in court. ↩︎
  9. Article 2(1)(c) of the EU AI Act. ↩︎
  10. That is certainly the case for any current model provider large enough to actually threaten the data commons in a meaningful way and therefore be a subject of interest in Creative Commons’ application, and possible enforcement, of preference signals. It’s worth noting that there might be challenges to the vast scope of this law precisely because of its attempt at extraterritorial application of copyright law in this manner. But, that’s mere speculation on a matter that may not be decided for many years to come. In the meantime, AI companies are likely to attempt compliance. The third draft of the “General Purpose AI Code of Practice,” further specifying requirements in the EU AI Act, does not give any additional insight into the matter. ↩︎
  11. The GDPR has a right of private action but damages are limited to actual damages suffered by the plaintiff(s). Because litigation is so expensive, it is exceedingly rare for such litigation to be worthwhile for individuals. In practice, these cases are brought as class actions. Nevertheless, the vast majority of enforcement action is via data protection authorities and not litigation. 
    ↩︎
  12. Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 (1984). ↩︎
  13. MGM Studios, Inc. v. Grokster, Ltd., 545 U.S. 913 (2005). ↩︎
  14. See my article, “Battle of the AI Analogies” for a lengthy discussion of the various facts that might make a fair use argument more or less likely to succeed. ↩︎

PLI CLE: Open Source and Machine Learning – Practical Licensing Issues

Heather Meeker and I co-chaired this year’s OSS program for PLI once again: “Open Source Software 2024 – From Compliance to Cooperation.” Check it out for a smattering of open source-related topics, both beginner and expert. I also presented a new segment with Yusuf Safari, “Open Source and Artificial Intelligence – Practical Licensing Issues” (aka “Open Source and Machine Learning – Practical Licensing Issues”). It covers the AI licensing landscape, “ethical” licenses, the data licensing landscape, the EU AI Act’s treatment of “open” AI, and AI corporate policies. Slides for the presentation along with speaker notes are available here.

Keynote at the International Conference on Learning Representations (ICLR)

Myself with ICLR’s two principal organizers, Been Kim (Research Scientist at Deep Mind) and Yisong Yue (Machine Learning Professor at CalTech) at the Organizing Committee reception.

This past spring I was invited to present a keynote at the 12th International Conference on Learning Representations (ICLR) in Vienna: “Copyright Fundamentals for AI Researchers.” If you were not one of the lucky 2,000 people in the audience – not to worry – ICLR has posted a video of my talk and the slides here. The keynote explored the current state of copyright law with respect to AI in the U.S., potential claims and defenses, as well as practical tips for minimizing legal risk.