SHARE

Apple Faces Lawsuit over Alleged AI Training with Pirated Books

The claim says Apple never attempted to pay for authors’ intellectual property, despite using it for what they describe as “a potentially lucrative venture.”

Written By

DS

Datamation Staff

Sep 9, 2025

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

When two prominent authors decided to take on one of the world’s most valuable companies, they probably knew they were in for a fight. They also say they found more than they bargained for. What authors Grady Hendrix and Jennifer Roberson discovered about Apple’s approach to AI development has sparked a legal battle that could reshape how tech giants source their training data.

The federal lawsuit, filed in Northern California, centers on a stark allegation: Apple systematically used pirated versions of copyrighted books to train its AI systems without asking permission or offering compensation. The authors say Apple never attempted to pay for their intellectual property, despite using it for what they describe as “a potentially lucrative venture.”

This is not just two writers versus Big Tech. It is a challenge to how the AI industry has been assembling its most valuable assets, the massive datasets that power everything from Apple Intelligence to competitive systems across Silicon Valley.

The shadow library controversy: How Apple allegedly sourced training data

According to the complaint, Apple’s Applebot can access shadow libraries containing vast numbers of pirated, copyrighted books that have not been licensed for use. These shadow libraries are not obscure corners of the web, they are large repositories with millions of works that authors and publishers never authorized for AI training.

The authors claim that Apple scraped data with Applebot for nearly nine years before disclosing any plan to use that scraped material for AI training. In short, this looks less like a sudden pivot to AI and more like a long game.

The complaint goes further. It alleges this was not an accident. The authors argue that Apple intentionally evaded payment by using books already compiled in pirated datasets. Instead of licensing content or partnering with publishers, Apple allegedly tapped content that was already illegally distributed, a shortcut that enabled rapid dataset assembly while bypassing the messy and expensive process of securing proper licenses.

Apple Intelligence under scrutiny: The commercial stakes

The heart of this lawsuit is not only about copyright infringement, it is about how Apple allegedly used stolen content to build a core competitive advantage. Apple Intelligence is the company’s banner AI strategy, integrated across iPhones, MacBooks, iPads, and other devices, the kind of feature set Apple is betting will define the next wave of personal computing.

The authors argue that Apple copied copyrighted works to train AI models whose outputs compete with and dilute the market for the very works that give Apple Intelligence its appeal. Imagine your novel helping train a system that can mimic your voice. Tough pill to swallow.

The lawsuit singles out Apple’s OpenELM large language model, described as a crucial part of the Apple Intelligence suite. OpenELM is not a lab toy, it is a foundation for Apple’s AI capabilities across its ecosystem, which makes the alleged theft central to Apple’s competitive posture.

Scale matters too. The authors emphasize that Apple, despite being one of the most profitable companies globally, chose not to offer compensation for their intellectual property. This was not a cash‑strapped startup cutting corners, it was a company with vast resources allegedly building an AI future on stolen content rather than licensed partnerships.

Industry-wide reckoning: The broader AI copyright crisis

Apple is not alone. AI companies are increasingly facing legal woes for using data and content without required permissions, which is forcing a long overdue reckoning over how these systems are built.

The financial stakes are no longer hypothetical. Anthropic recently agreed to pay $1.5 billion to settle a class action brought by authors who said the company stole their work to train its Claude chatbot. The deal, covering approximately 500,000 writers receiving about $3,000 per work, has been described as the largest publicly reported copyright recovery in history, and it gives courts and companies a concrete yardstick for the value of systematically harvested creative works.

The pressure stretches across other giants. Microsoft was sued in June over alleged use of books to train its Megatron model, and OpenAI has also faced claims from The New York Times and other media organizations who say their journalism was used without permission.

The Anthropic settlement lands with extra weight. It sets out concrete financial liability for using pirated content, regardless of whether the final training might later be argued as fair use. Legal experts have called it “industry guiding,” and one attorney labeled it “a landmark event, the first major settlement in a case against a generative AI company”. The message feels clear, the free ride is ending.

What this means for the future of AI development

The outcome of Apple’s case could define whether the industry can keep its current approach to data sourcing. If the plaintiffs succeed in achieving class action status, many more authors may join, and a wave of similar claims could force Apple to change how it develops AI.

The tension is simple to state and hard to resolve. Tech companies say large scale data collection is needed to make models better, while authors and publishers argue that unauthorized use undermines their earnings and the health of creative fields.

The core assumption is under attack. The industry has long acted as if publicly available content could be freely harvested for machine learning. These cases argue otherwise. “Publicly available” does not mean “free to use for commercial AI training,” especially when the material was illegally uploaded in the first place, which creates liability that stretches beyond fair use.

If courts keep finding that companies must pay for copyrighted training data, either through upfront licensing or settlements after the fact, the economics of AI will shift. Companies may need to negotiate broad licenses with publishers, share revenue with authors, or rethink training pipelines so they are not built on copyrighted material.

Looking ahead, firms may be forced to rethink how they source training material and to make formal agreements with rights holders before development begins. The Anthropic deal suggests the cost of using copyrighted content without permission is catching up with the industry, with penalties large enough to reshape budgets and strategy. For Apple, which has spent years cultivating a premium and ethical brand, the allegations are not only a legal problem but a reputational one.

The stakes could not be higher, not just for Apple, but for an AI ecosystem built on the assumption that the internet was an all‑you‑can‑eat buffet of training data. As these fights play out, they will help decide whether the next generation of AI is built on stolen content or on legitimate partnerships with the creators who make artificial intelligence possible.

DS

Apple Faces Lawsuit over Alleged AI Training with Pirated Books

The shadow library controversy: How Apple allegedly sourced training data

Apple Intelligence under scrutiny: The commercial stakes

Industry-wide reckoning: The broader AI copyright crisis

What this means for the future of AI development

Datamation Staff

Recommended for you...

Company

Categories