Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

admin June 30, 2023

0 3 minutes read

A number of visual artists are suing for using their images as training data for text-to-image generators. Now, two well-known novelists have filed a class action lawsuit against OpenAI, accusing the company behind ChatGPT and Bing Chat of copyright infringement for using their books as training data. This appears to be the first lawsuit filed over the use of text (rather than images or code) used as training data.

in the lawsuit In the lawsuit, filed in the United States District Court for the Northern District of California, plaintiffs Paul Tremblay and Mona Awad allege that OpenAI and its subsidiaries have committed copyright infringement, violated the Digital Millennium Copyright Act, and also violated California law. and also violated common law restrictions on unfair competition. The authors are represented by Joseph Saveri Law Firm and Matthew Butterick. This team is the same team behind the recent lawsuits filed against Diffusion AI and GitHub (over GitHub’s co-pilot).

According to the complaint, Tremblay’s novel hut at the end of the world And two of Awad’s novels: 13 ways to see fat girls and bunny It was used as training data for GPT-3.5 and GPT-4. OpenAI has not disclosed that copyrighted novels were included in its training data (which is kept confidential), but plaintiffs allege that ChatGPT provided a detailed synopsis and responded to questions about the books. Since it was able to answer, it concludes that it must contain copyrighted novels. You will have access to the full text.

“Because the OpenAI language model cannot function without the representational information extracted from and retained within the plaintiff’s work (and others), the OpenAI language model itself cannot be used to create derivative works made without the plaintiff’s permission. infringes and infringes the plaintiff’s exclusive rights under copyright law,” the complaint reads.

All three books also contain copyright management information (CMI), such as ISBNs and copyright registration numbers. The Digital Millennium Copyright Act (DMCA) states: Deleting or tampering with CMI is illegal And because ChatGPT’s output doesn’t include that information, plaintiffs allege that OpenAI violates the DMCA in addition to regular copyright infringement.

There are only two plaintiffs in the lawsuit so far, but the attorneys are seeking class-action status to seek damages from other authors whose works have been used by OpenAI. The attorneys are seeking monetary damages, legal costs, and an injunction to force OpenAI to make changes to its software and business practices centered on copyrighted material.

We reached out to Mr. Butterrick for comment on this lawsuit, and he referred us to his website. LLM litigationdetails the plaintiff’s position and grounds for action.

“We have developed ChatGPT and its underlying large-scale language model, GPT- We have filed a class action lawsuit against OpenAI challenging 3.5 and GPT-4.

They also criticize the concept of generative AI, writing that “‘generative artificial intelligence’ is just human intelligence repackaged and disconnected from its creator.”

Similar to Saveri and Butterick’s lawsuit against Stability AI for using copyrighted images as training data, this lawsuit also argues that taking text from the open internet to power LLM is fair use. It depends on the belief that it is not. That is a question that has not yet been answered in court.

In the 2006 incident, break vs google, the writer sued search engines for caching his work and making the cached version available for searches. However, a U.S. District Court dismissed the case on the grounds that Google’s caching of data was fair use. Justice Robert C. Jones argued that keeping documents in cache was a transformative use (one of the four factors used to determine fair use) and opened the potential market for a work. I wrote that it does not hurt (another element). Therefore, simply storing copyrighted data on our servers in the form of a cache does not make Google liable.

However, using copyrighted creations as training data is very different from indexing content for search. Some would argue that if an LLM could repeat key details in a book, it would harm the market for those works and not be truly transformative. On the other hand, if a human wrote the synopsis for the book, it would not normally violate copyright law. Ultimately, these issues will be decided by lawsuits like this one.

OpenAI is not the only company that uses copyrighted material for training and output. The company’s new search experience, Google SGE, often plagiarizes verbatim and paragraphs from copyrighted articles. What happens in this lawsuit could have wider implications for the generative AI industry.