Data Revolts Break Out Against A.I.

admin July 15, 2023

0 6 minutes read

For more than 20 years, Kit Lofstadt has written fan fiction that explores the alternate worlds of Star Wars heroes and Buffy the Vampire Slayer villains, and shared his stories online for free.

But in May, Lofstadt stopped posting her work after learning that a data company had copied her story and fed it into the artificial intelligence technology underlying ChatGPT, a viral chatbot. To her dismay, she hid the writing behind her locked account.

Lofstadt also helped organize a rebellion against AI systems last month. Along with dozens of other fans of her fiction writers, she published a flood of irreverent stories online to overwhelm and confuse the data-gathering services that feed the writer’s work into her AI technology.

Lofstadt, a 42-year-old voice actor from South Yorkshire, England, said, “Each of us did the best we could to show them that the fruits of our creativity cannot be harvested at the whim of machines. I have to,” he said.

Fanfiction writers are just one group currently waging a rebellion against AI systems as the tech frenzy sweeps through Silicon Valley and the world. In recent months, social media companies such as Reddit and Twitter, news outlets such as The New York Times and NBC News, and writers such as Paul Tremblay and actress Sarah Silverman have taken a stand against data siphoning by AI. are taking

Their protests take many forms. While the writer or artist has locked files to protect his work or boycotted certain of his websites that publish AI-generated content, companies like Reddit have access to data. We want to charge for access. At least 10 lawsuits have been filed this year against AI companies for training systems based on the creative work of artists without their consent. Last week, Silberman and authors Christopher Golden and Richard Cudley sued ChatGPT creators OpenAI and others over AI’s use of copyrighted material.

At the heart of the rebellion is a new understanding that online information (stories, artwork, news articles, message board posts, photos) can have significant untapped value.

A new wave of AI, known as “generative AI” from the text, images, and other content it generates, is built on complex systems such as large-scale language models that can generate human-like prose. These models are trained on the accumulation of all sorts of data, so they can answer people’s questions, mimic writing styles, and churn out comedy and poetry.

This has kicked off a move by tech companies to seek more data to feed their AI systems. Google, Meta, and OpenAI essentially used information from across the Internet, including large databases of fan fiction, treasure troves of news articles, and book collections, much of which was freely available online. In tech industry parlance, this is known as “scraping” the Internet.

Released in 2020, OpenAI’s AI system, GPT-3, spans 500 billion “tokens”, each representing a subset of words found primarily online. Some AI models span over 1 trillion tokens.

Internet scraping has been around for years and was largely exposed by the companies and non-profits that did it. However, it was not well understood or considered particularly problematic by the companies that owned the data. Things have changed since ChatGPT debuted in his November and the public has learned more about the underlying AI models that power chatbots.

“What is happening here is a fundamental recalibration of the value of data,” said Brandon Duderstadt, founder and CEO of AI company Nomic. “Previously, we thought we could get value out of the data by exposing it to everyone and running ads. can be considered to lock the data.”

Data protests may have little effect in the long run. Luxurious tech giants like Google and Microsoft already have a mountain of proprietary information and the resources to license them. But the days of easy content scraping are coming to an end, and smaller AI startups and nonprofits that wanted to compete with the big companies may not have enough content to train their systems. be.

In a statement, OpenAI said ChatGPT was trained on “licensed content, publicly available content, and content created by human AI trainers.” “We respect the rights of creators and authors and look forward to continuing to work together to protect their interests,” he added.

In a statement, Google said it is participating in discussions about how publishers can control their content in the future. The company said it “believes that everyone can benefit from a vibrant content ecosystem.” Microsoft did not respond to a request for comment.

Last year, a data rebellion erupted after ChatGPT became a global phenomenon. In November a group of programmers filed a class action lawsuit It sued Microsoft and OpenAI that they infringed their copyright after their code was used to train AI-powered programming assistants.

Stock photo and video provider Getty Images sued Stability AI, an AI company that creates images from text descriptions, in January, accusing the company of using copyrighted photos to train its system. rice field.

Then, in June, Los Angeles law firm Clarkson filed a 151-page class action lawsuit against OpenAI and Microsoft, explaining how OpenAI collected data from minors, and web scraping. He claimed that he violated the copyright law and was “theft”. The company filed a similar lawsuit against Google on Tuesday.

“The data rebellion we are seeing across the country is how society rebels against the idea that big tech companies have the right to take any information from any source and make it their own.” said Ryan Clarkson. Founder of Clarkson.

Santa Clara University Law School professor Eric Goldman said the lawsuit’s allegations are broad and unlikely to be accepted in court. But the wave of lawsuits is just beginning, he said, with “second and third waves” coming to define the future of AI.

Big companies are also pushing back against AI scrapers. Reddit announced in April that it wanted to charge access to its application programming interface (API), a way for third parties to download and analyze the social network’s vast database of conversations between individuals.

Reddit CEO Steve Huffman said at the time that the company “doesn’t have to give all of its value to the world’s biggest companies for free.”

That same month, Stack Overflow, a question-and-answer site for computer programmers, also announced it would require AI companies to pay for its data. The site has approximately 60 million questions and answers.The move was previously reported By Wired.

News outlets are also resisting AI systems. In an internal memo on the use of generative AI in June, The Times said AI companies should “respect intellectual property.” A Times spokeswoman declined to provide details.

For individual artists and writers, fighting AI systems means rethinking where they publish.

Nicholas Cole, a 35-year-old illustrator from Vancouver, British Columbia, was concerned about how his unique art style could be replicated by an AI system, and wondered if technology was scraping his work. I doubted it. He intends to continue posting his work on Instagram, Twitter, and other social media sites to attract clients, but he will continue to do so, such as ArtStation, which posts artificial intelligence-generated content alongside human-generated content. Stopped posting on the site.

“It feels like a needless theft from me and other artists,” Cole said. “It puts an existential horror hole in my stomach.”

At the Archive of Our Own, a fanfiction database of more than 11 million stories, authors are increasing pressure on the site to ban data scraping and AI-generated stories.

In May, dozens of writers rose up in arms when some Twitter accounts shared examples of ChatGPT that mimicked the style of popular fanfiction posted on the Archive of Our Own. They blocked their stories and wrote subversive content to mislead AI scrapers. They also pressured the leader of the Archive of Our Own to stop allowing AI-generated content.

Betsy Rosenblatt, who provides legal advice to The Archives of Our Own and is a professor of law at the University of Tulsa, said the site has a policy of “maximum inclusiveness” and that no articles are written. He said he didn’t want to be in a position to identify who was killed. with AI

For fan-fiction writer Lofstadt, his battle with AI came about while writing a story about Horizon Zero Dawn, a video game in which humans battle AI-powered robots in a post-apocalyptic world. In games, she said, there were good robots and bad robots.

But in the real world, she says, “they’re twisted into doing bad things, thanks to hubris and corporate greed.”