SAN FRANCISCO — ChatGPT maker OpenAI will pay to use Associated Press news stories to train its artificial intelligence algorithms, the first major deal of its kind amid a growing debate over whether tech companies should pay the creators of content they scrape from the web and use to build AI tools.
OpenAI will get access to the AP’s archive of text stories going back to 1985, according to a statement from the news organization. On top of licensing fees, the AP will also get access to OpenAI’s technology to use in experiments for deciding how it might improve its journalism.
The news organization has used automation to produce some local sports reporting and financial earnings reports for years. The AP does not use generative tech — chatbots such as ChatGPT — to write stories, according to the news organization.
OpenAI, Google and other AI companies have used billions of sentences pulled off the open internet to build the large language models that power their chatbots. News stories, Wikipedia articles, social media comments and blog posts have all gone into the models without getting permission from their owners, with the tech companies generally arguing they are free to use the public data.
A Washington Post analysis of a database of websites that was used to train one of OpenAI’s older AI models showed that the AP’s main news website was the 68th-most cited website in the database.
A rising group of authors, musicians, news organizations and social media companies has been pushing back, arguing that the use of their content to train AI is a massive shift in the way the internet works, especially because some of the AI tools being trained on human-made content are already being used to replace human workers. A wave of lawsuits has washed over the industry in the past two weeks alleging improper data use, including class-action suits against OpenAI and Google, and lawsuits against OpenAI from the comedian Sarah Silverman and two prominent fiction authors.
On Thursday, The Washington Post reported that the Federal Trade Commission opened an investigation into how OpenAI used consumers’ data to train its models.
“The data sets include a lot of content that is copyrighted,” said Andres Sawicki, a law professor at the University of Miami who studies intellectual property. “The owners of the copyrights are not consenting to these uses.” It is possible to imagine tech companies and content creators striking more deals like the AP one to create a “clean database,” he said.
“The problem is that the size of the data sets that are required to train the models are so large that I think it’s going to be really hard to get agreement from enough owners to make it technologically viable,” Sawicki said.
Chatbots such as ChatGPT are trained on a set of information and can not be continuously updated without retraining from scratch, meaning they are less useful for providing recent news and fresh information. Tech companies have tried to solve that problem by allowing the chatbots to search the web themselves or ask questions from a separate, updating database. The AP deal gives OpenAI access only to its archive, but the archive is updated with recent news stories regularly.
Tech companies have paid directly for news content in the past. Google and Facebook both pay news sites for direct access to their content to display on their platforms in some countries. In Australia, the government passed a law requiring the practice, and a similar law is about to go in effect in Canada.