Substack

Friday, February 9, 2024

Who owns LLM algorithms?

In the context of describing the trajectory of scientific progress, Issac Newton is supposed to have said that  "we stand on the shoulders of giants". This universal and time-invariant truth is often forgotten amidst the hype surrounding modern technological advances, especially in the digital sectors. 

This is especially ironical since there's a compelling case that these technologies may be more dependent on what has happened before than with other technologies of previous times. As is well documented now, the iPhone is the best example. Another one is the rapidly emerging AI models. 

LLM models-based solutions like ChatGPT and firms like OpenAI rely on public and private documents, videos, photos etc, most often used without any consent and in complete disregard of copyright protections. Their business models fails to compensate the providers or producers of those contents. Consider this

Many large language models are trained on vast databases of texts that have been scraped from the internet, without compensating creators. That could be a problem for unionized teachers concerned about fair labor compensation. There are also concerns that some A.I. companies may use the materials that educators input, or the comments that students make, for their own business purposes, such as improving their chatbots.

This raises the question of why these technology companies shouldn't pay the content creators whose contents were used to develop, train, and refine the AI algorithm models? And if yes, how the compensation should be structured?

In a test case, The New York Times has sued OpenAI and Microsoft for "copyright infringement" and "unauthorised use of published work to train artificial intelligence technologies".

The Times is the first major American media organization to sue the companies, the creators of ChatGPT and other popular A.I. platforms, over copyright issues associated with its written works. The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information. The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.” It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times... “Defendants seek to free-ride on The Times’s massive investment in its journalism,” the complaint says, accusing OpenAI and Microsoft of “using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it”... 

The lawsuit could test the emerging legal contours of generative A.I. technologies — so called for the text, images and other content they can create after learning from large data sets — and could carry major implications for the news industry. The Times is among a small number of outlets that have built successful business models from online journalism, but dozens of newspapers and magazines have been hobbled by readers’ migration to the internet... Concerns about the uncompensated use of intellectual property by A.I. systems have coursed through creative industries, given the technology’s ability to mimic natural language and generate sophisticated written responses to virtually any prompt.

The Times lawsuit goes beyond mere use of its contents to unfair direct competition with and copying from its own business.

Besides seeking to protect intellectual property, the lawsuit by The Times casts ChatGPT and other A.I. systems as potential competitors in the news business. When chatbots are asked about current events or other newsworthy topics, they can generate answers that rely on journalism by The Times. The newspaper expresses concern that readers will be satisfied with a response from a chatbot and decline to visit The Times’s website, thus reducing web traffic that can be translated into advertising and subscription revenue. The complaint cites several examples when a chatbot provided users with near-verbatim excerpts from Times articles that would otherwise require a paid subscription to view. It asserts that OpenAI and Microsoft placed particular emphasis on the use of Times journalism in training their A.I. programs because of the perceived reliability and accuracy of the material...

In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations. “Decreased traffic to Wirecutter articles and, in turn, decreased traffic to affiliate links subsequently lead to a loss of revenue for Wirecutter,” the complaint states. The lawsuit also highlights the potential damage to The Times’s brand through so-called A.I. “hallucinations,” a phenomenon in which chatbots insert false information that is then wrongly attributed to a source. The complaint cites several cases in which Microsoft’s Bing Chat provided incorrect information that was said to have come from The Times, including results for “the 15 most heart-healthy foods,” 12 of which were not mentioned in an article by the paper.

And The Times is not alone in pursuing a share of the emerging economy of AI companies, 

The actress Sarah Silverman joined a pair of lawsuits in July that accused Meta and OpenAI of having “ingested” her memoir as a training text for A.I. programs. Novelists expressed alarm when it was revealed that A.I. systems had absorbed tens of thousands of books, leading to a lawsuit by authorsincluding Jonathan Franzen and John Grisham. Getty Images, the photography syndicate, sued one A.I. company that generates images based on written prompts, saying the platform relies on unauthorized use of Getty’s copyrighted visual materials.

Ironically, even as the tech mavens celebrate themselves and their technologies, they also unwittingly acknowledge their dependence on what has gone before

In October, Andreessen Horowitz, a venture capital firm and early backer of OpenAI, wrote in comments to the U.S. Copyright Office that exposing A.I. companies to copyright liability would “either kill or significantly hamper their development.” “The result will be far less competition, far less innovation and very likely the loss of the United States’ position as the leader in global A.I. development,” the investment firm said in its statement.

There are interesting parallels with other digital technologies. Take the example of search engines and social media platforms. The former uses existing internet content to generate targeted search results that are in turn offered as a free service. Social media too uses private information of individuals to forge network connections, but does not charge its users. Instead both find revenue from advertisements, though there are legitimate claims here too that the profits should be shared with the original content owners and users. In fact, in 2021 the Australian government promulgated a law that mandates platforms like Google and Facebook negotiate content supply deals with media outlets. 

After all, targeting (and thereby advertisements) is a non-starter without people sharing their preferences and networks and patterns emerging from their transactions. By any reasoning, all the digital trails generated by an individual are legitimate private property of the individuals concerned. But the owners do not get any share of the revenues from advertisements arising from exploitation of their private property. What if the owners of these digital trails demand that rules be re-written to allow them to decide the terms of its use by companies? What if they demand a fair share of the profits? 

For sure, the companies will say that people are allowed to use the platforms for free. But that does not entitle them to use the individual's digital trails as they please. Of course, they could start to charge. In that case, the usage will surely crash, and with it the network effects that drive advertisement revenues. But that would also most likely trigger demands from users for a fair share of profits from use of their data. The big success of these companies was to give people free access to their products and thereby keep out discussions about commercial considerations. 

As another way of looking at this, people value the entertainment and enjoyment they get from these platforms. But since it's currently accessed for free, we do not know its true value for users. But if the firm starts to charge, then we will get a measure of the true value of the enjoyment from these platforms for its users. It would reflect in the willingness to pay to access the platform. But we know that if the platform starts to charge, the usage will fall sharply. This would in turn reduce the value of the platform further. 

Any LLM-model would perforce require large volumes of content. The more refined versions, ones that offer valuable services, would invariably require high quality content, most of which would have been commercially developed and under copyright protection. The service or product involving the LLM would not have been possible without access to such content. The question that needs to be asked is how much of the value addition comes from the underlying content and how much of it is determined by the quality of the algorithm. This is not easy to answer.

The LLM firms will argue that the development of all such technologies depends on large data/content and restrictions will leave these innovations remain still-born. The response to this argument is that the objective is not to restrict access to data, but to ensure a fair share of the profits generated from these technologies. More than even fair sharing with the individual owners of the data, there’s a need to pay taxes that reflect the public ownership of the feedstock that went into creating these models. This logic applies to all digital firms whose revenue model critically depends on processing the digital trails left behind by individual users of the services offered by the firm. Further, there’s nothing in history that suggests businesses require such super-normal profits to keep innovating. 

As a parallel, the LLMs of today appear like the student in the class who has managed to acquire knowledge throughout his entire school and college by cheating on payment of fees even as all other classmates have paid their fees!

Whichever way we look at this, we cannot gloss over the need for equitable sharing of returns from digital technology innovations. Business models in the digital markets must be made to acknowledge this and allow for such sharing. 

No comments: