In recent years, the use of artificial intelligence (AI) has become increasingly prevalent in various industries, including literature. However, a recent discovery has raised concerns among authors: their books have been used to train AI without their consent. Large language model (LLM) technology has made great progress in recent years, but behind it lies the cloud of copyright disputes. Technology giants use massive amounts of text data to train LLM. This inevitably involves copyrighted works, leading to strong protests from authors and media organizations.Ā Recently, Meta faced a class-action lawsuit from several authors, including comedian Sarah Silverman and author Richard Kadrey.Ā The class action accuses Meta of using the “Books3” data set containing a large number of pirated books to train its LLAM 1 and LLAM 2 models. Meta did admit to using the Books3 dataset. However, it refused to pay appropriate compensation to the authors.
What is Books3
Books3 is a text data set containing 195,000 books with a total capacity of nearly 37GB. It was created by AI researcher Shawn Presser in 2020 to provide a better data source for improving machine learning algorithms. However, the Books3 dataset is a collection of pirated ebooks, most of which were published in the past 20 years. It was part of a larger project called The Pile, which aimed to provide open-source data for language models. The dataset was made accessible to the public, and it was used by various companies.Ā
Several authors have reported that their books were included in the Books3 dataset without their permission. Some of the notable authors who have spoken out about this issue include Conor Kostick, Bart King, Lauren Groff, Bianca Turetsky, and T. Greenwood.Ā These authors have expressed their disapproval of their work being used for AI training, and some have even threatened to take legal action against those responsible
Meta has since admitted that it uses it to train its own LLAM model. This is why the company is now in court. The authors are seeking compensation for the use of their work for AI training.Ā Books3 contains a large number of copyrighted works crawled from the pirated website Bibliotik. This puts Meta’s actions at legal risk.
Gizchina News of the week
Meta’s response
Though Meta admitted to using Books3, it denies any intentional infringement of copyright books. The company claims that its use of the Books3 dataset falls within the scope of fair use. Meta also said that the use of these books does not require permission, attribution, or compensation.Ā In addition, Meta is challenging the legality of the lawsuit as a class action and refusing to provide any form of financial “compensation” to the writers who filed the lawsuit or others involved in the Books3 controversy.
It is worth noting that some of the content in the Books3 dataset comes from the piracy website Bibliotik. The dataset was requested to be removed from the shelves by the Danish anti-piracy organization Rights Alliance last year and currently faces a digital archive ban.
Meta’s case is not unique
It is important to note at this point that Meta’s approach is not unique. It is something that other brands do. Previously, the New York Times also filed a lawsuit against OpenAI and Microsoft for using its articles to train the chatbot ChatGPT. OpenAI argued that training an AI model without using copyrighted material is āalmost impossibleā. The company eventually asked the court to dismiss the lawsuit.
Recall that in November two years ago, generative AI came upon us suddenly with the advent of ChatGPT. At the time, there was almost no law guiding the use of generative AI. Many people do not know how the AI got its data. They do not also know how AI models were able to get pretty decent results. However, as time went on, the public got to understand the training models that were required. Since then, there have been multiple lawsuits against different AI brands for data use.
The Impact on AI Models
The use of pirated books in AI training has raised concerns about the quality and reliability of the AI models generated from this data. One commentator on Hacker News suggested that the authors of the books themselves would not have the right to make a general decision about allowing the data to be trained on models, as they do not have insight into how allowing the data will advance AI technology.
In response to the controversy surrounding the Books3 dataset, an anti-piracy group has shut down the dataset. This decision highlights the importance of respecting authors’ rights and ensuring that the data used to train AI models is legally obtained.
Conclusion
The data needed to train LLAM models is so enormous that it is almost impossible to get the consent of all authors. Meta did not deny using Books3 to train its LLAM model. However, it denies any intentional infringement of copyright books. Meta also claims that its use of the Books3 dataset falls within the scope of fair use.Ā The unauthorized use of the Books3 dataset for AI training has raised significant concerns among authors. This has led to several legal actions. The shutdown of the Books3 dataset serves as a reminder of the importance of respecting authors’ rights. It also ensures that AI models are built on legally obtained data. As AI technology advances, it is crucial to maintain a balance between innovation and respect for intellectual property rights.
Author Bio
Efe Udin is a seasoned tech writer with over seven years of experience. He covers a wide range of topics in the tech industry from industry politics to mobile phone performance. From mobile phones to tablets, Efe has also kept a keen eye on the latest advancements and trends. He provides insightful analysis and reviews to inform and educate readers. Efe is very passionate about tech and covers interesting stories as well as offers solutions where possible.