Skip To Content

Class Action Lawsuit Against Copilot, Github’s AI trained code generator, Shines Light on Copyright Issues in AI Training Models

January 15 2023

Recently, there has been a lot of discussion in the legal tech community about potential copyright issues related to datasets used in training models for advanced artificial intelligence (AI) and machine learning (ML) systems. A class action lawsuit against Github, and others, regarding its AI powered code generator, Copilot, has brought these issues to a head. This lawsuit begins to shine a light on the potential problems that could arise when copyrighted materials are used in AI training models.

 

What is the Copilot class action lawsuit about, who is involved, and what does it mean for the use of copyrighted materials in datasets used in training models?

In November of 2022, counsel for a group of unnamed plaintiffs filed a class action lawsuit against Github, OpenAI, Microsoft and others in relation to Github’s Copilot product.  The claim presented relates to misuse by defendants of publicly available code snippets posted on Github by software developers that are potentially subject to copyrights and certain licensing rights.

Plaintiff’s counsel allege Copilot uses the copyrighted works of plaintiffs in datasets used to train the underlying AI, and that this has caused actual or potential monetary damages to plaintiffs under various theories. In particular, plaintiff’s counsel claims the use constitutes copyright infringement under the Digital Millennium Copyright Act (DMCA). While the case is pending in the federal district court for the Northern District of California and its outcome is yet unknown, its implications for the wider AI and ML industry are significant. It serves as a cautionary example of how organizations should proceed when using copyrighted materials for training ML and AI models.

 

How did Copilot come to be, and how does it utilize AI/ML to generate code snippets for developers on Github?

Copilot is a tool that was created to reduce the amount of time it takes to code certain elements in software applications. By leveraging the power of AI and ML, Copilot can generate code snippets for use in software projects. Through comprehensive semantic analysis, Copilot attempts to understand the exact nature of a developer’s issue and provides relevant solutions that are directly applicable.  The code developed by Copilot can save developers dozens if not hundreds of hours in the software development process.  According to Github’s own product description of Copilot, Copilot is able to do this because its underlying AI was trained by being fed “billions of lines of open source code.”

 

What are some of the potential implications of this lawsuit if Copilot is found to have violated copyright law by using copyrighted material in its training models without permission from the original authors or creators of that codebase?

If the lawsuit against Copilot is found to have merit, there could be far-reaching consequences for companies that use copyrighted material in their training models.  Even if Copilot was only trained on code that was open source and publicly available, there could still be issues, as open source does not mean free to use without restriction.  Many open source software applications and code bases are offered under particular open source licenses, which can come with significant restrictions, such as restrictions on commercial use.

As software developers and technology companies look to refine and include better AI/ML systems in their products, the potential legal ramifications of using existing datasets and codebases without permission from the original authors or creators should not be taken lightly. Not only could it result in costly copyright infringement claims, but if a court were to grant an injunction on the further use of these copyrighted materials, it would present a technological hurdle to cause an AI or ML system to “un-learn” those materials from the existing and ever improving models.  Furthermore, this lawsuit may establish an important legal precedent that impacts the use of copyrighted material in the future, requiring companies to conduct more detailed research into proper licensing before utilizing third party datasets and codebases in their training models.

 

What do experts think about this case, and what could it mean for the future of works produced by AI/ML systems trained on unlicensed copyrighted works?

Legal experts appear to be divided on the case at hand, with some anticipating a ruling that could drastically shape the future of AI and ML generated works.  Several key legal questions remain unanswered, including the extent to which an AI/ML system can be said to have copied an original work and the potential ramifications of using unlicensed works as a part of an AI/ML system’s training data.

Those copyright holders opposed to use without permission in these training datasets believe they will benefit from greater recognition and protection for their works and that under the law they are guaranteed the ability to monetize their creations in a manner they deem fit, subject only to a few limited exceptions, like fair use, which they argue would not apply in this case.

On the other hand, those supporting the use of such copyrighted materials in the training data, whether under fair use or other exception, are generally looking at this as the AI/ML systems simply learning from the materials, but not “copying” them in any meaningful way.  For example, in this argument, Copilot learns the same way you or I would, by reading code snippets on Github and then using the knowledge we learned to generate our own software applications.  Or for a non-technical example, no one would claim that by reading an article on how to do pushups on the internet and using the knowledge gained from the article to do pushups would infringe the copyrights of the author of such article.

Outside of just those who are pro or con to the initial question of copyright infringement, there are others that view this via the lens of the creation of a new creative artistic tradition which is already emerging.  Rather than a system to simply produce derivatives of pre-existing works, a class of knowledge workers are being trained to be conductors, where their orchestras are AI and ML systems trained to play various instruments, and their symphonies are the prompts given to their orchestras. While the AI/ML orchestras produce the wonderful result, it is the composer of the prompt that determines the quality of the result.  With so much at stake and still uncertain, this is an important case that is likely to reverberate throughout not just the software world, but the world of art, literature, and more, well beyond Copilot.

Indeed, assuming it is not resolved via settlement in silent, this case will likely have far-reaching implications for the use of copyrighted materials in training datasets for ML models and could set a precedent for future cases involving AI-generated works. If defendants, via Copilot, are found to have violated copyright law, it could pave the way for more lawsuits against companies that use copyrighted material in their training datasets without permission from the original authors or creators. However, if Copilot is found to be within its rights to use copyrighted material in its training dataset in the manner it did, this could open up new opportunities for AI-generated code and other creative works produced by machines. Regardless of the outcome of this case, it is sure to have a significant impact on the future of AI and ML.