Training AI Models – Just Because It’s 'Your' Data Doesn’t Mean You Can Use It

Businessman and business sketch — Back view image of young businessman standing against business sketch

Many companies are sitting on a trove of customer data and are realizing this data can be valuable to train AI models. However, what some companies have not thought through, is whether they can actually use that data for this purpose. Sometimes this data is collected over many years, often long before a company thought to use it for training AI. The potential problem is that the privacy policies in effect when the data was collected may not have considered or disclosed this use. Using customer data in a manner that exceeds or otherwise is not permitted by the privacy policy in effect at the time the data was collected could be problematic. This has led to class action lawsuits and/ or enforcement by the FTC. In some cases, the FTC has imposed a penalty known as “algorithmic disgorgement” to companies that use data to train AI models without proper authorization. This penalty is severe as it requires deletion of the data, the models, and the algorithms built with it. This can be an incredibly costly result.

For example, the FTC filed an administrative complaint against Everalbum, Inc.. Everalbum provided a photo album and storage application but uses the customers’ photos and videos for other purposes. Everalbum created new datasets it used to train its facial recognition technology to create a different application often without user permission. It also did not delete photos and videos from users who deactivated their accounts. The FTC settled with Everalbum for AI/privacy violations with the result being that Everalbum had to destroy various data, algorithms, and models.

This is an example of “algorithmic disgorgement.” It requires a party to destroy ill-gotten or improperly used data along with the models and algorithms built with it. Some have analogized this to the concept of the “fruit of the poisonous tree.” This is a significant penalty as training AI models can cost tens or hundreds of millions of dollars. Destroying the models wipes out this investment.

The scope of algorithmic disgorgement can be broad. It has been defined by the FTC to include any models or algorithms developed in whole or in part using data or other content that was improperly collected or used. This is pretty comprehensive.

It is important to note that this can cover either data that was improperly collected or data that was properly collected but used for a purpose beyond that which was disclosed to and/or agreed to by the users from whom the data was collected. This is clear from the fact that even the FTC acknowledged that Everalbum did not improperly obtain the photos and videos. The photos and videos were voluntarily uploaded by users for storage and to generate albums and Everalbum properly obtained consent for that purpose. The problem was that it used that content to train AI models without consent and retaining that content after ensuring users it would be deleted upon account deactivation.

This leads to a key takeaway which is that even if you properly obtained data or content, this alone does not necessarily mean you can use it to train AI models and you must ensure that you are accurately representing and obtain consent for the scope of how you are using data or content to users.

Another interesting component of the FTC settlement with Everalbum was the nature of the disclosure required to use collected data to train AI models. The settlement did not permit Everalbum to just include the disclosure of use in a “privacy policy,” “terms of use” or other similar document. Rather, the settlement required, before using any data to train, develop, or alter any face recognition model or algorithm, that Everalbum clearly and conspicuously disclose to the user from whom it has collected the data, separate and apart from any “privacy policy,” “terms of use” page, or other similar document, all purposes for which Everalbum will use, and to the extent applicable, share, the data and obtain the affirmative express consent of the user from whom it collected the data.

It is not clear from this alone that a separate disclosure of such use is always required, but it may be safer to do so. Thus, it may be beneficial to consider including such disclosure in the privacy, but also include a separate pop-up disclosure to which a user must affirmatively consent.

This is not the only case where algorithmic disgorgement has been applied. In 2019, the FTC settled with a data analytics and consulting company engaged in the deceptive practice of harvesting personal information from social media sites and required the deletion of the information and any algorithms or equations, that originated, in whole or in part, from this information. In March 2022, the FTC settled with a weight loss app used by children and required deletion of data and models and/or algorithms developed in whole or in part while using the personal information collected from the children.

The rise of generative AI has inspired many companies to leverage the data and content they have amassed over the years, to train AI models. It is important that these companies ensure they have the right to use this data and content for this purpose.

The lessons from Everalbum are worth heeding. However, the FTC is not the only threat to companies training AI models. Class action attorneys are circling the waters and smell blood. At least one recent class action suit has been filed based on the use of images uploaded by users to train AI models, arguably without the proper consent to do so.

For example, in Flora et al v. Prisma Labs, Inc. (February 2023) the plaintiffs allege that Prisma’s app allows users to upload their “selfies” for editing and retouching and that Prisma: (1) collects the photo subject’s biometric data (facial geometry) in an non-anonymized fashion; (2) offers a confusing and false disclosure of its collection practices; (3) retains the subject’s biometric data in a non-anonymized fashion; (4) retains that data indefinitely for uses wholly unrelated to the user’s purpose for using Lensa; (5) profits from the biometrics; and (6) has no public written policy for the deletion of that data. Allegedly, the Privacy Policy fails to disclose the use of the biometric data and other information Prisma collects from its users and from the images uploaded through app in violation of various sections of the Illinois Biometric Information Privacy Act.

The plaintiffs are seeking money damages and “equitable, injunctive and declaratory relief.” The complaint does not specifically request algorithmic disgorgement. If the plaintiffs prevail, it is not clear whether the court will issue an order imposing algorithmic disgorgement.

The foregoing cases primarily address situations where companies used data they already had to train AI models, at least arguably without consent to do so. Many companies are newly collecting data and content from various sources to build databases upon which they can train AI models. In these cases, it is important to ensure that data is properly acquired and that its use to train models is permitted. This too has led to lawsuits and more will likely be filed.

The issues in cases of newly collected data are somewhat fact dependent. They depend on the type of data, how it is collected and/or from where it is collected. For example, sometimes the content is copyright protected (e.g., images) and can constitute infringement. In two suits against Stability AI, the plaintiffs have alleged that the method used to train models on their images constitutes copyright infringement and the defendant has alleged it is not infringement and/or it is fair use under U.S. copyright law.

Fair use is a concept under US copyright law but does not apply in many other jurisdictions. This has led some companies that train AI models to “forum shop” for a legally favorable venue to do so. For example, Japan has recently declared that using datasets for training AI models doesn’t violate copyright law. This decision presumptively means that model trainers can gather publicly available data without having to license or secure permission from the data owners.

Another type of content used to train AI models (e.g., for AI-based code generators) is source code. Typically, the code is obtained from open source repositories under an open source license. These licenses typically permit use of the code for any purpose subject to certain license conditions (e.g., giving attribution, maintaining copyright notice and/or providing the license terms). Because broad use is permitted, training AI models likely is not infringement. But failure to comply with the conditions can breach the license.

This scenario is at issue in Doe 1 et al v. GitHub, Inc. where the AI models are trained on code available under open source licenses. This case does not allege infringement. Rather, it alleges violation of the Digital Millennium Copyright Act (DMCA) because the outputs do not include the copyright management information and breach of the open source license for failing to comply with conditions in the open source licenses.

Some AI code generators have tools to manage open source compliance issues. See, Solving Open Source Problems with AI Code Generators – Legal Issues and Solutions. Various other tools exist and are being developed to help mitigate legal risk with generative AI. For some generative AI applications, there are different versions for individual use and enterprise use. Many companies that use generative AI and develop policies for such use mandate that employees use the enterprise version which includes these tools.

Another category of content used to train AI models includes images licensed under a Creative Commons or similar licenses. The perception by many is these licenses are like open source licenses and broadly permit any use. Many people fail to realize there are 6 different versions of the Creative Commons license. Three of these prohibit commercial use. Two prohibit any making any derivatives. All require attribution. Thus, it is important to understand which version of the Creative Commons license applies to any content for which you want to use to train AI and that you consider any restrictions (no commercial use, no derivatives) and compliance obligations (attribution).

Conclusion

The rapid growth of generative AI has led to a flurry of activity, including the training of AI models on various types of content. Whether you are training models based on content you already possess or are newly acquiring, it is important to ensure you have the right to use that content for those intended purposes. This includes clearly disclosing and obtaining consent to such use.

The issues in each situation are fact dependent, including the nature of the content, how it was obtained, any agreements or policies relevant to such use, and for what the AI tool is used. Sometimes, for example, with AI-based medical tools, other regulatory issues may be relevant. For example, see ChatGPT And Healthcare Privacy Risks. Training AI models for use in other regulated industries or uses may implicate other considerations.

Training AI models is just one area in which legal landmines can arise in connection with generative AI. Various issues arise with training AI, user inputs, and the outputs. Companies entering this space or using these tools would be well served to develop a policy on employee use of generative AI. For examples of what these policies should include and why you need them, see AI Technology – Governance and Risk Management: Why Your Employee Policies and Third-Party Contracts Should be Updated.

As companies enter the generative AI space, in-house counsels are scrambling to get up to speed on these and other legal issues and how to develop policies to mitigate the associated legal risk. Many companies have found it helpful to have knowledgeable counsel conduct an in-house presentation on legal issues with generative AI to assist in understanding the growing number of legal issues and how to develop company-specific policies.

This article appeared first on Sheppard Mullin’s Law of the Ledger blog.