Data Consent and AI Ethics
Why Traditional Consent Fails for AI
The standard model of data consent, inherited from medical research ethics and adapted for digital privacy, follows a pattern: an organization tells you what data it collects, how it will be used, and you choose whether to agree. This model assumes that the organization knows in advance what it will do with the data, that the individual understands what they are agreeing to, and that the individual's choice is genuinely free. All three assumptions collapse in the AI context.
Large language models are trained on datasets comprising hundreds of billions of text tokens scraped from the internet. Common Crawl, a standard training data source, contains petabytes of text from billions of web pages. The creators of these models did not obtain consent from the billions of people whose writing, forum posts, social media comments, blog entries, and other content appear in the training data. In most cases, the individuals whose data was used had no knowledge that their content would be included, no opportunity to consent or refuse, and no mechanism to request removal after the fact. The scale of data collection makes individual consent logistically impossible: you cannot ask billions of people for permission.
Purpose limitation, a core principle of data protection law, requires that data collected for one purpose not be used for a fundamentally different purpose without new consent. A person who posted a restaurant review on Yelp consented to that review being visible on Yelp's platform. They did not consent to it being used to train a language model that generates text. A person who uploaded a photograph to a social media platform consented to it being displayed on that platform. They did not consent to their face being incorporated into a facial recognition training dataset. The chasm between the purpose for which data was originally shared and the purpose for which AI systems use it represents a systemic consent violation that affects billions of people.
The inference problem adds another dimension. When someone shares their shopping history, location data, social media likes, or typing patterns, they are sharing data that appears mundane. They are not sharing their political views, sexual orientation, health conditions, or pregnancy status. But AI can reliably infer all of these from the mundane data, extracting sensitive information that the person never chose to disclose. No consent framework covers this because the person never decided to share the inferred information. The consent they gave for sharing shopping data does not extend to the health conditions that shopping data reveals when processed by a sufficiently capable algorithm.
The Right to Be Forgotten Meets Model Weights
The EU's General Data Protection Regulation establishes a "right to erasure" (commonly called the right to be forgotten): individuals can request that organizations delete their personal data. For a traditional database, this is straightforward: find the record, delete it, confirm deletion. For an AI model, it is profoundly difficult. Personal data used in training does not exist as a discrete record in the model's parameters. It is diffused across billions of weights through the training process, influencing the model's behavior in ways that cannot be cleanly isolated or removed.
Machine unlearning is an active research area attempting to solve this problem. The goal is to modify a trained model so that it behaves as if a specific piece of training data was never included, without retraining the entire model from scratch (which would be prohibitively expensive for large models). Current approaches include influence function methods that estimate each training example's contribution to the model's parameters and remove it, gradient-based methods that fine-tune the model to "forget" specific data, and partitioning methods that train the model on isolated data shards so that individual shards can be retrained independently. These methods produce approximate unlearning, reducing but not fully eliminating the influence of specific training examples, and their effectiveness remains debated.
The practical implication is that current AI technology cannot fully honor deletion requests in the way that data protection law intends. Once your data has been used to train a model, extracting its influence is, at best, approximate and, at worst, impossible. This creates a temporal asymmetry in consent: the decision to include your data in training is practically irreversible, while the decision to object comes after the fact. Meaningful consent requires the ability to withdraw consent, but withdrawal is meaningless if its effects cannot be implemented.
Consent Models for the AI Era
Several alternative consent frameworks have been proposed to replace or supplement the failing notice-and-choice model. Broad consent allows individuals to agree to a general category of research or development use, without specifying every specific application. This model, borrowed from biobank ethics where biological samples are used for research not yet conceived, acknowledges that data may be used in ways that cannot be predicted at the time of collection. The challenge is that broad consent can be so vague as to be meaningless: agreeing that your data may be used "for AI development" covers everything from beneficial medical research to invasive surveillance.
Dynamic consent provides individuals with ongoing control over how their data is used, with a platform or dashboard where they can see what their data is being used for and adjust their preferences in real time. This model respects individual autonomy and accommodates changing circumstances, but it places significant burden on individuals to monitor and manage their data across potentially hundreds of organizations. It also assumes that organizations are transparent about how they use data, which is often not the case.
Data trusts and data cooperatives represent collective approaches to consent. Instead of individuals negotiating with organizations one by one, a trusted intermediary manages data on behalf of a group, negotiating terms of use, monitoring compliance, and distributing benefits. The trust makes decisions based on the collective interest of its members, similar to how labor unions negotiate on behalf of workers. This model addresses the power imbalance between individuals and large technology companies and provides a mechanism for ongoing governance rather than one-time consent.
Opt-out at scale, exemplified by robots.txt for web crawling and the proposed "Do Not Train" flags for AI training data, provides a mechanism for content creators to signal that their data should not be used for AI training. The Spawning AI project created a tool that allows artists and content creators to opt out of AI training datasets. The practical effectiveness of these mechanisms depends entirely on whether AI developers honor them, which is currently voluntary. Mandatory opt-out respect, with enforcement mechanisms, would represent a significant shift in the consent landscape but faces resistance from AI developers who argue that requiring consent for publicly available data would make large-scale AI development impossible.
The Collective Dimension of Consent
Individual consent frameworks miss a critical dimension of AI data use: the information that AI can derive about people who never consented. If most members of a community share their data, AI can infer information about the members who did not. Location data shared by 80% of residents in a neighborhood reveals traffic patterns, social gatherings, and daily routines that affect the privacy of the 20% who opted out. Genetic data shared by relatives can reveal information about family members who never consented to any testing. Social media data shared by friends reveals information about individuals who are not on social media at all.
This collective dimension means that individual consent is insufficient as a privacy protection. Even a person who never shares any personal data, never creates a social media account, and never agrees to any data collection can have sensitive information inferred about them from the data that other people share. Privacy in the AI era is a collective property, not an individual one, and protecting it requires collective mechanisms like regulation, data governance frameworks, and technical protections that operate at the population level rather than the individual level.
The consent landscape for AI is evolving rapidly, driven by lawsuits (multiple class-action suits have been filed against AI companies for using copyrighted and personal data without consent), regulation (the EU AI Act and GDPR impose data governance requirements on AI developers), and public pressure (growing awareness of AI training data practices is generating demand for stronger protections). The resolution will likely involve a combination of regulatory mandates, technical mechanisms for expressing and enforcing consent preferences, collective governance structures, and a fundamental rethinking of what consent means when data can be used in ways that neither the data subject nor the data collector can fully anticipate.
Traditional notice-and-consent frameworks fail for AI because training data is collected at scales that preclude individual permission, AI infers sensitive information that was never explicitly shared, and once data is embedded in model weights it cannot be fully removed. New consent models, including data trusts, dynamic consent, and collective governance, are emerging but none has fully solved the problem.