📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️
TL;DR
Machine Learning (ML) technology has become ubiquitous in recent years, from the myriad purpose-built models supporting automated decisions across sectors to the near-overnight success of products like ChatGPT, marketed as intelligent “general purpose” systems. New regulatory frameworks are required to address the challenges posed by this new paradigm, which has spurred renewed regulatory efforts on AI around the world. However, despite the importance of training data in shaping the technology, transparency requirements in recent proposals have remained limited in their scope; hindering the ability of regulatory safeguards to remain relevant as training methods evolve, of individuals to ensure that their rights are respected, and of open science and development to play their role in enabling democratic governance of new technologies. At the same time we are seeing a trend towards decreasing data transparency from developers, especially for models developed for commercial applications of ML.
To support more accountability along the AI value chain and facilitate technology development that respects established rights, we need minimum meaningful public transparency standards to support effective AI regulation. These should be detailed enough to ensure that researchers and civil society have sufficient access to the relevant aspects of AI systems’ training datasets to support their informed governance and strike a more sustainable balance between the needs of the developers and the ability to provide recourse for potential Al harms. Additionally, recognizing the essential role of open research in providing a sufficient shared understanding of the technology to support discussions between different stakeholders, these requirements should be accompanied by support for the development and sharing of open large-scale ML training datasets in the form of further clarity and operational guidance on legal regimes that govern the use of publicly accessible data in research and development - such as the opt-out requirements of the EU CDSM Text and Data Mining exception.
TOC:
Introduction
Most current AI systems are built within the paradigm of Machine Learning (ML), where a model “learns” primarily by being exposed to a large number of training data points and updating its weights based on signals from these data. In short, AI systems are first and foremost a representation of their training datasets; which makes understanding what is in these datasets crucial in governing the models. The various uses of data in this setting also raise questions about data subjects’ property, privacy, and user rights; answering these questions will require a minimum level of transparency as to how and where the data is used and managed. In this context, more data transparency supports better governance and fosters technology development that more reliably respects peoples’ rights. In practice, however, model developers provide varying levels of information about the data they use, from providing direct access along with tooling to support non-technical stakeholders to withholding any information about the training datasets.
Developers on the more conservative end of this spectrum may see the composition of their training data as a competitive advantage, fear legal exposure for data uses of uncertain legality, or simply choose to deprioritize the work involved in sharing and documenting datasets – particularly considering that developing tools that meaningfully describe terabytes of data in an accessible fashion remains an open research area. While these decisions may make sense for the companies themselves in the absence of legal transparency requirements, they do create an accountability gap for the technology as a whole, one that threatens to widen further if more developers start following suit. As it is, journalists and scholars working on outlining industry-wide issues have to fall back on analyzing the datasets shared by more open actors as a necessary but insufficient approximation for those of systems built with less transparency (e.g. the Washington Post analysis of the C4 dataset in lieu of the actual ChatGPT corpus).
Regulation has a role to play in helping the field of AI strike a more sustainable balance by supporting the open sharing of large datasets for research and development purposes and setting standards for minimum meaningful transparency any time a developer uses data involving external rights holders. Sharing entire training datasets may not always be feasible or desirable, but extensive research in ML data governance, documentation, and visualization in recent years has supported the development of a range of tools for providing sufficiently meaningful information about large corpora without a full release. The present memo reviews how recent large ML model developers have chosen to leverage these tools (or not) to provide different levels of insight into their training data so as to help determine what may constitute a minimum standard of transparency in various settings.
Data Transparency in Focus: What is Needed?
In order to scope what constitutes minimal meaningful transparency, we can start by examining how some existing and proposed regulations might require specific information about the training dataset composition to be enforced in the context of AI technology. For example:
- Respecting the right to be forgotten: the GDPR formalizes the right of EU citizens to have their personal data or information about them removed or corrected. While editing information that is stochastically and contextually encoded in a trained model remains an open research problem, requesting that information be deleted from current and future versions of a training dataset provides a more reliable pathway toward honoring this right in future models or future versions of a model. However, to make such a request, data subjects need to know what relevant information about them was gathered by a developer while curating a training dataset.
- Respecting TDM exemption opt-outs: the EU directive on Copyright in the Digital Single Market outlines a Text and Data Mining regime that allows developers to easily use publicly accessible media, including those subject to copyright, as long as they respect opt-outs expressed in an appropriate machine-readable format. However, a lack of visibility into whether and how these opt-outs are respected by model developers disincentivizes content creators from investing in technical tools and new approaches to developing such machine-readable formats.
- Evaluating social biases at the dataset level to understand liability: ML models integrated in automatic decision systems can exacerbate discrimination, in violation of laws guaranteeing non-discrimination and equal treatment. Given the nature of current AI systems, and particularly the opacity of how results are produced in large ML models, proposals like the AI Liability Directive make it easier to assign liability to the developer or deployer of an AI system when they do not adequately comply with a duty of care. For social biases that shape a system’s likelihood of reproducing or exacerbating discriminatory outcomes, evaluating this duty of care requires assessing choices made from the dataset curation to the deployment stage of an AI product.
- Assessing the reliability of evaluations: recent regulatory efforts have aimed to make AI systems safer and more reliable. In particular, users of AI systems need to be able to evaluate their performance on various tasks to assess whether they can safely be applied in their context. While developers do typically provide limited performance evaluation outside of a deployment context in the form of benchmark numbers, recent studies have shown that some of the numbers provided by developers were inflated by issues of “data contamination”, where the benchmark over-estimates a model’s performance because the chosen evaluation setting is too close to the training data. This overlap needs to be examined for every new evaluation of a model’s capacity.
Models of Data Transparency
Sufficient data transparency to address all of the requirements outlined above can be achieved through a range of tools and methods. Here, we focus on development choices that foster reproducibility and direct access to training datasets and provide documentation and visualization to represent insights about their composition.
Reproducibility and direct access
Direct access to a Machine Learning dataset remains essential for understanding major characteristics of AI systems and supporting investigations by third-party researchers, journalists, and other investigators – including work on the social biases introduced by scale, biases introduced by common quality and toxicity filtering approaches, and journalist investigations outlining potential privacy and intellectual property concerns. Public access and reproducible datasets are particularly valuable since they enable broad collaboration on answering questions that are often beyond the scope of what a single team can investigate, and because they let external stakeholders with different perspectives (and often different priorities) than the developers’ frame these questions in a way that is more relevant to them.
Reproducibility and access can take different forms. Providing a code repository containing all of the processing steps and tools that were used to compile a dataset may be sufficient to allow well-resourced external actors to obtain a close match to the original training dataset. This was the original release method for Google’s C4 and mC4 datasets, web-based datasets containing Terabytes of text that were notably used to train Google’s T5 models. By providing a script rather than a ready-to-use dataset, developers provide sufficient information to study the data without having to redistribute it themselves; however, reconstituting a dataset often requires significant computational resources that may not be accessible to all relevant stakeholders. Alternatively, hosting processed versions of the dataset removes this barrier to entry but can require more elaborate governance. The Pile dataset by the nonprofit organization Eleuther.AI is an example of a hosted dataset that has supported much of the recent research on Large Language Models. In practice, most open web-scale datasets (especially multimodal datasets) fall somewhere between the two, hosting part of the data and metadata directly and providing code or methodology to obtain the rest. For example, the LAION multimodal dataset, which has been used to train Stable Diffusion models, provides text data aligned with the URLs of images - leaving it up to potential dataset users to retrieve the actual images.
To best support regulatory and investigative efforts, datasets should be accessible to any stakeholder with relevant expertise (especially about an AI system’s deployment and social context). While releasing the dataset publicly under an open license such as a Creative Commons license is often the most straightforward way to achieve this goal, developers can also adopt more targeted models of governance for their datasets – for example, the full ROOTS corpus is available on demand depending on specific research needs, and The Stack dataset requires users to keep their version up-to-date to propagate opt-out requests from data subjects.
Documentation and visualization
As outlined above, direct access is most relevant for stakeholders who are in a position to conduct new research on AI systems and their datasets, and to rights holders seeking recourse for suspected misuse of their data. For a broader audience, insights about an ML dataset may also be provided in a more directly accessible format to inform users and regulators through documentation and visualization tools.
Documentation of ML datasets in formats such as data statements, datasheets, data nutrition labels, dataset cards, and dedicated research papers all provide opportunities for dataset curators to communicate “essential characteristics” of the training datasets that are necessary to understand the behavior of the AI systems they support and have been shown to help developers grapple with ethical questions. Common requirements for such documentation include the origin and composition of the data, the demographics of the people represented in the dataset, descriptive statistics such as the number or size of individual data items, the original purpose of the dataset, and high-level description of the processing steps followed to create the dataset. Sufficient documentation can serve as a broadly accessible first introduction to a dataset, or as a way to help deployers assess whether a system is fit for their purpose and when it isn’t.
Datasheets are an example of a commonly adopted standard, accompanying the announcement or releases of, e.g., DeepMind’s Gopher and Chinchilla models, Google’s first PaLM model, and TTIUAE’s Falcon models. However, while these documents are a welcome effort by developers to meet a minimum standard of transparency, it should be noted that a single document’s ability to present meaningful and actionable information about a corpus containing millions to trillions of documents is inherently limited. Understanding how to get the most value out of dataset documentation under these scale constraints will require further investment in the growing field of ML data measurement – and access to open datasets along with trained models to support this research. The following examples illustrate the range of information provided through static documentation for recent web-scale datasets. This list is meant to be illustrative and non-exhaustive.
- Dataset papers: papers focused on describing the process and result of a dataset creation effort tend to have extensive information about important processing steps and analysis of the full dataset. They are written either by the original dataset curators or by other researchers when the dataset has been released by the original curators.
- The Pile paper: dataset used for training GPT-NeoX, the Pythia models, etc.
- ROOTS paper: dataset used for training the BLOOM and BLOOMz models
- RefinedWeb paper: dataset used for training the Falcon models
- LAION paper: dataset used for training StableDiffusion models
- C4 analysis: dataset used to train T5, FlanT5, etc.
- BooksCorpus analysis: the training dataset of the first GPT model
- Dataset analysis in model papers: research papers describing new models may also provide informative statistics about their training data. These include, for example: the top domain names in a web-crawled dataset, topics represented in the dataset, length statistics, bias analyses (such as through gender pronoun counts), etc.
- Meta Llama dataset description: Section 2.1
- DeepMind Gopher dataset analysis: Appendix A
- IBM Granite dataset and data governance description: Sections II and III
- Google PaLM dataset analysis: Appendix C
- Standardized formats: datasheets, data statements, dataset cards, and data nutrition labels focus on providing important information about ML datasets in a more structured and standardized manner:
- The Pile datasheet
- Notably, The Pile fills out one datasheet for each of its major components rather than a single one for the entire corpus, thus providing significantly more detailed information
- LAION (draft) Data Nutrition Label
- German Sign Language corpus data statement
- OSCAR multilingual web corpus dataset card
- The Pile datasheet
Interactive visualization of a very large training dataset can complement static documentation and help bridge the scale gap between static documentation and documented artifacts. Many of the most pressing questions that need to be asked of training datasets are highly contextual and require some additional processing to make assessments that are relevant to a particular use case. By creating broadly accessible interfaces that allow users controlled interactions with training datasets, developers can provide stakeholders with information that is relevant to their particular needs without necessarily releasing the full underlying data. The following examples showcase how such visualization and exploration interfaces have been leveraged for recent large-scale ML datasets:
- The Hugging Face Data Measurements Tool provides access to an extensive catalog of statistics for popular datasets, including the C4 web corpus. In particular, the nPMI section helps surface social biases in the training data based on user-provided anchors to provide a much more complete picture than a single table would.
- The dataset maps (or Atlas) developed by Nomic.ai leverage data embeddings computed by ML systems to help users navigate very large datasets, providing both a high-level view of topics covered and specific illustrative examples. For example, the OBELICS dataset is a web-scale multimodal dataset of aligned text and images that can be explored through such a dataset map.
- Hosting a search index over a training dataset can provide valuable insights to users who need to inquire about the presence of specific text or media in a dataset and support a wide range of research questions on AI systems trained on the dataset. The ROOTS corpus search tool shows users relevant snippets from the dataset with potentially sensitive information redacted. The GAEA explorer extends this search to The Pile, the C4, and the text in the LAION dataset. The LAION dataset was also released with an image index that lets users find all images in the corpus that match a description.
- Membership tests are a particular category of tools that can support governance and compliance. For example the data portraits of the Stack code dataset help users which parts of an LLM-generated software code string are present in the training dataset. Developers can also leverage metadata to help users check whether their work was included, and thus support opt-out requests from rights holders.
Trends in Data Transparency
The previous paragraphs illustrate the range of methods available to developers who want to provide meaningful data transparency into the technology they create under various constraints. Understanding how to best characterize and describe datasets consisting of billions to trillions of examples still constitutes a nascent research field, but work in the last few years since the arrival of Large Language Models and other AI systems of similar sizes has already provided transparency tools that will be invaluable in supporting the application of existing and proposed regulation. For example, they enable analysis of social biases along the development chain to understand liability for discriminatory outcomes under the proposed AILD, can help meet the copyright disclosure requirements in the latest version of the EU AI Act, offer means of enacting and verifying opt-out by data subjects under the CDSM TDM exemption regime, and support GDPR compliance and enforcement, among others.
Data transparency faces more than technical challenges, however, and the promise of increasingly ingenious documentation and visualization tools is counterbalanced by worrying trends in the release choices made by many prominent AI developers. Google/DeepMind model releases went from the fully reproducible C4 and mC4 datasets for T5 models (2019), to providing datasheets and some high-level data documentation in the papers describing the DeepMind Gopher (2021) and first Google PaLM (2022) systems, to a single sentence for the PaLM v2 announcement (2023). OpenAI has followed a similar trend throughout the releases of GPT through GPT-4 and Dall-E through Dall-E 3, withholding all information about pre-training for the latest systems in both series. Newer company Anthropic has provided no public information about its training data for the Claude LLMs, and even Meta limited the amount of information they disclosed about Llama-2 to a single paragraph description and one additional page of safety and bias analysis – after their use of the books3 dataset when training the first Llama model was brought up in a copyright lawsuit.
This trend among larger actors in the field stands in stark contrast to work done by some smaller companies and nonprofit organizations building alternative models in a more open setting. Foundation models released by the BigScience and BigCode projects and the nonprofit organization Eleuther.AI leverage the full range of tools described above to support extensive data transparency and governance. MosaicML’s MPT models, TTIUAE’s Falcon LLM series, and Hugging Face’s IDEFICS model (a reproduction of DeepMind’s Flamingo) also use publicly accessible and documented datasets and provide visualization tools.
Supporting openness and transparency across modes of development will be essential to fostering sustainable governance of AI systems. Minimum legal transparency requirements to enable data subjects to leverage their rights and clarification of the legal regimes that govern the use of publicly available data in ML – e.g. through operational guidance on the CDSM TDM exemption regime – both have a role to play to that end.
Cite as:
@inproceedings{Hugging Face Community Blog,
author = {Yacine Jernite},
title = {Training Data Transparency in AI: Tools, Trends, and Policy Recommendations},
booktitle = {Hugging Face Blog},
year = {2023}
}