Large-scale Artificial Intelligence Open Network — 5.85 Billion.

October 2022, since December 2023, it was temporarily taken down due to a Stanford study found child sexual abuse material in the dataset. In August 2024, Laion released RE-LAION-5B.

About 240TB, including 5.85 Billion image-text pairs.

Laion, a German non-profit which makes open-sourced artificial intelligence models and datasets.

HuggingFace, Doodlebot and Stability, data was hosted on the-eye.eu.

Images in Common Crawl, corresponding captions from their associated HTML tags’ alt-text.

Without labelling, alt-text as caption.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

“Strongly advocate academic use only, do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models is still in progress."

Laion-5b's three subsets (laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+) are used to train Stable Diffusion from Stability AI.

↪Laion 5B's Filtering Heuristics

↪Laion 5B website

Common Crawl.

From 2008 – ongoing, in recent years, every 3-5 weeks, a new crawl/"snapshot"/"dump" is released.

Each "dump" varies in size, ranging from 200 to 400 TB of uncompressed content.

Common Crawl, data obtained by Common Crawl's CCBot crawler.

Common Crawl is a nonprofit organization founded by Gil Elbaz. Sponsors include Andreessen Horowitz, Nutch, Apache Tika, DuckDuckGo, HPLT Project, the Linux Foundation, MLCommons, NVIDIA, and OpenWebSearch.eu (as listed on the Common Crawl website). Amazon Web Services began hosting Common Crawl’s archive through its Public Data Sets program in 2012.

Web crawling.

Not annotated/labeled.

No data processing after collection.

For anyone to analyze and research on open web data.

A very large portion of commercial large language models and generative models are trained partially on differently filtered versions of Common Crawl, for example, OpenAI’s GPT-3, Meta’s LLaMa, Google’s Bert etc.

↪Common Crawl has no Filtering Heuristics

↪Common Crawl website

BookCorpus.

2015, It was originally hosted on the author’s website but was removed around May 2019

1.18GB, include 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).

Zhu et al. from University of Toronto and Massachusetts Institute of Technology.

The Natural Sciences and Engineering Research Council (NSERC), the Canadian Institute for Advanced Research (CIFAR), Samsung, Google, and the Office of Naval Research (ONR).

"Free" books from smashwords.com, “written by yet unpublished authors.”

Not annotated/labeled.

🕳

🕳 the paper tried to frame these books were copyright free, but in the dataset, there are multiple copyright statements from authors.

BookCorpus is part of the training data for OpenAI’s GPT-1 and GPT-3, and GPT-N models, and Google’s BERT model. The Pile made a new version of BookCorpus following its method.

↪BookCorpus's Filtering Heuristics

↪BookCorpus website

Conceptual Captions 3.3 million.

2018.

1.85GB, consisting of ~3.3 million image-caption pairs.

Sharma et al. in Google AI.

Google.

Crawling Internet webpages (not specified where).

Not annotated/labeled, alt-text HTML attribute associated with web images is used as image caption.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

"For fine-tuning vision-and-language models."

cc3m is part of the training data for BLIP and BLIP 2, MiniGPT-4.

↪cc3m's Filtering Heuristics

↪cc3m website

Colossal Clean Crawled Corpus.

2020.

About 750GB.

Raffel et al. from Google.

Google.

Based on Common Crawl (web crawling).

Not annotated/labeled.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

🕳

C4 is used to train Google’s T5 Text-to-Text Transfer Transformer, Meta’s Large Language Model Meta AI (LLaMA).

↪c4's Filtering Heuristics

↪c4 data page

Chinese Language Understanding Evaluation Corpus 2020.

2020.

About 100GB.

CLUE Organization, an open source organization dedicated to Chinese natural language processing, whose funder is also the founder of ChatYuan (元語智能), an AI startup in China.

Specific details were not provided, it only mentioned that the cloud computing power was supported by Cloud TPUs from Google’s TensorFlow Research Cloud.

It uses Common Crawl snapshots from July to December 2019.

Not annotated/labeled.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

"for Language Modeling. Pre-training, Generating tasks."

Most likely, this dataset is a part of the training dataset for PromptClue, an AI chatbot from ChatYuan. And Huawei’s PANGU-Σ model.

↪CLUE corpus' Filtering Heuristics

↪CLUE corpus github page

The Pile.

December 2020. In July 2023, The Eye, where Pile is hosted, took down the copies of The Pile due to copyright issues.

886.03 GB, composed of 22 smaller datasets.

Gao at el. from EleutherAI. EleutherAI grew from a Discord server. The Pile dataset was a Discord collaboration.

This dataset was created by individuals working on their own time without funding. The dataset was hosted on The-Eye.eu.

The Pile is composed of 22 smaller datasets, ranging from original scrapes, to text data made available by the data owners, to third-party scrapes available online.

In the 22 subsets, some have no annotations and others using captions or alt-text as annotations.

In their data processing pipeline, no step is mentioned regarding copyrighted material. But in July 2023, The Pile was taken down as it includes the shadow library dataset Book3.

"For training large-scale language models."

The Pile is used to train EleutherAI’s GPT-Neo models and Pythia, a suite of models. Microsoft’s Megatron-Turing Natural Language Generation, Meta AI’s Open Pre-trained Transformers, LLaMA, and Galactica, Stanford University’s BioMedLM 2.7B, the Beijing Academy of Artificial Intelligence’s Chinese-Transformer-XL, and Yandex’s YaLM 100B, and Stability AI’s Stable LM suite.

↪the Pile's Filtering Heuristics

↪the Pile website

WebText.

2017.

about 40 GB of text.

OpenAI created WebText dataset but did not publish it.

OpenAI.

By extracting non-Wikipedia webpages that were linked to from Reddit posts with at least 3 karma up until December 2017.

Not annotated/labeled.

🕳️

🕳

WebText was made for training OpenAI’s GPT-2.

↪WebText's unknown Heuristics

↪🕳 the website is not available anymore, the paper can be viewed on wayback machine here

OpenWebTextCorpus.

2019.

13.51 GB of text.

OpenWebTextCorpus is an open source effort to reproduce OpenAI’s WebText dataset by Aaron Gokaslan and Vanya Cohen of Brown University.

Without funding.

By extracting webpages that were linked to from Reddit posts with at least 3 karma from July 2008 to January 2013.

Not annotated/labeled.

🕳️

🕳

OpenWebTextCorpus is used to train OpenGPT-2 and Huggingface’s Distilled-GPT2.

↪OpenWebTextCorpus' Filtering Heuristics

↪OpenWebTextCorpus website

OpenWebText2.

2021.

67.40GB of text.

EleutherAI, as a part of The Pile dataset.

Without funding.

By extracting webpages that were linked to from Reddit posts with at least 3 karma from 2005 up until April 2020, from Pushshift datasets.

Not annotated/labeled.

🕳

OpenWebText2 is a part of The Pile. The Pile is used to train EleutherAI’s GPT-Neo models and Pythia, a suite of models. Microsoft’s Megatron-Turing Natural Language Generation, Meta AI’s Open Pre-trained Transformers, LLaMA, and Galactica, Stanford University’s BioMedLM 2.7B, the Beijing Academy of Artificial Intelligence’s Chinese-Transformer-XL, and Yandex’s YaLM 100B, and Stability AI’s Stable LM suite.

↪OpenWebText2's Filtering Heuristics

↪OpenWebText2 website

Wikipwdia-based Image Text.

2021.

About 25GB, including 37.6 million image-text examples with 11.5 million images across 108 Wikipedia languages.

Srinivasan et al. from Google research.

Google.

By crawling Wikipedia.

Not annotated/labeled, description text as image caption, crowd-sourced human annotators are used for validation.

It only retained images that have a “research-permissive license such as Creative Commons (the text of Wikipedia is licensed under a CC-BY-SA license).”

🕳

↪WIT's Filtering Heuristics

↪WIT website

WuDao Corpora.

2021.

3TB.

Beijing Academy of Artificial Intelligence (北京智源人工智能研究院), researchers listed on the paper are also from Tsinghua University and Recurrent AI.

Not specified, the Beijing Academy of Artificial Intelligence (BAAI) is supported by Ministry of Science and Technology and Beijing Municipal Government. BAAI relies on prominent institutions and companies in the artificial intelligence field such as Peking University, Tsinghua University, Chinese Academy of Sciences, Baidu, Xiaomi, ByteDance, Meituan, and Megvii Technology for joint construction.

Crawling from 822 million web pages.

Not annotated/labeled.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

"For Chinese language model pre-training, word embedding, etc."

WuDao is a part of training data of Huawei’s PANGU-Σ model, International Digital Economy Academy’s FengshenbangLM. WuDao is also a part of the BigScience’s ROOTS Corpus used to train BLOOM.

↪WuDao's Filtering Heuristics

↪WuDao family website

Large-scale Artificial Intelligence Open Network — 400 Million.

August 2021, since December 2023, it was temporary taken down due to a Stanford study found child sexual abuse material in the dataset.

About 10TB, with webdataset with 256×256 images, captions and metadata.

Laion, a German non-profit which makes open-sourced artificial intelligence models and datasets. With contributions from: the data hoarders Reddit community, the-eye community and unknown contributors.

doodlebot.ai, Gentec Data, data was hosted on the-eye.eu.

The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

Not annotated/labeled, alt-text as caption.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

"For research purposes to enable testing model training on larger scale for broad researcher and other interested communities, and is not meant for any real-world production or application."

Google’s Imagen.

↪Laion-400m's Filtering Heuristics

↪Laion-400m website

Large-scale Artificial Intelligence Open Network — Aesthetics.

August 2022, as it is a subset of LAION-5B, it is no longer available for download starting December 2023. Only parts of the aesthetics subset samples can still be browsed under the first author Christoph Schuhmann's personal domain.

Size not stated, the dataset has 52,068,913 rows, which is around 0.9% of the larger 5B dataset (thus ~45 million), with 625K image-text pairs with predicted aesthetics scores of 6.5 or higher from LAION-5B.

Laion.

🕳

From LAION-5B, which are extracted from the Common Crawl.

LAION trained a LAION-Aesthetics Predictor V2 model to sort images according to aesthetic scores. The Aesthetics Predictor model is trained on Simulacra Aesthetic Captions (SAC) dataset, LAION-Logos dataset, and Aesthetic Visual Analysis(AVA) dataset.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

🕳

The subset with aesthetic score 5+ is used to train Stable Diffusion V1 model.

↪Laion-Aesthetics' Heuristics

↪Laion-Aesthetics website

Responsible Open-science Open-collaboration Text Sources corpus.

2022.

1.6TB, covering 59 languages, 46 natural languages and 13 programming languages.

Researchers in the BigScience 1-year workshop.

BigScience is an open science project, this research workshop is an open collaboration boot-strapped by HuggingFace, GENCI and IDRIS, support by HuggingFace, Centre national de la recherche scientifique (CNRS), Institut national de recherche en sciences et technologies du numérique(inria), Naver Labs, Snorkel, reciTAL, LightOn, and Salesforce research.

62% of the text comes from a community-selected and documented list of language data sources, including existing NLP datasets, Pseudo-Crawled Data corresponding to the target domain names from 18 snapshots of Common Crawl in 2020 and 2021 and GitHub Code..

Not annotated/labeled.

🕳️ copyright issues were not considered/mentioned in their data pipeline. But some subsets of the ROOTS, like the Github Code on Google’s Big Query was built as an archive and used as a dataset.

“The efforts to put the corpus together were value-driven... We hope this paves the way toward a more reflected use of the data that makes its way into large language models.”

The BigScience’s BLOOM🌸 model.

↪ROOTS' Filtering Heuristics

↪ROOTS Huggingface page

A Multilingual And Document-Level Large Audited Dataset.

2023.

Around 3TB, covering 419 languages.

Kudugunta et al. from Google DeepMind and Caswell et al. from Google Research.

Google.

From all available snapshots of CommonCrawl as of August 20, 2022.

Not annotated/labeled,the team members did self-audit/quality review manually.

The authors showed their concerns over “not all languages have available tools for removing…copyrighted content”, but no specific steps are mentioned in their pipeline of any effort to address that.

"We urge practitioners to carefully consider their target usecase before using MADLAD-400.”

Google's multilingual machine translation (MT) model.

↪MADLAD-400's Filtering Heuristics

↪MADLAD-400 github page

RefinedWeb.

2023.

1.68TB, with 5 trillion tokens.

The Falcon LLM team in the Technology Innovation Institute.

The creation of the dataset was privately funded by the Technology Innovation Institute. The Technology Innovation Institute funded by Abu Dhabi government.

From CommonCrawl dumps, one part of the paper stated they took the all the CC dumps until the 2023-06 one, another part stated it used CC dumps till the January/February 2023 one. It could be updated with additional dumps as they are released.

Not annotated/labeled.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

"RefinedWeb was created to serve as a large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).”

Falcon-40B model.

↪RefinedWeb's Filtering Heuristics

↪RefinedWeb Huggingface page

MassiveText.

December 2021 (not released to the public).

2.35 billion documents, or about 10.5 TB of text.

Rae et al. from Google DeepMind.

Google 🕳️

The whole dataset is consist of 6 subsets from MassiveWeb (web crawling), Books (not specified, contains books from 1500 to 2008), C4 (from Common Crawl), News, Github and Wikipedia.

Not annotated/labeled.

🕳️ MassiveText uses 2.1 TB book data from 1500 to 2008 and 3.1TB Github code, but the authors only mentioned for Github, they include code with Apache License version 2.0, MIT license, The 3-clause BSD license, The 2-clause BSD license, Unlicense, CC0, ISC license, and Artistic License 2.0.

The dataset was created for pre-training language models.

Google DeepMinds’s Gopher and Chinchilla.

↪MassiveText's Filtering Heuristics

MultiModal MassiveWeb.

November 2022 (not released to the public).

43.3M instances (documents) in total, with a total of 185M images and 182 GB of text.

Alayrac et al. from Google DeepMind.

Google 🕳️

Scraped from 43 million webpages, plus the ALIGN (image + alt-text) dataset.

Not annotated/labeled, texts in nearby HTML DOM tree is used for image and video captions.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

The dataset was created for pre-training vision-language models.

Google DeepMind’s visual language model Flamingo.

↪m3w's Filtering Heuristics

Multimodal C4.

June 2023. The dataset was accidentally deleted in February 2025, and only partially recovered from local downloads.

43 billion English tokens with 101.2 million documents and 571 million images.

Zhu, Hessel and other researchers in the Allen Institute for AI and LAION.

DARPA MCS program through NIWC Pacific, the NSF AI Institute for Foundations of Machine Learning, Open Philanthropy, Google, and the Allen Institute for AI (AI2). Stability AI provided computational resources.

It retrieved the original webpages for each document in the c4-en dataset from the Common Crawl version 2019-18, and download images from there.

Not annotated/labeled. Text sentences on the same webpage as images’ caption, evaluated with CLIP.

🕳️ copyright issues were not considered/mentioned in their data pipeline.

🕳

OpenFlamingo, an open-source reproduction of DeepMind's Flamingo model.

↪mmc4's Filtering Heuristics

↪mmc4 github page

Data for Open Language Models’ Appetite (Dolma).

August 2023.

11 TB, including 4 billion documents, 3 trillion tokens.

Soldaini et al. from AI2 (Allen Institute for AI).

AI2.

It is consist of several subsets:Common Crawl (web scraping), the Stack (Github code), Reddit, C4, peS2o (Semantic Scholar), Project Gutenberg and Wikipedia.

Not annotated/labeled.

Dolma uses “open access” papers, and “open source” repositories in the Stack (The Stack takes opt-in as default unless GitHub users proactively submit issues to opt out)and “permissively licensed” sources, e.g. Creative Commons.

For training language models, or studying interaction between pretraining corpora and models trained on them.

open language model OLMo from AI2.

↪Dolma's Filtering Heuristics

↪Dolma website

Open Bimodal Examples from Large fIltered Commoncrawl Snapshots (OBELICS).

June 2023.

141million documents, with 115 billion text tokens and 353 million images.

Laurençon et al. from Hugging Face, Sorbonne Université, and Stanford University.

Hugging Face, computing resources was supported by Institut du Développement et des Ressources en Informatique Scientifique (IDRIS) of the Centre National de la Recherche Scientifique (CNRS).

It used Common Crawl (web crawling) dumps from February 2020 to February 2023..

Not annotated/labeled. Web texts as image captions.

OBELICS uses Spawning API to remove all images for which creators explicitly opted out of AI model training. They stated that if there is copyright infringement with the dataset, the authors assume no liability for such violations.

🕳

IDEFICS, an open-access version of DeepMind's visual language model Flamingo.

↪OBELICS's Filtering Heuristics

↪OBELICS Huggingface page

Common Corpus.

November 2024.

🕳️ it has 1,998,647,168,282 tokens.

Langlais et al. in PleIAs, a French private AI Lab.

It’s supported by AI Alliance, the French Ministry of Culture and DINUM. Storage and processing power was supported by AI Alliance, Jean Zay (Eviden, Idris), Tracto AI, Mozilla..

Web crawling “open of use” data with mainly following licenses: Public Domain, CC-By, MIT, CC-By-SA, Apache-2.0, BSD-3-Clause, Open license, BSD-2-Clause, CC-BY-4.0, CC0-1.0.

Not annotated/labeled.

Common Corpus filled their data source mostly with CC-By, Public Domain/CC0, and CC-By-SA, and code from “Open Code” — The Stack (who takes opt-in as default unless GitHub users proactively submit issues to opt out).

For “pre-training of fully open and auditable LLMs”; it “works as an open science infrastructure dedicated to the entire lifecycle of language models.” 🕳️

PleIAs 1.0, Barcelona Supercomputing Center's Salamandra series, Lucie, Nvidia's NeKo, Anthropic used Common Corpus in feature visualization experiments.

↪Common Corpus's Filtering Heuristics

↪Common Corpus's huggingface page