Common Crawl

Organization

A non-profit organization that archives and provides a massive dataset of web crawl data, which was instrumental in the initial training of models like those from OpenAI.

First Mentioned

8/2/2025, 6:25:22 AM

Last Updated

8/20/2025, 5:10:35 AM

Research Retrieved

8/2/2025, 6:28:57 AM

Summary

Common Crawl is a non-profit 501(c)(3) organization that has been crawling the web and making its extensive archives and datasets publicly available since 2008. Founded by Gil Elbaz, with advisors including Peter Norvig and Joi Ito, the organization collects petabytes of data through monthly crawls, adhering to nofollow and robots.txt policies. Its open-source code for data processing is also publicly accessible. The Common Crawl dataset, which includes copyrighted material, is distributed from the US under fair use claims, influencing researchers in other countries to adapt methods for copyright compliance. English is the predominant language in its datasets, accounting for 46% of documents in the March 2023 version. The archived data is mirrored and accessible via the Wayback Machine, and notably, AI models from OpenAI have been built using datasets like Common Crawl.

Referenced in 1 Document

Document 341b318d...

Research Data

Extracted Attributes

Type
Non-profit organization
Founder
Gil Elbaz
Mission
To provide access to high-quality crawl data previously only available to large search engine corporations to small startups or individuals
Data Size
Petabytes of data
Data Hosting
Amazon Web Services (AWS) Public Data Sets, multiple academic cloud platforms
Legal Status
501(c)(3)
Crawler Policy
Adheres to nofollow and robots.txt policies
Crawl Frequency
Approximately once a month
Data File Format
WARC files (since November 2013)
Code Availability
Open-source code for data processing is publicly available
Number of Employees
Handful
Crawler Software (Main)
Apache Nutch (since 2013)
Crawler Software (News)
StormCrawler
Data Distribution Basis
Fair use claims
Data Distribution Location
United States
Other Significant Languages
German, Russian, Japanese, French, Spanish, Chinese (each less than 6%)
Primary Language (March 2023)
English (46% of documents)

Timeline

Common Crawl was founded. (Source: Common Crawl Website)
2007-01-01
Began collecting petabytes of web data for its archives. (Source: Summary, Wikipedia, Wikidata, Web Search)
2008-01-01
Began using the Apache Software Foundation's Nutch webcrawler instead of a custom crawler. (Source: Web Search)
2013-01-01
Switched from using .arc files to .warc files for its crawls. (Source: Web Search)
2013-11-01
Began daily releases of WARC files for its news dataset. (Source: Web Search)
2016-01-01
Common Search, an independent non-profit search engine project that inspired Common Crawl, was discontinued. (Source: Web Search)
2018-01-01
Google's Colossal Clean Crawled Corpus (C4), based on Common Crawl, was constructed for training the T5 language model series. (Source: Web Search)
2019-01-01
A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model. (Source: Web Search)
2020-01-01
English accounted for 46% of documents in the Common Crawl dataset. (Source: Summary, Wikipedia)
2023-03-01
The truncation threshold for archived content was increased from 1 MiB to 5 MiB. (Source: Web Search)
2025-03-01

Wikipedia

View on Wikipedia

Common Crawl

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008. It completes crawls approximately once a month. Common Crawl was founded by Gil Elbaz. Advisors to the non-profit include Peter Norvig and Joi Ito. The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available. The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other legal jurisdictions. Contents archived by Common Crawl are mirrored and made available online in Wayback Machine. English is the primary language for 46% of documents in the March 2023 version of the Common Crawl dataset. The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.

Web Search Results

Common Crawl - Wikipedia
Common Crawl is a nonprofit501(c)(3)_organization#501.28c.29.283.29 "501(c) organization") organization that crawls the web and freely provides its archives and datasets to the public.( Common Crawl's web archive consists of petabytes of data collected since 2008.( It completes crawls approximately once a month.( [...] In 2013, Common Crawl began using the Apache Software Foundation's Nutch webcrawler instead of a custom crawler.( Common Crawl switched from using .arc files to .warc files with its November 2013 crawl.( A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model, announced in 2020.( Timeline of Common Crawl data ----------------------------- [edit] The following data have been collected from the official Common Crawl Blog( and Common Crawl's API.( [...] Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series "T5 (language model)") in 2019.( There are some concerns over copyrighted content in the C4.( References ---------- [edit]
Training Data for the Price of a Sandwich - Mozilla Foundation
Common Crawl (henceforth also referred to as CC) is an organization that has been essential to the technological advancements of generative AI, but is largely unknown to the broader public. This California nonprofit with only a handful of employees has crawled billions of web pages since 2008 and it makes this data available without charge via Amazon Web Services (AWS). Because of the enormous size and diversity (in terms of sources and formats) of the data, it has been pivotal as a source for [...] Inspired by an independent nonprofit search engine project called Common Search (discontinued in 2018), Common Crawl developed an approach that builds on ranking web domains by calculating their harmonic centrality (Interview CC Crawl Engineer). Harmonic centrality measures the importance of a node in a network based on the distance this node has to all the other nodes, with shorter distances contributing more to the score than longer ones. In other words, the more often a domain is directly or [...] Common Crawl as well. Common Crawl’s stated mission is to provide access to “high quality crawl data that was _previously only available to large search engine corporations_” to “small startups or even individuals” (Common Crawl Foundation, n.d. emphasis added). In other words, Common Crawl's purpose is to make web data available that otherwise only a Big Tech company would have access to.
Common Crawl - Open Repository of Web Crawl Data
Common Crawl maintains a free,open repository of web crawl data that can be used by anyone. =========================================================================================== Common Crawl is a 501(c)(3) non–profit founded in 2007. ‍ [...] We make wholesale extraction, transformation and analysis of open web data accessible to researchers. ================================================================================================================================================================= Overview Over 250 billion pages spanning 18 years. ----------------------------------------- Free and open corpus since 2007. -------------------------------- [...] ### Impact ### Privacy Policy ### Terms of Use Image 5: Twitter LogoImage 6: LinkedIn Logo.svg)Image 7: LinkedIn Logo © 2025 Common Crawl
Overview - Common Crawl
###### CC-MAIN-2015-22 ###### CC-MAIN-2015-18 ###### CC-MAIN-2015-14 ###### CC-MAIN-2015-11 # The corpus contains raw web page data, metadata extracts, and text extracts. Common Crawl data is stored on Amazon Web Servicesâ Public Data Sets and on multiple academic cloud platforms across the world. Learn how to Get Started. ## Access to the corpus hosted by Amazon is free. [...] Common Crawl Logo # Overview # The Common Crawl corpus contains petabytes of data, regularly collected since 2008. ###### CC-MAIN-2025-30 ###### CC-MAIN-2025-26 ###### CC-MAIN-2025-21 ###### CC-MAIN-2025-18 ###### CC-MAIN-2025-13 ###### CC-MAIN-2025-08 ###### CC-MAIN-2025-05 ###### CC-MAIN-2024-51 ###### CC-MAIN-2024-46 ###### CC-MAIN-2024-42 ###### CC-MAIN-2024-38 ###### CC-MAIN-2024-33 ###### CC-MAIN-2024-30 ###### CC-MAIN-2024-26 ###### CC-MAIN-2024-22 ###### CC-MAIN-2024-18 [...] # You may use Amazonâs cloud platform to run analysis jobs directly against it or you can download it, whole or in part. You can search for pages in our corpus using the Common Crawl URL Index. â Check out the Example Projects, view Use Cases, or Statistics for our crawls. Common Crawl Logo ## The Data ### Overview ### Web Graphs ### Latest Crawl ### Crawl Stats ### Graph Stats ### Errata ## Resources ### Get Started ### AIÂ Agent ### Blog ### Examples ### Use Cases
Blog - News Dataset Available - Common Crawl
The data is available on AWS S3 in the commoncrawl bucket at crawl-data/CC-NEWS/. WARC files are released on a daily basis, identifiable by file name prefix which includes year and month. We provide lists of the published WARC files, organized by year and month from 2016 to-date. Alternatively, authenticated AWS users can get listings using the AWS Command Line Interface and the command: [...] Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB. For more details, see our truncation analysis notebook. Common Crawl Logo ## The Data ### Overview ### Web Graphs ### Latest Crawl ### Crawl Stats [...] While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:

Wikidata

View on Wikidata

Founder
Gil Elbaz
Instance Of
Q163740
Notable Work
CCBot
Inception Date
1/1/2008

DBPedia

View on DBPedia

Common Crawl

First Mentioned

Last Updated

Research Retrieved

Summary

Referenced in 1 Document

Research Data

Extracted Attributes

Type

Founder

Mission

Data Size

Data Hosting

Legal Status

Crawler Policy

Crawl Frequency

Data File Format

Code Availability

Number of Employees

Crawler Software (Main)

Crawler Software (News)

Data Distribution Basis

Data Distribution Location

Other Significant Languages

Primary Language (March 2023)

Timeline

Wikipedia

Common Crawl

Web Search Results

Wikidata

Founder

Instance Of

Notable Work

Inception Date

DBPedia