Where can I find huge data sets, is a question in front of every person who is aiming to develop, test or study big data analytics tools. Here we provide the list of data sets available in public domain.
This project makes 260TB of human genome data available to the users.
Internet Archive
More than 80TB of data is available for research of internet web crawling
ClueWeb09
The TREC conference provided this data set to public. You will need to sign agreement and pay fee of (upto $610) to cover sneakernet data transfer. Total 5TB data is available.
http://lemurproject.org/clueweb09.php/
ClueWeb12
ClueWeb12 is available as freebase annotation.
CNetS
Indiana University makes a 2.5 TB click dataset available
ICWSM
ICWSM have made their 2011 conference blog posts available to public. You will have to register an actual form, but this is available for free, which is 2.1TB in compressed format.
Proteome Commons
Proteome commons have made several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size. There are several others over 100 GB in size.
Reference Energy Disaggregation Data Set
The has data on home energy usage it's about 500 GB compressed.
Tiny Images
This dataset has 227 GB of image data and 57 GB of metadata.
ImageNet
The dataset is pretty big.
MOBIO
The dataset is about 135 GB of video and audio data
Yahoo! Webscope
The program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup, from Yahoo! Music, which is a bit over 1 GB.
Google
made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
Yandex
has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
Freebase
Freebase provides data dumps on regular basis. The largest is their Quad dump, which is about 3.6 GB compressed.
Open American National Corpus
This is about 4.8 GB uncompressed.
Wikipedia
made a dataset containing information about edits available for a recent Kaggle competition. The training dataset is about 2.0 GB uncompressed.
Research and Innovative Technology Administration (RITA)
RITA has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.
Wiki-links
data made available by Google is about 1.75 GB total [20].
Amazon Public datasets
Amazon Provides centralized repository of public data sets available on Amazon cloud to integrate & downlaod.
More Than 1 TB data
1000 GenomesThis project makes 260TB of human genome data available to the users.
Internet Archive
More than 80TB of data is available for research of internet web crawling
ClueWeb09
The TREC conference provided this data set to public. You will need to sign agreement and pay fee of (upto $610) to cover sneakernet data transfer. Total 5TB data is available.
http://lemurproject.org/clueweb09.php/
ClueWeb12
ClueWeb12 is available as freebase annotation.
CNetS
Indiana University makes a 2.5 TB click dataset available
ICWSM
ICWSM have made their 2011 conference blog posts available to public. You will have to register an actual form, but this is available for free, which is 2.1TB in compressed format.
Proteome Commons
Proteome commons have made several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size. There are several others over 100 GB in size.
More than 1 GB
Reference Energy Disaggregation Data Set
The has data on home energy usage it's about 500 GB compressed.
Tiny Images
This dataset has 227 GB of image data and 57 GB of metadata.
ImageNet
The dataset is pretty big.
MOBIO
The dataset is about 135 GB of video and audio data
Yahoo! Webscope
The program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup, from Yahoo! Music, which is a bit over 1 GB.
made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
Yandex
has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
Freebase
Freebase provides data dumps on regular basis. The largest is their Quad dump, which is about 3.6 GB compressed.
Open American National Corpus
This is about 4.8 GB uncompressed.
Wikipedia
made a dataset containing information about edits available for a recent Kaggle competition. The training dataset is about 2.0 GB uncompressed.
Research and Innovative Technology Administration (RITA)
RITA has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.
Wiki-links
data made available by Google is about 1.75 GB total [20].
Amazon Public datasets
Amazon Provides centralized repository of public data sets available on Amazon cloud to integrate & downlaod.
0 comments:
Post a Comment