Public Big data sets available for download?

Where can I find huge data sets, is a question in front of every person who is aiming to develop, test or study big data analytics tools. Here we provide the list of data sets available in public domain.

More Than 1 TB data

1000 Genomes
This project makes 260TB of human genome data available to the users. 

Internet Archive
More than 80TB of data is available for research of internet web crawling

ClueWeb09 
The TREC conference provided this data set to public. You will need to sign agreement and pay fee of (upto $610) to cover sneakernet data transfer. Total 5TB data is available. 
http://lemurproject.org/clueweb09.php/

ClueWeb12  
ClueWeb12 is available as freebase annotation.

CNetS 
Indiana University makes a 2.5 TB click dataset available 

ICWSM 
ICWSM have made their 2011 conference blog posts available to public. You will have to register an actual form, but this is available for free, which is 2.1TB in compressed format. 

Proteome Commons 
Proteome commons have made several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size. There are several others over 100 GB in size.



More than 1 GB


Reference Energy Disaggregation Data Set 
The has data on home energy usage it's about 500 GB compressed.

Tiny Images 
This dataset has 227 GB of image data and 57 GB of metadata.

ImageNet 
The dataset is pretty big.

MOBIO 
The dataset is about 135 GB of video and audio data

Yahoo! Webscope
The  program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup, from Yahoo! Music, which is a bit over 1 GB.

Google 
made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.

Yandex 
has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.

Freebase 
Freebase provides data dumps on regular basis. The largest is their Quad dump, which is about 3.6 GB compressed.

Open American National Corpus
This is about 4.8 GB uncompressed.

Wikipedia 
made a dataset containing information about edits available for a recent Kaggle competition. The training dataset is about 2.0 GB uncompressed.

Research and Innovative Technology Administration (RITA)
RITA has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.

Wiki-links 
data made available by Google is about 1.75 GB total [20].

Amazon Public datasets
Amazon Provides centralized repository of public data sets available on Amazon cloud to integrate & downlaod.


CONVERSATION

0 comments:

Post a Comment