Big Data for Machine Learning
// <![CDATA[ (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-42543297-1', 'tumblr.com'); ga('send', 'pageview'); // ]]>
There is a large number of publicly available data sets on the Internet. Whether you want to learn data mining or practice with machine learning algorithms, these data are worth investigating. Some of them are real-world data both in terms of size and quality — it is actually the lack of quality that makes them “real-world” data.
In this post, I will report some data sets that I find interesting. Please, observe and respect the licenses for these data as stated in their respective websites.
The Ensembl project
The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online. In particular, the human (Homo sapiens) site provides a data set based on the February 2009 Homo sapiens high coverage assembly GRCh37 from the Genome Reference Consortium. This assembly is used by UCSC to create their hg19 database. The data set consists of gene models built from the genewise alignments of the human proteome as well as from alignments of human cDNAs using the cDNA2genome model of exonerate.
You can download the data in various ways, and you can even download the entire database as a MySQL dump. You can find these data on AWS as well; if you use the AWS infrastructure that should save you some bandwidth!
The Google Books N-gram Viewer Data
If you don’t know what the Google books n-gram viewer is, you can have a look here. It basically shows you a graph that displays how your input phrases have occurred in a corpus of books over a specified period of time. There are 11 corpora of books that cover a variety of languages, including Chinese.
Each of the numbered links will directly download a fragment of the given corpus. You can obtain the 1-gram (i.e., individual words) counts, the 2-gram counts, and so on, up to the 5-gram counts. In addition, for each corpus, you can get the file total counts, which records the total number of n-grams contained in the books that make up the corpus. This file is useful to compute the relative frequencies of n-grams. Obviously, there are more fragments for 2-grams than there are for 1-grams, more for 3-grams than for 2-grams, and so on.
The corpus construction is described in the Science article written by Jean-Baptiste Michel et al. It should be noted that only the n-grams that appeared over 40 times in the whole corpus are reported. Therefore, the sum of the 1-gram occurrences, in any given corpus, will be smaller than the number given in the total counts file.
Government Sponsored Data
The United States census bureau offers a wide variety of data sets that span many decades. While you are perusing that site, make sure that you check out the DataFerret tool. You can also find US census data on AWS, for example, the data from the 2000 US Census can be found here.
You can find data from the Japan population census in the following link: http://www.stat.go.jp/english/data/kokusei/2010/summary.htm
You can obtain international data on census by using the following US census bureau website: http://www.census.gov/population/international/data/idb/informationGateway.php
The UK’s government hosts a web site (visit http://data.gov.uk/) that offers over 5,400 data sets, from all central government departments and a number of other public sector bodies and local authorities.
There is also a wealth of data offered by the World Bank’s Open Data initiative in the following website: http://data.worldbank.org/data-catalog
Freebase
Freebase is an open, graph database with more than 23 million entities; an entity is a single person, place, or thing. Full data dumps of every fact and assertion in Freebase are available in a variety of formats and are updated every week. The various data dumps can be found here: http://download.freebase.com/datadumps/
Public Data Sets on AWS
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. The following two data sets stand out:
1000 Genomes Project The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available.Aside from the scientific importance of this data set, its sheer size (~200TB) makes it interesting for machine learning at scale. Common Crawl Corpus A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.Although smaller than the 1000 Genomes project data set, these web crawl data are approximately 60TB in size.
The complete list of data that are offered on AWS can be found in the following page: http://aws.amazon.com/datasets?_encoding=UTF8
Banking Statistics
The Bank for International Settlements offers a consolidated banking statistics website: athttp://www.bis.org/statistics/consstats.htm
The mission of the Bank for International Settlements (BIS) is to serve central banks in their pursuit of monetary and financial stability, to foster international cooperation in those areas and to act as a bank for central banks. Currently, central banks in 30 countries report their aggregate national consolidated data to the BIS, which uses them as the basis for calculating and publishing global data. The data are published as part of the BIS Quarterly Review. Preliminary data, including a commentary, are released a few weeks before the publication of the Quarterly Review. Users should be aware of the limitations of the preliminary data.









