I've obviously been interested in 'Big Data' for a while now, but it's still surprisingly a pretty untapped area. Granted that technologically speaking, we've only just made certain milestones that facilitate tackling Big Data, chief amongst them is the decreasing cost of storage space. But on top of that, we have had great improvements to databases through collaborative and open source efforts by some of the largest internet companies (ie. Facebook) through Project Voldemort, BigTable, Cassandra, HBase, CouchDB, Dynamo, etc. On top of that, services like Amazon's S3 (Simple Storage Service) are making data storage economical.
What exactly is Big Data? It's exactly what you think it means, imagine immense data sets ranging into the petabytes (1,000 terabytes). This can cover governmental statistics, scientific measurements, financial data, sports statistics, consumer spending trends, astronomical observation and any number of other large data projects. So, with a simple idea such as lots of numbers and data, where exactly is the difficulty?
Two main issues surface: availability of data and sufficient organization and modeling of that data.
Surprisingly people still think that "you can find anything on the internet." But that saying has never been true and still is far from truth. Much data is still proprietary, kept in information silos ranging from publications, to corporations, to scientific research teams. Even something like trying to find the price for a hip replacement surgery across various hospitals is nearly impossible though some companies have been trying to make such information more available. But the insurance companies are notorious for hiding such numbers. Thankfully even the government is beginning to show awareness of the need to make data free and available, with their 2009 release of the Data.gov website.
But even services like Factual with over $25M in funding has barely been able to scratch the surface of what data is out there. Their user interface is appealing, but the amount of data they have available is remarkably limited despite having been active since 2007 and having significant financial backing. Most of their data sets can be obtained through other means, like their geospatial data which seems to be its largest repository being available from geonames.org. I have been similarly actively obtaining large data sets like books, beer, wine, movies, games, locations, restaurants, etc. and I long determined that Factual's data was not sufficient and thus I have gone on my own. That makes me wonder just what all that money has done for Factual. Of course I admit that I am hesitant to ever pay for data that I believe I can get on my own, so Factual's business model fails to attract me.
Moving on, what if the raw data was freely available for those that wanted to access it? Then we hit the second and more difficult problem of organizing the data so you can find what is more relevant and then modeling the data to draw meaningful conclusions and results.The first aspect of this problem is being faced by InfoChimp, in that the data they have available is a chore to filter through. The same problem occurs on the Data.gov website. Sure, there is plenty of information there, but to be able to find the right data set with the right formatting can become a project in itself. These data repositories need to take a lesson from e-commerce sites by organizing their sets as products, with categories, tags, and reviews/usage statistics/recommendations.
Then, the crunching and analysis of the data is even more challenging since it requires far more experience and talent in human resources. Not enough people are trained to be data specialists or experts, to be able to think spatially about the different points of data, draw lines linking relevant information, and formulating the necessary queries or scripts to pull a final message from the fray. This is why large corporations are scrambling to attract the best and brightest data specialists and taxonomists to their teams so they can get value from their huge stores of data, like Facebook with its social grid, user browsing and app usage habits, Google with its most recent effort to link social networks to filter web search results. Even when I was working with SAP, I realized the importance on being able to see the larger picture at the beginning of setting up data because something as simple as a planned numbering scheme to a chart of accounts or products can make a huge difference later on.
Services like Factual and InfoChimp tend to have a limited business model, freemium. They tend to offer smaller data dumps, usually created by clients or random users, for free. But they make their money by selling their premium data sets at significant prices.
Many items (objects) in our world do not have any standardized numbering to them. Unlike products that tend to have a UPC/EAN/G-TIN code or books that have an ISBN code, locations, companies, websites, animals, digital data, etc. don't have any set numbering scheme. This makes standardizing across data sets very difficult, as seen when Amazon implements its own ASIN numbering system to products.
Factual and InfoChimp will face many difficulties trying to get access to various stores of information. They already claim that they do not use crawlers or scrapers to parse websites. I do commend their adherence to good practices of the web, but it will be a slow process to get data.
Even though Factual's UI is more appealing at this time, there is very little available data for clients at this point. Granted Factual tends to focus on as complete and accurate data as possible, while InfoChimp took the approach of throwing as much random data sets up as possible. A happy medium needs to be sought.
As stated, all these data repository services and even the Data.gov site need to organize their data better to make it more appealing to clients. Use the standard practices of e-commerce sites (categories, search filters, sort by popularity/date, reviews of the data, sample of the data, recommendation engine, tags, etc.)
I'm all about free data in the end, so is there a way for these services to figure out a different business model that doesn't rely solely on selling premium data sets? Maybe not, but I think it would be better if they not only offered raw data, but offered analytical services to clients, to help them crunch through the available data sets since many clients won't have the necessary resources to draw meaningful results themselves.
There are plenty of risks when it comes to Big Data. I'm sure people got defensive when I mentioned how many things don't have numbers, since it calls up the idea of Big Brother and even the Holocaust when everyone will be branded with a number. Frankly, although numbers are the easiest method, as seen in India's current push to get ID numbers to their population, there are other means of identification available to us. A simple example would be a web-standard identification name. Currently that tends to be people's email address, but it can get a bit more standardized than that, where you register your dominant email address with some organization as your primary web ID. Then using some hash code of that ID, we can have a global identifier based on individual choice as opposed to assignment.
A bigger risk of Big Data is that people tend to fall under the false assumption that more data is better. Honestly, most data being collected is junk. We need to be able to recognize what data is meaningful and when it is worthwhile to try to sort through data. A good example here is that day-traders love data crunching software that returns the latest high/low trends of a stock with historical performance and maybe a comparison to the values of competitors. But a value investor should not be letting the daily fluctuations dictate his investment decisions, for it is more important to assess strategy, leadership, deals in the pipeline, and other long-term factors that don't require large data sets.
Technology has improved but can still go further. I don't have any in-depth experience with NoSql solutions and how they deal with large data, so I cannot speak too much about it all. But I would imagine that time will make the documentation and adoption of these solutions more mainstream. Relational databases are great for smaller data sets that are more static, but with the demands of Big Data, the new suite of solutions will hold the key.
Big Data is still in its infancy and I would love to see more people apply themselves towards this direction. But as stated in the last entry, people and projects follow the money, and we are still more concerned about making a quick buck from silo-ed niche information than trying to tackle how to standardize and organize the scary mass of data out there.
Just consider, Google is not known as a content producer or provider, but more as a searcher, organizer, aggregator. But even then, their data still has much area of improvement.