There's data, there's data, and then there's data
We had a saying on my former team at comScore: there's data, data, (pause) and then there's data.
As you've likely too frequently read, there is more data being created and collected today than ever before. New data streams are being generated by modern businesses, governments, hobbyists, devices, etc… It's often reported as the panacea to solve problems and a go to way to unlock new opportunities. The abundance of new data is correlated, if not caused by, the massive reduction in storage cost on both an absolute and relative basis. Moreover, storage and access can dynamically change with both planned and unplanned usage. Early stage technology companies often have a majority of their expenditure allotted to salary (labor with a capital L) rather than hardware (capital a capital K). Large companies can have major data related costs in hardware, licenses, and people though hopefully have reasons to invest given market success. For both large and small companies, there is affordable options to capture and store all the data within reach even prior to a known way to make sense of it or make money from it. Storing everything you can is treated like a no brainer.
Fine, granted, so what can we do with it? Well, there's data, data, (pause) and then there's data. Spoiler, there is a fourth level of data too.
Data level 1: Existence
What to capture?
How to implement measurement and storage?
Data level 2: Availability
How to access?
How to visualize?
Data level 3: Trust
How to make decisions based on data?
How to become a data driven culture?
Data level 4: Decision
How to automate decision making?
How to create data driven products?
I've heard some people map this to information -> data -> insight -> knowledge. This is bite sized, vague, and devoid of practical implication; it's a I-know-it-when-I-see-it segmentation. Here is an attempt to walkthrough how to think about specific decisions you'll encounter as you try to infuse data into your company's operation, product, and psyche.
This will focus on broader operational decisions and motivation rather than specific technical systems. It also largely ignores the specific engineering challenges related to long term storage and serving data at web scale.
In the technology sector, it seems that only the upper echelon of large technology companies and promising, engineering focused upstarts can attract and afford the diverse technical, operational, and scientific talent necessary to answer the above questions.
I've ignore many related topics so just take this for what you will. If you're passionate and can lead me to ways to make this easier, please send me a note. I think data and its applications will be the raw materials of the next great businesses and for all of our advancements, we're still clunking our way through the basics.
____
The first level of data is merely creating and capturing it correctly. You've chosen to log something, you've implemented the tracking, you've piped it to homegrown (lots of work) or third party analytic systems, and you've verified it's both consistent and complete. None of these tasks are trivial; they often require multiple people, iteration, debugging, etc… In the web space, you can go a long way by instrumenting client side code (what the user actually sees in browser) using excellent off the shelf tools like Mixpanel, KissMetrix, Google Analytics, ChartBeat, etc… This approach still requires front end modifications, testing, diligence to maintain, up front time for learning their systems, and some medium term lock in to their policies. You still need to do validation and common sense sniff testing to ensure you're getting the results you'd expect, especially if you do anything even slightly custom.
Measuring server side code or other backend systems (the fun part where you application performs) is often performed by logging- your code writes out some flat lines of text or formatted key-value pairs (JSON, XML, etc.). Most business executives think this type of logging should come "for free" when systems are built but there is often complex work to create and capture relevant data.
This requires a level of precision not typically found in day to day non-engineering life. Even highly skilled analysts, big data distributed algorithm gurus, and Ph.D. data scientists are rarely equipped to design what to log and how to log it. It's a bit of science and art. There are tradeoffs for performance, data size, correctness, etc… Many of these tradeoffs require thinking through how the data will be queried which means predicting usage by others (not an easy task). Engineers may approach this from a code coverage perspective: "Are we logging errors? Is code executing in a performant way? How early can we tell there is a functional problem? Are the machines operating normally?" While critical and necessary, it is not sufficient. We need to also measure if the system is achieving its goal: "Are the current outputs or behaviors correct and/or normal? Are users using the system in successful way? Is the data being generate useful?" These questions require much more thoughtful definition and are inherently subjective. You need to interpret how code impacts people, businesses, and other systems. You can successful log errors, keep the machines healthy, maintain your uptime, beat your SLA requirements, and still produce garbage data? Do you want to know that ASAP or when people start complaining downstream...
Concrete example: At Hyperpublic, we indexed lots of data about local businesses, places, and venues from publicly available sources. We were building, and the team continues to build at Groupon, scalable systems for collecting this data, normalizing it, and determining the most likely accurate information. Let's imagine we're interested in the physical address of the business: it's not enough to instrument application code to alert on errors, we need to monitor how well the system is determining a place's likely street address. Yeah, the machines are on and the processes are running, but is it producing good results? Should people using this data be happy?
One helpful way to get started is to pick one, and only one, measure of a data stream's validity. Picking one metric that means something to the team can be the driving motivation to follow through the project from beginning to end (Kissmetrics blog has a nice detailed post on some ideas per business type). It's best, though not required, if this measure can (1) be computed independently from other systems, (2) operate on one record at a time, (3) be stateless, (4) performed computationally trivially, (5) improves the accuracy of the test over time, and (6) have tangible meaning to non-engineers and non-stats people: i.e. normal people. This means that your metrics can be performed on streams of incoming data, can be distributed over multiple threads, process, and/or machines, doesn't impact application performance, creates new understanding of expected results over time, and can be described to everyone in the company. If it fails any of these needs, you may have to overcome more complex engineering and insight challenges but that's where the creativity and fun comes in.
Designing these metrics is the fascinating intersection of engineering, statistics, and business insight. Are you looking for a massive abrupt shift, a slow change over time, or an individual data anomaly? Being aware of how you'll evaluate the summary statistics will impact your choice in how you measure and design the system.
That's a lot of work, and we're still at data level one. If your data isn't well structured, you're in a whole other boat.
Data level two.
Ok. So now you've chosen what to measure, next up is convincing the team, or yourself, that it's important to integrate the tracking. This can become a major endeavor that can rabbit hole its way into a long term project with no clear early wins. Do not bite off too much at first. Pick one thing, get it working, get it on a graph, get that graph on the wall, and high five. Yes, you will make some short term decisions that will likely require rewriting down the line. Yes, you will build some short term systems that are single purpose, don't scale to new sources, don't allow for arbitrary analytic access, but this is the only way I know how to get anything done. I don't always do a good job with it because it can be so damn fun to plan and draw data flow diagrams (seriously one of my favorite things to do). Resist the urge. Use the open source tools, steal the front-end view from your neighbor, and just get that metric on the wall. You'll feel better.
Data level 3. If you've made it this far, the data is flowing… Can you query it? Can you, with confidence, make decisions based on it? Can you ask someone else at the company what the state of that monitor is? Is it in your daily workflow? Does everyone trust it? Can the CEO read it? Is there any ambiguity? This is the softer, and harder, side of data. It always have caveats, coverage issues, and is susceptible to misinterpretation.
Your instinct might be to create wikis though they have very low likelihood of improving the situation. This step does not come overnight. You need to create a trusting relationship between the data and you, your team, and your company. Like all good relationships it takes time, attention, frequent interaction, retrospection, and occasionally intervention. Once you reach a healthy level of trust, data will become invaluable. Everyone will want access, teams that didn't see the value earlier will want to participate, and data will have a voice when decisions are made. Now you might need to rebuild your systems to handle the scale of new use cases, uptime requirements, data volume, storage needs, privacy requirements, access controls, retention policies, ad hoc vs. production schedules, concurrency, asynchronous collection, better viewing tools, alarms, sessionization, long term reporting, back up, etc… This just became a real asset and requires commitment.
Data level 4. This is the stage that's talked about the most-- data as a feature. It's what powers product recommendations, search engines, news feeds, ad targeting, etc… There is statistical, testing, and production engineering focused on servicing data needs in this area but we'll leave that for another post. This is the promised land but is a ways off from where most companies are…
This was a very high level and likely too wishy washy view but one that I hope highlights that realities of creating a data driven culture that is overlooked in the typical how big is your big data discussion. If you're good at this stuff, please let me know.












