Talk to enough startups, VC’s, business school students and engineers and you’re bound to hear the term “big data.” It’s been a buzzword for at least five years, maybe ten. It’s a perennial, like “cyber” and “social media,” just vague enough to be all encompassing while niche enough to feel specific. Big data, is hot, it’s exploding, it is definitely the future, but what exactly it is people don’t really agree. Does it simply mean many observations? Observations with many features? Is it high velocity data? Is it structured or unstructured? Is machine learning implied? What is machine learning? Ad nauseum…
Here’s the truth. Most problems are not big data problems. Ah, doesn’t that feel better? You don’t have to worry so much about big data issues when you let that sink in. It is only a small subset of problems that meet big data requirements. First, the data must be cheaper to gather and store than the payoff of the analysis. Otherwise, the return on your invested time and effort are not worth it. There are many situations where a heuristic is more expedient and doesn’t give up much accuracy. In my “Python for Data Science” class one group’s final presentation ran a cluster analysis on yelp reviews to show on a map where the best food by cluster was. It wasn’t shocking that good Indian food could be found in Murray Hill.
Ok, so rule out all problems that you wouldn’t even open a spreadsheet to solve. Still a lot of things we can crack with “big data,” right? Well, maybe. Perhaps you want to help people make business decisions? Imagine the potential of such an algorithm for decision science. Or what if you wanted to know which way a binary choice would turn out, like an election or merger? Binary choices that get made once in a specific context are not great candidates for big data. If you can make a bet or invest resources thousands of times a slight ability to predict an outcome can work across a portfolio. Look at last year’s election. The statisticians weren’t wrong when they said Donald Trump had a 28.6% chance of winning. It’s just that if you win you get to be 100% president.
Then there are the problems that are solved by combining some theory about the process that creates and a small amount of data. For example, if you are trying to solve a queuing problem you could model a few data points with the Poisson distribution rather than gather six months of data. This is also good for decision science since you will know the full distribution and not simply an arbitrary cutoff, often defined in science as the 95% confidence interval. The confidence interval poses all sorts of problems from overfitting your data, and arbitrary thresholds, to problems with sampling bias.
So what is big data good for?
- No useful heuristic
- Cheap data acquisition
- Making the same bet many times
The advances in information technology have tricked us into believing that this is true for most problems. In fact, most data sets are expensive to gather or non existent. Computer network data may pile up at rates faster than we can use it, but learning what people really believe about a presidential candidate is expensive and complicated. Deciding whether to buy a company has many more unknowns than deciding where to put a “buy now” button on a website. I don’t mean to slander big data. It has some amazing potential for genetics, automated driving, cyber security, advertising and many other fields.
What then should IT departments, managers, consultants and data scientists be selling if not big data? First, decide if you’re in the business of selling confidence or information. If it’s the former, your product doesn’t matter. That your clients like you and the assurances you give them are enough. Your method is not important.
If you are in the business of selling information, than the only thing that matters is your ability to get results with limited resources. This lends itself to small data approaches. The approaches could be a simulation, it will probably be pseudo-Bayesian, it could be supervised or unsupervised. The important thing is to do more with less. That is, you need fewer data points and get richer information out of the analysis. This is exactly what small data can do.
Instead of providing a recommendation for a go/no go decision, you can provide the full distribution of outcomes with relatively cheaper inputs. This is why it’s important to determine if you’re selling confidence or information. This will not help your clients confidently choose a path or provide them cover when things turn south. It is, however, the whole story. Providing your client with complete information is not possible with small data, but the right thing to do.