If you have enough data, the answer is most likely yes, if your correlation metrics are sophisticated.

In this article in Wired, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Chris Anderson discusses how big data is impacting the scientific method. The scientific method is based on testing hypotheses and designing experiments to prove or disprove them. With massive amounts of data available, do scientists still need to follow this process?

You still need to have some idea what sorts of questions you want to answer, but the designing of experiments may be giving way to mining massive amounts of data to see what we can learn. Anderson argues:

“Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot….With enough data, the numbers speak for themselves.

In this article, What’s to be Done about Big Data?, Gil Press discusses the book *Big Data: A Revolution that Will Transform How We Live, Work, and Think* by Viktor Mayer-Schönberger and Kenneth Cukier. He argues that correlations are enough in many situations:

The authors correctly say, “For many everyday needs, knowing

whatnotwhyis good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.”This indeed is one of the key themes in the book, that “society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing

whybut onlywhat. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality.”

The trick is to have sophisticated correlation metrics. The simple linear correlation metrics offered by most “analytics” packages, Pearson’s correlation coefficient and linear variance and covariance, aren’t useful for scientific research, since most physical and biological systems are not linear.

In the diagram below, the distributions in the bottom row all have zero linear correlation, so linear correlation metrics will not identify any relationship in the data, when clearly the distributions are not random.

P-values are the most commonly used measure of statistical relevance in scientific research. Non-linear measures of correlation, such as mutual information, are necessary if we want to discover complex relationships in the data. These are generally only available in statistical software packages.

Research that lets the data do the talking has been incredibly expensive because the available statistical packages are based on 30-year old code, developed in a time when massively parallel processing wasn’t possible. In order to do scientific analysis with these packages, supercomputers or clusters with thousands of nodes are required.

Statistical correlation metrics can drive innovations in every industry, once freed from the constraints of multi-million-dollar investments, the need for data scientists and reductionist approaches, and analysis times that can take weeks. As Chris Anderson says, the opportunity is great:

Learning to use a “computer” of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

Don’t have a supercomputer for your research? Simularity can let your data do the talking without the multi-million dollar investment.

## Partha Pratim Ghosh

/ April 27, 2013It is tricky to apply standard correlation and get meaningful association. It depends on the distribution of input data. If the input data distribution is (say, uniform), you will get meaningful association i.e. rho=1 implies the two random variables (X,Y) are correlated. If you switch to Gamma distribution, rho(X,Y) = 0 does not necessarily mean they are uncorrelated. So, interpret rho(X,Y) carefully!

## James

/ April 30, 2013Technically speaking, one cannot disprove a hypothesis, only fail to reject, ie: not guilty as opposed to innocent. Also, be wary of outliers and collinearity.

## lkafle

/ May 1, 2013Reblogged this on lava kafle kathmandu nepal.

## Andy

/ May 16, 2013The thought that we can actually replace (versus simply complement) designed business experiments with correlation analysis of observational data is laughable. I hope dearly that this author was quoted out of context or something, for his own sake.

Big data methodology and better computing power certainly unlocks much more value from the observational data we have, but there are hundreds of business applications where it really does matter “why” and not just “what,” never mind all the businesses that are involved in manufacturing as well.

## Clark Abrahams

/ May 17, 2013As Ben Franklin put it, “There is no substitute for common sense.” No matter the method used to formulate a model or solve a problem (i.e. the scientific method, data mining, dart throwing,…) we must consider the validity and how representative the inputs are for the purpose at hand, and the properties of the solution and whether or not it makes sense to adopt it. A balanced approach that includes data, science and judgment is the best one based on my 4 decades of practical experience. An approach that focuses purely on correlations and does not take into account causality and reasonableness will turn out to be sub-optimal at best and may prove disastrous in some cases. The mere existence of correlations does not mean they have relevance for the problem at hand or, for that matter, anything else.