Friday, May 4, 2012

What's the point of big data?

The point of big data and data science is to:
  • Understand why things happen the way they do
  • Make predictions about the future

Seems like an innocent statement?

Oh no. No it isn't.

This matters.

What's the problem?

Two different groups of people believe in two different things. It can't be the case that both of the bullet points I stated can be simultaneously right, can it?

Origins of the problem

Technology emerged that made it possible to make predictions about the future without any understanding.

Leo Breiman, towards the end of his life, saw this and then just let it rip. He wrote "Statistical Modeling: The Two Cultures" for Statistical Science, 2001, (16), 3, 199-231.To summarize:
  • It recently became possible to make accurate predictions about what was going to happen next without any human understanding of why.
  • He suggested that experts generate worse outcomes by trying to explain something.
  • Look at all these methods you statisticians are ignoring.

In his own words:

"The goals in statistics are to use data to predict and get information about he underlying data mechanism. Nowhere is it written on a stone tablet what kind of model should be used to solve problems involving data....I am not against data models per se. In some situations they are the most appropriate way to solve the problem. But the emphasis needs to be on the problem and on the data." (p. 214)

Recent Dialogue

The Machine Learning people have had wins, but so have the Domain Experts.

The Data Science Debate at the Strata Conference was really about these two cultures. And, at Strata, the Breiman point of view won the debate. You can read more of the specific at Driscoll's Data Utopian blog.

Yesterday, Technology Review wrote up a longer version of Fader's point of view on big data. In that piece, Fader expresses:

"Even with infinite knowledge of past behavior, we often won't have enough information to make meaningful predictions about the future. In fact, the more data we have, the more false confidence we will have."

In the same article, he rightfully calls chartists 'quacks' and argues for more science in data science.

I respect Fader a lot. He's one of the few people expressing skepticism out there, and that's great. He's adding to the conversation.

I don't want to misunderstand what Fader was intending to say. I don't believe he's completely backing off the scientific method as applied to big data. And, it goes back to the value that we make realistic assumptions and use science to predict the future. That spirit is nicely embodied in a piece by Tsang (2009). You can read the summary here.


In sum, shots have been fired on both sides. The ML people are shouting "less domain knowledge!" and the domain experts are shouting "more science!". Big wins.

The Centrist Approach

If I can't explain why a machine is making a prediction, I can't convince people to trust the machine. The machine culture says "screw'em - people are the problem". Yet - machines are remarkably effective at making predictions. So, they should be used as a tool. That perspective alienates the traditional statisticians.

I don't understand the root cause of the polarization. I don't understand why statisticians can't use these techniques to simulate the world and make predictions. I don't understand why machine learners are completely resistant to hypothesis testing and well informed theories, indeed, to the point where big data itself is trivialized. It should be all real science that advances our understanding about the world and enables us to make predictions about the future.

What's the point?
  • Understand why things happen the way they do
  • Make predictions about the future
It doesn't have to be polarized.

We can have both if we know what we're doing.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

2 comments:

Jim Novo said...

My favorite machine learning story is those folks spending 6 months on a project with end output "customer flag = deceased is highly correlated to customer defection".

Um, I could have "theorized" that.

By the way, Highly? Not 100%?

My read of the Fader piece was not so much he was taking one side or the other. He was simply stating a common problem in the field, which (in my words) is people assume more data = more business value created and that is often not the case. Big Data is coming to mean "all the data" and we certainly know from web analytics there's a ton of data created that has limited business value.

Christopher Berry said...

@JimNovo

He's right, we're all right, that more data doesn't equal more intelligence.

A dashboard doesn't make people better drivers. A GPS makes some people safer drivers I suppose. Not everybody.

You're both right. It's all in the using to do something.