Friday, February 24, 2012

The Old New Old New New New Frontiers of Analytics


It's very easy to be cynical about new ideas, especially when they've been previously hyped and previously failed.

Ideas fail. Statistically, failure is the norm.
 
I've been asking myself the question:


"What's different today that might make yesterday's fad become sustainable?"

There are three broad analytical areas that are prime for re-discovery and a fresh round of hype:

Reasons for skepticism:

  • I don't want my refrigerator to tweet when it's empty.
  • I don't want to give brands yet another channel to spam me with coupons.
  • I find the Internet hard enough to use, I don't need my favourite sites changes all the time.

What's different now:

  • I want to make things that are harder to steal, easier to retain and that are pirate-resistant.
  • A playlist that adjusts according to my browsing pattern would be valuable.

All three analytical frontiers mandate an excellent blend of design and science.

How awesome is that?

What do you think? Time for hype?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, February 23, 2012

Business Intelligence is not Data Science

Business Intelligence is not Data Science.

There's a lot of 'yeah but' statements eminating from some in the BI community.

TL;DR summary:
  • Yeah but, it's all about driving business insights from the data!
  • Yeah but, Data Science still uses all the same BI tools we use!
  • Yeah but, Data Science is really just what BI was years ago!


A perspective:
  • No. BI is about using asymmetrical information advantage to extract surplus from customers. Data Science is discovering pareto optima between the customer and the business.
  • No. Data Science is not religious about toolsets.
  • No. Data Scientists have seen what went gone wrong with BI. Achieving the same fate would be a failure.
What I stand for as a Data Scientist:
  • The sun rises and sets with the customer.
  • If we do right by the customer, we'll do right by the business.
  • Pareto Optimality, not Nash Equilibrium.
  • The right tool for the right job.
  • Iterate violently.

In other words:
  • It's about the Customer, not the Business.
  • It's about the Experience, not the Data.
  • It's about the Outcomes, not the Tools.

Perhaps if more BI consultants thought like Data Scientists, they'd be Data Scientists.


It's okay. We can be neighbors. We can work together.

I'm not saying that BI isn't important. It's totally important.

We think we're different because we're thinking differently.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, February 22, 2012

What does overfitting mean?

Have you ever heard anybody use the sentence:

"The problem with that model is that it over fits the data."

Ever wonder that that means?

The purpose of science is to use knowledge to make good predictions about the future. To do so, you use theories which inform models. Models are deliberate simplifications of the world which make explicit statements about the direction of the arrow of causality, and are judged to be useful only if the assumptions are actually good.

A good model makes accurate predictions about the future. That supposes that the assumptions which underpin the model are actual best-proxies for how nature actually works.

[Data scientists: If you have a problem with what I wrote here, leave a comment or email me.]


Models can be calibrated to over fit one data set, and in so doing, fail to make accurate predictions about the future.

There's a convention, at least in Marketing Science, that you have actual empirical data to support your model. That is to say, you have a dataset, and in order to explain something in nature, you're creating a piece of math whose purpose it is to replicate those results.


There's a way that scientists break up a single data source into multiple sets. One part is used to tune a model, and another part is held 'out-of-sample' to validate the model. It's a way of making a prediction about the model the results of the model before it goes out to be replicated by others, in other contexts.

Sometimes a model predicts a single data set extremely well. Say, a given model predicts the eBay auction price of an American League baseball card extremely well for the years 2003 through 2005, however, when given data for 2006 to 2011, it ceases to make accurate predictions.

When a model is too specific, too fine tuned to a given data set, we say that it is overfitting the data.

Overfitting technically makes a paper appear very good, you know, hooray, your model worked - high five, drinks time. However, it may not be of any actual use to anybody else. Overfitting can fool people into believing they actually found something useful.

That's what overfitting means.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, February 21, 2012

Thinking about Predictive Analytics Thinking

Yesterday I concluded that "Existing theoretical frameworks assume too much, and demand too much cognition by the end user."

The opposite of asking you to think about linear regression or support vector machines is Netflix.

Netflix uses a machine algorithm to suggest movies that you might like. They do this using a few sources:
  • When you first sign up, Netflix asks a few questions about you. 
  • They have a prior viewing history of all their subscribers before you, who also answered a few questions about themselves. Y
  • You tell them what you like by watching various movies and shows. 
  • You tell them more by rating them on a five star rating system.
By comparing your tastes to other people like you, in other parts of their library graph, they create a more relevant experience.

This is machine decision making at scale.

You benefit in three ways:
  • You don't have to think by searching through a massive library of content (at least in the US).
  • You don't have to think much about giving a movie 1 star or 5 stars.
  • You don't really have to think, since starting a movie is a no-cost, low-risk procedure.

Netflix also benefits:
  • They don't have to get into committee fights with their content providers about who gets prominence in the interface.
  • Humans don't have to decide which groups see what, saving considerable resources.
  • The algorithm learns over time, increasing your satisfaction and generating lock-in.

There are huge benefits in delivering utility to people by not making them think.

People like easy. Netflix uses predictive analytics to make things easy.

Companies like Netflix, Google, and Amazon have big-n/little-thinking problems. They have a massive amount of content and need to figure out how to route the most relevant pieces to the right people. Predictive analytics at scale works very well for these companies.

Most companies have a small-n/little-thinking problems. They don't have much content. They're just fighting for scarce attention. Predictive analytics has a different application in this space. Same question - how do you reduce the amount of thinking that is required?

The routinization of business rule architecture is fairly well established in call-center and direct mail. That's all done. If anything, the blanket application of rigid business rules has done more de-humanize and destroy customer relationships than anything else could have. This has been a colossal step backwards.

Customer centric predictive analytics systems are differentiated from their business centric cousins on that point.

In many ways, this is a solution hunting for a problem. It's a different problem set. How do we turn this around on itself?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, February 20, 2012

Choice, Expectations and Predictive Analytics

James March explains that making a decision involves understanding alternatives, forming expectations about what's likely to happen, thinking about your preferences in terms of your wants, fears, hopes, dreams in relation to those expectations, and then making a choice.

That explanation really resonates. So we're going to use it here.

There's an assumption that choice amongst alternatives is cut and dry. It isn't.

Choice is a form of knowledge - specifically:
  • There are choices that you know you know.
  • There are choices that you know you don't know about. 
  • And there are choices you know you don't know.


Choices themselves aren't even really binary. There's significant ambiguity as to what a choice really means. How many times have you heard a statement in the form: "I thought we agreed on x, that we also agreed on y and z." People having varying understandings of a choice is standard.

Discovering Alternatives

Analytics is a bit hamstrung on what we do and do not say. For instance, we wouldn't say that just because a Featured Content Area didn't generate a high clickthrough, that the entire unit should be replaced by a navigation bar. We could. It's just that you need evidence for the alternative choice. 

Analysts make very factual statements about what is, and really fight the good fight on explaining why something happened.

Analytics really hasn't turned attention to the discovery of alternative choices. I believe that analysts are really well equipped to methodologically engage in alternative discovery, if they could keep themselves from forming expectations so prematurely.

Expectations

Expectations make alternative discovery look like a trivial problem.

The most obvious predictive analytical method is the plotting of a regression line and projecting it into the future.

To unpack that:
  • We build a dataset containing observations about the past and past performance.
  • We use a statistical or machine learning process to identify the line that best fits the past performance.
  • We take that equation for the line and project it into the future.

That method of forecasting the future assumes that past performance will continue. That trends continue. That history repeats.

Forecasting is not destiny.

Predictive Analytics

The bucket we call predictive analytics has many applications. In the context of supporting decisions, the field has a long way to go in terms of helping people understand their expectations in relation to choice and alternatives.

Existing theoretical frameworks assume too much, and demand too much cognition by the end user.

It shouldn't just be waved away.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca


Friday, February 17, 2012

Groups of people don't make decisions the way analysts assume they do

Ask most analysts and they'll have a very straightforward theory about how decisions are made.

  • They did up numbers.
  • They put them into context.
  • Decision makers make a decision based off those numbers and context.

Only that they don't.

Enter March and Olsen. In 1972 they programmed a simulation in Fortran. It's called a Garbage Can Model. Their idea was a solid 40 years ahead of its time.

To summarize the Garbage Can Model:
  • Institutions are organized anarchies.
  • Problems, solutions, participants and energy go into a Garbage Can and shaken all around.
  • Solutions really search for problems.

When you mix it with Arrow's Impossibility Theorem, you get a much more complete picture of why groups of people don't behave the way you think they do.

To summarize Arrow's Impossibility Theorem:

  • Individuals may have very straight forward preferences (Beef>Chicken>Pork), but when combined all together, the groups preferences frequently become circular. (Beef>Chicken>Pork>Beef>Chicken....)
  • That circularity forces decision making algorithms to be sub-optimal.
  • There is no perfect voting system.

To synthesize:
  • An algorithm that enables group decision making to be representative of the all the facts, opinions, and preferences to be rational, can not exist.
  • People cause organized anarchies.
  • Choice, itself, is ambiguous.
The first bullet point absolutely gutted me in my sophomore year. The second bullet saved my senior year. The third bullet point is putting a lot more wind into my sails.

There's a dangerous underlining assumption that if organized anarchies (institutions, companies, departments) had better information, and make sense of it, that they would instantly make better decisions.

Is it really true that better informed people make better decisions?

I have anecdotes. (I make no claim that this is evidence.)
  • I make better decisions in the gym when I have my notebook with a complete history of what I lifted, on what.
  • I make far better eating decisions when I put in the effort to track it. 
  • I make better financial decisions because I track spending.

 Those are individual decisions. I don't consult anybody in making them. It's a dictatorship of one.

Is it really true that better informed groups of people make better decisions?

I'm not so sure that that is always the case.

Given what you now know about group decision making - what do you think?


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, February 16, 2012

Simulation as Learning

You may hear marketing scientists or data scientists talk about simulation.


Simulation, at its core, means the imitation of something real.

The purpose of a simulation is to understand a system or a model. A simulation enables the analyst to take something very complex, program it, and run it again and again hundreds, thousands, or even tens of billions of times.

A good simulation takes in many independent variables, and produces a single number that is meaningful to a human. (Strong recommendation to web analysts: resist the urge to produce multiple dependent variables.)

A simulation:
  • Can be fed figures that are observed in nature.
  • Can imitate a system.
  • Can produce figures that can be compared to figures that are observed in nature.
If a simulation produces predictions that are generally in agreement with what is observed in nature, we can start to really pick apart the assumptions contained within the model. This process of aggressive inquiry and adjustment is learning.

"But a simulation isn't real!"

You're right! A simulation isn't real.

But neither is a negative number. The fact that it isn't real doesn't mean it can't be useful.
 
What's the point?

This question represents a very major fault line.
 
You have to make up your own mind. Are simulations useful for a sub-set of problems that you'll confront?

I find them useful in a handful of circumstances. It's a hammer. And not every problem is a nail.

It's my hope that we'll have a rational discussion in analytics about the applicability of simulation.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, February 15, 2012

Making sense of Geographic Data

Last week, Statistics Canada (StatsCan) gave us the initial population counts, broken down by area, from the 2011 census. It was a great day for analysts, and a congratulate them delivering. A huge shoutout to The Canadian Press for a pretty sweet interactive map of the results. It’s a great mirror. It’s a great reflection of ourselves.

Where people live is a pretty big indicator (not explicitly a predictor) of many other things. Like people clump alike. 

  • There are areas associated with low income. 
  • There are areas associated with rapidly rising income. 
  • There are areas associated with established wealth. 
As a result, education, income, and wealth are associated with where you live. If I have your name and postal code, I can predict a few things about you.

What if I know where you are at various points in the day? Impossible? Consider, then, the signal emitted by a cellphone. You ought to be aware of the iPhone keeping a record of everywhere you go. Where you work relative to where you sleep yields a key insight into what you are.
 

I know that you’re a commuter if there’s a huge distance between the two coordinates. If I know that transit is terrible in that area, I know to annoy you with radio advertising on the drive into work. If transit is good in that area, I know to break up my spend into transit . If there’s a very short distance, I’m going to need outdoor placement to reinforce my message. Moreover, I also know a lot about the core attitudinal drivers of those who live in cars versus those who live outside. They’re different.  

Where you exist has a lot to do with what you are.

  • Marketers really want to understand what you are.
  • Public policy analysts want to really understand what you are.
  • Traffic planners really, really, really want to understand what you are, at what points, where, and when.
Do you think you can learn more about yourself by learning more about what others are? 

Making sense of all that geographic data represents an awesome challenge and an awesome opportunity.

You should be aware of what, and why, you're telling who through your apps and through your phones. 

Tuesday, February 14, 2012

Leading words shape perception

Consider:
  • Only 97% of analysts use Excel at some point in their careers.
  • Only 3% of web analysts have yet to use R.
  • A whopping 0.3% of web analysts have downloaded PANDAS since Monday.

Now consider:
  • A whopping 97% of analysts use Excel at some point in their careers.
  • A whopping 3% have used R.
  • Only 0.3% of web analysts have downloaded PANDAS since Monday.

Leading words shape perception. Perception shapes both what is asked and biases within what is asked next.

For Instance: 
  • Who the hell are the 3% who haven't used Excel?
  • Why is Excel such a dominant tool at 97%?

And Next:
  • Why aren't way more web analysts using R?
  • Wow, what do those 3% of web analysts know that we don't?

And finally:
  • PANDAS was just announced over the weekend - how did so many find out about it?
  • Why didn't PANDAS weekend announcement generate more downloads?


I'm not saying that it's right or wrong.

I'm saying that it is.

At least be aware of what you may be doing and make up your own mind about how you shape perception.


P.S. All figures are made up and illusory.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, February 13, 2012

The unexpected consequences of the rise of GIS enabled Apps

Scott Hanselman wrote an excellent piece on App geo-location data. If there's a nobel prize for writing blog titles, he would win it.

The piece is entitled:

It's 2012 and your kids have an iPhone - Do you know where they are? I do.



Admiration aside, yes, you're living through one of the greatest rises of applied Geographic Information Systems (GIS), ever.

It's bigger than the launching of the first weather satellite. Or LandSat. This time, it's millions of people equipped with sensors. And they're doing the sensing.

Many apps use geo-location data as a function of what they do, of varying utility, for the user:
  • There are traffic congestion apps that rely on applied GIS - to crowdsource intelligence around traffic and accidents. 
  • Starwalk uses GIS to put the stars in the right context.
  • FourSquare is a GIS check-in game that's monetizing through coupons.
People are doing interesting things by combining GIS with Social. And it's really just the begining.

And, to be sure, you should be aware of that you're trading your location to the App developer (and most likely third parties, since GIS is a niche), in exchange for a utility.


Can a 13-year old possibly consent? Parental locks and control is absolutely essential, especially as the technology ramps up and developers become more inventive.


All disruptive technologies generate externalities, especially when they meet an apathetic or uninformed population. Don't be either.

Friday, February 10, 2012

Who's Downvoting You On Redit (Part 5)

This is the fifth in a series of five posts about Reddit and Analytics.


Previously - we covered the nature of the dataset, read histograms, generated segments, and understood that the most frequent users of Reddit are the ones who are doing the most downvoting by an astounding margin.


But wait, there's more.

Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that.


Takeaways:
  • Upvotes outnumber downvotes.
  • The interface of Reddit itself causes upvotes to accumulate.
  • Reddit itself is a cause of a bias - probably by design.
The histogram below is by links - the content getting upvoted or downvoted. There were just over 2 million links submitted. On average, each link received 3.62 upvotes. Given everything you know about long tails, think about just how deceptive that 3.62 mean figure is. Note how you can't even see the bumps in the tail. And be in awe of the efficiency of the collective Reddit behavior that causes popular content to disproportionately promoted while even 'good' or 'average' content gets relentlessly shifted to the left - all by a very small group of people.

Takeaways:
  • The long tail is long and powerful.
  • This small group Power-Paulines are far more likely to downvote because of a much higher frequency of use.
I'm thanking Reddit for making so many API's publicly exposed and enabling this sort of analysis and exploration. Thank you.


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, February 9, 2012

Who's Downvoting You On Reddit (Part 4)

This is the fourth in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Previously - we covered the nature of the dataset, read histograms, generated segments, and examined them. 

Putting a bow on it

The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes - even if the mean is exaggerated in the Power Pauline segment.



To really hammer the point home about the origin of downovotes, take a look a the table below. It's broken out by the segments you understand. It also contains two new variables - upvotes and downvotes. That is the total count of the number of upvotes and downvotes made by each segment.


Takeaways:
  • One-Time Olivers as a group were responsible for 175 of all the downvotes cast.
  • Vanity-Vanessa's as a group were responsible for 1781 of all the downvotes cast.
  • Average-Andy's as a group were responsible for 13,258 of all the downvotes cast.
  • Frequent-Fred as a group were responsible for 120,758 of all the downvotes cast.
  • Power-Paulines as a group were responsible for 1,672,368 of all the OBSERVED downvotes cast - but are probably responsible for a lot more in aggregate across all of Reddit. (This sample contains a bias, but bias doesn't mean I can't say anything at all about anything.)
Note the differences in order of magnitude between each group. 1781 is roughly 10 times greater than 175. And so, a bit imperfectly on the way up to Frequent-Fred's. There's an order of magnitude difference here in terms of the amount of weight each group casts.

The greatest power users users of Reddit are the ones who are downvoting you - and it's an exponential power.

I'll be posting some implications and the 'so-what' tomorrow.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, February 8, 2012

Who's Downvoting You On Reddit (Part 3)

This is the third in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Previously - we covered the nature of the dataset, read histograms, and generated our segments. Now we're going to examine each segment individually.

One-Time-Olivers

There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn't to be efficient - but to be clear. In that spirit, I give you the histogram below.

Takeaways:
  • All 4877 One-Time-Olivers voted exactly one time.
You should lol. It makes sense though, right? And, the segment name should make a lot more sense.
The histogram below summarizes how, on average, One-Time-Olivers voted - positive or negative. Since they only voted one time, it's either an upvote, or a downvote. A +1 or -1 average.


 Takeaways:
  • One-Time-Oliver's tend to upvote once, and are never heard from again.
  • In answering the question - "Who's downvoting you on Reddit", it isn't One-Time-Olivers.
Vanity Vanessa

Vanity accounts frequently enter Reddit, they flicker, and they go out. They get discouraged. They never really commit to the bit. That's what happens to them. The histogram below takes on that familiar long-tail curve.

 
Takeaways:
  • There are lot of Vanity-Vanessa's, some 7,527 of them.
  • Most of them posted only 2, 3, or 4 times.
So, how did they vote?

The histogram below summarizes the story:

 
Takeaways:
  • Vanity-Vanessa's upvoted nearly everything they saw, with very few exceptions.
  • Very few persistently downvoted everything they saw.
  • They're not the ones downvoting you on Reddit.

Average-Andy's

Recall that the average username votes 326 times, and yet, I still labeled Average-Andy, ranging between 9 and 48 votes, as average andy.

That's because the mean number of votes that Average-Andy's cast is 22.25 - which is close to the median of 20 for the entire set.

This mixing and abstraction of median, mean, and segmentation isn't something that I expect most people to consider or think about, but I can foresee some getting hung up on it. When you think about an equal segmentation though, it makes sense that the mean of your middle category should be close to the median of the entire set.

For everybody else - just know that you're you're looking at the "average joe redditor" here.

Takeaways:
  • Average number of votes is 22.25, close to the median of 20 for the whole set.
  • Familiar long tail.
How do they vote?
 
Takeaways:
  • A majority of Average Andy's liked everything they saw - they upovoted everything.
  • They downvote more often than Vanity-Vanessa's or One-Time-Oliver's, but not massively.
  • They aren't downvoting in such a huge way to say that these are the ones downvoting you on reddit.

Frequent Fred

By now you're pretty much a pro at reading these histograms. Frequent Fred's vote frequently. Look at the histogram below.


Takeaways:
  • Classic long-tail continues.
  • Averaging 139.3 votes.
  • The unusual bump at the beginning of the series is just magnified by the scale from the previous vote frequency histogram. (It's fine).
How do they vote?
 
Takeaways:
  • Far fewer of them are likely to upvote absolutely everything they see.
  • There's significant flattening of the long tail - the average is .74.
  • More of them, on average, are disposed to downvoting.
Power Paulines

Power Paulines are the most difficult group to analyze, but the easiest to summarize and understand. Take a look at the histogram below.

Takeaways:
  • The long tail is holding - there's significant clustering at 1000 and 2000.
  • The cause is related to rate limiting within the Reddit API.
  • The longest part of the long tail - those power users with thousands and thousands of votes, are all bundled and clustered together at 2000.
  • There are around 500 of such power users, representing some 1.5% of the total usernames.
So how do they vote?

 Takeaways:
  • The bump at 0 is caused by 1000 upvotes getting averaged out by 1000 upvotes.
  • 0's aside, which are tugging on the mean, Pauline's are on average more prone to downvoting.
  • Power Paulines are downvoting you on Reddit.
Tomorrow we're going to put a bow on it and bring it all together.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, February 7, 2012

Who's Downvoting You On Reddit (Part 2)

This is the second in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Recall that the average of all the votes a username made is called 'averagevote'. If somebody was persistently downvoting links, they'd have a negative number. If they upvoted everything they saw, they'd have an averagevote of +1.

Read the histogram below.   
The three takeaways are:
  • Negativity follows a reverse long tail. (It really happens - see how the figures fall away to left)
  • On average, usernames upvoted what they saw (average 0.79).
  • There are bumps at 0 (related to a methodological note) and at -1.
By now, two of my good friends in London are screaming at the screen. Means are a horrible way to explain long tail distributions. You can see that now too. Means are giving us a pretty skewed view of the world.

The table below is a byproduct of our Frequency table. It's aptly labeled 'Statistics', and compares these two variables, numberofvotes, and averagevote, side by side. I've thrown a yellow box around 'percentiles'. Recall the cumulative column from previous frequency table.
  • 22.8% of all usernames voted 2 times or less.
  • 40.8% voted 9 times or less.
The program I'm using is giving me 'break points' for those percentiles.


Two takeaways:
  • The median gives a better summary of what's going on here - half of the usernames voted 20 times or less, and, another set of usernames always upvoted what they saw.
  • If I know that roughly 80% of all usernames posted 325 times or less, then I know that 20% of the usernames in my sample posted 325 times or more.
We're going to use those percentile cutoff points to inform a segmentation, next.

Segmentation

A segmentation is a grouping of records, usually people, into categories. There is not prescription for how to do this. If you talk to a modeller, they'll tell you about their clustering algorithms. If you talk to a machine learning scientist, they'll tell you about bump-hunting or unsupervised machine learning clustering. Those are all very good algorithms. I use them myself.

I'm going for simplicity here. I have these four percentile cut-off points that evenly cut people into five categories. And, for further simplicity, instead of referring to a group of people who posted between 9 and 48 times as 'those who posted between 9 and 48 times', I'm going to call them Average-Andy's. And I'll just keep on calling them that.

At this point, I don't know if they're male or female. (And we won't in this thread). And it's controversial to use alliteration. But it's done.

So, mapping the percentiles against a segmentation, based on how many times a username voted, we have:
  • 1 time: One-Time-Oliver
  • 2 to 9 times: Vanity-Vanessa
  • 9 to 48 times: Average-Andy
  • 48 to 325 times: Frequent-Fred
  • More than 325 times: Power-Pauline
Take a look at the result below - a variable I'm calling 'equalseg' - short for 'equal segmentation'.


Takeaways:
  • There are 4877 One-Time-Olivers, representing 15.5% of the usernames in the sample.
  • Vanity-Vanessa's represent 23.9% of the usernames.
  • The last three segments are pretty equally divided - the first two are more lopsided.
Even though I aimed to have five groups of people with equal numbers in each, you can see the division between One-Time-Olivers and Vanity-Vanessa's are off. This happens very often when segmenting a long tail into equal groups. And, while not ideal, it's okay for our purposes.

Next, we're going to examine each segment individually.

Tomorrow we'll look at the voting characteristics of each segment.

 ***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, February 6, 2012

Who's Downvoting You On Reddit (Part 1)

This is the first in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

So who keeps on downvoting you on Reddit? We'll find out.

But first - three notes:
  • You may be familiar with Reddit. If you're not - you can read this explanation about what Reddit is.
  • To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here.
  • The file contains three columns - a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL) or even what was the nature of the content they were upvoting and downvoting.
So, who's downvoting you on reddit?

To find out, I took that huge file transformed it into another one - boiling it down into a single user name, how many times that username vote (numberofvotes), and the average of all their votes.
You can see below that _mike voted 26 times, and, if you take the average of all his votes, +1 for an upvote and -1 for a downvote, it turns out to be -.92. Basically, _mike didn't like a lot of what he saw. In fact, _mike upvoted once (+1) and downvoted 25 times (-25). So (- 25) + (+1) is -24, and -24/26 is -.92.


There are over 30,000 usernames here - and that's a lot of data. It's really important to visualize the data before you really get into any analysis. One way to do that is to run a histogram.

To read the histogram below, remember:
  • Frequency means 'the number of usernames that fall into this category or range'.
  • Numberofvotes means 'the number of times a username voted.'
  • Mean is another word for average.


There are three takeaways from the histogram above:
  • The average number of votes by a username was 234.
  • A large number of usernames didn't vote very many times at all.
  • There are bumps at 1000 and 2000 votes. (If you're interested as to why - see the Methodological notes. Incidentally - this is why you should always visualize your data.)
A histogram is built from a Frequency Table, which we'll see below.

The way to read a frequency table is:
  • The 'Valid' column means 'how many times a username voted'.
  • Frequency means 'the number of usernames that falls into this category'.
  • Percent means 'the percentage of all the usernames that those in this category represents'.


There are three takeaways from the Frequency Table above:
  • 4877 of the usernames only voted one time (It's likely they submitted a single link and never returned).
  • Note how both the percentages and number of usernames in each category decrease.
  • 50.1% of all the usernames voted 20 times or less. (Look at the cumulative percent column and make sure that makes sense to you. We're going to use this column later.)
You may have heard the term 'long tail' many times before. This is a demonstration of what that means. The bars on the histogram falls away to right.

Tomorrow we'll look at the distribution of votes and do a segmentation.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Friday, February 3, 2012

Building A Data Science Team: Kurt Schrader's perspective

Kurt wrote an excellent post about building a data science team. It's excellent and it's worth reading.

To expand off his points:
  • The first 90 days provide fuel for the subsequent 180.
  • The 180 days after are far muddier, because what was scaling in very unsophisticated interfaces require a lot more work to become elegant solutions.
  • Data scientists should evangelize evidence and do what they can to develop interfaces that democratize the data. The math is a means to the end.

Own reflections:

  • I'm extremely thankful for my years of experience with Information Architects and Designers - as now - when I go into a room and they're not around, I actively think about that end state.
  • I'm glad I've spent as many years with developers as I have. I have new found appreciation for simplexity and complicity.
  • Data is inertia. Foresight pays.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, February 2, 2012

The Usability of Histograms

A fellow data scientist and I were debating how to answer a very specific question that is asked all the time by others. How would we answer it? I grabbed a piece of paper and drew a histogram.

A histogram:
  • Plots a single variable along the X-axis.
  • Plots the occurrence, or frequency of a given variable along the Y-axis.
  • Is used by statisticians and analysts to understand the frequency distribution of a given variable.

I said: "This is how I would want to see the data. This is how I answer the question today. This is what I would want to compare," Then paused. Reflected. And added, "I am not the end user."

The end user isn't a statistician, marketing scientist, or an analyst. Histograms aren't encountered in everybody's everyday life. Uncertain if I was being elitist, I turned around and asked twitter what they thought of it. And I got a mixed response.

  • Some thought that anybody would be able to understand it. 
  • Some thought that it required too much mental processing. 
  • Some thought that everybody should know how to read a histogram.

So, instead, we went about asking about the easiest way to communicate a distillation of what a histogram says. What is it that we see in a histogram that answers the question? How do we subtract all the thinking?

I believe that people should be able to read a histogram. I believe that people would be better off if they had a subset of data expressed to them in the format of a histogram. I believe people would make better decisions.

But those are opinions and unbridled optimism. They're very normative statements.

I don't think that histograms are usable from a mass usability perspective because they cause too much thought.

Don't make me think.

Don't make them think.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, February 1, 2012

Don't make me think (and other lessons for web design)

"Don't Make Me Think" by Steve Krug is one of my favourite books. I strongly recommend it to web analysts and data scientist.


In that spirit - here are a few of my favourite interfaces:








Commonalities:
  • Real choices about what to put in and leave out were made - in other words - they are designed. They were not assembled.
  • Not every surface is crammed with stuff. Just because nature abhors a vacuum doesn't mean you need to cram something into every pixel.
  • It's obvious what everything does.

Simple can be functional.

What are your nominations?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, January 31, 2012

Time: The neglected variable in Marketing Return On Investment questions

What's the Return On Investment on Marketing?

Depends on how soon you want your return. Time is frequently a neglected variable.

Recall that marketing had a schism right around 1920:

  • One man went on to found the branding agency, and found salvation through broadcast radio, and later, TV. 
  • One man founded the first direct advertising agency, and continued to find salvation through direct response and cataloging. 
  • The schism only really came to a head when digital forced it to come to a head.
Implications:
  • Evidence for a direct causal inference between marketing treatment and marketing conversion is greatest at the point of sale / point of conversion.
  • Any evidence of causality is severely diluted at the branding / awareness level at the earliest portions of the customer funnel / fish / cycle.
  • It follows that direct mail people overestimate their impact, and underestimate the impact of branding.

Remember that:

  • There is a lag between initial treatment, customer acquisition, and return.
  • The longer the lag, the more opportunities for noise and collinearity to creep into your models.
  • Skepticism expands as the time lag expands.
Marketing is a system. Time is a factor in that system. The biggest conflict in brand modelling is how long is that time horizon.

That doesn't mean that anybody is wrong.

Just be aware that it's a factor you need to be aware of. And that it's a good factor.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, January 30, 2012

Apple, Animals, Analytics


You may have read a lot about Foxconn last week.

Tl;DR summary:
  • Foxconn is the subcontractor that makes the iPhone and iPad.
  • Foxconn's CEO called his workers animals.

Here's the key tl;dr quote, from a current Apple executive:


  • “We don’t have an obligation to solve America’s problems. Our only obligation is making the best product possible.”

That's a lot of focus. That's laser-like precision on a given mission. Because it follows that if Apple produces the best product possible, people won't care about anything else.

Indeed, isn't s/he right?


Doesn't free market and price competition makes hypocrites of us all? (Two-Buck-Chuck anybody?)

What if it didn't have to?

Implications for Analytics Practitioners:

  • NRCan runs the Energuide program which assigns a single, real number to most major appliances expressing how much electricity it uses over the course of a year, and in so doing, it makes the previously invisible, visible, to the consumer.
  • Is there some mechanism by which we could assign the unseen human cost of a given product and make it visible to the consumer?
  • How could the invisible be made visible, so that at the very least, our collective market can generate preferences that are not based on market price alone?

I make no normative judgements about industrialization in other countries, nor, a statement about the working conditions that people escape when they transition from subsistence agriculture into industrialization. I'm only asking if an analytical solution can drive an incremental disruptive dimension of consumer preference. ILO standards already exist, and have existed for decades. It's a problem-solution space that I know analytics practitioners have unique insight on.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Sent from my iPad


Friday, January 27, 2012

It's possible to dicern exactly what film someone is watching by analysing the power consumption of their TV

You may be familiar with smart grids, open data, and Pachube.

I was reading a piece on smart data, when suddenly a wild quote appears:

“...a  group of hackers who demonstrated in early 2012 that it is possible to discern exactly what film someone is watching by analysing the power consumption of their TV via their smart meter, as every film has a unique  ‘fingerprint’ of electricity usage.”

Oh yes. Confirmed. It happened at a hacking for privacy event.

Reactions and Questions:
  • What an unintended consequence of the technology!
  • What other hidden signals might there be in other sources of data?
  • What good might come of re-purposing seemingly noisy/garbage data?

Just as William H. Perkin discovered purple dye in waste coal tar, who else might discover useful relationships in wasted data?

Another reason why it pays to be curious and creative.

(Your cable company knows more about what you watch - so the specific privacy risk - while interesting, is really quite theoretical in nature.)

**
I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, January 26, 2012

The Productivity Trilemma, Average is Over, and Analytics

Thomas L. Friedman wrote a fairly good piece for the New York Times. The theme is linked to something that has kept public policy makers awake for a very long time - the Productivity Trilemma. These two themes explain part of the reason for the rise of Data Science and how Web Analytics must evolve.

To summarize Friedman:

  • The era of average people relying on doing an average job for average pay is over.
  • Technology is more efficient than ever at destroying average jobs.
  • Everybody has to get smarter.

To summarize the Productivity Trilemma:

  • Productivity growth causes growth in GDP, producing negative employment effects.
  • Real interest rates outpace real growth rate of GDP, causing regressive redistribution effects, leading to the impoverishment of debtors and the enrichment of creditors.
  • Governments attempt to keep employment high through deficit spending to compensate for employment effects, enriching creditors and ultimately impoverishing all debtors.
Policy theorists have known of this problem since the late 1990's (cited). I recall a paper from Europe dating to the 1970's though. We know this problem exists.

The implication for web analytics and data science:

  • Automation technology is destroying manual data entry and dashboarding positions.
  • Get smarter, get creative, and get into experimentation-as-a-value-add. Do it now.
  • Data Science is a creative outgrowth of BI and Web Analytics, maneuvering directly to be the destroyer of not just manual entry, but any thinking at all.
I can't solve the Productivity Trilemma. That's something for all of society to decide. In the meantime, the best defense against these forces is a very aggressive offense.

Wednesday, January 25, 2012

SOPA protest was a 'watershed event'

According to Chris Dodd, the response against SOPA was unlike anything he's seen in his thirty years in politics. He called it a 'watershed event'.

Possibly.

Proponents of SOPA argue:


Opponents of SOPA argue:


Proponents want to believe that somehow Google made me oppose SOPA. What should be of even more concern to Chris Dodd was that Google had very little incremental effect. Their contribution to the movement was weak compared to what the real grassroots did.

There was no astroturf:

  • I learned of SOPA from one of the image boards.
  • It led to a slow moving reddit bloom a few days later.
  • We all gathered our pitch forks and went after go daddy.
  • Another reddit bloom encouraging blackouts ensued.
  • The cost of incremental votes lost outweighed the benefits of incremental lobby dollars.
The pattern was obvious to anybody watching the situation unfold.

The real test of whether this is a watershed event is vigilance. Chris Dodd will try again. Hopefully they'll have a plan for absorbing their own costs for their own enforcement, but with fewer externalities and 99% less rent seeking. They probably won't.

Piracy has to be addressed. The entire business model needs revamping. There's no doubt. It's just how it gets addressed, with as little damage to other sectors, is the real question.

I hope that this awoken GenY. The pattern of ignorance, I hope, has been broken now that something real and concrete in their lives has been attacked.


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, January 24, 2012

Changes to Google's Privacy Policy

Google has announced changes to its privacy policy.

TL;DR:

  • They're rewriting it into human language.
  • Nothing about your Google Analytics account data changes.
  • "The only change for Google Analytics users under the new privacy policy is that now, information about how you interact with the Google Analytics interface may be shared with our other products."

Implications:

  • If the product is free, you are the product.
  • Software-As-A-Service (SaaS) analytics on SaaS users (meta-meta) is a major input in the product development lifecycle, so you can expect Google products to get better.
  • This paves the way towards a single Google Center of Excellence for internal SaaS Analytics.

Predictions:

  • Initial uproar.
  • Diminishing interest.
  • Business-as-usual.

Carry on.

Differentiation through negativity

There's a big difference between skepticism and blind negativity. It's through negativity that many experts attempt to differentiate themselves from a herd. Expertise is often some sort of competition - a game by which some people are more expert than others. Over time, that negativity can accumulate in a community, causing stasis and then retreat.

Skepticism:
  • The sample size involved seems awfully low. We need more evidence that this relationship holds up before declaring that this is a natural law of marketing.
  • The author didn't consider a few factors from prior work in this field - probably a genuine oversight on their part - so I'd like to see this report replicated with those factors to see the  link.
  • If I accept the authors assumptions, then yes, the conclusions are logically sound. However, I have a problem with the realism of one particular assumption. As a result, the model might not be predictive of events in the following sets of circumstances.

Negativity:
  • The study is stupid because correlation isn't causality - you can't ever say that these factors cause this to happen. It's impossible to prove, so we shouldn't try. And you're an idiot for trying.
  • I disagree with one point, so the entire thing just falls apart.
  • I disagree, therefore, the entire study is invalid.


It's good to be skeptical.

I've watched tremendous harm done because of negativity. I've watched it over and over again for the past fifteen years. It's always the same pattern.

People frequently choose to disassociate themselves from a spinning black hole of negative energy. What happens when experts don't have an audience to shout at?


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca