Friday, February 10, 2012

Who's Downvoting You On Redit (Part 5)

This is the fifth in a series of five posts about Reddit and Analytics.


Previously - we covered the nature of the dataset, read histograms, generated segments, and understood that the most frequent users of Reddit are the ones who are doing the most downvoting by an astounding margin.


But wait, there's more.

Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that.


Takeaways:
  • Upvotes outnumber downvotes.
  • The interface of Reddit itself causes upvotes to accumulate.
  • Reddit itself is a cause of a bias - probably by design.
The histogram below is by links - the content getting upvoted or downvoted. There were just over 2 million links submitted. On average, each link received 3.62 upvotes. Given everything you know about long tails, think about just how deceptive that 3.62 mean figure is. Note how you can't even see the bumps in the tail. And be in awe of the efficiency of the collective Reddit behavior that causes popular content to disproportionately promoted while even 'good' or 'average' content gets relentlessly shifted to the left - all by a very small group of people.

Takeaways:
  • The long tail is long and powerful.
  • This small group Power-Paulines are far more likely to downvote because of a much higher frequency of use.
I'm thanking Reddit for making so many API's publicly exposed and enabling this sort of analysis and exploration. Thank you.


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, February 9, 2012

Who's Downvoting You On Reddit (Part 4)

This is the fourth in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Previously - we covered the nature of the dataset, read histograms, generated segments, and examined them. 

Putting a bow on it

The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes - even if the mean is exaggerated in the Power Pauline segment.



To really hammer the point home about the origin of downovotes, take a look a the table below. It's broken out by the segments you understand. It also contains two new variables - upvotes and downvotes. That is the total count of the number of upvotes and downvotes made by each segment.


Takeaways:
  • One-Time Olivers as a group were responsible for 175 of all the downvotes cast.
  • Vanity-Vanessa's as a group were responsible for 1781 of all the downvotes cast.
  • Average-Andy's as a group were responsible for 13,258 of all the downvotes cast.
  • Frequent-Fred as a group were responsible for 120,758 of all the downvotes cast.
  • Power-Paulines as a group were responsible for 1,672,368 of all the OBSERVED downvotes cast - but are probably responsible for a lot more in aggregate across all of Reddit. (This sample contains a bias, but bias doesn't mean I can't say anything at all about anything.)
Note the differences in order of magnitude between each group. 1781 is roughly 10 times greater than 175. And so, a bit imperfectly on the way up to Frequent-Fred's. There's an order of magnitude difference here in terms of the amount of weight each group casts.

The greatest power users users of Reddit are the ones who are downvoting you - and it's an exponential power.

I'll be posting some implications and the 'so-what' tomorrow.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, February 8, 2012

Who's Downvoting You On Reddit (Part 3)

This is the third in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Previously - we covered the nature of the dataset, read histograms, and generated our segments. Now we're going to examine each segment individually.

One-Time-Olivers

There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn't to be efficient - but to be clear. In that spirit, I give you the histogram below.

Takeaways:
  • All 4877 One-Time-Olivers voted exactly one time.
You should lol. It makes sense though, right? And, the segment name should make a lot more sense.
The histogram below summarizes how, on average, One-Time-Olivers voted - positive or negative. Since they only voted one time, it's either an upvote, or a downvote. A +1 or -1 average.


 Takeaways:
  • One-Time-Oliver's tend to upvote once, and are never heard from again.
  • In answering the question - "Who's downvoting you on Reddit", it isn't One-Time-Olivers.
Vanity Vanessa

Vanity accounts frequently enter Reddit, they flicker, and they go out. They get discouraged. They never really commit to the bit. That's what happens to them. The histogram below takes on that familiar long-tail curve.

 
Takeaways:
  • There are lot of Vanity-Vanessa's, some 7,527 of them.
  • Most of them posted only 2, 3, or 4 times.
So, how did they vote?

The histogram below summarizes the story:

 
Takeaways:
  • Vanity-Vanessa's upvoted nearly everything they saw, with very few exceptions.
  • Very few persistently downvoted everything they saw.
  • They're not the ones downvoting you on Reddit.

Average-Andy's

Recall that the average username votes 326 times, and yet, I still labeled Average-Andy, ranging between 9 and 48 votes, as average andy.

That's because the mean number of votes that Average-Andy's cast is 22.25 - which is close to the median of 20 for the entire set.

This mixing and abstraction of median, mean, and segmentation isn't something that I expect most people to consider or think about, but I can foresee some getting hung up on it. When you think about an equal segmentation though, it makes sense that the mean of your middle category should be close to the median of the entire set.

For everybody else - just know that you're you're looking at the "average joe redditor" here.

Takeaways:
  • Average number of votes is 22.25, close to the median of 20 for the whole set.
  • Familiar long tail.
How do they vote?
 
Takeaways:
  • A majority of Average Andy's liked everything they saw - they upovoted everything.
  • They downvote more often than Vanity-Vanessa's or One-Time-Oliver's, but not massively.
  • They aren't downvoting in such a huge way to say that these are the ones downvoting you on reddit.

Frequent Fred

By now you're pretty much a pro at reading these histograms. Frequent Fred's vote frequently. Look at the histogram below.


Takeaways:
  • Classic long-tail continues.
  • Averaging 139.3 votes.
  • The unusual bump at the beginning of the series is just magnified by the scale from the previous vote frequency histogram. (It's fine).
How do they vote?
 
Takeaways:
  • Far fewer of them are likely to upvote absolutely everything they see.
  • There's significant flattening of the long tail - the average is .74.
  • More of them, on average, are disposed to downvoting.
Power Paulines

Power Paulines are the most difficult group to analyze, but the easiest to summarize and understand. Take a look at the histogram below.

Takeaways:
  • The long tail is holding - there's significant clustering at 1000 and 2000.
  • The cause is related to rate limiting within the Reddit API.
  • The longest part of the long tail - those power users with thousands and thousands of votes, are all bundled and clustered together at 2000.
  • There are around 500 of such power users, representing some 1.5% of the total usernames.
So how do they vote?

 Takeaways:
  • The bump at 0 is caused by 1000 upvotes getting averaged out by 1000 upvotes.
  • 0's aside, which are tugging on the mean, Pauline's are on average more prone to downvoting.
  • Power Paulines are downvoting you on Reddit.
Tomorrow we're going to put a bow on it and bring it all together.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, February 7, 2012

Who's Downvoting You On Reddit (Part 2)

This is the second in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Recall that the average of all the votes a username made is called 'averagevote'. If somebody was persistently downvoting links, they'd have a negative number. If they upvoted everything they saw, they'd have an averagevote of +1.

Read the histogram below.   
The three takeaways are:
  • Negativity follows a reverse long tail. (It really happens - see how the figures fall away to left)
  • On average, usernames upvoted what they saw (average 0.79).
  • There are bumps at 0 (related to a methodological note) and at -1.
By now, two of my good friends in London are screaming at the screen. Means are a horrible way to explain long tail distributions. You can see that now too. Means are giving us a pretty skewed view of the world.

The table below is a byproduct of our Frequency table. It's aptly labeled 'Statistics', and compares these two variables, numberofvotes, and averagevote, side by side. I've thrown a yellow box around 'percentiles'. Recall the cumulative column from previous frequency table.
  • 22.8% of all usernames voted 2 times or less.
  • 40.8% voted 9 times or less.
The program I'm using is giving me 'break points' for those percentiles.


Two takeaways:
  • The median gives a better summary of what's going on here - half of the usernames voted 20 times or less, and, another set of usernames always upvoted what they saw.
  • If I know that roughly 80% of all usernames posted 325 times or less, then I know that 20% of the usernames in my sample posted 325 times or more.
We're going to use those percentile cutoff points to inform a segmentation, next.

Segmentation

A segmentation is a grouping of records, usually people, into categories. There is not prescription for how to do this. If you talk to a modeller, they'll tell you about their clustering algorithms. If you talk to a machine learning scientist, they'll tell you about bump-hunting or unsupervised machine learning clustering. Those are all very good algorithms. I use them myself.

I'm going for simplicity here. I have these four percentile cut-off points that evenly cut people into five categories. And, for further simplicity, instead of referring to a group of people who posted between 9 and 48 times as 'those who posted between 9 and 48 times', I'm going to call them Average-Andy's. And I'll just keep on calling them that.

At this point, I don't know if they're male or female. (And we won't in this thread). And it's controversial to use alliteration. But it's done.

So, mapping the percentiles against a segmentation, based on how many times a username voted, we have:
  • 1 time: One-Time-Oliver
  • 2 to 9 times: Vanity-Vanessa
  • 9 to 48 times: Average-Andy
  • 48 to 325 times: Frequent-Fred
  • More than 325 times: Power-Pauline
Take a look at the result below - a variable I'm calling 'equalseg' - short for 'equal segmentation'.


Takeaways:
  • There are 4877 One-Time-Olivers, representing 15.5% of the usernames in the sample.
  • Vanity-Vanessa's represent 23.9% of the usernames.
  • The last three segments are pretty equally divided - the first two are more lopsided.
Even though I aimed to have five groups of people with equal numbers in each, you can see the division between One-Time-Olivers and Vanity-Vanessa's are off. This happens very often when segmenting a long tail into equal groups. And, while not ideal, it's okay for our purposes.

Next, we're going to examine each segment individually.

Tomorrow we'll look at the voting characteristics of each segment.

 ***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, February 6, 2012

Who's Downvoting You On Reddit (Part 1)

This is the first in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

So who keeps on downvoting you on Reddit? We'll find out.

But first - three notes:
  • You may be familiar with Reddit. If you're not - you can read this explanation about what Reddit is.
  • To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here.
  • The file contains three columns - a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL) or even what was the nature of the content they were upvoting and downvoting.
So, who's downvoting you on reddit?

To find out, I took that huge file transformed it into another one - boiling it down into a single user name, how many times that username vote (numberofvotes), and the average of all their votes.
You can see below that _mike voted 26 times, and, if you take the average of all his votes, +1 for an upvote and -1 for a downvote, it turns out to be -.92. Basically, _mike didn't like a lot of what he saw. In fact, _mike upvoted once (+1) and downvoted 25 times (-25). So (- 25) + (+1) is -24, and -24/26 is -.92.


There are over 30,000 usernames here - and that's a lot of data. It's really important to visualize the data before you really get into any analysis. One way to do that is to run a histogram.

To read the histogram below, remember:
  • Frequency means 'the number of usernames that fall into this category or range'.
  • Numberofvotes means 'the number of times a username voted.'
  • Mean is another word for average.


There are three takeaways from the histogram above:
  • The average number of votes by a username was 234.
  • A large number of usernames didn't vote very many times at all.
  • There are bumps at 1000 and 2000 votes. (If you're interested as to why - see the Methodological notes. Incidentally - this is why you should always visualize your data.)
A histogram is built from a Frequency Table, which we'll see below.

The way to read a frequency table is:
  • The 'Valid' column means 'how many times a username voted'.
  • Frequency means 'the number of usernames that falls into this category'.
  • Percent means 'the percentage of all the usernames that those in this category represents'.


There are three takeaways from the Frequency Table above:
  • 4877 of the usernames only voted one time (It's likely they submitted a single link and never returned).
  • Note how both the percentages and number of usernames in each category decrease.
  • 50.1% of all the usernames voted 20 times or less. (Look at the cumulative percent column and make sure that makes sense to you. We're going to use this column later.)
You may have heard the term 'long tail' many times before. This is a demonstration of what that means. The bars on the histogram falls away to right.

Tomorrow we'll look at the distribution of votes and do a segmentation.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Friday, February 3, 2012

Building A Data Science Team: Kurt Schrader's perspective

Kurt wrote an excellent post about building a data science team. It's excellent and it's worth reading.

To expand off his points:
  • The first 90 days provide fuel for the subsequent 180.
  • The 180 days after are far muddier, because what was scaling in very unsophisticated interfaces require a lot more work to become elegant solutions.
  • Data scientists should evangelize evidence and do what they can to develop interfaces that democratize the data. The math is a means to the end.

Own reflections:

  • I'm extremely thankful for my years of experience with Information Architects and Designers - as now - when I go into a room and they're not around, I actively think about that end state.
  • I'm glad I've spent as many years with developers as I have. I have new found appreciation for simplexity and complicity.
  • Data is inertia. Foresight pays.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, February 2, 2012

The Usability of Histograms

A fellow data scientist and I were debating how to answer a very specific question that is asked all the time by others. How would we answer it? I grabbed a piece of paper and drew a histogram.

A histogram:
  • Plots a single variable along the X-axis.
  • Plots the occurrence, or frequency of a given variable along the Y-axis.
  • Is used by statisticians and analysts to understand the frequency distribution of a given variable.

I said: "This is how I would want to see the data. This is how I answer the question today. This is what I would want to compare," Then paused. Reflected. And added, "I am not the end user."

The end user isn't a statistician, marketing scientist, or an analyst. Histograms aren't encountered in everybody's everyday life. Uncertain if I was being elitist, I turned around and asked twitter what they thought of it. And I got a mixed response.

  • Some thought that anybody would be able to understand it. 
  • Some thought that it required too much mental processing. 
  • Some thought that everybody should know how to read a histogram.

So, instead, we went about asking about the easiest way to communicate a distillation of what a histogram says. What is it that we see in a histogram that answers the question? How do we subtract all the thinking?

I believe that people should be able to read a histogram. I believe that people would be better off if they had a subset of data expressed to them in the format of a histogram. I believe people would make better decisions.

But those are opinions and unbridled optimism. They're very normative statements.

I don't think that histograms are usable from a mass usability perspective because they cause too much thought.

Don't make me think.

Don't make them think.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, February 1, 2012

Don't make me think (and other lessons for web design)

"Don't Make Me Think" by Steve Krug is one of my favourite books. I strongly recommend it to web analysts and data scientist.


In that spirit - here are a few of my favourite interfaces:








Commonalities:
  • Real choices about what to put in and leave out were made - in other words - they are designed. They were not assembled.
  • Not every surface is crammed with stuff. Just because nature abhors a vacuum doesn't mean you need to cram something into every pixel.
  • It's obvious what everything does.

Simple can be functional.

What are your nominations?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, January 31, 2012

Time: The neglected variable in Marketing Return On Investment questions

What's the Return On Investment on Marketing?

Depends on how soon you want your return. Time is frequently a neglected variable.

Recall that marketing had a schism right around 1920:

  • One man went on to found the branding agency, and found salvation through broadcast radio, and later, TV. 
  • One man founded the first direct advertising agency, and continued to find salvation through direct response and cataloging. 
  • The schism only really came to a head when digital forced it to come to a head.
Implications:
  • Evidence for a direct causal inference between marketing treatment and marketing conversion is greatest at the point of sale / point of conversion.
  • Any evidence of causality is severely diluted at the branding / awareness level at the earliest portions of the customer funnel / fish / cycle.
  • It follows that direct mail people overestimate their impact, and underestimate the impact of branding.

Remember that:

  • There is a lag between initial treatment, customer acquisition, and return.
  • The longer the lag, the more opportunities for noise and collinearity to creep into your models.
  • Skepticism expands as the time lag expands.
Marketing is a system. Time is a factor in that system. The biggest conflict in brand modelling is how long is that time horizon.

That doesn't mean that anybody is wrong.

Just be aware that it's a factor you need to be aware of. And that it's a good factor.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, January 30, 2012

Apple, Animals, Analytics


You may have read a lot about Foxconn last week.

Tl;DR summary:
  • Foxconn is the subcontractor that makes the iPhone and iPad.
  • Foxconn's CEO called his workers animals.

Here's the key tl;dr quote, from a current Apple executive:


  • “We don’t have an obligation to solve America’s problems. Our only obligation is making the best product possible.”

That's a lot of focus. That's laser-like precision on a given mission. Because it follows that if Apple produces the best product possible, people won't care about anything else.

Indeed, isn't s/he right?


Doesn't free market and price competition makes hypocrites of us all? (Two-Buck-Chuck anybody?)

What if it didn't have to?

Implications for Analytics Practitioners:

  • NRCan runs the Energuide program which assigns a single, real number to most major appliances expressing how much electricity it uses over the course of a year, and in so doing, it makes the previously invisible, visible, to the consumer.
  • Is there some mechanism by which we could assign the unseen human cost of a given product and make it visible to the consumer?
  • How could the invisible be made visible, so that at the very least, our collective market can generate preferences that are not based on market price alone?

I make no normative judgements about industrialization in other countries, nor, a statement about the working conditions that people escape when they transition from subsistence agriculture into industrialization. I'm only asking if an analytical solution can drive an incremental disruptive dimension of consumer preference. ILO standards already exist, and have existed for decades. It's a problem-solution space that I know analytics practitioners have unique insight on.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Sent from my iPad


Friday, January 27, 2012

It's possible to dicern exactly what film someone is watching by analysing the power consumption of their TV

You may be familiar with smart grids, open data, and Pachube.

I was reading a piece on smart data, when suddenly a wild quote appears:

“...a  group of hackers who demonstrated in early 2012 that it is possible to discern exactly what film someone is watching by analysing the power consumption of their TV via their smart meter, as every film has a unique  ‘fingerprint’ of electricity usage.”

Oh yes. Confirmed. It happened at a hacking for privacy event.

Reactions and Questions:
  • What an unintended consequence of the technology!
  • What other hidden signals might there be in other sources of data?
  • What good might come of re-purposing seemingly noisy/garbage data?

Just as William H. Perkin discovered purple dye in waste coal tar, who else might discover useful relationships in wasted data?

Another reason why it pays to be curious and creative.

(Your cable company knows more about what you watch - so the specific privacy risk - while interesting, is really quite theoretical in nature.)

**
I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Thursday, January 26, 2012

The Productivity Trilemma, Average is Over, and Analytics

Thomas L. Friedman wrote a fairly good piece for the New York Times. The theme is linked to something that has kept public policy makers awake for a very long time - the Productivity Trilemma. These two themes explain part of the reason for the rise of Data Science and how Web Analytics must evolve.

To summarize Friedman:

  • The era of average people relying on doing an average job for average pay is over.
  • Technology is more efficient than ever at destroying average jobs.
  • Everybody has to get smarter.

To summarize the Productivity Trilemma:

  • Productivity growth causes growth in GDP, producing negative employment effects.
  • Real interest rates outpace real growth rate of GDP, causing regressive redistribution effects, leading to the impoverishment of debtors and the enrichment of creditors.
  • Governments attempt to keep employment high through deficit spending to compensate for employment effects, enriching creditors and ultimately impoverishing all debtors.
Policy theorists have known of this problem since the late 1990's (cited). I recall a paper from Europe dating to the 1970's though. We know this problem exists.

The implication for web analytics and data science:

  • Automation technology is destroying manual data entry and dashboarding positions.
  • Get smarter, get creative, and get into experimentation-as-a-value-add. Do it now.
  • Data Science is a creative outgrowth of BI and Web Analytics, maneuvering directly to be the destroyer of not just manual entry, but any thinking at all.
I can't solve the Productivity Trilemma. That's something for all of society to decide. In the meantime, the best defense against these forces is a very aggressive offense.

Wednesday, January 25, 2012

SOPA protest was a 'watershed event'

According to Chris Dodd, the response against SOPA was unlike anything he's seen in his thirty years in politics. He called it a 'watershed event'.

Possibly.

Proponents of SOPA argue:


Opponents of SOPA argue:


Proponents want to believe that somehow Google made me oppose SOPA. What should be of even more concern to Chris Dodd was that Google had very little incremental effect. Their contribution to the movement was weak compared to what the real grassroots did.

There was no astroturf:

  • I learned of SOPA from one of the image boards.
  • It led to a slow moving reddit bloom a few days later.
  • We all gathered our pitch forks and went after go daddy.
  • Another reddit bloom encouraging blackouts ensued.
  • The cost of incremental votes lost outweighed the benefits of incremental lobby dollars.
The pattern was obvious to anybody watching the situation unfold.

The real test of whether this is a watershed event is vigilance. Chris Dodd will try again. Hopefully they'll have a plan for absorbing their own costs for their own enforcement, but with fewer externalities and 99% less rent seeking. They probably won't.

Piracy has to be addressed. The entire business model needs revamping. There's no doubt. It's just how it gets addressed, with as little damage to other sectors, is the real question.

I hope that this awoken GenY. The pattern of ignorance, I hope, has been broken now that something real and concrete in their lives has been attacked.


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, January 24, 2012

Changes to Google's Privacy Policy

Google has announced changes to its privacy policy.

TL;DR:

  • They're rewriting it into human language.
  • Nothing about your Google Analytics account data changes.
  • "The only change for Google Analytics users under the new privacy policy is that now, information about how you interact with the Google Analytics interface may be shared with our other products."

Implications:

  • If the product is free, you are the product.
  • Software-As-A-Service (SaaS) analytics on SaaS users (meta-meta) is a major input in the product development lifecycle, so you can expect Google products to get better.
  • This paves the way towards a single Google Center of Excellence for internal SaaS Analytics.

Predictions:

  • Initial uproar.
  • Diminishing interest.
  • Business-as-usual.

Carry on.

Differentiation through negativity

There's a big difference between skepticism and blind negativity. It's through negativity that many experts attempt to differentiate themselves from a herd. Expertise is often some sort of competition - a game by which some people are more expert than others. Over time, that negativity can accumulate in a community, causing stasis and then retreat.

Skepticism:
  • The sample size involved seems awfully low. We need more evidence that this relationship holds up before declaring that this is a natural law of marketing.
  • The author didn't consider a few factors from prior work in this field - probably a genuine oversight on their part - so I'd like to see this report replicated with those factors to see the  link.
  • If I accept the authors assumptions, then yes, the conclusions are logically sound. However, I have a problem with the realism of one particular assumption. As a result, the model might not be predictive of events in the following sets of circumstances.

Negativity:
  • The study is stupid because correlation isn't causality - you can't ever say that these factors cause this to happen. It's impossible to prove, so we shouldn't try. And you're an idiot for trying.
  • I disagree with one point, so the entire thing just falls apart.
  • I disagree, therefore, the entire study is invalid.


It's good to be skeptical.

I've watched tremendous harm done because of negativity. I've watched it over and over again for the past fifteen years. It's always the same pattern.

People frequently choose to disassociate themselves from a spinning black hole of negative energy. What happens when experts don't have an audience to shout at?


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Monday, January 23, 2012

Effective analytics is disruptive

Effective analytics is disruptive because being smarter causes smarter actions.

Organizations do not, and probably can not, change as rapidly as the intelligence suggests. This alone can be a massive source of frustration, both for the analytics professionals, and for other areas of management within the organization.

Three key questions to consider:

  • Small failure is likely and common within most organizations - are you comfortable with those getting surfaced?
  • Small success is likely and common within most organizations - are you more concerned with sharing the resulting insights instead of investing in assigning credit?
  • Do you have a system for change management and updating strategy?

Analytics shines a harsh light on previously dark corners. And yet, knowing what you don't know, and what you would do if did know, is a healthy attitude towards the benefits of analytics.

Friday, January 20, 2012

Building an analytics team or a center of excellence

No fewer than four companies in Toronto looking to build analytics departments. I'm excited for them.

A few points of advice as they move forward:

  • If you don't like the truth, you're not going to like analytics.
  • Effective analytics is disruptive and prompts change.
  • If you're not open to changing, then there's no point in being smarter. You're better off being dumb.
  • You'll hit a trough of disillusionment, usually because too many of the wrong people in the organization are looking for too many of the wrong numbers, getting too frustrated that nothing is telling them anything (that they want to hear or see), and that there isn't a transformation.
  • Some organizations never get out of that trough and give up. Plan for that and don't give up.

I wish them all the best and the luck as they try.

Thursday, January 19, 2012

Why Marketing Science Isn't Physics

Marketing Science isn't Physics.

One of the great things about the Marketing Science community is the sane approach to assumptions. Unlike economics, marketing science aims to make reliable predictions about the world, just like the other grown up sciences.


Consider:
  • Physics is just called physics.
  • Chemistry is just called chemistry.
  • Biology is just called biology.
None of these three fields use the description 'science' to affirm that they're science. They just are 'science'.

Next, consider:
  • Marketing.
  • Politics.
These are two sciences which are very young - especially as compared to physics. Marketing Science is really only 50 years old.

Marketing Science has special problems that are created by the subject matter itself.

Consider:
  • The same laws of motion that put a rocket into space in 1962 applied in 2012.
  • The same commercial that caused mass awareness in 1962 would not be nearly as effective in 2012.
  • The same commercial that caused mass awareness about Taco Bell, in America, in 1962, would not be effective in introducing Taco Bell to Ecuador Bolivia, in 2012.

Marketing science is particularly difficult because the phenomenon it studies doesn't stay put.

It fidgets. It evolves. It's highly susceptible to technological change.

We'd be a lot further along if we had a stable phenomenon and centuries to study it.

It just means that the types of laws and understandings we're all working to derive are that much more fluid and more dynamic. This feature of the object of study is the defining difference between physics and marketing science.

It's not any differences in the type of math or methods employed. (Marketing science is experimental and just as observational as physics.)


***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Wednesday, January 18, 2012

A Classification for Web Analytics Metrics

I believe this classification was first enunciated by Alex Langhshur at eMetrics Toronto (2008). It's worth expanding upon.


Consider the following classification for web analytics metrics:

  • Pre-Click Metrics
  • On-Site Metrics
  • Post-Click Metrics

To unpack that:

  • Pre-Click Metrics refer to all activities that led to a visit to a digital owned property. (E.G. paid search keywords, referring domain, any traditional spend)
  • On-Site Metrics refer to all the activities that can be observed on the site. (E.G. Visits to specific pages, graph analysis / path analysis, time spent on site, and a host of very specific things like the nebulous world of engagement.)
  • Post-Click Metrics refer to all the activities that occur after the visit. (E.G. Money getting transferred to your bank account, an email address getting opted-in by the user, and even the nebulous world of recommending the service to a friend.)

The advantages of this structure include:
  • A clear, simple, linear chain of causality from beginning to end.
  • Enables the separation of the On-Site metrics as its own intermediate variable.
  • Enables clean segmentation among visitor attributes: returning visitor, returning customer, geography, and so on.

The disadvantages of this structure include:

  • Is too linear, since it does not explicitly call out the impact of repeat visitation.
  • Is too linear, as it does not highlight the importance of loyalty and retention.
  • Is too linear, as it does not highlight the interplay between bricks and clicks.
If you draw an arrow from Post-Click back to Pre-Click, you can get a nice cycle going. There's no reason not to. It just complicates the classification.

Useful?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca


Tuesday, January 17, 2012

How are we behaving in a multi-medium media world?

I can tell you right now how I'm behaving in a multi-medium media world.


I have several tabs open as I write this:
  • Facebook - which contains a newsticker stream of activity.
  • RDIO - which is playing music, pipped right into my ear.
  • Blogger - where I'm typing this.
  • A half dozen tabs I haven't visited for a few hours, or, I have yet to actively click upon.
I also have an Adobe Air App loaded:
  • Tweetdeck.
That's just one device - my MacBook.

I attended an INFORMS conference in 2008 where a particularly bright professor presented findings from a new type of diary. His findings suggested that multi-medium media consumption was the norm. Ie. Many people reported listening to the radio while having the TV on while reading a magazine.

He argued for medium planning. I thought he was right and well ahead of the curve.

And this was pre-tablet.

My behavior can't be unique, but it certainly can't be common - yet. Or is it?

The challenge to this generation of marketing scientists and marketers is obvious.

Monday, January 16, 2012

Why seeing the distribution of data is important

SPSS, R, and Python (matplotlib) have very functional visualization libraries because seeing the data is vital, even when armed with statistical methods.

The chart below, called Anscombe's Quartet, illustrates why:

All four data sets return the same summary statistics:
  • Their averages are all 9. 
  • The correlation between x and y are all 0.816. 
  • They can be described by the best fit linear regression equation y = 3 + 0.5x.


It's important to visualize the data, even when relatively powerful summary statistics are available, because:
  • Outliers are common in most data, deserve special attention, and can cause very large skews.
  • You may need something a bit heavier than linear regression to predict the relationship between x and y.
  • Summary statistics sacrifice specificity for simplicity, and as such, are not substitutes for understanding.




Friday, January 13, 2012

Conversion is an anomaly

Depending on who you believe and the context, average site eCommerce conversion rates vary between 0% and 12%. That's not very helpful. In my own experience, defining conversion as number of completed checkouts divided by total number of site visitors, that rate varies between 0.20% to 2.00%.

That fact has important implications for analysis, bias, and making causal statements about what causes conversion.

Specifically:
  • When doing an experiment, the lower the conversion rate, the greater the number of visitors that are required to make a truthful causal statement that something causes conversion.
  • As a consequence, poorly converting sites that could benefit from experimentation the most are the most disadvantaged.
  • Methods that are more common in the machine learning community may actually be more appropriate than what we'd call 'traditional statistical analysis'.

As A Result:
  • If the traffic to a given site is low, it is even more important to test big things that matter, than it is to fiddle with something likely to be trivial. Take big risks.
  • It is preferable to increase the efficiency of the site by converting visitors into customers than it is to incur high incremental costs from driving more unqualified traffic.
  • We may have more success if we treat conversion as an anomaly detection problem as opposed to a regression problem.

Thursday, January 12, 2012

Data Science is not defined by the tools it uses

Yesterday, I wrote:

"Many [Data Scientists] will find some of their peers co-opted by tools, as it's far easier to be religious about the merits of a tool over another one than it is to exert any sort of real leadership or independent thought."

To expand on that point:

  • Data science is results oriented - the tool is the means to the end - it isn't the end itself.
  • Arguing the merits of Cognos against SAS is akin to the chefs spending an entire episode of Bravo's Top Chef arguing whether a boning knife or a birds beak knife should be used to cut a duck. (It doesn't make for good TV and it doesn't matter.)
  • The central tenet of good web design, and by extension product development, is 'don't make me think'. Great tools don't make you think. As a result, it's far easier to be a mindless proponent of something and to look no further.
I don't view the dumbing of Data Science as inevitable or even a good thing.

Do you?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca


Wednesday, January 11, 2012

Baby Faced Stratans and Graybeard BI's

Steve Miller wrote of Data Science Maturity yesterday. It's a very good post.

To summarize:
  • He attended both Strata, a Data Science (DS) event, and Enzee, a Business Intelligence (BI) event, and  noted just how young all the DS kids are, and how old all the BI adults are.
  • The DS kids come out of university armed with open source tools, the BI graybeards are all settled on enterprise tools.
  • He predicts that DS will merge with BI, largely as BI analytic data structures are unified under the BI banner and come to dominate organizations.

Editorial:

  • BI defines itself by tools whereas DS defines itself by methods and ends.
  • Many DS'ers will find some of their peers co-opted by tools, as it's far easier to be religious about the merits of a tool over another one than it is to exert any sort of real leadership or independent thought.
  • DS stands a chance of not becoming co-opted by tools because they're aware of what happened to BI and Web Analytics in previous generations - the real test will come.

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca

Tuesday, January 10, 2012

The Three Reasons People Ask For Data, and The Three Questions Analysts Should Ask

There are three broad categories of reasons why people ask for figures.

  • They know what they know and need evidence to support what they know. (Convenient Reasoning)
  • They know what they don't know, and genuinely need objective evidence in support or against somebody or something. (Decision Support)
  • They don't know what they don't know, and are looking for somebody to tell them what they should know, or what they should do. (Exploration)

There are three questions an analyst should ask whenever they get an incomplete request for data:

  • What problem are you trying to solve?
  • Who are you trying to convince?
  • What are you going to do differently if you had the evidence?

Editorial:

  • Not everybody who engages in convenient reasoning is evil - they're just building a business case - and conflicting evidence frequently isn't welcomed or viewed as helpful.
  • Not everybody who engages in decision support is indecisive or engaging in analysis paralysis - they have a hypothesis about the world, they have an inkling that something is likely to be true, but they're keeping an open mind.
  • Not everybody who engages in exploration is a pain in the ass. They're trying to understand the world better and might not even have a firm idea of what the core problem they're trying to solve yet.

I've done my best here to be mutually exclusive and comprehensively exhaustive.

What do you think?

***

I'm Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca