Monday, February 28, 2011

Data Market

Have you heard about Data Market?

It is one of the largest (and free) curated repository of public data.

Benefits:

  • It has internal search that doesn't suck, so you can find what you're looking for and get out.
  • It offers the ability to preview the data in tables and charts before you export.
  • It offers the ability to export in popular formats.
  • It's freemium. (API and LIVE data have a cost).

Why am I excited about this?

These data sets are very clean, and some of the data has direct uses for analysts in their social-professional lives.

They're there, and you should register and check them out.

Monday, February 21, 2011

On Error

There are at least five types of error related to analytics. These are instrumentation error, algorithm error, transposition error, statistical error, interpretation error.

1. Instrumentation Error

When the instrument is measuring a phenomenon incorrectly. This is not to be confused with a human mistaking what an instrument really actually measures. Rather, this is when the instrument itself is only recording half of something. Or not measuring something at all. It's akin to saying that thermometer is broken.

Instrumentation has varying degrees of accuracy. For instance, the unique http cookie is subjected to fault as a result of a deteriorating cookie retention curve. The instrument continues to work just fine - it's just that user behavior has changed, affecting its accuracy. To be clear, that's not the fault of most instruments that use http cookies. The instrument is just fine. The underlining behavior is making the instrument less useful.

2. Algorithm Error

You get an algorithm when a calculation just goes wrong. Take, for instance, the intent that rate of change is ((present-past)/past)*100, but sometimes gets mixed up. Sometimes, in certain complex algorithms, unexpected (unintended) cases slip through, and they, in turn, are a source of error. It's akin to saying 'getting a rule wrong'.

3. Statistical Error

There's the instance of error that is caused by the forced necessity to sample the world instead of absorbing it all. These types of error are imposed by the universe and fairly bad random table generators. It's somewhat akin to saying that you're being fooled by randomness.

4. Transposition Error

Transposition error happens when a human, usually an analyst, has to manually move a set of figures from one medium or location to another. That is to say, from a tool into excel, or from one area of excel to another area of excel.

5. Interpretation Error

Interpretation error happens when a human, usually an analyst or a reader, believes a number to be something that it is not. For instance, confusing unique cookies for people.


The five types of error are omnipresent in analytics. It's best to acknowledge them and manage them as best as you can.

Monday, February 14, 2011

The Productivity Curve and the 3/2 Rule

An excellent analysis done by Allan Engelhardt, back in 2006 I suppose, talks about the 3/2 rule of employee productivity. The Coles notes version is that when you triple the number of employees, you cut their productivity in half. Check out the diagram below.




Pretty scary right? Naturally, the story is much more complex than portrayed. Some sectors have mild slopes, like technology companies. Arguably, they're using technology to flatten out the productivity slope. But it's still slightly negative.

Naturally, larger companies scale, so they still make more profit overall.

Small companies are very good at doing many things. They become less good as they become large. And then ultimately, they stop being really, really good at anything at all.

There are many reasons for this. The ones I'm most familiar with has to do with networks and hierarchies.

A group of 8 people are efficient. A group of 16 people are less efficient. We grapple with the phenomenon of the "8000 dollar meeting" (that of feeling like there are far too many people in the room) by imposing a hierarchy on the 16 - anointing 2 to 4 people to meet, and then communicating out to the 12. In so doing, we lose detail, and importantly, understanding. Bottlenecks form. Inevitably, effective people form networks throughout the 16 in an active effort to regain efficiency. And so on it goes, up to 32, 64, and 128.

Productivity is defined as the amount of output for a given amount of cost. Something is efficient if there's little cost but is high in revenue. So clearly, larger groups of people, are less efficient from purely a communication basis - there's an incremental cost in terms of it. If the definition of planning is "preparation of the mind", then that loss of understanding is even worse. It takes time for people to talk through a plan, internalize it, and then go out and make the day to day decisions when the unexpected inevitably arises. That loss of understanding severely impacts both efficiency and effectiveness because it generates paralysis. Worse, it opens up degrees of freedom which can lead to misunderstandings and asymmetrical efforts.

A leader should be judged effective in part on their ability to reduce and minimize asymmetries - not exploit them.

It's definitely a social law that I'm cognizant of, and need to actively resist. I'm optimistic though, because fundamentally it's a human problem. And most human problems can be solved by humans.

Friday, February 4, 2011

Konrad von Finckenstein and subsidies

Konrad von Finckenstein, chairman of the CRTC, went before committee yesterday and made the remark:

"The vast majority of Internet users should not be asked to subsidize a small minority of heavy users."

I take issue with Finckenstein's statement.

For one, the vast majority of Internet users subsidize a number minority. Urban customers, who are comparatively cheaper to connect with bandwidth, pay more to subsidize rural customers, who are comparatively more expense to connect.

Didn't the CRTC pass fees last year, forcing the vast majority of us to subsidize the viewing habits of the small minority of people who watch the CBC, Flashpoint, and DeGrassi?

Isn't the CRTC mandating the subsidization of something called "Canadian New Media"?

Seems to me that Finckenstein is just fine subsidizing those groups.

(And I'm trolling in part. I happen to agree with rural subsidies for broadband access, even though I do live or own in rural Canada.)

He seems to believe that heavy Internet users incur direct marginal costs to the major Internet Service Providers. And that these costs are HUGE! HUGE! Figures put out by Netflix suggest that this cost is 1 cent per gigabyte. I don't believe that it's 5 dollars/gb over a cap, like the actual marginal pricing the Industry is trying to charge, suggests. I don't believe that a 25gb cap is justified on a cost basis. But it's not.

It's about profit maximization and defense.

I would normally applaud profit maximization. This isn't a normal situation. We're talking about duopolies and natural monopolies.

Information workers - those who are developers, analysts, and scientists - use a lot of bandwidth. They use a lot of bandwidth at home - part of being first movers and innovators. They use a lot of bandwidth when they bootstrap and start up. They do so because the era of cloud computing, telecommuting, and big data is built upon cheap memory and cheap bandwidth - especially residential bandwidth. These are high value industries to Canada.

I'm all for us to pay an internationally competitive cost for the bandwidth we use. I'm not willing to be gouged and have my industry fail in a misguided attempt by the CRTC to retain traditional cable revenues. Or, to support other CRTC subsidies that they themselves support and deny they support simultaneously.

Konrad von Finckenstein's vilifies innovators as unfair subsidy seekers. I don't see him vilifying himself when he hands out generous subsidies to other interest groups.

As a Canadian innovator, early adopter, and cloud data scientist, competing globally, I'm asking for non-punitive rates at the very least. This isn't an irrational position.