# Hunting for Simpson’s Paradox, part 1

Let’s say we  happen to know the batting average for two baseball players.  Overall, player 1 has a higher average that player 2.  However, if you consider only how each player hits against left-handed pitchers we find that player 2 actually has a better average player 1.  In this hypothetical scenario, it also turns out that player 2 also has a better average against right-handed pitchers.  How is that possible?

Doesn’t it make intuitive sense that if player 2 is better than player 1 against left and right handed pitchers separately, that he must be better than player 1 against all pitchers?  While that may be what our intuition tells us, it turns out that it’s not necessarily true.

Consider the following table.  Note that batting average is simply the ratio of a players number of hits over the number of at-bats.

This table presents exactly the hypothetical scenario described above.  Separately, player 2 had a higher average than player 1 against left and right handed pitchers, but over all player 1 has a higher average than player 2.

This phenomenon is commonly known in statistics as Simpson’s paradox.  It demonstrates how our intuition can get us into trouble.  Briefly stated, Simpson’s paradox occurs when a correlation or trend that is present in groups is reversed when the groups are combined.

I was recently reminded of Simpson’s paradox when @Math_Bits posted a link on twitter to an article, “Instances of Simpson’s Paradox” by Thomas R. Knapp.  It got me thinking.  Sure I can manufacture an example and I’ve seen a few examples in papers, texts and even wikipedia.  But I want to find my own examples.  And of course, manufactured examples like the table above don’t count.  I need real data.

I figured the best place to start for real data that’s easy to find is in sports, say baseball, while we’re thinking of it.  I started simple and pulled up the first split data set I found.  Over on Yahoo sports, I pulled up team statistics for the full 2010 season, see http://yhoo.it/hKAiLn.  I pulled the data over into excel and began (manually, ug…) hunting for an example of Simpson’s paradox.  Of course, I would start the hardest way possible.  I looked at batting averages (overall, home and on the road).  I made three lists, one for each category: overall, home and road.  Then, I sorted the teams from highest to lowest and began looking one-by-one for pairs of teams where one had a higher average overall but lower both at home and on the road.

… to no avail.  I even reversed the process by sorting from lowest to highest.

I’m beginning to believe that Knapp was right when he claimed that examples of Simpson’s paradox are extremely rare.

The next step was to automate this process.  As a programmer, I began devising a simple code that will take an overall list and lists for each group and identifies all those pairs that satisfy Simpson’s paradox.

In the next post, I’ll walk through the progress I made using Matlab to do my dirty work.