Launch, grow, and unlock your career in data

March 23, 2012

An Invaluable Tool – The Tableau Reshaper

This blog post is mostly a duplication of the Reshaping Data Made Easy” blog post from Tableau Software.  It’s recreated with the permission of Ross Perez.

You can download the reshaper here.

If you have been using the 4.1 version, there’s really not much difference.  The changes are mainly conveniences.  Andy Cotgreave did an excellent job upgrading the tool.

For more information on preparing Excel files for analysis, please click here.

Reshaping Data Made Easy

Submitted by Ross Perez on March 16, 2012 - 4:53pm

Today we are releasing a new version of our data reshaper add-in for Excel 2007-2010 (32- and 64-bit). For those who are not familiar with this tool, it’s a free add-in that helps you prepare your Excel files for analysis.

There are three main changes.

The biggest change is the addition of the “Open in Tableau” button. Press this and, hey presto, Tableau opens a new workbook connected to your current workbook.

Visually we have updated the UI so that it now has an Excel ribbon style interface making it easier to get to the buttons and generally making things a little easier on the eye.

The final change is the addition of 2 keyboard shortcuts to help speed up your process. The shortcuts available are as follows:
  • Ctrl+Alt+T: Open in Tableau
  • Ctrl+Alt+R: Reshape data
We hope that this new version of the tool will speed up your analytics using Tableau and Excel files.

Important: This tool is a free, unsupported, undocumented add-in from Tableau. Although it has been tested to work on 32-bit and 64-bit versions of Excel you should back up your data before using it.

March 16, 2012

Fixing Nielsen’s bubbles (and tips for effectively organizing and displaying data)


The chart below comes from a recently released study by Nielsen.  With some simple fixes, this data can communicate much more effectively.

I find this chart hard to read and interpret because:

  1. The data is not aligned vertically, making comparisons across categories for the same country difficult.  For example, your eyes are constantly pinging left to right to left trying to compare the UK values.  It kind of makes me feel like I’m in a tennis match.
  2. The bubbles are not sized according to their percentage, making comparing bubble sizes meaningless and inaccurate.  You can’t tell me that the orange Italy bubble for Downloaded Music is 1/3 the size of the US bubble.
  3. There doesn’t seem to be any logic to the order of the categories.  At first I thought they were sized by the US percentages, but that’s not it.  Maybe they’re ordered by the total?  Nope.  I have no idea!

There are two better alternatives for presenting this data.  First, if you like the bubbles, then a viz like this works. 


With this viz, it’s so much easier to compare values both across and down.  It’s easier to compare bubble sizes, you don’t have to lookup the colors since they’re organized in columns, and the bubble sizes are relative to each other.  Look at Downloaded Music in Italy now: 20% now looks like it’s a bit less than 1/3 the size of the US bubble (62%).

Note that I would normally have ordered the categories alphabetically, but I sorted them in the same order as the Nielsen viz so that you could compare the mine and their’s more easily.

A second alternative would be a simple bar chart like this.


This chart also addresses the comparison problems.  The gridlines make it especially easy to compare categories within the same country, though your eyes do have to skip over three other bars before getting to the next one.

Bar lengths are much, much easier to compare than bubble sizes, but the bar chart feels a bit more cluttered to me than the bubble chart.  In this situation, I would use the bubble chart I created. 

This goes to show that there’s more than one way to skin a cat.

March 10, 2012

Baseball Predictions: How good is the Marcel the Monkey Forecasting System?

No comments

If you haven’t heard of the Marcels, then you need to read up on it here. I can’t possibly summarize what the Marcels are better than their website:

The Marcel the Monkey Forecasting System (or the Marcels for short) is the most advanced forecasting system ever conceived.


Actually, it is the most basic forecasting system you can have, that uses as little intelligence as possible. So, that's the allusion to the monkey. It uses 3 years of MLB data, with the most recent data weighted heavier. It regresses towards the mean. And it has an age factor.

Yes, that’s it.  Don’t read too much into it.  The whole point of the system is its simplicity.  But you know what?  It does a decent job.

I took the Marcels forecast data (both batting and pitching) from 2001-2012 and matched it up with the “actual” data from Lahman for the same time period.  I was only interested to see how the Marcels performed, so I only looked at data where the year and player existed in both systems.  Get the complete set here.

I Tableaud the data to built this interactive viz (download the workbook here).  There are two tabs, one for batting and one for pitching.  You use them exactly the same way.

  1. Start by picking the stat you would like to view
  2. The viz at the top compares the Marcels prediction and the actual stat for all year/player records. 
    • Hover over a point to see the details (i.e., player, year, data, etc.)
    • Click on a point and the chart at the bottom will update with records for only that player
    • The points are color-coded by the prediction error (i.e., (Actual-Marcels)/Marcels)
  3. The charts at the bottom summarize all of the data for the stat chosen
    • The lines show the Actual and Marcels data across the years
    • The bars show the % error for the total year (red = under forecast, black = over forecast)
  4. If you want to analyze a specific player without having to hunt and peck in the scatter plot, simply pick him from the list
  5. Rinse and repeat with the pitching data

One of the features that I like best is that you can enter a stat minimum.  For example, you first pick HR for the stat, but you are only interested in seeing players that hit 30+ HRs in a season.  Enter 30 in the box and hit enter.  Voila!  The charts update.

Two overall themes stuck out to me:

  1. The Marcel over forecast nearly all of the “raw” batting and pitching stats…things like PA, R, H, HR, RBI, SB, W, ER, etc.
  2. The Marcels tend to forecast “calculated stats” very well, e.g., BA, HR %, OBP, SLG, OPS, ISO, ERA, WHIP

Finally, before you complain about the axes not being to the correct decimals for things like BA, OBP, etc. know that I’m using a parameter in Tableau from which I’ve built a calculation.  The best you can do, that I know of, to get close to the correct number formatting is to leave it set to automatic.  I don’t know of a way to force the format of the field to update based on the parameter selection.

March 9, 2012

Displaying Labels on Filled Maps Only When a Particular Criterion Has Been Met

1 comment
Andy Piper from Norfolk Southern, and long-time member and frequent presenter at ATUG, has created this fabulous tutorial.  From Andy:
The ability to display labels onto map locations (and more recently filled maps) has always been present -- simply drag and drop a field you wish to use as the label onto the Label shelf. However, in many circumstances there may simply be too many labels that appear on the map to obtain useful information.
This document will demonstrate how to display labels on a filled map only when a particular criterion has been met (without manually modifying your marks). A measure field called Demand is used in this example, though other measures such a sales, weight, population, etc. can easily be used.  This same process can also be useful when conducting sensitivity analysis.
Download the tutorial here.

March 4, 2012

Are teams benefiting from relievers pitching less? A visual analysis.


If you love baseball and particularly if you love baseball stats, you need to follow FanGraphs.  The depth of the analysis is simple incredible, but one of the things I find lacking is visual analysis.  There are often tables and some rudimentary charts, but I think the writing could be enhanced by adding some viz to the terrific explanations of the numbers.

Recently they wrote about the use of relief pitchers in Major League Baseball and whether adding depth to the bullpen resulted in a strong ROI.  In this post, I’m going to quote directly from the article, but all of the charts and graphs that supplement the words were created by me.

All of the data that I used can be found here and the Tableau workbook I used to created the charts can be found here.

Batters Faced per Game
“The change in bullpen usage is the biggest difference in the sport now compared to 30 years ago.”


“Despite the fact that modern bullpen roles have been well established for quite a while, the dwindling rate of batters faced per appearance shows no signs of slowing down. While the drop from 1982-1991 was the most extreme, the last two decades have each seen the league shed an additional half a batter per reliever appearance, and given that we’ve seen teams now expand to carrying 13 pitchers at times, there seems to be no end in sight to this trend.”

The article only provides a table and if the writer did not include the analysis in words, there’s no way anyone would have ever been able to identify this trend by scanning their eyes across a list of number. 

The chart above in broken down by decade by year and includes three methods for analyzing batters faced per game (BF/G). 

  1. BF/G (top lines) – This is simply a trend of batters faced per game over the last 30 years.  As the writer points out, the drop in the first decade is the most extreme (1.6 BF/G decline), but the last two decades have each declined more than a half batter (0.6 and 0.5 BF/G respectively).
  2. BF/G vs. 1982 (middle lines) – I wanted to understand how drastically the number of batters faced per relief appearance has really changed from 30 yrs ago.  The numbers and trend are truly staggering.  19.8% decline by 1991 and additional 12.3% decline by 2001 and another 7.5% decline through 2011.  That all adds up to an almost 40% decline. 
  3. BF/G vs. Start of Decade (bottom lines) – This is similar to #2 except the calculation “resets” each decade.  The idea here is to measure how much the BF/G rate has changed within the decade.  If the –11% trend from the last two decades continues, you can expect relief pitchers to be facing less than four batters per appearance by 2021.  So basically, every pitcher would be treated like a closer.

Wow!  Bullpen strategy sure has changed!

Walk and Home Run Rates
“With pitchers facing fewer batters, you’d expect them to be able to throw harder and exploit platoon advantages for better results overall. The trade-off should be more quality for less quantity.”


“Looking at the numbers, we don’t really see much evidence that the modern bullpen has helped relievers perform better at all.”

  1. “Over the last thirty years, walk rates by relievers are essentially unchanged. They went up a bit when the home run barrage took over the late-1990s, but have gone back down as home runs have become less common.” (top two lines)
  2. “The ratio of walks to home runs is pretty steady and consistent over the last thirty years, and there’s certainly no evidence that the modern day bullpen has helped pitchers avoid the base on balls.” (bottom lines)

Strikeout Rates
“On the other hand, strikeout rate has skyrocketed, increasing by 40% since 1982.  This would seem to support the idea that relievers can be more effective in shorter stints, and that playing the match-ups can help prevent run scoring.”


I have broken down the strikeout rates similarly to the BF/G rates.  

  1. K% (top lines) – This is simply a trend of strikeout percentage over the last 30 years.  K% has been on a steady increase of about 2-3% over each of the last three decades.
  2. K% vs. 1982 (middle lines) – As the writer noted, the strikeout rate has increased 40% since 1982, with the biggest increase from 1992-2001 of 18.4%.  But he also goes on to explain this:

    “While (starting pitchers’) strikeout rate has been raising at the same time that the modern bullpen has been evolving, this seems to be a case where correlation is not causation. If starters are seeing the same rise in strikeout rate, that points to a more fundamental shift among hitters – more sluggers swinging for the fences, the rise in acceptance of the strikeout as just another out among organizations – rather than a specific benefit being given to relievers from their new roles.”
  3. K% vs. Start of Decade (bottom lines) – Again, this is similar to #2 except the calculation “resets” each decade.  This provides stronger evidence of the “swinging for the fences” effect of the late-1990s; strikeout rates increased 19% from 1992-2001.

BF/G vs. K%
We’ve seen the write discuss BF/G and K% rates, but do these two have a relationship?  When I look at relationships between two stats, I like to look at them to ways: (1) a dual axis line chart and (2) a scatter plot.


The strikeout rate for relievers is clearly correlated to batters faced per appearance.  As BF/G goes down, K% goes up.  This is clear and easy to understand in both of these charts.  This would have been a nice nugget for the writer to include.

BABIP and HR Rates
“Likewise, it doesn’t appear that relievers are really generating much of a benefit when hitters do put the bat on the ball.”

The write makes a few notes about the stats, but I don’t agree completely.

  1. “Home run rates have risen at a similar rate as what starting pitchers have experienced.”  Ok, nothing to argue with here.  I have to take his word for it since I don’t have the data for starting pitchers.
  2. “Batting average on balls in play has increased significantly over the years.”  I subtly disagree here.  BABIP has only increased 11 points or 4%.  Is that significant?  I don’t think so.


Let’s extend the analysis a bit farther.  Let’s look again at the relationship between the two stats to find correlations.

When looking at the two lines together, there isn’t a clear correlation, like the obvious inverse relationship between BF/G and K%.  But what is interesting is when you plot them on a scatter plot.  I added the averages across all 30 years to each axis to make a nice quadrant chart.  The R-Squared is only 0.688, but what sticks out to me is how, for the most part, the years within each decade cluster together nicely (for the most part).

  1. Nine of ten years from 1982-1991 were below average for both BABIP and HR/9, with the tenth year also below the average BABIP.
  2. Eight of ten years from 1992-2001 were above the average BABIP with seven of those years also above the average HR/9 (remember, there was a significant increase in HRs in the late-1990s).  Note how much higher the BABIP and HR/9 rates are for the years above average.
  3. From 2002-2011, nine of ten years were above average for both BABIP and HR/9, but not nearly as far above average as 1992-2011.  Notably, the HR/9 rate fell from 1.03 in 2006 to 0.85 in 2011, a 17.5% decrease (this can be seen in the line chart).

ERA- and FIP-
“If you look at (ERA- and FIP-), there’s just no evidence that bullpens are preventing runs at a better rate now than they were before the current roster construction norms came along. Any improvements in quality of performance by the elite relievers have been offset by the fact that more innings are now being given to inferior arms, so the trade-off has essentially resulted in a change of no real benefit.”


If you truly trust the reader, then you’re only choice is to take him at face value here.  Me though, I like to “see” the data.  I’ve done a couple things here to quantify the data, but first, two notes.

  1. For ERA- and FIP-, values below 100% are better than the league average.  The lower the number, the better.  Think of them like an index.  If the ERA- is 95%, then that means it’s 5% better than the league average.
  2. Notice that the axes range from 80-120%.  This was done to emphasize the lack of significant year to year variances.

For this particular chart I have:

  1. Added a reference line at 100% to remind us that this is the average
  2. Synchronized the axes so that you can see how ERA- and FIP- compare to each other
  3. Added color bands below and above the average to indicate levels of goodness and badness.  That is, the darker the green, the better and the dark the tan, the worse.

Now, after having “seen” the data, I agree with the writer that “there’s just no evidence that bullpens are preventing runs at a better rate now than they were before the current roster construction norms came along.”

So what do you think?  Does these charts and graphs make it easier to interpret the stats?  Do they help tell the story more effectively?

March 2, 2012

NBA Franchise Values Visualized. How much are they really worth?


Every time Sports Chart of the Day creates a chart that ranks something it looks like this:

What bothers me over and over again is that they are trying to communicate a rank, yet they rank the data low to high.  Why?

In addition, they always use column charts instead of a bar charts, which forces me to turn my head sideways.  Ok, one more thing.  The source article is centered on the Lakers, so COTD used the Lakers colors.  Cute?  I think not.  I’d say chart junk.

I’d go with a simple bar chart like this.  Yes, it’s a bit taller, but you can at least read it without needing to go to the chiropractor afterwards.


But why stop there?  There’s more data available then just franchise values.  Check out this interactive viz I built with Tableau.

There are two vizzes to check out.  The NBA Franchise Values viz allows you to:

  1. Choose the data you want to view for two charts
  2. Provides the ability to sort these two chart by any data point
  3. Creates a scatter plot that compares the two data points you chose in step 1
  4. Allows you to pick a data point to size the teams by in the scatter plot

The Value & Debt Rankings viz is a heat map of the rank of each team across the different data points.  The article and data only provide the rank based on current value, but I was interested in the rank across all of the data points.  I used the INDEX() function in Tableau to create a rank for each field, then used the Advanced calculation to specify the field to rank.  Thus I can have a rank field for each data point.

While the article stresses the rise in franchise values, what they neglect to address is the heavy debt load on many teams.  For example, the New Orleans Hornets are currently valued at $285M, yet they carry $299M in debt.  Compare that to the Knicks and Clippers who carry no debt.  Now THAT  is the way to run a franchise.