December 28, 2011
December 24, 2011
Heat Map: Manchester Derby Results Since 1907
The Soccer by the Numbers blog continues to provide me great inspiration. After Manchester City destroyed Manchester United 6-1 on October 23rd (at ManU), they blogged about the results and provided a bit of statistical analysis. City has long been the “noisy neighbors” to United, but they now are doing their best to buy a title, rather than grow their own talent. Soccer by the Numbers posed this fundamental question:
We now know that the outcome of the match was truly unusual. But how unusual?
They created the table below, which is a frequency distribution of scored lines since 1907 of matches at Manchester United. To use the chart you simply identify the score line by going across City’s scores then down ManU’s scores. So the 6-1 scoreline has occurred 2.99% of the time since 1907. Clearly this was an unusual result.
But I think this table could be improved. I made these changes:
- Changed the numbers to percentages and rounded to one decimal. Two decimals is unnecessary precision.
- Removed most of the gridlines so that the lines separate the data from the categories
- Formatted the results as a heat map. I chose a red-white two-color scheme since Red is ManU’s color. This makes the largest percentage of result very obvious. For example, you can now easily see, without having to scan across all of the data points, that 1-1 is the most common score line…boring result!
- Formatted the totals as a second heat map. I chose a brown-white scheme for these. The totals show you the % of the total goals scored for each time. ManU has scored one or two goals 64.2% of the time while City has scored one or two goals 58.2% of the time.
Which format do you like best? Does one make the story easier to interpret than the other?
December 22, 2011
Using a non-zero-based axis: I don’t understand why “experts” can’t get it right
Nielsen is widely regarded as providing exception analysis of consumer data. In fact, many of the best analysts I work with spent many years at Nielsen and they have been very interested in learning data visualization best practices because they now understand the benefits.
I agree that Nielsen’s insights are often fantastic, yet I don’t understand why they can’t present their analysis more appropriately. I guess my larger concern is that when a company with the influence on analysts like Nielsen has presents data visualization, even simple ones, so poorly, it make the abuse more and more pervasive. I’m seriously considering send a link to this post to the author of the presentations I’m about to review.
I received two annual reports the other day, both authored by a high-level employee who was supplemented by many other resources, so I can’t lay the blame on any one person, but more on Nielsen as a whole. Here is a summary of the charts they presented:
Presentation #1
Zero-based axis = 6
Missing or non-zero-based axis = 88
Pie chart with no color to differentiate the slices = 2
Charts well done = 6/96 = 6.3%
Presentation #2
Zero-based axis = 37
Missing or non-zero-based axis = 36
Smoothed line charts = 7
Charts well done = 37/80 = 46.3%
Hopefully this means someone told them presentation #1 had a lot of chart junk and they made an effort to improve presentation #2, but I doubt that’s actually the case. It’s easy to see that most of the charts were created in Excel, which will automatically set the axis to start somewhere other than zero if the numbers in the chart are large. I don’t know the specific business rules that Excel uses, but they should be changed.
Here is a representative example of the charts they created which did not have a zero-based axis. I’m holding out hope that this wasn’t done to intentionally deceive the reader, but to emphasize the subtle differences between the data points.
I recreated the chart as a dual-axis chart with the primary axis starting at zero and the secondary axis set to Excel’s default. Also note that I created a line chart since this is time-based series data, which typically means you’re wanting to see the overall pattern.
Clearly these imply a very different story.
Ok, we see what’s wrong, but how could Nielsen have presented the data more effectively? You have two primary options.
Dot plot
- You can replace bar charts with dot plots so that the sequence over time is de-emphasized
- Dot plots don’t require a zero-based scale
- Dot plots force the reader to refer to the scale before comparing two values
- Sizing and coloring the bubbles by change over prior year would speed up the reader’s analysis of variances between variables
Non-zero-based line chart with special alerts
This is the most effective method if you insist on NOT using a zero-based scale. Stephen Few sums it up best in Show Me the Numbers (page 169):
You should generally avoid starting your graph with a value greater than zero, but when you need to provide a close look at small differences between large variables, it is appropriate to do so. Make sure you alert your readers that the graph does not give an accurate visual representation of the values so that your readers can adjust their interpretation of the data accordingly.
Nielsen can learn a few lesson by reading some of the great data visualization books by Few and Tufte, or they could hire resources that know what they’re doing and allow those resources the freedom to make the best practices viral. The alternative isn’t good for any of us.
December 21, 2011
When you use a smoothed line chart, your data is not affected, it’s misrepresented!
This past week, I was watching a presentation on Q3 performance and up pop a bunch of charts that were clearly created in Excel with smoothed lines. I hadn’t seen smoothed line charts in quite a while, so I was taken aback. I almost, but thankfully didn’t, stand up and call out the junk.
It was incredibly clear to me that the smoothed lines were distorting the data, not much, but distorting it nonetheless. And I have a problem with THAT!
Let’s first take a look at some examples to see how badly the data can be distorted. The first chart is obviously the smoothed lines. Nice and pretty, I agree. It makes me feel like I’m going up a chairlift then skiing down the slopes of Keystone, Colorado.
I added the gridlines, though I would never do this if I were presenting this for real, so that you can see where the points truly intersect. It should be abundantly clear now that the line is trying to connect points that don’t exist. Look between July & August 2009 or between August & September 2008. In both of these instances, and many more across the chart, the lines go beyond where they should in an attempt to make the chart nice and smooth.
If I were to look at this quickly, I might think that my sales increased from July to August 2009, but in fact, there was a slight decrease. In order for the line to connect smoothly to September 2009, the line has to go around August 2009. Think about all of the people that don’t used zero-based axes. Imagine how distorted their data could look.
Contrast the smoothed line chart to this standard line chart.
You now easily see that sales decreased from July to August 2009. It’d be tough to interpret anything from this chart between the months because the lines clearly connect month to month. The smoothed lines lead you to believe that there is more data being connected.
Now, let’s look at how the smoothed and straight lines look on the same chart. For illustrative purposes, we’re only looking at 2008. Now that dip after August really stands out.
Jon Peltier of the Peltier Tech Blog sums it up best in his post about the charts to choose and avoid in Excel 2010:
Smoothed lines are abused. If you are plotting measured data, the only valid connecting curve between points is a straight line (or a line which is fitted to a function that comes from a physical model of the data). A smoothed curve implies that the data goes places where it has not been measured. Smoothed lines without points are even worse, because the person trying to interpret the chart doesn’t even know what points on the smoothed curve belong there.
My advise? NEVER use smoothed lines. The ONLY possible outcome is misinterpretation.
Let me wrap up with what I find to be a bit of a funny line from Microsoft’s help for creating smoothed lines:
When you use this procedure to soften the jagged edges of a line chart, your data is not affected.
This is very true. Your data is not affected, it’s merely misrepresented. Semantics?
December 16, 2011
Is it possible to share 101.4% of Facebook? Chart of the Day thinks so!
There's a bad stomach bug going around this part of town and I think I might know part of the reason why. Today, my good friends over at Chart of the Day published this pinwheel pie chart and I think the filling might be bad, because the pie sure looks ugly.
Here are some of the problems with this chart:
- IT'S A PIE CHART!
- Colors are re-used, or maybe they are so similar it's hard to tell they're different
- The slices are not in order, making it even hard to look up the values (notice how Microsoft is listed ahead of Peter Thiel and some others)
- The dollar amounts are based on their portion of $100B, yet they total up to $101,350,000,000.
- Correspondingly, the percentages add up to 101.4%. How can you have more than 100% of a total?
To highlight the differences, I created the following charts with Tableau.
What I attempted to do here was show the Stated % Share (gray bar) from the pie chart compared to the "Restated % Share" (black bar). I calculated the Restated % Share with the following formula:
SUM([Stated $ Value])/TOTAL(SUM([Stated $ Value]))
NOTE: A special thank you to Marc Reuter (@tableaujedi) for enlightening the ATUG crowd today with some Jedi magic and for showing how to use the TOTAL function. I had never used it before (and didn't know about it either), but I find it totally awesome! It'll be so useful!
Basically, I'm taking the value stated on the pie chart and dividing it by the total value of the pie chart. This gives you the Restated % Share. The label is the difference between the Restated % Share and the Stated % Share.
The chart on the right represents the % variance number (as identified by the label on the left) multiplied by $100B (the estimated total value of Facebook).
If I were one of these shareholders, I'd be a bit concerned about the math. This isn't chump change! In the end, Chart of the Day may have made a $1.35B miscalculation. Oops!
Download the Tableau workbook here and you will see the original and restated data like this:
December 14, 2011
What does it take to survive in the English Premier League?
If you love soccer, then it’s likely that you follow the EPL. My favorite team? ARSENAL! Did you see the incredible goal by RVP Saturday against Everton? You may never see better technique and now he’s only one goal behind Thierry Henry’s team record for goals in a calendar year. Please Lord, keep RVP healthy for a full season!
And if you love soccer and you love stats, then check out Soccer by the Numbers. Chris Anderson writes many quality posts and recently he blogged about points and relegation. I wanted to take Chris’ ideas a step farther. I needed a richer dataset than what Chris was able to provide, so I downloaded the final tables (i.e., standings) from the EPL back to the 2001-02 season from ESPNSoccernet. You can download the full dataset here.
I borrowed (or is it stole?) Steve Wexler’s technique for providing instructions (hover over the EPL logo to see what I mean). There’s lots of interactivity in the viz, so first check out the instructions, then start clicking around.
Answer this: How many teams have qualified for the Champions League with a negative goal differential? Who were they? What else can you tell me about the team(s)? Post your answer in the comments.
December 13, 2011
ATUG Webinar – Jedi Tricks and Brilliant Dashboarding (Thursday Dec 15, 1-4pm ET)
While we prefer that you join us in person, this month we are offering the Atlanta Tableau User Group meeting as a webinar. Anyone is welcome to join.
RSVP (if you plan to attend in person) – http://www.tableausoftware.com/usergroups/atlanta-dec15-11
Please join 5 minutes early as we will start promptly at 1pm.
Agenda:
- Jedi Tricks – Marc Rueter, Tableau Software
- In this session, Marc will show ten powerful tricks that will make you truly the master of your analytics.
- Hands-on Training: Tips & Tricks – Andy Kriebel, Coca-Cola
- Authoring Brilliant Dashboards
- Ten powerful tricks that will speed up your work
- Information - January & February Meetings
-- This will be a hands on session --
Webinar Login Information
URL: http://bit.ly/drY3e5
Toll free: (877) 906-9811
Conference code: 2561415367
December 8, 2011
Join us at the next ATUG meeting – December 15th 1-4p @ Coca-Cola
The next ATUG meeting will be December 15 @ 1PM ET
Where – Coca-Cola - 1 Coca Cola Plaza, Atlanta, GA 30313, Nicholson Room – Central Reception Building
Important – You must check-in with security and tell them you are there for the Atlanta Tableau User Group meeting.
Plan to arrive at least 15 minutes early!!
RSVP – http://www.tableausoftware.com/usergroups/atlanta-dec15-11
Agenda:
- Jedi Tricks – Marc Rueter, Tableau Software
- In this session, Marc will show ten powerful tricks that will make you truly the master of your analytics.
- Hands-on Training: Tips & Tricks – Andy Kriebel, Coca-Cola
- Ten powerful tricks that will speed up your work
- Authoring Brilliant Dashboards
- Information - January & February Meetings
-- This will be a hands on session - Bring your laptop and Tableau with you --
December 7, 2011
Sports Chart of the Day: When you want to emphasize rank, sort appropriately…please!
Dear Sports Chart of the Day,
I’ve been patient. I’ve added comments (which never seem to get posted/approved). And I’m frustrated. I love your blog, but you really need to make some simple changes to your charts.
You often post charts like the one below (from this blog post). You often say things like you said in this post:
Here are the top 17 teams in the NFL based on points scored and points allowed and how many wins those teams would be expected to have based on those numbers (actual record in parentheses)
As a reader, it’s very clear that you are emphasizing top-to-bottom rank. Those of us that live in the western part of the world have learned to read left to right. You want us to look at the “top 17 teams”, yet the chart reads right to left. Please, please start sorting your ranking charts in the appropriate order, descending in this case. Trust me, this will improve your message.
Respectfully,
Andy
December 6, 2011
When income grows, who gains? Find out for yourself.
Anyone who follow this blog know how much I despise pie charts, but there are times when I tip my cap to someone that does them well. As much as it pains me to say it, pie charts are not ALWAYS evil.
The viz below is a great example of how to use a pie chart well (from the State of Working America blog):
- There are a maximum of four slices to this pie chart
- You can very quickly see how dominant one slice is versus the others
- The colors contrast well enough to not have to constantly refer to the legend
Click on the image to interact (you will be taken to the source site).
I guess what caught me most off-guard about this chart is the summary text when you choose 2002-2008. All income growth went to the top 10%. I had no idea! A great chart can indeed tell a great story, or better yet, let the reader discover the story for themselves.
December 4, 2011
Choosing a good chart type – A Cheat Sheet
Charles Schaefer’s session at TCC11 “Understanding and Working with Chart Types” included this decent flow diagram for choosing a chart type. As the title says, it’s “A Thought-Starter”. In other words, use it as a way to get you going down the right path. This absolutely shouldn’t be taken as gospel; use your common sense. For example:
- Don’t ever create 3D area charts. In fact, don’t ever create a 3D chart of ANY kind.
- Never create a circular area chart
- When created a stacked column chart don’t connecting the same colored bars with lines.
You can download the materials from Charles’ session here. Included are a JPG and PDF version of the flow chart.
The flowchart originates from a blog post back in 2006 on The Extreme Presentation Method blog. This blog, in turn, links to an interactive version of the chart chooser on Juice Analytics. Although Tableau does a lot of this thinking for you, definitely check out the interactive tool to get a feel for some best practices and download the sample excel workbooks to quickly recreate these charts yourself.