Launch, grow, and unlock your career in data

August 19, 2012

Tableau Tip: Create hub & spoke diagrams using the Path shelf


I’m in constant learning mode with Tableau.  I don’t think a day has gone by when I haven’t learned something.  I love the Tableau community because everyone freely shares ideas from which they hope others borrow.  This technique of from-to patterns is no exception; it’s definitely not a new topic I’m covering here.  But I do like documenting the processes I follow, which is more or less why I started this blog in the first place.

Paths are not a feature I use often, so in my inevitable quest for all things Tableau, I decided to find some data that would provide a good learning experience.  The data for this post comes from the Bureau of Transportation Statistics.

Here’s a sample of the data. 


Each origin airport/destination airport combination is represented as one row in the spreadsheet. This is great, but in order to use Tableau’s Path functionality, you need to have two records for each origin/destination combination.  The first for the starting point, and the second for the ending point. 

You can accomplish this quite easily through Custom SQL.  The example below contains all the columns I need from the spreadsheet.

  • Those before the red box are inherent in the spreadsheet.
  • The SQL inside the red box is for the fields I have to add to make the paths work.
    1. You must specify the path.  In this case, 1 is the starting point, 2 the ending point.
    2. I added additional latitude and longitude fields to capture the starting and ending points.  That is, I use the origin lat/long when my path = 1 and the destination lat/long when my path = 2.
    3. I included a concatenated string for the route, which I will use on the Level of Detail shelf.
    4. Union the SQL for path 1 to itself, but change the path to 2 and change the lat/long to the destination lat/long. Notice that the Route calculation stays to same.


In Tableau, all you have to do is:

  1. Double-click on both Latitude and Longitude to draw the map
  2. Add Route to the Level of Detail shelf
  3. Change the Mark Type to line
  4. Add the Path Order dimension to the Path shelf


You now have a giant mess of lines all over the place that represent every single route in the data.  That’s all you need to do. 

The trick is that you need to write the custom SQL in order to have your data in a usable format.

I wanted to take the viz a bit farther though, so I created this interactive version on Tableau Public.

I included:

  • Filters options for Origin Airport, Destination Airport, and Airline (I chose ATL as my initial view since it’s the world’s busiest airport)
  • Colors to indicate the length of the route
  • Clicking on a color highlights the routes in the selected flight range and also highlights the destination airports on the bottom map
  • A map of the number of passengers that flew to each destination, indicated by the size of the bubble
  • Custom tooltips, which include the great-circle distance calculation

This is a pretty fun viz to interact with.  Choose all origin airports and you see all of the routes.  Choose the airport you most frequently fly from.  Choose your preferred airline.  The three New York airports are particularly interesting (EWR, LGA, JFK).

There are lots of other data elements that could be included, but I haven’t gotten to.  Think about adding options to color the bubbles and lines by things like frequency of delays, average ticket price, etc.

Wouldn’t this be really cool to do with your friend networks on Facebook?  Oh wait, we did that in a recent hack-a-thon.

August 16, 2012

Hours in R vs. five clicks in Tableau


Matt Katsaros built this really cool map in R of Spotify data.  From Matt: “This is a map of SF showing the % of Spotify listens of [I’m leaving the song unnamed] within each ZIP code.”

When I saw this, I left a comment and told Matt that I could show him how to do this in three clicks with Tableau.  We met today over lunch.

I connected to the data in Tableau, in this case a CSV file, and was presented with this window:


Step 1: Double-click on the “zip” dimension and you get this map (NOTE: I zoomed in on the continental US for purposes of this write up):


With one click we know where all of the listeners are located.  That’s right, one click!

Step 2: Double-click on the “listens” measure and you get this map:


Two clicks and I have an excellent map that shows the concentration of listeners.

Step 3: Drag “listens” onto the color shelf and you get this map:


We’re up to three clicks now and now we’ve emphasized the concentration of listeners with both size and color.

Step 4: Matt then wanted to see the map he built of San Francisco, so I lassoed those point and chose the Keep Only option.  I now have this map.  Notice that the colors have adjusted to reflect only the data in the view.


Step 5: To change it to a filled map, all I have to do is select the Filled Map option on the Marks card and remove “listens” from the Size shelf.


Five mouse clicks to get Matt’s map.  Yes, I know I told him three clicks, but I wanted to show him what Tableau does when you simply click the dimensions and measures.  You could do it in less steps using the Show Me menu.

To say Matt was excited is a huge understatement.  I believe we have a new convert to Tableau.

LeBron James is underpaid by nearly $7M. But you can’t pay him with pies!

No comments

LeBron James is pretty awesome; we all know that.  Sports Chart of the Day has interesting content, but creates horrible charts; we all know that too.  This time, Sports Chart of the day is trying to represent deviations with pie charts.  Really, seriously, I’m not lying.

Here’s the evidence.

If the headline of Cork Gaines’ article didn’t tell you what the point of the story was, do you think you’d actually be able to deduce anything from his pie charts?  Can you tell me who is the largest % underpaid? 

This has to be in the top 2 or 3 worst ways to represent change.  But then again, Sports Chart of the Day (and their cohorts over at Chart of the Day) create quite a bit of chart junk.

There’s some pretty basic information available and there are also some simple ways to portray the data that will tell the story better.  Let’s go through a few examples (in no particular order).

Example 1: Bar in bar chart, labeled by the amount under/overpaid


With this chart design, you can compare the salaries at a glance, both for the player and across the team.

Example 2: Bar in bar chart, but the bar lengths represent the % of the team payroll for each player.


This chart is the identical shape as example 1, but the percentages provide a bit more perspective than the raw numbers.  You cannot easily discern from example 1 that LeBron should account for 30% of the team payroll.

In both example 1 and example 2, you’re able to quickly rank the players based on their current salary (you may not have even noticed that they’re not in order on the pie charts).  So I can clearly see that Mike Miller is the 5th highest paid player and that he’s basically making 2x what he deserves.

It’d be even better if you could select your sort criteria. Fortunately I can build and share in seconds using Tableau. Click here to see it live.

Example 3: Bar chart that shows the amount each player is underpaid or overpaid.  Color is used as an additional visual cue.


This chart represents the math between the two bars in chart 1 so that the focus is directly on the amount LeBron is underpaid (and the amount Chris Bosh is overpaid).  I prefer this over example 1 because the message is much simpler and clearer.

Example 4: Bar chart that represents that % that a player is underpaid or overpaid.  In other words, who is really getting screwed.


Interesting…when you compare this chart to example 3, it’s a very different story.  Yes, LeBron is the most underpaid, but that’s because he makes so much money.  This chart here tells us that he’s actually getting screwed the LEAST of those that are underpaid.  Can you imagine being underpaid by 44% and being able to prove it?  James Jones’ agent needs to get to work!

This chart also shows us how ludicrously overpaid Wade, Bosh and Miller are.  Good luck restructuring those contracts to help out their teammates.

I’ve provided four alternatives, but many more exist.  If I had to pick one that tell the story the best, I would go with example 4 because it makes the salaries all relative.

August 15, 2012

Data viz exercise: Find all possible ways to visualize a ludicrously small data set of two numbers

No comments

I was teaching a data viz class at work yesterday and I tried an exercise that I’ve never done before.  The idea came from this blog post by Santiago Ortiz on

To set that stage, we covered attentive vs. preattentive processing, three forms of preattentive processing (I skipped motion), and the Gestalt principles of visual perception.  Nothing ground breaking there, but a necessary base/toolkit that everyone should have.

After the initial training, I had them grab some markers, head to the white board, and think of as many possible ways to visualize two simple numbers: 75 and 37. 

At the end, we picked the best visualization.

Here’s a sample of what they came up with (there was plenty more to the left and right):


Like the author of the blog post, I had no idea how this would go.  It could have been a total flop, but it was the exact opposite.  We spent a good hour discussing two simple numbers!  And it was incredibly rich discussion.  We went through each and every sample, discussed the pros and cons, and compared them all to determine which one we like best.

What I found very interesting was that everyone made an assumption that the base was 100, but the directions never said that.  All I gave them was two numbers.  This led to an awesome discussion about needing context when presenting data.

Give it a shot!  Grab a few fellow data viz nerds and do this at lunch.  Or better yet, over dinner and beers.

August 11, 2012

Why sort order is important in bar charts (Nielsen needs help again!)


Nielsen is at it again.  It’s not a spiral chart this time, but the simplest of bar charts that they couldn’t get right.

It all begins with this headline: “Telecom Grows Global Ad Spend, Durables & Services Decline”

Followed by this chart:

Wait…where’s telecom?  Oh there it is, all the way at the bottom.  If it’s so important, why isn’t it at the top?  Because they sorted the chart alphabetically.  When emphasizing rank, never sort your categories alphabetically.

Next question, which sector was the 4th fastest grower?  I bet it’ll take you longer than it should to find out.

A few very small tweaks, that take all of 10 seconds, would have made their chart so much more effective.

  1. Sort the sectors descending by growth rate, i.e., winners at the top, losers at the bottom
  2. Color code the bars to complement the length of the bars, i.e., double emphasize the point
  3. Put the mark labels on the ends of the bars
  4. Update the title – They didn’t mention that distribution channels were the leaders.  Why not?  I don’t get it.


Which sector was the 6th fastest grower?  I bet it took you less than 2 seconds, certainly far less time than when the sectors are sorted alphabetically.

Now their story makes more sense.  If only they would give their work a bit of thought and put more pride into it.  Nielsen consistently looks like they’re slapping a bunch of stuff together so that they can publish quickly. I think it hurts their brand to produce the junk they do, especially given how little time it takes to do it well.

August 10, 2012

Displaying time-series data: Stacked bars, area charts or lines…you decide!

Matt Stiles of The Daily Viz presented this chart to “see the trend in this quick column chart” in one of his recent blog posts.

First, let me say that this is a tremendous improvement over those produced by the U.S. Bureau of Alcohol, Tobacco, Firearms and Explosives (a.k.a. the ATF).  Don’t bother reading the ATF report, unless you love 3D bar charts and 3D pie charts created in Excel.

A stacked bar chart is basically a pie chart unrolled to make a stick.  And more often than not, when plotted as a time series, they do a poor job at showing the overall trends.  Stacked bars are good up to three bars, no more.  Why? Because it’s difficult to compare the heights of any of the bars except for the bottom bar, rifles in this case. 

Let’s go through several alternative displays.  If you’re interested in playing with the data, Matt published it here for me.  Thank you Matt!

All of the charts below were built with Tableau.  You can view an interactive version of all of these charts here and download the workbook here.

Let’s start with a redesigned stacked bar chart that uses Tableau’s built-in color blind palette.


Can you see the trends for each of the weapons?  Maybe an area chart would be better.


Well, ok.  Now the trends are easier to see, right?  Area charts certainly improve the ability to see trends over time, but there are only two trends that give an accurate reading:
  1. The line at the top of the bottom area, i.e., rifles.
  2. The top of the top chart, which represents the total.

We still don’t have the ability to see the trends for any weapon except for rifles. 

Before you read on, take out a piece of paper and sketch what you think the trend is for shotguns (light blue) based on the area chart above.

Ok.  Now let’s compare the area chart above with the area chart for shotguns.


Did you come close?  I doubt you did.  Why?  Because the tops of each color are influenced by the size of the colors below it, therefore making gauging the true size of each individual color extremely difficult.

Here’s another way to prove it.  I know this isn’t a good way to represent the data, but bear with me, I’m trying to prove a point.  If I overlay lines for each weapon over the area chart, look how different the shapes of the lines become.


Like most time-series data, your best way to represent the data is nearly always going to be a line chart.


Using a line chart we can quickly make some observations:
  1. There was a three-year spike in the early 90s for pistols made and there’s been a similar, but longer, surge since 2006.  What was the cause of the big decline in 1995?  Was there a change in handgun laws in 2005 or 2006? 
  2. Revolvers were on a steady 20-year decline until 2005-2006.  Is this merely coincidental with the pistols?  Possibly so, possibly not. 
  3. Rifles have increased recently, but shotguns have decreased.  Are people buying rifles instead of shotguns? Their rate of variance since 1994 has grown consistently and the gap continues to get wider.

Using a line chart, you’re immediately asking questions of your data.  Rapid-fire analysis!

When analyzing time-series data across several categories, consider not only looking at the raw numbers like above, but also review how each category contributes to the total.  Let’s go through the same series of charts.


We’re off to a good start with the stacked bar chart.  It looks like measuring the contribution of each weapon to the total may tell us something.  Let’s try it as an area chart.


Not much better, other than it looks smoother.  How about a line chart?


Ok, now we’re onto something.  You might think that this is the same as the line chart for the raw numbers, and I can see how you might make that conclusion at a quick glance.  But let’s look at them side-by-side.


The charts look very similar up until 1997, but then look at how many more rifles started to be made compared to the rest.  And look at the drop off in percentage of shotguns produced since 2004.

Hopefully you’ve learned two main lessons:
  1. Don’t display time-series data as stacked bars (or pies unrolled onto on a stick if you prefer).  The best medium for time-series data is a line chart.
  2. Consider looking at both the raw numbers and their contribution to the total.  It’s always a good idea to look at your data in more than one way.  You may get some additional and/or different insights.

Let me wrap with two charts that disturbed me a bit as I was playing with the data for this blog post.  I’m not disturbed by their visual display, but by what they reveal.


The chart on the left is the running total of guns made by gun type since 1986.  The chart on the right summarizes the chart on the left.

These charts tell us that the US has manufactured over 99 million guns since 1986.  Seriously!  99 million!  According to the US Census Bureau, there were ~238M Americans over 18.  That means that approximately one of every five Americans 18 or older owns a gun. 

That terrifies me!

Perhaps political interests (and lobbyists) have played a part?? For more information on how to use the US Census Bureau data, check out this guide.


UPDATE – Source CNN: This certainly explains the drop that started in 1994 and the subsequent increase in 2005.
The Clinton administration imposed a ban on several types of military-style semi-automatic rifles and high-capacity magazines in 1994, but that ban was allowed to lapse in 2004. Obama has proposed restoring the ban, requiring background checks for buyers at gun shows, and other "common-sense measures."

August 6, 2012

Tableau Tip: Automatically excluding or keeping only the current week (without manual intervention)


Situation: You have daily sales data, but you don’t want to include the current week until it closes.  You also don’t want to manually filter the most recent week out of the report each week because that’s an unnecessary nuisance.

Solution: Create two date calculations and one filter calculation

Step 1 – Create a calculated field to get the current year & week


Note that I’m creating a string that includes the year.  This is necessary if you have more than one year in your reports.  If the formula only returns the week number, then that week number would be excluded for all years.

Step 2 – Create a calculated field to get the year & week for each date of your date field


Step 3 – Create a calculated field to act as the filter


Step 4 – Drag the Exclude Current Week field to the Filters shelf and choose the Exclude option on the bottom right when the result of the calculation is “EXCLUDE”


An alternative solution would be to build the two calculated field from step 1 & 2 into the calculation built in step 3, but I chose to break them out separately for this example so that you can more easily see how the calculations work together.

The calculation would look like this:


This method would work if you wanted to ONLY KEEP the current week.  In this case, you would follow step 4, but you would NOT select the Exclude checkbox on the bottom right.