Launch, grow, and unlock your career in data

October 19, 2017

Workout Wednesday: Market Basket Analysis

For week 42, Emma set us a task to conduct some market basket analysis. Essentially she asked us to figure out which product sub-categories cross-sell the best. This is a very, very common question, especially in the retail sector.

Emma indicated that you would need to use a PC and I'm on a Mac. This, however, didn't pose a challenge.

STEP 1: Join the data to itself

The idea here to join the two Orders tables by Order ID. This self-join allows us to see the cross selling of each line item in each Order ID with all other line items with the same Order ID.

STEP 2: Filter out sub-categories that are the the same in both data sources

The goal of this filter is to remove any product sub-category that joins to a product sub-category with the same name. To do this, I created a simple boolean calculation, then added it to the Filters shelf and selected False.

STEP 3: Create a Total Sales calculation

This is pretty straight forward. Simply take the two sales measures and add them together.

STEP 4: Combine the sub-categories to get a list of the products that sold together

You could do this with a combined field, I merely prefer to write the calculation.

STEP 5: Create the Top N parameter and show the parameter control

STEP 6: Build a bar chart of the sub-categories purchased together sorted by Total Sales

Notice that I now have every combination of product sub-categories purchased together duplicated. I'll need a couple table calcs to fix this.

STEP 7: Check if the current row matches the previous row

With this calculation, I'm checking for duplicates. If I get "True" back as a result, then that means it's a duplicate. Notice here that I'm rounding the sales. I had to do this because Tableau wasn't matching the sales for some reason.

By excluding the "True" results with a table calc filter, I now have removed the duplicates.

STEP 8: Filter to the top N sub-categories

Again, I need a table calc filter. I need to rank the total sales, but since I have twice as many records (the previous table calc filter essentially just "hides" the other rows), I need to take twice as many rows as I indicate in the parameter.

Drag this calc to the Filters shelf and choose True and I have the results I want.

STEP 9: Build the final dashboard


  1. Nice job! Maybe put it in a dashboard with heat map matrix at top with highlight actions between both charts?

    1. Go for it! That wasn’t the challenge posed though.

  2. Hi Andy,

    Great article, may be you can also try all records in x-axis from one data-set and others in y-axis from other data-set. I tried it a couple of months ago and it give a great approach to data. I am unable to upload any JPEG here :( otherwise shared the snapshot

  3. To get rid of the duplicates, I used INDEX()<=2*TopN rather than RANK(). It works, but I'm always curious about application outside of this specific example. Are there any issues with using INDEX rather than RANK?

    1. Index is a counter. It’s the same as Rank_Dense not Rank.

  4. Andy,

    How would you suggest going about finding counts for more than two subcategories? For example, how many orders were there for Paper and Tables and Copiers and Chairs?

    1. Not sure, see if you can figure it out.

    2. I've been looking into this for longer than I care to admit.. So far my lines of though are along creating two separate Sets (1.Paper & Tables - 2.Copiers & Chairs) and then making a set of those two sets. The 'IN' of that set should give the count of order id's that have all four of the items.

      Thanks for your response - reading your blog has taught me so much.