1.13. Application: Association Rules, i.e. Market Basket Analysis¶
Market Basket Analyses are a common application of association rules. One goal of a market basket analysis is to understand the association between items purchases. The relationship between items purchased at a grocery store will be considered in this handout.
An association rule highlights the fact that some items are more (or less) indicative of the purchase of others. For example, purchasing cereal increases the likelihood of purchasing milk. These types of analyses may also reveals that liquor and milk are rarely purchased together.
Association Rules are used to uncover associations or relationships that exist between items. Often these rules are constructed to identify relationships between items purchased, i.e. Market Basket Analysis.
Procedural Steps
- Determine how often items are purchased
- Determine how often items are purchased in conjunction with other items
- Identify which purchased items are indicative of others being purchased
Data Technologies
- Filtering/Subsets
- Creating Tables
- Applications of Summaries
Consider the following subset of data from a collection of transactions from a grocery store.
| Transaction ID | Items Purchased | 
|---|---|
| 1 | {Bread, Milk} | 
| 2 | {Eggs, Ham} | 
| 3 | {Bread, Fruit, Milk} | 
| 4 | {Beer, Bread, Butter, Fruit, Soda} | 
| 5 | {Bread, Fruit, Milk, Soda} | 
Association rules are developed under the following guiding principles.
Association Rule Principles
- Items should be purchased somewhat (Support)
- Reliability, i.e. the degree to which one set of items predicts the purchase of another set of items (Confidence)
Consider the following association rule – the purchase of Bread indicates the purchase of Milk.
Rule 1
Compute the support and confidence for this rule.
Consider a second association rule for the purchase of Milk.
Rule 2
Compute the support and confidence for this rule.
Question
- Why might Rule #1 be considered “better” than Rule #2 when interest lies in the purchase of Milk?
Consider a third association rule for the purchase of Milk.
Rule 3
Compute the support and confidence for this rule.
Lift is another measure often considered when evaluating rules of association.
For our example, realize that the support for Milk is fairly large. i.e, Milk was purchased in 60% of the transactions. This provides a baseline value for confidence. That is, rules that exceed this value indicate gains when considering the association provided by the rule. When the lift of a rule is near 1, then the rule provides little information to understanding the purchase of the item.
- \(Lift > 1\) implies positive association between items
- \(Lift \approx 1\) implies no association between items
- \(Lift < 1\) implies negative association between items
| Rule | Support | Confidence | Lift | 
|---|---|---|---|
| 
\[\left\{ \text{Bread} \right\}\  \rightarrow \left\{ \text{Milk} \right\}\] | 
\[\frac{3}{5}\] | 
\[\frac{\frac{3}{5}}{\frac{4}{5}} = \frac{3}{4}\] | 
\[\frac{\frac{3}{4}}{\frac{3}{5}} = 1.25\] | 
| 
\[\left\{ \text{Fruit} \right\}\  \rightarrow \left\{ \text{Milk} \right\}\] | 
\[\frac{2}{5}\] | 
\[\frac{\frac{2}{5}}{\frac{3}{5}} = \frac{2}{3}\] | 
\[\frac{\frac{2}{3}}{\frac{3}{5}} = 1.11\] | 
| 
\[\left\{ Bread,\ Fruit \right\}\  \rightarrow \left\{ \text{Milk} \right\}\] | 
\[\frac{2}{5}\] | 
\[\frac{\frac{2}{5}}{\frac{3}{5}} = \frac{2}{3}\] | 
\[\frac{\frac{2}{3}}{\frac{3}{5}} = 1.11\] | 
1.13.1. Common Data Structure¶
Next, consider the complete grocery dataset. This dataset contains 9835 transactions and 169 unique items. This dataset can be downloaded from the Workshop website.
Open the Groceries dataset in Excel. The binary representation of this market basket dataset is provided in this Excel file. A snippet is shown here.
Spreadsheets consist of rows and columns. Datasets also consist of rows and columns as well, but also contain information that is not data, e.g. variable names. Excel does not differentiate the header row from actual data unless you convert the collection of rows and columns into a Table.
| Putting Data into the Table structure in Excel Give your table a name for easy referencing | Short-cuts for cursor movement in Excel | 
The following snippet shows the Groceries dataset specified as a table.
The drop-down arrows provided for each variable (or field) are called Filters. Filters in Excel allow you to subset rows.
| Filter on Whole Milk | Select Whole Milk = 1 to identify transactions that purchased whole milk | 
After a Filter is applied, certain rows are hidden from view. Excel indicates this fact with changing the row label color to blue.
The status bar in Excel, the bar across the bottom of the Excel file, provides simple summaries for columns of the table. For example, if the Whole Milk column (column Y) is highlighted, the following summaries are shown.
Applying filters to columns Whole Milk and Butter allows one to easy compute the support and confidence for the rule \(\left\{ \text{Butter} \right\}\ \rightarrow \left\{ \text{Milk} \right\}\).
The =COUNT() function in Excel can used to count the number of nonblank
rows in a column. Excel functions also work with tables and variable
names. The following will provide a count of the number of transactions
in the Groceries dataset, i.e. 9835. The use of the table and variable
names is preferred as this avoids the need to highlight an exact range
of cells in Excel.
The =COUNTIF() function in Excel provides a count of only the cells that
satisfy some condition. The following can be used to compute the support
for butter.
=COUNTIF( Groceries[butter] , 1 )
If more than one condition is needed, the =COUNTIFS() function can be
used. COUNTIFS is necessary to compute Support(Butter AND Whole Milk).
=COUNTIFS( Groceries[butter] , 1 , Groceries[whole milk] , 1 )
A brief description of the COUNTIFS function in Excel is provide here.
Move to far right of the Groceries table in Excel. You can use Ctrl → to move quickly to the far right edge. Enter the following function in Excel to compute the counts necessary for measuring support for the rule \(\left\{ \text{Butter} \right\}\ \rightarrow \left\{ \text{Milk} \right\}\).
Use the value computed above to compute the Confidence and Lift for this rule.
Verify that these formulas are correct by comparing them to the output provided below.
1.13.2. Evaluating Several Rules¶
The procedure provided above lack efficiencies and does not scale well
when several rules need to be evaluated. For example, to evaluate the
rule
\(\left\{ \text{Yogurt} \right\}\  \rightarrow \left\{ \text{Milk} \right\}\),
the formulas for support will need to be changed. The =INDIRCT()
function in Excel will help increase the efficiency in computing the
support, confidence, and lift for several rules.
| INDIRECT()Function | ||||
|---|---|---|---|---|
| 
 | 
The =INDIRECT() function can be used in the following manner to
automatically update the variable names when computing the support for
several rules.
=COUNTIF( INDIRECT ( “ Groceries[ “ & B2 & “ ] “ ), 1)
The following setup is used to evaluate six different association rules for Milk.
This formula can be copied down in Excel to evaluate the support for the remaining rules. The confidence and lift are computed for these rules as well.
Questions
- The Lift for
\(\left\{ \text{Cereal} \right\}\  \rightarrow \left\{ \text{Milk} \right\}\)
is about 2.5 which is fairly high. Thus, given that the transaction
includes cereal, there is 2.5 fold increase in the likelihood of milk
being purchased.- Compute Support(Cereal AND Milk).
- This value is fairly low. Why does a low support value negate the usefulness of a rule?
 
- The Lift value for the rule \(\left\{ \text{Canned\ Beer} \right\}\ \rightarrow \left\{ \text{Milk} \right\}\) is lowest on this list. What can be said about the purchase of Canned Beer AND Milk?
- Which of these rules is least useful in the prediction of Milk? Explain how you made this determination.
Task
Use Excel to obtain the Confidence and Lift for all association rules for Whole Milk where only single items are considered on the left.
- Copy all variable names and paste them into a single column. This can be done using Paste Special – specify Values and Transpose when pasting.
- The =CONCATENATE()function can be used to create the Rule column, i.e.=CONCATENATE(“{“,B2,”} -> {Milk}”).
Specify Values and Transpose under Paste Special
Output for Rules

























Some Comments