Predicting Response at Bookbinders: Decision Trees

follow the instructions in the case. Use SPSS to answer the questions in the case. Thanks.

Kenan-Flagler Business School

The University of North Carolina

Professor Charlotte Mason prepared this case to provide material for class discussion rather than to illustrate either

effective or ineffective handling of a business situation. Names and data may have been disguised to assure

confidentiality. The assistance of the Direct Marketing Educational Foundation in supplying data is gratefully

acknowledged.

Copyright Ó 2001 by Charlotte Mason.

Predicting Response at BookBinders: Decision Trees

Recursive partitioning algorithms (or decision trees) are a versatile tool for uncovering patterns

or relationships in data. They are especially useful when there is a large set of potential

predictors and when you are not sure which are most important or what the relationships

between the predictors and the target (dependent) variable are. In the case of a binary target

variable, decision tree algorithms iteratively search through the data to find which predictors

best separate the two categories of the target variable.

So far, we have used RFM segmentation and logistic regression to predict the response to the

mailing offer for “The Art History of Florence.” Now we will see how decision trees compare as

an alternative.

Tree using Exhaustive CHAID

We’ll start with a tree using exhaustive CHAID which is one of the algorithms in SPSS’s

AnswerTree software package. Our target variable is BUYER (whether or not they bought The

Art History of Florence) and all other variables will be potential predictors. To see how the tree

grows, let’s take it one step at a time, beginning with the ‘root node’. Because decision trees

are prone to ‘overfitting’, we will split the dataset in two: two-thirds of the observations will be

used to develop the model (the training sample) and the remaining one-third will be used to test

the model (the validation or test sample) to see how well the model performs on ‘new’

observations.

Page 2

Exhibit 1 “Root” node for CHAID Tree (Training Sample)

The root node contains all the observations in the training sample. We see that 29981 or

90.88% are not buyers and the remaining 3008 or 9.12% are buyers. Next, we’ll grow the tree

one level. After specifying this, the AnswerTree software searches through the potential

predictor variables to see which one ‘best’ separates the buyers from the non-buyers, and we

see that gender is selected. These results are shown in Exhibit 2.

Exhibit 2 One Level CHAID Tree (Training Sample)

To see how well this one-level tree classifies buyers and non-buyers we can look at the

classification table and ‘risk estimate’. In AnswerTree, the risk estimate is the percent of

customers incorrectly classified. Exhibit 3 shows the misclassification for this one-level tree.

Page 3

Exhibit 3 Misclassification Matrix for One Level CHAID Tree (Training Sample)

Note that the tree predicts a total of zero buyers – meaning, so far, there are no nodes in the

tree where the buyers outnumber non-buyers. Because our sample is so dominated by nonbuyers,

it’s not surprising that the computer predicts non-buyers for many or even all nodes.

Although the classification is correct for 91% of the cases, its predictions are not very useful for

distinguishing good prospects from bad ones. To identify segments worth targeting (i.e., those

with a relatively high probability of responding), we will need to look at the specific response

rates for the nodes. First, though, let’s grow the tree another level. Once again, the

AnswerTree software searches the available predictor variables and selects which variable(s) to

branch on for this level.

Exhibit 4 Two Level CHAID Tree (Training Sample)

Misclassification Matrix

Actual Category

No Yes Total

Predicted No 29981 3008 32989

Category Yes 0 0 0

Total 29981 3008 32989

Training Sample

Risk Estimate 0.091182

SE of Risk Estimate 0.001585

Page 4

.

EXHIBIT 5

CHAID Tree

Based on 2/3rds of customer records used

for ‘training’ sample

A

Page 5

EXHIBIT 6

CHAID Tree

Based on 1/3rd of customer records used for

‘test’ sample

A

Page 6

.

Note that for females the tree branches on the number of reference books purchased, but for

males the tree branches on the number of art books purchased. Females who have purchased

2 or more reference books had more than twice the response rate (14.44% compared with

6.89%) than females who had purchased one or no reference books. Also, with CHAID

algorithms there can be binary or multi-way splits (e.g. three or more branches from a single

node) as seen above. The tree divides males into three categories or nodes based on the

number of previous art books purchased. The response rate increases substantially with the

number of prior art book purchases.

Exhibit 5 shows a complete tree after an iterative process of growing and pruning branches. So

far, the analysis has used a random sample of 32,989 (equal to 66%) of the 50,000 customers

in the dataset. These 32,989 records comprise the training sample. The remaining one-third, or

17,011 records, form the validation or test sample.

The results using the test sample are shown in Exhibit 6. The tree in Exhibit 6 uses the same

branching rules as the tree we just developed using the training sample – and was used solely

to classify the 17,011 customers in the test sample. Each customer in the test sample is ‘put

through’ the tree starting at the root node and branching until a terminal node is reached. For

example, if the first customer is a female with no past purchases of reference or art books, she

will end up in the node labeled ‘A’ in Exhibit 6. We can see that there were a total of 8078

customers who fit that profile (females with no prior reference or art book purchases) and that

7783 of them did not purchase ‘The Art History of Florence’.

Exhibit 7 shows a summary of the number of customers, the number of buyers and the

response rate in each ‘leaf’ or terminal node for both the training sample and the test sample.

The nodes in Exhibit 7 are sorted by response rate from highest to lowest.

Exhibit 7 Response Rates by Node

Training Sample Test Sample

#

Customers # Buyers

Response

Rate

#

Customers # Buyers

Response

Rate

1016 350 34.45% 513 168 32.75%

1788 438 24.50% 244 62 25.41%

503 119 23.66% 898 208 23.16%

5197 871 16.76% 2749 431 15.68%

2075 298 14.36% 1106 165 14.92%

803 54 6.72% 239 19 7.95%

428 26 6.07% 895 50 5.59%

1689 90 5.33% 401 22 5.49%

501 26 5.19% 1657 85 5.13%

3190 160 5.02% 231 9 3.90%

15799 576 3.65% 8078 295 3.65%

For example, node “A” in Exhibit 5 includes 15,799 customers from the training sample. Of the

15,799 customers in this node, 576 belong to the target category of Yes (i.e. they are buyers),

which is a response rate of 3.65%.

Page 7

The terminal nodes of the tree and the summary statistics in Exhibit 7 are used to identify which

segments to target and which to avoid. An important decision is how ‘deep’ in the customer

base to go. This decision may be based on the number of prospects wanted, a desired

response rate or a desired proportion of potential buyers you want to contact, or profitability.

For the BookBinders’ mailing offering The Art History of Florence, we know the following about

profits for the two groups:

· Non-responder: -$0.50 for the cost of mailing

· Responder: $5.50 ($18 revenue less $9 COGS, $3 shipping and $0.50 for the mailing)

We can use this information to determine which nodes are profitable to target. An equivalent

approach is to target customers in nodes with a response rate greater than or equal to the

breakeven response rate.

Validating the Model

Decision trees are prone to overfitting – meaning that the tree is overly tailored or customized to

the dataset used to create the tree. If this is the case, then the tree will do a substantially poorer

job of predicting or classifying on a new set of data. To assess the performance of the decision

tree on ‘new’ data, we use the one-third of the dataset that forms the validation or test sample.

We expect the model to perform slightly worse on the test sample compared with the training

sample – although the difference should be slight. A large discrepancy between the two

suggests that the tree has been overfit and needs to be pruned back.

Case Questions:

1. Using the information in Exhibits 5 and 6, summarize – for the Director of Marketing –

which customer groups should be targeted with this mailing.

2. Use the information in Exhibit 7 to make a cumulative gains chart for both the training

and test samples. Does the tree appear ‘overfit’? Why or why not?

3. Using the same costs as before ($18 selling price, $9 wholesale price, $3 shipping and

$0.50 mailing costs), estimate what the gross profit (in dollars and as a % of gross sales)

as well as the return on marketing would be if the “The Art History of Florence’ offer were

only mailed to those predicted by the CHAID tree results to be good prospects for this

offer.

4. Compare and contrast the results and insights from using RFM, logistic regression and

CHAID decision tree analysis for targeting buyers for BookBinders offers.

SPSS functions refresher

To conduct the step-by-step decision tree analysis, you only need a few SPSS functions:

1. To select the appropriate cases for a give node examination, go to Data>Select cases> If condition is satisfied and define the condition (e.g. age = 1 or age = 2; (female = 0) and (age = 1 or age = 2))

2. To create dummy variable, go to Transform>Create dummy variables, under main efects pick a root (e.g. ag for age and inc for income) and a macro name (m1 e.g.) and then SPSS does its magic ?

3. If you want to combine two categories, say age = 1 and age = 2, you can simply go to Transform>Compute variable and sum the dummies for these categories (age12=ag_1 + ag_2, assuming that’s how you names your dummy variables)

4. For cross-tabs, go to Analyze>Descriptive Statistics>Cross tabs and do not forget to click on Chi-square under Statistics. To see more decimals in the output table, activate by double clicking and then right click to format the cell.