How Do You Know If Your Predictive Scoring Test Works?
Most marketing and sales leaders think of lift measurement when comparing the performance of two prediction models. While lift is ideal for segmenting good leads from bad, it is not the best approach for comparing two scoring models.
Ideally you want to focus on success criteria specific to the goal at hand. For instance, if you are trying to sell to more customers, train the test measurement on an increase in revenue rather than an increase in leads. After all, it’s possible that a single customer could generate a significant increase in revenue; your focus should be on identifying the leads with that potential.
On the other hand, if you want to acquire new customers, specific revenues are irrelevant. In that case, you need to understand how many leads are required to land the desired number of new customers, so the focus should be on lead conversion rate.
Once you have decided on your success criteria, it’s important to measure the testing results the same way for each vendor. Keep in mind that prediction models can use a number of different machine learning technologies, including neural networks, decision trees and logistic regression to name a few. However, as long as the models are trained and tested on the same data sets and are built to predict the same event, you can use the same criteria to measure their performance.
The right approach for measuring model performance
While lift measure is a standard for determining lead cutoff, it is not very useful for comparing performance of lead scoring models. Consider the following table comparing lift for two models.
If you want to identify the top 100 leads, you’d choose Model 1 because it yielded the highest lift. However, if your goal is to identify more than 100 leads — or multiple segments — you’ll see that Model 2 outperforms Model 1. Similarly, for the top 300 leads, Model 1 outperforms Model 2, but for the top 500 leads, Model 2 outperforms Model 1. In other words, looking at lift individually for each segment doesn’t give you an accurate picture of which model is better.
With that in mind, a better way to compare the models would be to compare the cumulative conversions (D and F) (rather than the lift of individual segments) as shown in the table below.
Using this approach, you can clearly see which model is better for your needs. If your goal is to pass 200 leads to the sales team, Model 1 is the better choice because it yields more conversions. If, on the other hand, your goal is to pass 500 leads, you would choose Model 2.
When comparing multiple models:
Set a cutoff goal prior to evaluating predictive models.Keep vendors informed of the goals, as many can tweak the model accordingly to yield the best possible performance.
Do not use lift to evaluate performance of prediction models. Instead, compare the total conversion expected from a set of leads determined by the cutoff goal.
Compare Differences in Model Outcomes
Once the vendors provide output from their predictive models, you will need to determine if the difference in performance is statistically significant.
Let’s suppose a test set of 250,000 leads — with 5,000 converted leads — was created to compare the performance of two models. The performance of each model is shown below. Looking at the first three rows, one might conclude that Model 1 is better than Model 2 because Model 1 has more converted leads in the top 25,000 and 50,000.
However, keep in mind that the same model may produce different results each time it is used to score leads. The difference in performance is typically called “variance” and can be determined by well-known statistical sampling and measurement methods. If you’re not a statistician, a simple rule is to take the square root of the sample result.
For example, the top 25,000 leads scored by Model 1 resulted in 1,500 converted leads (first row). The variance range for this segment can be estimated by taking the square root of 1,500, which is 39 (38.72 to be precise). So, if the top 25,000 leads were selected from another batch of 250,000 scored by Model 1, it is possible that between 1,461 and 1,539 (1,500 +/- 39) would convert. This variance in conversion rate is shown in the table below for each segment. With that in mind, you can see that Model 1 is better for scoring the top 25,000 and 50,000 leads, but not for scoring the top 75,000. Because there is significant overlap between the expected conversion range of the two models, there is no clear winner for the 75,000-lead tier. Interestingly, Model 2 performs better for scoring the top 100,000 leads and up to the top 225,000 leads.
Learn more about comparing vendors in our latest ebook. These examples illustrate that it is important to take into account the business context when evaluating different scoring solutions. Lift alone may be an inadequate measure. Focus instead on defining the ultimate success criteria first, and then evaluate the models based on that outcome.
Image Credit(s): Doran