Dear Dr. Jay,
After the 2016 election, how will I ever be able to trust predictive models again?
Whether we’re talking about political polling or market research, to build good models, we need good inputs. Or as the old saying goes: “garbage in, garbage out”. Let’s look at all the sources of error in the data itself:
- First, we make it too easy for respondents to say “yes” and “no” and they try to help us by guessing what answer we want to hear. For example, we ask for purchase intent to a new product idea. The respondent often overstates the true likelihood of buying the product.
- Second, we give respondents perfect information. We create 100% awareness when we show the respondent a new product concept. In reality, we know we will never achieve 100% awareness in the market. There are some folks who live under a rock and of course, the client will never really spend enough money on advertising to even get close.
- Third, the sample frame may not be truly representative of the population we hope to project to. This is one of the key issues in political polling because the population is comprised of those who actually voted (not registered voters). For models to be correct, we need to predict which voters will actually show up to the polls and how they voted. The good news in market research is that the population is usually not a moving target.
Now, let’s consider the sources of error in building predictive models. The first step in building a predictive model is to specify the model. If you’re a purist, you begin with a hypotheses, collect the data, test the hypotheses and draw conclusions. If we fail to reject the null hypotheses, we should formulate a new hypotheses and collect new data. What do we actually do? We mine the data until we get significant results. Why? Because data collection is expensive. One possible outcome from continuing to mine the data looking for a better model is a model that is only good at predicting the data you have and not too accurate in predicting the results using new inputs.
It is up to the analyst to decide what is statistically meaningful versus what is managerially meaningful. There are a number of websites where you can find “interesting” relationships in data. Some examples of spurious correlations include:
- Divorce rate in Maine and the per capita consumption of margarine
- Number of people who die by becoming entangled in their bedsheets and the total revenue of US ski resorts
- Per capita consumption of mozzarella cheese (US) and the number of civil engineering doctorates awarded (US)
In short, you can build a model that’s accurate but still wouldn’t be of any use (or make any sense) to your client. And the fact is, there’s always a certain amount of error in any model we build—we could be wrong, just by chance. Ultimately, it’s up to the analyst to understand not only the tools and inputs they’re using but the business (or political) context.
Dr. Jay loves designing really big, complex choice models. With over 20 years of DCM experience, he’s never met a design challenge he couldn’t solve.
PS – Have you registered for our webinar yet!? Join Dr. Erica Carranza as she explains why to change what consumers think of your brand, you must change their image of the people who use it.
What: The Key to Consumer-Centricity: Your Brand User Image
When: February 1, 2017 @ 1PM EST