The posts here represent the opinions of CMB employees and guests—not necessarily the company as a whole. 

Subscribe to Email Updates


see all

A User's Guide to the “Perfect” Segmentation

Posted by Jay Weiner, PhD

Mon, Jul 22, 2019

iStock-628987676 (1)

A really good segmentation benefits many users. The product development team needs to design products and services for key target segments. The marketing team needs to develop targeted communications. The data scientists need to score the database for targeting current customers. The salesforce needs to develop personalized pitches.  Last, but not least, the finance department uses segmentation to help allocate the resources of the firm. With so many interested parties, it’s easy to see why getting buy in up front is critical to the success of any segmentation.

A "perfect" segmentation solution would offer insights for each user to help them execute the strategic plan.  What does this mean from an analytical perspective?  It means we have differentiation on needs for the product development folks, attitudes for the marketing folks and a predictive scoring model for the internal database team.  That sounds easy enough, but in practice it is difficult.  Attitudes are not always predictive of behaviors.  For example, I’m concerned about the environment.  I have solar panels on my roof.  You’d think I would drive a zero emissions vehicle (ZEV) and yet I drive a 400HP V8 high octane burning gas powered car.  I don’t feel too bad about that since I don’t really drive much.  That said, my next car could be the Volkswagen I.D. Buzz, an all-electric nostalgic take on the original VW van, but I digress.

Segmentation is not a property of the market.  It is an activity.  It’s usually helpful to evaluate several potential segmentation schemes to see how well they deliver the key objectives.  We do this by prioritizing the objectives.  Getting nice differentiation on attitudes to help create more effective marketing campaigns might be more important than getting a high accuracy on scoring the database.

My colleague, Brant Cruz recently listed leveraging existing data sources as one of the keys to successful segmentation.  This is often one of the biggest challenges we face in segmentation.  How well can we classify the customer database?  What’s in the database?  Most often it’s behavioral data like month spend, products purchased, points redeemed.  These data are the most accurate representation of what happened and when it happened.  What they don’t help explain is why it happened and in some cases who did it.  For example, many families subscribe to streaming music and video services.  If you don’t remember to log in, then the behavior is correct for the family, but not necessarily attributable to a specific user.

Appending demographic and attitudinal data to the database can help provide the links.  When such data are available, we have to verify the source of those data.  Many companies offer the ability to populate demographic and potentially attitudinal data. If this is the source of the append, then is it an actual value for the specific customer or is it a proxy for that customer based on nearest neighbor values.  In either case, we would still need to determine the age of the appended data.  How often do these values get updated?  Are some values missing?  For example, if you have recently signed up for an account, then your 90-day behavioral data elements won’t get populated for some period of time.  This means that I would need to either remove these respondents from my file or build a unique model for new customers.  How well we can accurately predict the segments is contingent in part on how accurate our data are.

The most accurate solution would be to simple segment using only information in the database.  If our ultimate goal is to help the client with prospecting for new business, a segmentation of customers is not likely to be too helpful.  This means that I need to collect primary data and ask surrogates for the values in the database.  A concurrent sample of customers would help with any need calibrate the survey responses for over/under statement.

When we start to mix database values with primary survey data, we typically do two things.  First, we dilute the differences in attitudes and needs.  Second, we reduce the accuracy of scoring the database.  There are ways to improve the scoring accuracy.  We can provide a list of attributes that could be appended to the database to increase the correct classification.  Sometimes, the data scientists may be able to identify additions variables in the database that were not provided up front.  Other times, it’s simply a matter of figuring out how to collect these values and have them appended to the database.

One part of the evaluation is to determine how many segments to have. Just because you have a segment, doesn’t mean you have to target that segment.   You should have at least one more segment than you intend to target.  Why?  This lets you identify an opportunity that you have left in the market for your competitors.  Just because there are segments of folks interested in zero-emission vehicles, or self-driving cars does not mean you need to make them.  Most companies can only afford to target a small number of segments.  Database segmentations with targeted digital campaigns are often easy to execute with a larger number of segments.

How long can you expect your solution to last?  Typically, segmentation schemes last as long as there are no major changes in the market.  Changes can come from technological innovations.  ZEV and self-driving cars have changed the auto industry.  Shifts in the size of the segments over time are just one indication that the segmentation could use refreshing.

Dr. Jay is CMB’s Chief Methodologist and VP of Advanced Analytics and is always up for tackling your most pressing questions. Submit yours and he could answer it in his next blog!

Ask a Question!

Topics: advanced analytics, market strategy and segmentation

How Advanced Analytics Saved My Commute

Posted by Laura Dulude

Wed, Aug 22, 2018


I don’t like commuting. Most people don’t. If you analyzed the emotions that commuting evokes, you’d probably hear commuters say it made them: frustrated, tired, and bored. To be fair, my commute experience isn’t as bad as it could be: I take a ~20-minute ride into Boston on the Orange Line, plus some walking before and after.

Still, wanting to minimize my discomfort during my time on the train and because I am who I am, I tracked my morning commute for about 10 months. I logged the number of other people waiting on the platform, number of minutes until the next train, time spent on the train, delays announced, the weather, and several other factors I thought might be related to a negative experience.

Ultimately, I decided the most frustrating part about my commute is how crowded the train is—the less crowded I am, the happier I feel. So, I decided to predict my subjective crowd rating for each day using other variables in my commuting dataset.

In this example, I’ve used a TreeNet analysis. TreeNet is the type of driver modeling we do most often at CMB because it’s flexible, allows you to include categorical predictors without creating dummy variables, handles missing data without much pre-processing, resists outliers, and does better with correlated independent variables than other techniques do.

TreeNet scores are shown in comparison to each other. The most important input will always be 100, and every other independent variable is scaled relative to that top variable. So, as you see in Figure 1, the time I board the train and the day of the week are about half as important as the number of people on the platform when I board. That means that as it turns out, I probably can’t do all that much to affect my commute, but I can at least know when it’ll be particularly unpleasant.

Importance to Crowding_commuter

What this importance chart doesn’t tell you is the relationship each item has to the dependent variable. For example, which weekdays have lower vs. higher crowding? Per-variable charts give us more information:

Weekday and Crowding_commuter

Figure 2 indicates that crowding lessens as the week goes on. Perhaps people are switching to ride-sharing services or working from home those days.

For continuous variables, like boarding time, we can explore the relationships through line charts:

Boarding Time and Crowding_commuter

Looks like I should get up on the earlier side if I want to have the best commuting experience! Need to tackle a thornier issue than your morning commute? Our Advanced Analytics team is the best in the business—contact us and let’s talk about how we can help!

 Laura Dulude is a data nerd and a grumpy commuter who just wants to get to work.

Topics: advanced analytics, EMPACT, emotional measurement, data visualization

Predicting Olympic Gold

Posted by Jen Golden

Wed, Feb 21, 2018


From dangerous winds and curling scandals to wardrobe malfunctions, there’s been no shortage of attention-grabbing headlines at the 2018 Winter Olympics.

And for ardent supporters of Team USA, the big story is America’s lagging medal count. We’re over halfway through the games, and currently the US sits in fifth place behind Norway, Germany, Canada, and the Netherlands.

Based on last week’s performance (and Mikaela Shiffrin’s recent withdrawal from the women’s downhill event), it’s hard to know for sure how America will place. However, we can use predictive analytics to determine the main predictors of medal count to anticipate which countries will generally be on the podium.

We’ll use TreeNet modeling to identify the main drivers of medal count based on previous Winter Olympics outcomes. For the sake of simplicity, we’ll focus on the 2014 Sochi winter games (excluding all Russia data which would skew the model!) From there, we can infer similarities between medal drivers for Sochi and PyeongChang.

Please note all these results are hypothetical, and not reflective of actual data!

To successfully run a TreeNet analysis, you need both a dependent variable (e.g., the outcome you are trying to predict) and independent variables (e.g., the input that could be possible predictors of the dependent variable).

In this case…

Dependent variable: Total 2014 Sochi Winter Games medal count
Independent variables (including data both directly related to the Olympics and otherwise):

  • Medal count at the Vancouver Olympic games
  • Medal count at previous Winter Games (all time)
  • Number of athletes participating
  • Number of events participating in
  • Number of outdoor events participating in (e.g., downhill skiing, bobsled)
  • Number of indoor events participating in (e.g., figure skating, curling)
  • Average country temperature
  • Average country yearly snowfall
  • Country population
  • Country GDP per capita

The Results!

Our model shows the relative importance of each variable calibrated to a 100-point scale. The most important variable is assigned a score of 100 while all other variables are scaled relative to that:

Olympic Medal Predictors.png

Meaning, in this sample output, previous medal history is the top predictor of Olympic medal outcome with a score of 100 while # in outdoor events and indoor events participating in are the least predictive.

This is a fun and simple example of how we could use TreeNet to forecast the Winter Olympic medal count. But, we also leverage this same technique to help clients predict the outcomes of some of their most complex and challenging questions. We can help predict things like consideration, satisfaction or purchase intent for example, and use the model to point to which levers can be pulled to help improve the outcome.  

Jen is a Sr. Project Manager at CMB who was a spectator at the Sochi winter games and wishes she was in PyeongChang right now.

Topics: advanced analytics, predictive analytics

CMB's Advanced Analytics Team Receives Children's Trust Partnership Award

Posted by Megan McManaman

Wed, Nov 01, 2017


We're proud to announce that CMB’s VP of Advanced Analytics, Dr. Jay Weiner and Senior Analyst, Liz White, were honored with the Children’s Trust’s Partnership Award. Presented annually, the award recognizes the organizations and people whose work directly impact the organization's mission–stopping child abuse.

Jay and Liz were recognized for their work helping the Children’s Trust identify the messaging that resonated with potential donors and program users. Through two studies leveraging CMB’s emotional impact analysis—EMPACT, Max Diff Scaling, concept testing, self-explicated conjoint, and a highlighter exercise, the CMB team the Children's Trust identify the most appealing and compelling messaging.

“There is no one more deserving of this award than the team at CMB,” said Children’s Trust’s Executive Director, Suzin Bartley. “The messaging guidance CMB provided has been invaluable in helping us realize our mission to prevent child abuse in Massachusetts.”

Giving back to our community is part of our DNA of CMB and we’re honored to support the Children’s Trust’s mission to stop child abuse in Massachusetts. Click here to learn more about how the Children’s Trust provides families with programs and services to help them build the skills and confidence they need to make sure kids have safe and healthy childhoods.

From partnering with the Children’s Trust and volunteering at Boston’s St. Francis House to participating in the Leukemia & Lymphoma Society’s annual Light the Night walk, we have a longstanding commitment to serving our community. Learn more about CMB in the community here.



Topics: advanced analytics, predictive analytics, Community Involvement

Does your metric have a home(plate)?

Posted by Youme Yai

Thu, Sep 28, 2017


Last month I attended a Red Sox/Yankees matchup at Fenway Park. By the seventh inning, the Sox had already cycled through seven pitchers. Fans were starting to lose patience and one guy even jumped on the field for entertainment. While others were losing interest, I stayed engaged in the game—not because of the action that was (not) unfolding, but because of the game statistics.

Statistics have been at the heart of baseball for as long as the sport’s been around. Few other sports track individual and team stats with such precision and detail (I suggest reading Michael Lewis’ Moneyball if you haven’t already). As a spectator, you know exactly what’s happening at all times, and this is one of my favorite things about baseball. As much as I enjoy watching the hits, runs, steals, strikes, etc., unfold on the field, it’s equally fun to watch those plays translate into statistics—witnessing the rise and fall of individual players and teams.

Traditionally batting average (# of hits divided by number of at bats) and earned run average (# of earned runs allowed by a pitcher per nine innings) have dominated the statistical world of baseball, but there are SO many others recorded. There’s RBI (runs batted in), OPS (on-base plus slugging), ISO (isolated power: raw power of a hitter by counting only extra-base hits and type of hit), FIP (fielding independent pitching: similar to ERA but focuses solely on pitching, and removes results on balls hit into field of play), and even xFIP (expected fielding independent pitching; or in layman’s term: how a pitcher performs independent of how his teammates perform once the ball is in play, but also accounting for home runs given up vs. home run average in league). And that's just the tip of the iceberg. 

With all this data, sabermetrics can yield some unwieldy metrics that have little applicability or predictive power. And sometimes we see this happen in market research. There are times when we are asked to collect hard-to-justify variables in our studies. While it seems sensible to gather as much information as possible, there’s such a thing as “too much” where it starts to dilute the goal and clarity of the project.  

So, I’ll take off my baseball cap and put on my researcher’s hat for this: as you develop your questionnaire, evaluate whether a metric is a “nice to have” or a “need to have.” Here are some things to keep in mind as you evaluate your metrics:

  1. Determine the overall business objective: What is the business question I am looking to answer based on this research? Keep reminding yourself of this objective.
  2. Identify the hypothesis (or hypotheses) that make up the objective: What are the preconceived notions that will lead to an informed business decision?
  3. Establish the pieces of information to prove or disprove the hypothesis: What data do I need to verify the assumption, or invalidate it?
  4. Assess if your metrics align to the information necessary to prove or disprove one or more of your identified hypotheses.

If your metric doesn’t have a home (plate) in one of the hypotheses, then discard it or turn it into one that does. Following this list can make the difference in accumulating a lot of data that produces no actionable results, or one that meets your initial business goal.

Combing through unnecessary data points is cumbersome and costly, so be judicious with your red pen in striking out useless questions. Don’t get bogged down with information if it isn’t directly helping achieve your business goal. Here at CMB, we partner with clients to minimize this effect and help meet study objectives starting well before the data collection stage.

Youme Yai is a Project Manager at CMB who believes a summer evening at the ballpark is second to none.


Topics: advanced analytics, data collection, predictive analytics