WELCOME TO OUR BLOG!

The posts here represent the opinions of CMB employees and guests—not necessarily the company as a whole. 

Subscribe to Email Updates

Spring into Data Cleaning

Posted by Nicole Battaglia on Tue, Apr 04, 2017

scrubbing.jpegWhen someone hears “spring cleaning” they probably think of organizing their garage, purging clothes from their closet, and decluttering their workspace. For many, spring is a chance to refresh and rejuvenate after a long winter (fortunately ours in Boston was pretty mild).

This may be my inner market researcher talking, but when I think of spring cleaning, the first that comes to mind is data cleaning. Like cleaning and organizing your home, data cleaning is a detailed and lengthy process that is relevant to researchers and their clients.

Data cleaning is an arduous task. Each completed questionnaire must be checked to ensure that it's been answered correctly, clearly, truthfully, and consistently. Here’s what we typically clean:

  • We’ll look at each open-ended response in a survey to make sure respondents’ answers are coherent and appropriate. Sometimes respondents will curse, other times they'll write outrageously irrelevant answers like what they’re having for dinner, so we monitor these closely. We do the same for open-ended numeric responsesthere’s always that one respondent who enters ‘50’ when asked how many siblings they have.
  • We also check for outliers in open-ended numeric responses. Whether it’s false data or an exceptional respondent (e.g. Bill Gates), outliers can skew our data and lead us to draw the wrong conclusions and make more recommendations to clients. For example, I worked on a survey that asked respondents how many cars they own.  Anyone who provided a number that was three standard deviations above the mean was set as an outlier because their answers would’ve significantly impacted our interpretation of the average car ownershipthe reality is the average household owns two cars, not six.
  • Straightliners are respondents who answer a battery of questions on the same scale with the same response. Because of this, sometimes we’ll see someone who strongly agrees or disagrees with two completely opposing statements—making it difficult to trust these answers reflect the respondent’s real opinion.
  • We often insert a Red Herring Fail into our questionnaires to help identify and weed out distracted respondents. A Red Herring Fail is a 10-point scale question usually placed around the halfway mark of a questionnaire that simply asks respondents to select the number “3” on the scale. If they select a number other than “3”, we flag them for removal.
  • If there’s incentive to participate in a questionnaire, someone may feel inclined to participate more than once. So to ensure our completed surveys are from unique individuals, we check for duplicate IP addresses and respondent IDs.

There are a lot of variables that can skew our data, so our cleaning process is thorough and thoughtful. And while the process may be cumbersome, here’s why we clean data: 

  • Impression on the clientFollowing a detailed data cleaning processes helps show that your team is cautious, thoughtful, and able to accurately dissect and digest large amounts of data. This demonstration of thoroughness and competency goes a long way to building trust in the researcher/client relationship because the client will see their researchers are working to present the best data possible.
  • Helps tell a better storyWe pride ourselves on storytelling–using insights from data and turning them into strong deliverablesto help our clients make strategic business decisions. If we didn’t have accurate and clean data, we wouldn’t be able to tell a good story!
  • Overall, ensures high quality and precise dataAt CMB typically two or more researchers are working on the same data file to mitigate the chance of error. The data undergoes such scrutiny so that any issues or mistakes can be noted and rectified, ensuring the integrity of the report.

The benefits of taking the time to clean our data far outweigh the risks of skipping it. Data cleaning keeps false or unrepresentative information from influencing our analyses or recommendations to a client and ensures our sample accurately reflects the population of interest.

So this spring, while you’re finally putting away those holiday decorations, remember that data cleaning is an essential step in maintaining the integrity of your work.

Nicole Battaglia is an Associate Researcher at CMB who prefers cleaning data over cleaning her bedroom.

Topics: data collection, quantitative research