Take Time to Clean Your Response Data

You might think that if you have skillfully created an online survey, all you need do is wait for the data to shower upon you so that you can start the fun: analysis, visualization, and interpretation. Hate to rain on your parade; you spend a major chunk of your project time cleaning the data before you can revel in the good stuff. No matter how carefully you design the survey, you must accept that the instrument, not to mention the lawyers, paralegals, legal-service vendors, outside counsel, and clients who take it, will confound you.

Let me draw on a recent project to sketch a few of the ways that you identify problems with the results data, figure out how to rectify them, and keep records so that you know you have completed your scrub accurately.

Typos or mistaken entries. One question in a law firm compensation survey asked for base salary. When I did a quick statistical summary of the data, one entry for $11,000 screamed out mistake. A zero had gone AWOL, but there was no way to know where (110,000 or 101,000 or even 100,100). I had to write the respondent, explain the situation, and keep track of whether he responded. He did, so I insert the correct figure in the correct place. Had he not responded, I would have inserted a zero where it brought the value closest to the median of the other base salaries for that level and then explained my action in a methodology footnote. With my programming script in R, at least the audit trail was clear.
Tangled combination selections. In the same survey, we asked about educational level. Our mistake was to allow people to check more than one level if they had earned multiple degrees. I always run a quick table for factors to see how many fall into each level. When the table showed one “Four-year college” combined with “Business degree” (“Four-year college,Business degree”), I realized that we could not tell whether that combination selection indicated an MBA or an undergraduate degree with a business major. Again, this data gremlin triggered a multistep process of emailing, tracking, and correcting. Having spotted the ambiguity, moreover, I quickly changed the instructions on the question so that thereafter it would not happen and we changed “Business degree” to “MBA.”
Fake or partial entries. At times you suspect the veracity or good faith of a response. One I noticed while eyeballing the downloaded data had entered a rating of 2 for all 12 items of a list. That pattern had no plausibility, so it made me suspicious of whether the person was otherwise submitting genuine answers (unknowable without speaking to the person). Perhaps their answers to other questions were legitimate, and they simply didn’t want to spend the time on the more elaborate question. My point is that you need to look at the data and if you catch an oddity or anomaly, you should think about it and a resolution, and probably write to the person.
Key missing data. Six of first 16 submissions on a compensation survey did not give compensation data! I noticed the glaring omission because I typically glance at the spreadsheet I export from my hosting software. To resolve this problem, I added a text warning on the survey to the effect of “no comp, no report” and then laboriously wrote each of the miscreants. It’s not easy to decide which questions form the backbone of your survey, as we did not tag it as a required question.
Non-standard titles. We asked for titles of the participants and gave them a list of the six that are most common, plus “Other.” Five respondents checked “Other” and filled in the text box. Because we wanted to analyze compensation by title, I had to figure out how to assign those unusual titles to our standard set. Fortunately, keywords such as “Director” or “Manager” in the bespoke titles matched the salary bands of others with those standard titles. We might have sidestepped the vexation by writing, “Please select the title that most accurately corresponds to your title.” and thereby eliminated the invitation to irritation (but losing detail, so a trade-off).
Mis-ordered entries. Our survey design had a matrix table to collect compensation information. Unfortunately, we failed to require the person completing the survey to put themselves in the first row, as the lead of the group. That order was a requisite for the analytic algorithm. Upon discovering that several people had not done so, I had to figure out in the R programming language I use how to swap the data around so that it was in the proper order. You can do the same in Excel, but you must remember not to over-write the corrected rows.
Reversal of ranking or rating. Instructions are particularly important for prioritizing questions because you want to avoid someone carelessly reversing the ranking order or rating scale. Your instruction might say, “This question uses 1 to rate the least desirable attribute of client satisfaction; it uses 10 to rate the most desirable attribute. Be sure to follow that rating order in your answers.” We have discussed other measures to try to assure compliance in “Add Instructions.”

None of these speed bumps are impassable, but they slow you down. Eyeballing the data, deciding how to untangle issues, and correcting your cleaned version all take time and can spawn different errors if you are not mindful.