Data quality has long been an important part of the research process, but recently sampling techniques have drawn criticism for being prone to automated bots completing surveys. Whilst there are many reasons for this, and a wide-ranging debate about how and why this has become such an issue, there are many steps you can follow to take control of your sample quality.
It's important to outline the steps we take during our quality assurance process and recommend all research users do the same to help you spot the bots in your survey and remove them from your analysis. Here's what to look out for:
‘Speeders’ are respondents who have completed your survey too quickly. From your survey design, you should have a good idea of how long you’d expect it to take someone to read each question, the available answer options and provide a considered response.
It is true that some people may complete the survey quicker than others, especially your ‘survey professionals’, but you can very quickly filter out the bots in your survey by setting a minimum completion time: usually you’d look to exclude anyone who has completed in less than 25% of the median completion time. This is based on the assessment that the respondent won’t have been able to read each question fully and think about their answer.
You will also find some survey providers can give you data on the average time spent per question. This can be helpful if you know you have questions requiring careful consideration and the average time for the respondent is extremely low.
Here, we’re talking about respondents who have selected the same answer to most, or in some cases, every question of that type. I’ve seen this happen most often with scale questions where the respondent picks the same point on the scale every time; quite often the mid-point. But it is also the case for other question types such as single and multiple choice where the response has just clicked the first few options available in the question.
A key thing to note with flatliners, however, is that someone may flag as a flatliner, but they are actually a genuine respondent with indifferent opinions. Take care with your survey design; it’s always good to include some opposing statements which should prompt the respondent to provide a different answer – if they don’t and still flag as a flatliner, you can then be more confident about the need to remove them.
Here, you’ll see respondents providing extreme answers to your questions, with scales proving the main question type culprit for this. Similar to flatliners, bots tend to pick the same point on these questions; if the choice is the extreme end of the scales, your survey results will be inaccurate and bias. If the respondent gave an answer at the top or bottom of the scale for 80% or more of the questions, you should seek to remove them from your results.
Again, do be careful with the survey design here; sensitive questions in your survey can prompt emotional responses, so carefully structuring your question wording is important to avoid leading genuine responders to selecting the extreme ends of the scale.
In addition to the above, there are further checks you can apply to your data to spot those bots which can be a bit trickier to incorporate into survey tools as standard checks. But these checks can easily be applied to your data file and worthwhile doing to ensure the best data quality possible.
You tend to see bots trying to cheat the system by following a pattern of responses. So, for example, on grids or scales with more than one sub-question, the response may follow a pattern across the selecting the first answer option for sub-question one, the second answer option for sub-question two and so on. Looking across your data file can easily flag these so you can consider how genuine the entry is.
If you’ve flagged an entry as failing the check for common patterns, the likelihood is that the answers also won’t make much sense. With this check, you may see a response indicate a like towards something at one stage of your survey and then a dislike of the same thing later on.
This type of check does require you to look at the entries line by line which can certainly be time-consuming. But a considered survey design can help make this check quite simple by incorporating opposing statements. If a response fails this check, it is highly unlikely that a genuine responder just simply changed their mind half way through and instead, you have just pinpointed the bot.
Checking responses to your open-ended questions is a key check on your data file. With data of poor quality, you may see a random sequence of letters typed out, odd symbols or punctuation used, lots of ‘don’t knows’ or ‘n/a’, or even just nonsensical answers; for example, you ask about what brands someone is aware of and receive a random answer of ‘cheese’, which believe me, has happened!
However, determining whether the entry flagged is a bot can prove tricky here. You will find some survey responders who simply just don’t like answering open-ends. These tend to be the people who have entered lots of ‘don’t knows’ and ‘n/a’ answers. If these entries look okay on your other checks and the open-ends don’t form a significant part of your survey design, chances are you will want to keep them.
But if the open-ends just don’t make any sense and/or fail on multiple other checks, the likelihood is that they are the bots you want to remove from your data.
A general rule of thumb if you have a good survey design is that no more than 10% of data is flagged as poor-quality. If you are worried about how many responses you’re removing from your survey, take a look at the entries you’re removing closely and consider the source where you have obtained the sample from.
If there are questionable entries based on the above checks, but they only failed on one check, you may consider keeping them if the data gives an indication of being good in other areas. Generally speaking, the most common data sets that fail on more than one of the above checks indicate the presence of a bot, but once that presence is highlighted, you can root them out and will be able to save the quality of your data sets in no time.