How to write good open text survey questions for a computer to interpret

Good measurement is always done with at least some regards on how you intend to analyse the resulting data. Text data, as generated by open questions in surveys is no different.

If you intend to do some form of computerised analysis of the data it’s important to write questions in a way that encourages people to write text in a way that computers will find it easy to understand.

In this article I explain how computers code text, some of the typical issues that occur and how to write questions to improve the accuracy of your coding.

How computers code text answers

Computer algorithms that aim to classify survey comments tend to fall into three main categories:

Unsupervised

These are clustering-based algorithms that look for patterns within the text. There are several key approaches here and some work better than others.

Unsupervised algorithms require a distance-metric, some way of determining how ‘close’ two text items are. One challenge with survey texts is that they tend to be short (we see median answer lengths of about 25 words) and therefore if the algorithm is looking for co-occurence between words they might struggle. Text answers will often provide too sparse a data set.

The second issue is that these algorithms typically will find clusters that are mathematically similar but practically useless. This is not to say that unsupervised algorithms are not useful - we use unsupervised algorithms as part of our process especially when building an inductive coding model - but in our experience all unsupervised approaches require a human to provide interpretation and remove non-informative clusters.

Rule-based approaches

Most of the established systems on the market are currently rule based. This is significantly because most of these systems were developed before machine-learning approaches were able to compete on quality levels.

At their simplest level rule based approaches may be little more than key word searches. eg {wage, pay, salary} all being classified ‘pay’. Most will, however, be looking at combinations of words. For example ‘pay’ on its own will be classified as ‘pay’ but ‘pay more attention’ will not.

Over time companies developing these types of approaches will have created thousands, and in some instances millions of classification rules. A negative consequence that we commonly see from this is that those firms who’ve invested large amounts building a large number of rules are often reluctant to abandon these as newer machine learning approaches appear and in many instances out-perform rule based approaches. Most on the market now will likely have hit a ceiling in terms of accuracy.

Machine-learning approaches

Machine-learning approaches use algorithms to identify patterns in example answers that have already been coded and then to apply these patterns to new, previously unseen answers. In some ways therefore these patterns are similar to the rules in the rule based systems albeit that they are developed by an algorithm instead of being developed by a skilled human.

There are advantages and disadvantages of using this approach. On the positive side machine learning algorithms can learn rules much more quickly and with lower cost than humans will need to create them. On the negative side they have no way of identifying whether a patterns is there because it indicates knowledge or whether it is there purely by chance. Unusual combinations of words, or even punctuation, can cause a model to wrongly classify data.

At some stage, when the amount of example data has reached a certain level machine learning approaches will outperform rules-based approaches. Machine learning rules are likely to be more complex than those written by a human. Therefore with larger numbers of examples they will start to bring in more correctly classified examples.

Depending on the machine learning approach used they can also handle things such as spelling mistakes more easily than rule based approaches. Some of our approaches do not look at the words present, but instead look at sequences of characters which makes them less likely to be confused by misspellings. Of course it would be possible to build a rule based system this way but it would dramatically increase the difficulty and cost of doing so.

Challenges for computerised survey text classification

If we are writing a question where we want a computer system to perform some or all of the analysis it is therefore important that we ask the question in a manner to maximise the chance that the algorithm will be successful.

Answers need to be understood in context of what question was asked

When asked a question, especially in surveys, people write answers assuming that the reader will have knowledge of what question was asked. Furthermore they often don’t write in full sentences. Therefore the short-form texts of many open text answers tend to have a large number of quite ambiguous answers if read on their own.

The easiest way of countering this is to understand what question was asked. This is almost certainly how a human would address the issue, and as a result dramatically increase the understanding of what the writer meant.

Computers face the same problem. Two potential solutions come to mind. The first is to build an approach that will take an input question and the answer and link the two together. If the employee is asked about innovation then it could assume that any demonstrative pronoun such as ‘this’ or ‘those’ refers to innovation.

The second approach is to build a model for each question where the training data has labels that are specific to the question that is being asked. This is the approach that we take.

What should be clear is that a general model without reference to the question being asked will always be suboptimal.

Survey User Interfaces and previous questions can determine how people provide data

Whilst your analysis probably treats each question as independent from each other, survey takers will undoubtedly treat the survey as one set of questions.

Therefore it is worth considering how you design the user interface - and in practical terms this probably refers most to question order, labels and sections.

We’ve seen instances where a survey interface change has resulted in markedly different ways that people answer questions. It is for this reason that it is often suggested that questions, or question groups are randomised in surveys.

If you don’t separate open questions sufficiently in the design people will refer to the previous questions, often explicitly. If the survey taker thinks (or is told) that there is a theme to a survey then their answers will relate to that theme, even if you hadn’t meant it to be.

Whilst this will also be an issue if the data is to be analysed manually it might be worth accounting for in any model. It’s most likely an issue if you want to compare the results with data captured at a different time, or in a survey where the set of questions is different. Differences in this instance might be as much a result of the interface as it is of changing priorities.

Sentiment analysis isn’t good enough to reliably be used to classify text answers

I’ve written about the issues with sentiment analysis earlier. At present the level of accuracy of sentiment analysis isn’t good enough to be used to filter comments to positive and negative (it might be good enough for trends).

An easy way of testing this for yourself is to take the answers to a negative question like “what didn’t you like about working for COMPANY?”. Using any sentiment analysis tool on these answers is likely to show mainly positively coded answers.

Out of dictionary examples can be problematic

Most text analytics algorithms will have problems with so called ‘out of dictionary’ examples. These are rare words, or rare spellings of words, that haven’t been included in the examples used to train the model.

One class of these words are misspellings. Fortunately many people make the same mistakes when misspelling words so it’s relatively easy to counter these in a training dataset (though human-programmed rule based systems will often struggle unless the coder realises the most common misspellings and codes them explicitly). Humans often will ignore spelling mistakes in a way which computers won’t be able to.

Another example is when answers contain slang, emojis or abbreviations. Unless the training data includes these forms it won’t be able to interpret them.

There are several ways of dealing with these issues. One is to use a continuous bag of letters approach where the patterns are identified not by words but by patterns of sequential letters, often bridging several words. Another is to use a very large training set, or by using a model which uses a vast language source to learn grammatical structure which the classification model is built upon. Both of these approaches imply the use of machine learning approaches.

Finally it is possible to integrate a spell correction algorithm into the text processing pipeline though it should be understood that this itself is a probabilistic model and will sometimes introduce further errors. This approach also is unlikely to deal effectively with words that are misspelt, but in doing so have the correct spelling of a different word. A significant number of people in our experience complain about the ‘moral’ in their team when they likely mean ‘morale’.

Where was the model trained?

For any model to be accurate it needs to be trained on a dataset that is similar to the data that it needs to classify. Unfortunately in many instances this isn’t true - training data typically is based upon a common dataset or through the text available to the team who originally built the model.

A common example of such a problem would be when a training data set has been developed using British or American English and is then applied to an answer set with Indian English. A British person would likely be able to interpret this text using common sense. An algorithm doesn’t have such common sense.

The way of countering this problem is quite simple: ensure that the model was trained incorporating your local language, or that the provider has a method to fine-tune the model on the local language.

Synonyms

According to wikipedia “A synonym is a word or phrase that means exactly or nearly the same as another word or phrase in the same language”. They cause a significant issue for text classification algorithms.

For a human coder the context of the question and the organization goes a long way to determine which meaning of the word the writer was referring to.

Therefore one successful strategy for dealing with Synonyms is to train the model on question and organization-specific text. Alternatively some more recent models trained on very large text sources can help by using the context of the synonym within the text to provide disambiguation. Whether this works as well as domain-specific models, especially on survey data where the text is short and has less context is open to debate.

It is worth adding that homophones - words that have the same pronunciation but different spellings - can often be treated in practical circumstances as synonyms given the relatively high proportion of people who will misspell the words.

Large organizations develop their own languages

Large organizations typically will develop their own language over time. Projects and systems will be given names, a large number of acronyms will be created and even, in some organizations, misspellings or misappropriations of words will become dominant.

Industries as well develop their own languages with terminology and jargon becoming prevalent. All of these instances need to be incorporated into the model for it to be accurate

In the consumer space generic but industry-specific models are relatively common from the big text analysis suppliers. I have yet to see industry trained models for employees and it’s likely that industry-specific consumer models will show similar issues applying to employee text as general consumer models do.

Advice on how to construct good questions

Most of the issues above all relate to versions of the same problem - that computers (and humans to a lesser degree) find it difficult to deal with ambiguous statements. Therefore the common thread in all the recommendations below are that they work by trying to encourage the writer to provide as little ambiguity as possible.

Keep the questions as open as possible

There is obviously a need when asking a question to provide some context so that people write about what you want them to write about. However you need to make sure that you’re not feeding ideas onto the keys of the typer.

Writing a good survey question is in many ways like asking a good interview question - you shouldn’t lead the person whose answer you want.

…But don’t be too open

The opposite of this we probably see more often, and it’s at least as damaging. These are questions of the type “Is there anything else that we haven’t asked that you think is important”. The text responses that we’ve received from this question tend to be more ambiguous than for other question phrasing and the question also has a tendency to encourage sarcastic and flippant responses (which computers really struggle with). We also frequently see this type of question encourages the ‘moaners’ and that engaged employees either don’t answer or provide a variant of ‘all is good’.

Be cautious about pre-classification question

Several of our clients have survey designs where they ask employees to pre-classify text answers. The reason for doing this is mostly because in the days before reliable computerised text classification it was a simple way of performing analysis - the text wasn’t analysed but the text classification would have been.

Unfortunately these questions encourage ambiguous answers with a significant issue relating to demonstrative pronouns. Given the person has been asked specifically what they want to write about in the user interface they then assume that the analyst will understand this choice when interpreting the answer. The best strategy is, wherever possible, to build a model for each question.

The other issue with these questions that we’ve seen is that they encourage the responder to provide a non-informative answer, far more so than if a totally open question would have been asked. Thinking that they need to provide some text we see a lot of instances when people select a choice and then write something along the lines of ‘this is an issue’. Whilst in a traditional survey these people might have not responded in these designs they’ll often provide an answer but the answer will be uninformative. These answers will therefore provide an extra burden on the text coding.

Ask for a positive and negative response

As noted above sentiment analysis hasn’t reached a level of quality where it can be reliably used to filter positive and negative comments. Therefore the challenge for the survey analyst is how to create this filtering.

The easiest and most reliable way of separating the comments by sentiment is to ask two questions; one asking for the positive comments and another asking for the negative. Therefore, for the most part, your model can assume that if the employee mentions ‘pay’ in the ‘what do you you like…’ question that the employee likes their pay.

Another advantage of this approach is that it is likely to produce better quality data. Many people, though they have ideas of what they would like to improve might also identify things that are positive. In the same way as you’d encourage managers to provide a positive and improvement part of an employee review you’ll get richer data if you ask for both explicitly.

Finally, our experience of data where the firm has only asked one side (usually the ‘what would you like to improve’ part) is that a sizeable number of employees will provide the positive aspects as well as the areas to improve increasing the ambiguity of the answer and reducing the effectiveness of the model. If you don’t give employees the opportunity to write what they want they’ll find a different place to tell you.

Don’t ask for ‘one thing’

Lots of firms ask a ‘what is one thing’ type question. Employees rarely follow this advice, and if they do it might encourage them not to communicate all that is relevant to them.

Given this these types of question don’t typically increase ambiguity but they can reduce the richness of the content. You’ll also see a lot of ‘how can I mention just one thing’ statements.

Don’t ask for 3 (or another arbitrary number of) words

This question forming was again popularised to ease analysis when analysis wasn’t as sophisticated as it is today.

We’ve seen datasets where less than 20% of people answered the question with 3 words, and where over 20% of those answers were hard even for a human to understand such was their level of ambiguity.

Depending on your culture these questions will increase the amount of flippant remarks. With one client, when asking employees about a development programme we saw a number of answers of the ‘complete’, ‘waste’, ‘time’ variety.

Don’t ask too many questions

We’ve found the optimal number of questions on a survey is below 5, and in most instances below 3. Any more than that and the amount of words that people write will decrease, along with the richness of the data.

The worst type of design we’ve seen are those where there is an optional ‘comments’ section on each quantitive question. On this design we see a large number of ambiguous, flippant or uninformative answers. People providing these poor answers next to a quantitive question are less likely to provide a detailed answer in any open question elsewhere.

Slightly less bad is a comment section on each survey section. One way we like thinking about this is that an employee has decided to allocate 15 minutes to answer a survey. If the questions are too numerous or take too long they reduce the answers, either removing the richness or increasing ambiguity.

Make answering optional

“everything is good but I need to click something”

Making text questions essential to answer doesn’t provide you with better data. It might increase the amount of data you have but doing so will reduce the data quality. It is almost always harder to clean poor data than to analyse sparser data.

What is harder to measure in many instances is how poor survey design increases the amount of people who abandon, but anecdotal evidence suggests that if people are not given the opportunity to respond how they like some will close the survey.

Show the data is valuable

“I know nobody is going to read this but….”
“Thanks for giving the opportunity to have free-text (which was given once but never realized that this was read by someone and thought through for potentially implementing).”

Many employees will expect that free text survey data won’t be read or analysed and therefore what is the point in providing the answers.

There are two ways of addressing this that we’ve seen work successfully. The first is an explicit commitment of how you’ll deal with the text and ensure that every comment is valuable. This can be in the survey or in surrounding communications.

The second is by reporting some of the findings of the text in a public manner. Text can make a great source for visualisation and with large enough volumes can be both interesting and anonymous. It can also demonstrate an unexpected competence from an HR department where the expectation of good, advanced, analysis is rare.

Make text data the main event

Whilst we believe that a good employee survey will have a mixture of qualitative and quantitive questions the best surveys are about listening and listening is best with open questions and an open mind.

With analysis important not only to deal with the volume of data - we often describe a survey responses in the number of bibles text provided - but also to ensure that bias is removed then it makes sense to design your surveys to ensure you collect data that a computer will be able to analyse.

In most instances the recommendations here will also ensure that a human can more easily interpret the data. In the meantime asking questions in a way that maximises the analysis is one of the easiest ways of increasing the value of your survey.

Employee Engagement, Employee Voice, Modelling, People AnalyticsAndrew MarrittApril 14, 2020