How to gain insight from employee text feedback

One of the biggest lessons we’ve learnt during the last 5 years of analysing and reporting insight from employee-sourced text feedback is that the real value starts after you’ve classified or coded your text.

The first part of analysing feedback data is to categorise the comments against a set of themes. When you’ve done this you effectively have structured count data. At this stage you might also want to add other metadata such as sentiment or emotion scores, thought as I’ve explained before there are usually better ways of doing this.

As soon as you’ve changed the unstructured responses to structured data there are then a variety of different ways of making sense of what is going on.

What did they say?

The simplest approach, and the one which many analysts stop at, is simply to plot the frequency that each topic is mentioned.

There are two types of chart that are typically used here. The first is a simple bar chart. The second is a theme cloud - very similar to the familiar word cloud but with themes instead of the raw words.

Both have challenges and weaknesses. The key one is that each theme has a different breadth. For example different components in a benefits package might be identified separately, whereas different types of training might all be grouped together. This might cause the more tightly-defined themes to appear smaller. The viewer therefore might down-play such themes. Text analysis is inherently at least partially subjective in nature. Some analyses try to counter the issue by having a hierarchical coding scheme though again these hierarchies also require a human choice in creating and there is no guarantee that all viewers will agree with those choices.

Theme clouds share an issue with word-clouds where the length of words can make longer words seem more important that they should be. This issue can be reduced by using bubbles instead of words.

The easiest way of adding value to a simple bar chart is to add a comparison, for example comparing the themes mentioned in the finance department with the themes mentioned in the rest of the organization. Depending on your audience - I think they’re better but do seem to need an audience more familiar with data - this would be a very good use of a slopegraph, an example of which is shown below.

As an aside I think the best use of wordclouds is as an exploratory data analysis tool, where their purpose is to highlight to the analyst patterns rather than to report this data accurately.

Thinking about the data

So, with a few simple charts it’s relatively easy to identify the most common themes but this doesn’t equate to great insight by itself.

The addition of a comparison will make a big difference however it’s worth thinking about how the data is captured. Text feedback is different from the scale questions of a survey as it doesn’t measure the strength of an individual’s feelings across a full set of themes. The feedback provider is simply providing what is top-of-mind. For larger themes this doesn’t have so much of a problem but it can make differences in smaller themes stand out.

We can think about this problem by thinking about analysing supermarket shopping baskets. If my local supermarket was to analyse my weekly shopping they’d find that I buy milk every week. Sometimes I might buy 2 litres and other times I might buy 3 litres but it’s likely to be there every week. It’s safe to assume that I, or my family like milk.

If, however, we think about how frequently I purchase vanilla pods it would be easy to jump to the wrong conclusion. I love vanilla, but use it relatively infrequently. Therefore it’s not in my shopping basket every week.

At the level of a year where there is much more data it’s probably reasonable to compare my consumption of vanilla - e.g. did I buy more or less vanilla in 2019 to 2018 - but at a week level it’s not possible, whereas it would be for consumption of milk.

The same is true for text. Maybe a decent number of people in your survey mention communication as needing to be improved whereas fewer might mention travel budgets. Comparing communication across groups might be a sensible approach but comparing the travel budget issue might simply be revealing issues relating to the small amount of data. It’s why we tend to adjust scores for the strength of the evidence in our reporting.

Another issue which we frequently see in text arises from the influence of other questions in the survey to bring topics to ‘top-of-mind’. If your survey has included a section on corporate social responsibility then you can assume that more people will discuss this topic than they would if you hadn’t just asked them about the topic. User Interface design makes big differences in text responses.

How are the themes linked?

Individuals frequently mention more than one topic or theme in their answers, and therefore it’s possible to look at the relationship between different themes.

There are two ways that we’ve found useful:

Using a heatmap to show the cooccurence of the themes
Creating a graph or network of how the themes are linked.

In most instances we’re using the latter approach. Instead of actually plotting how often the themes are mentioned together we instead calculate a probability that the relationship is stronger than expected and use this probability as the edge weight.

Weighted Co-occurence graph. Edges show the likelihood that the relationship between the themes is unusually strong

In either approach it’s possible to calculate a ‘distance’ between the themes and using this to cluster them together. We’ve found this really valuable as a way of making sense of groups of employees.

Who said what?

If you want to understand where to direct action it’s important to understand who said what.

There are a few ways that we’ve found useful. All use machine learning techniques on the results to identify relationships between the topics mentioned and other variables:

heatmaps can show numerous groups together with the topics which are unusually likely to be mentioned by each group
models can highlight which groups are most likely to mention each theme (eg a women is twice as likely to mention theme X than a man is). An example of this is shown above.
we often do this using the clusters or segments found in the co-occurence analysis described in the section above
it’s worth doing this using demographic data, event-based data if you have it and the answers to other questions
don’t go too small with your groups. Text data isn’t the same as quantitive data, most notably the absence of a topic doesn’t mean an individual doesn’t believe it’s there, just that it wasn’t top-of-mind for them to measure. The data is in this sense noisy. Increasing group size and / or using a statistical model increases the amount of insight you can draw from the data.

Heatmap showing likelihood of themes mentioned by group, blue being very likely, red being very unlikely.

How is this changing?

Another key way you should be analysing your data is by looking at change over time.

There are broadly two different ways of looking at change over time:

At the population, or sub-population level. This is the easiest to do but has some disadvantages - group size issues need consideration and it doesn’t clarify whether it’s the same people or different people within the groups who are participating
At the individual level using so-called ‘panel data’. This addresses the second of the issues above by comparing the results at the individual level - e.g. what did people who mentioned X last time mention this time. There are, of course, issues with this approach, most importantly that only a proportion of your answers will answer both of two sequential surveys. You also might want to add ‘didn’t answer’ as an ‘option’ to see if there is any pattern between mentions of a topic last time and non-participation during the next survey.

Both are important ways of looking at the data, and in an ideal world you’d do both depending on what you were trying to achieve.

One of the visualisation approaches that we use, especially for the former, is to use a slopegraph to show differences between the time periods. An example of this is provided below.

Slopegraph showing changes between two time periods

Identifying relationships between text statements and another variable

Most surveys used in organizations will have one or more target variables. This could, for example, be an employee engagement score, satisfaction score or eNPS rating. Therefore a common form of analysis once the text has been coded is to try and understand the level or movement of these scores by understanding their relationship with the comments.

In the plot below we show each theme mentioned in a 2 question survey (what is good / what could be improved). The bubble size represents the number of people mentioning the topic. On the x axis we indicate the balance between whether it’s used as a positive or a negative and on the y axis we show the average ‘score’ on a rating question that people mentioning this topic provide.

Bubble plot showing themes shown by frequency and sentiment

These types of visualisations help managers understand which topics are linked to various states. For example we can see that ‘hard’ parts of the employee offer - pay, various parts of the benefits package - are linked to negative ‘sentiment’. We can understand this further when you read a few examples:

“What is good about working for this company?”

“I get paid on time.”

“It’s not far from home.”

As well as visualisations it’s possible to build statistical models that highlight each theme’s importance in the overall rating. We’ve seen, and used, the importance from these themes as variables to be visualised. For the right audience it can be useful but it’s worth thinking about who you’re presenting the data to and whether your more complicated models will improve their understanding or be one more thing to confuse them.

Moving towards causality

Taking this further we can use the panel-data approach mentioned in the previous section to identify topics associated with a rise / drop in the target variable, for example what topics are mentioned by people whose engagement score / eNPS etc drops.

We’ve found two interesting ways of looking at this:

Looking at topics mentioned by people who subsequently show a drop / jump in the target variable
Looking at the topics mentioned by people who have shown a change in the target variable.

Our findings with these types of analyses make intuitive sense and support findings of other analyses and research papers that these topics could be seen as causal. Reading the comments of people making these comments further supports that they are, in many instances, linking these topics with their change in the target variable.

Opening the black box

One of the ‘issues’ with modern text analysis approaches, like the ones that we employ, is that they are ‘black boxes’ - i.e. to some degree how the algorithm is coding each sentence can be opaque.

Although we provide examples of each theme to the algorithm as training data how it interprets these is not explicitly defined by the analyst. Machine learning algorithms are looking for patterns in the data not following rules.

What is more, these training examples are always inherently ‘noisy’. Numerous studies have shown that even the best qualitative analysts don’t 100% agree. We’ve seen studies which show that inter-rater agreement is typically about 80%.

For these reasons one of the things we like to do is to create some analyses to highlight what has been coded within each theme. It might show sample phrases, linkages and cooccurence between themes and which groups or variables are most likely to be linked with theme. We also use an algorithmic approach to highlight a few sentences which describe the range of different sentences which have been coded with this theme.

Exploring words in sentences coded ‘team and teamwork’. The position shows the average rating of the rater and the frequency of use.

Doing so provides a quick summary, and more detail, to describe the theme and its content. Though the final stage of review will always be to read the statements we want to summarise as much as possible so that individual statements are just examples of what the viewer already understands.

Including text codes in other models

Structured text codes, especially if they’re granular enough, can be very useful features to add to predictive models. We’ve typically found when we’re doing things like attrition modelling that they are some of the most useful variables to include in the model.

The challenge for the analyst when using these variables in models will likely be to do so whilst preserving the confidentiality agreements that were made when the data was collected.

Whilst, for example, it might be possible to protect for this by using a leaf size (or example) in a decision tree set to the minimum reporting size, much of the problem lies in the action not the analysis.

If you are planning to take action at a group level (groups where more than X numbers of people mention theme Y) then you can direct action better than provided at the population level and depending on the action this might be appropriate.

Alternatively for many surveys it is possible for an intermediary (a third party or technology solution) to ask the comment provider for consent to know who they are so action can be taken. For example if someone raised comments about bullying you might want to investigate these and therefore to do so would need to know who was concerned.

What is unique?

Finally it’s worth reviewing those statements which haven’t been coded. Part of the reason for this will be natural model errors. Our coding tends to have a very low false-positive rate (we’d prefer to say ‘don’t know’ than get it wrong) but a consequence of this is that we’ll have statements that maybe could have been coded if we had a lower tolerance to errors. (Depending on the client’s needs we can fine tune this via more active learning).

Other statements that aren’t coded are likely to be ones which are far from any of the coded examples, that is they are genuinely unique. It’s worth reviewing these statements as we’ve frequently found some of the most valuable feedback in these unique contributions.

Summary

Understanding the distribution and frequency of topics is important when analysing text comments but it should only be seen as a starting point. In most instances what you do with the structured themed data is more valuable than just the themes themselves.

Much of what you choose to do will depend on the questions that you and your audience have about the data. As always this will shape how you conduct and present your analysis.

I mentioned in the beginning that text is unstructured. That is not entirely true. Text responses are structured by their grammar, how the topics cooccur, the institutional context and even the data collection approach (the survey design). They will always be more valuable if the analyst can provide context from outside the dataset, for example linking the comments with the events occuring in the organisation.

As analysts we believe that open ended text comments are best used in an exploratory manner. They’re there for suggesting patterns and prompting hypotheses. Text is a wonderful conversation starter - much more so than other types of survey responses. Depending on the question that you’re addressing, and the other data you have at hand, you might have enough confidence with just the text response to make an effective decision.

As statisticians we will highlight that there are better ways than text data at confirming or rejecting these hypotheses. However using text responses as a first stage will allow you to ask better questions with your statistical analyses and experiments. The pattern recognition approaches described above might also help you direct these experiments more effectively.

In this article I haven’t aimed at providing a fully comprehensive set of analyses or visualisation that you might perform once you have coded data. However I hope that it has illustrated why what you do once you have coded data is probably more important than the coding itself.