OrganizationView, the firm that I run, is based in the world-famous Swiss resort of St. Moritz. It’s a great place to run a technology business from — the digital infrastructure is superb and it’s the ideal location to work hard / play hard if you’re interested in the great outdoors. Skiing or mountain biking before work is possible (and we encourage it!)
The primary local industry is tourism and whilst having drinks a few months ago with some hotel directors we discussed whether the open-question analysis technology behind Workometry could be applied to the hotel sector.
How we code text
Workometry’s approach to text is somewhat different to most other text coding services on the market. Instead of having a complex, extensive coding model which we apply to all data we have a process which we can scale quickly to build custom coding models for different datasets. In effect we’re narrowing the problem down to the simplest state believing that narrow problems — rather than general — are the easiest to solve with today’s AI. We build question- and organization-specific models instead of using general models.
At a high level the way we code text is almost exactly the same as how a skilled human researcher would tackle the problem. We start by creating an unsupervised model to identify clusters of answers that have similar semantic meaning and then use a supervised approach to code as many statements as possible with these topics. We believe that the algorithms should do as much as possible but that a domain expert can contribute as long as you give the human as little as possible and learn from each answer.
The benefit of this approach is that we are exceedingly good at understanding the narrow data provided. I suspect the models that we created for our local hotels wouldn’t be the best fit if we were analysing the hotels in Zurich or Bern as we’d assume the purpose of those visits would be different.
Understanding the Engadin reviews
We started with a dataset of about 8,000 hotel reviews for our local area covering about 60 hotels. There were just under 1 million words. Our first realisation was that the text in reviews is significantly lower quality than we’d see with employee answers to specific questions.
When making reviews hotel guests write varying amounts ranging from one sentence to a sizeable essay. They talk about the hotel, the resort, how they got to the hotel, who they were travelling with, why they were travelling, the weather, even their favourite films. We wanted the insight that we produced to be things that the hotel directors could use to make improvements — obviously only a subset of the review content was actionable.
When we ask an employee a question such as ‘how could we improve the customer experience in our stores’ we can assume that most comments are about ways of improving the customer service. In a public hotel review we have to identify first what is really about the hotel. Instead of building very specific models we need to build much more general models and then remove irrelevant categories.
Depending on your use-case, sentiment analysis isn’t yet good enough
The other big issue compared to how we ask for feedback is that reviews can be positive or negative. With employee feedback we narrow the text as much as possible by asking questions such as ‘what is the best thing about working for COMPANY?’ or ‘how could we improve the performance management process?’ — we use the question to narrow the sentiment.
With reviews we wanted to do topic-level sentiment. It’s typical to see sentences such as “My pillow was too hard but room service changed it really quickly.” What we felt was important here was not the overall sentence level sentiment (probably positive) but that the pillow was negative and the room service was positive.
A few years ago I heard Mark Cieliebak of ZHAW presenting his paper looking at the accuracy of best-in-class sentiment analysis tools. I remembered he mentioned that accuracy on average was about 60%. A recommendation was to ensemble multiple systems to increase accuracy. That paper was from about 2013. Given our experience I’d suggest that accuracy of current best-in class tools on real-world data is about 70%. Again, using multiple libraries and an ensemble method makes sense.
As Mark mentioned those years ago whether this is good enough depends on what you’re trying to do. If you’re looking at average or trend analysis with large volumes of text it’s certainly useful. If you’re trying to filter the positive or negative references to the bed quality it becomes a frustrating activity and the user will probably loose confidence in the technology.
If you get the chance in a survey ask a positive and negative question, or take sentiment from a well designed scale question. Leave algorithmic sentiment analysis to the use-cases where asking for better data isn’t possible.
Frequency is important, but unusual frequency is what you want to know
Many of the tools available to hotels will provide analysis what their guests are mentioning in the reviews. They do this by looking at the frequency of topic.
What we’ve found through our employee feedback work is that absolute frequency of certain topics is relatively meaningless on its own — for understanding where action is needed you also need to look at the relative frequency compared to a peer group.
What our analysis does is answer the question ‘which of these topics is unusually frequent given your peers’. We use the size of the review set and the distribution of topics across all groups to flag areas for attention.
We also found that comparison group was very important when comparing hotels. Just comparing the 5-star hotels with the rest in the valley produced results which whilst not wrong, weren’t very informative. Compared with all hotel guests, five star hotel guests had unusually positive views about the luxurious spas, but this is neither helpful or surprising as few other hotels have them. What was more informative for this group of hotels was to compare them with their peers.
Where should we focus?
The differentiated topics were helpful but what really matters is outcome-type variables. Unfortunately, given we were looking at external data we weren’t able to use data-points such as whether the customer comes back or their spend whilst there. We were left with how many stars the reviewer gave.
This was enough, on the valley level, of building a reasonable model to explain what hotels needed to focus on if they wanted to improve their review rating. It’s not just a case on focusing on factors where they’re lagging, they need to understand which factors actually make the biggest difference to the rating score.
Public reviews aren’t great for continual improvement
Reviews are really important to hotels as they’re one of the key ways that individuals decide which hotel to book. Most hotel reviews are hosted by organizations who make money by selling hotel rooms. All the hotels that we spoke to use one of a handful of review management tools which scour the web for reviews and provide tools to help hotels respond in an efficient manner.
What we learned is that hotel reviews are quite a poor way of hotels learning what they can do better. At the same time the commercial pressure on getting reviews means incentives hotels to get reviews for marketing purposes rather than feedback for continual improvement.
To get better data you need better questions
In traditional surveys a lot of time and effort is spent ensuring that the questions are well written and valid. Whilst we don’t believe such effort is needed with open questions (there is much less abstraction in the questions) it is certainly true that open questions need to be well written to get the best data.
There is a balance to be made. To solicit the widest feedback you want the question as open as possible, however to encourage the respondent to be specific you need to frame it so they don’t write about non-relevant data. For hotel reviews there was lots which was out of the control of the hotel (eg delays on the train).
What is clear is that to solicit information for customer experience just looking at reviews is insufficient.