Why great training data is the key to text analysis success in People Analytics

Most People Analytics teams at some stage decided that they need to analyse text data. Most will have large volumes of text, they will often realise that it’s the richest data-type that they have and that text can capture information that is difficult to efficiently capture elsewhere.

To the non-text specialist data scientist, and certainly to the HR teams who might manage or work with them, this seems like a technology problem. We’re also living in an age where many of the most powerful ‘AI’ models are open-source and increasingly delivered in a few lines of code. Wouldn’t it be easy to choose your foundational model and complete your task?

Classification models

Most use cases of text analytics in HR are a type of classification model. To train classification models you need a set of data examples and the ‘labels’ that apply to them. So in a text model you might need to find a sentence ‘I would like a pay raise’ and the corresponding label ‘salary increase’.

In many instances, when you’re building many classification models in HR you have labelled data. For an attrition model you need a set of data about individuals and a label - whether they’ve left or not at a particular point in time. You probably have this information already which is why building attrition models is such a popular place to start as a People Analytics use case.

With classification of text documents you don’t have an easily accessible set of labelled data. You have to create it. There is no getting away from the fact that this will require human effort.

Zipf’s law

Zipf’s law - named after the American linguist George Kingsley Zipf - is an assertion that the frequencies (f) of certain events - in our case words in a text source - are inversely proportional to their rank (r). In English the relationship is approximately f(r) = 0.1 / r. So the most frequently occurring word (in English ‘the’) occurs about every 10th word and the second most common word (in English ‘of’) occurs about every 20th word. What Zipf’s law describes - certainly for the most frequent words in any set of related documents - is an exponential distribution.

We see something akin to this in every one of our theme frequency plots.

A typical distribution of themes in employee text

Themes aren’t words - there are numerous words to describe ‘salary’ and some sentences about salary don’t event mention a word about salary - e.g. ‘give me a raise!’ - but a similar exponential distribution is likely to be present.

What are the implications of this?

Let’s imagine an experiment. Say you have a set of 1000 survey responses. Let’s simplify this to assume that each answer has one sentence and each sentence has one theme (this certainly isn’t realistic).

You’ve been asked to sort the sentences into themes. You take the first sentence and as no theme currently exists you create a new theme.

You then take the next sentence. Does it belong to the existing theme? If yes allocate it to that theme. If not create a new theme.

What do you see when you’ve finished the task? Your biggest theme probably covers between 7 - 15% of the sentences. You’ll have a long tail of themes with one example.

It would be reasonable to say that to be a theme you need ‘n’ examples, say 5. You can then bucket all the themes below this threshold into an ‘other’ bucket.

How would you find 5 examples of a rare theme - one of your ones with ‘1’ example - if you needed 5 sentences? You’d probably need a dataset 5 times the size. However if you now did the experiment for 5000 answers whilst many of your original themes might be above the ‘5’ threshold now you’ve found a new set of themes below that threshold. There will always be a long tail of frequency = 1 answers.

Building labels for text classification

The example above is a simple approach to creating the labels and examples that you need to provide a machine learning (AI) algorithm to learn how to classify text. How many examples do you need to for a text classification model to start predicting a label? A very rough guide from our experience is that you might have a base number of 100 examples to start getting good results. (We are more comfortable with 500 examples of each theme as a starting point, 1000 is probably better).

If we think that the biggest theme occurs 10% of the time then reviewing our first 1000 examples should provide enough examples for just the biggest theme. But you don’t need just the biggest theme, your internal clients will certainly want more than that so you’ll need to review and label more data.

One estimate of the time required to label text data from surveys is that to label 1000 answers manually will take approximately 27 hours. In real life you don’t have simple, one-theme sentences.

The taxonomies will be task-related

Ask any qualitative researcher and one of the clear messages is that to build a high-performing coding model you need to build one that depends not only on the question that was asked but also the purpose or question that your audience wants answering.

Think of a simple example. Compared to asking a broad question to employees consider what themes would be relevant if you’ve asked a question specifically about DEI.

In a broad model it’s probably right to highlight answers that mention DEI. In the DEI model you are probably more interested in explaining what about DEI the employees are communicating.

Even for a single question different audiences will likely have different needs. For your general employee survey question what will interest the HR team is likely to be different to what will be of interest to the IT teams, or business strategy groups.

Information value

Of course there’s a catch (isn’t there always!). Not every labelled example is equally useful for training the AI system. If you have one sentence saying ‘pay me more’ adding a second won’t add much new information. In fact having large volumes of identical or near-identical examples will probably degrade the model performance as your system will think that it should look for exactly that sentence.

What you need is both volume and variety of answers.

Here’s where our old friend the exponential distribution comes back. The most common phrasings of each theme are likely to occur many, many times more often that the least frequent. Finding 100 examples certainly doesn’t provide you with 100 unique, or nearly unique values. To get diversity what do you need? Yes, more examples!

What are the edge cases?

The most information-risk answers to include in your training data are the ones at the decision boundary. These are the ones that the model will have the highest probability of miscoding, or simply not being confident of the result.

Each model will have different examples that it finds difficult. Hence in a single-model world you’ll want to find the examples which are difficult not for you to allocate to a theme, but which the model finds difficult. The two things might not be the same.

If you’ve progressed to using an ensemble of models then the different models might have different cases when they’re unsure. Of course you can use this information to help either allocate more examples that you need to label or use some sort of voting rule to minimise the uncertainty.

Being smart about which data to label

When reading the last two sections you might have wondered whether we need to label everything, whether a random-sample approach would provide the biggest advantage or whether you should instead try and use machine learning models to identify which sentences or answers to label. The answer is probably obvious.

The use of machine learning models to guide human coders to the examples with the highest information value is called ‘active learning’. It’s an approach that we and almost all experienced teams utilise. Historically it’s been under-researched in the ML community though this is starting to change, driven by those with applied problems.

Who is your data-labelling team?

Ask anyone in a applied ML team what is the most difficult challenge they face and they will state that it is getting robust training data, provided by coders with domain knowledge. For most corporates outsourcing this to a platform like Mechanical Turk is a non starter. A legal application probably needs lawyers to do the coding. A finance system will need folks with a finance background. We need people who can really understand employee survey answers.

Most companies will often default to their data science teams. This is not a long-term solution. Labelling is a skilled role and using data scientists to label (who often don’t have the domain knowledge) will take those folks off other data science tasks.

We’re increasingly seeing corporates building data labelling teams. Others have chosen to outsource this activity. We have yet to see internal HR teams build dedicated coding teams but given the need for large volumes of training data as companies explore more sophisticated - or high performant use cases - this is going to be inevitable.

A few things to consider

When building good quality training data you quite quickly get to a stage where you need to formalise the process.

  • You will need to build coding documentation to communicate to the wider team how to make certain decisions

  • You will need to build some form of governance process to handle the inevitable disagreements, especially around ambiguous comments and the edge cases

  • You need to consider training data maintenance and updating. We see new themes continually developing. I would estimate that of our new theme categories a third are genuinely new themes as language use evolves, a third are linked to new questions - or a shift in emphasis, and a third result in splitting older broad themes into sub-themes.

Do you need to acquire labelling tech for the coders to use?

We, like most teams, started out with a collection of scripts and spreadsheets to create and refine our coding models. As soon as this approach started to become stable we built internal systems that not only replicated or improved what was being done in these spreadsheets, but allowed us to distribute and measure coding activity. We’ve been refining our system now for about 4 years and now all the end-to-end effort of building, fine-tuning and maintaining our text models can be done by our domain specialists rather than data scientists.

Our internal tech is very specific to our use case and is very specific to the coding approach that we’ve developed. We built partly because there were few available commercial or open-source systems that offered what we wanted.

This is no longer the case and there are a host of coding platforms available such as labelbox and LightTag. These tools have much broader use cases than our internal tools, but I’d argue are less focussed on the specific tasks we need to perform. If we were starting now we’d probably use one. As it is I don’t see a need to switch.

Summary: the most important driver of text analysis success is probably not what you think it is

In this article I’ve tried to show how building great text analysis is less of a technical challenge, and more dependent on the painstaking and time-consuming task of building large volumes of high-quality training data.

If someone asked me how to improve their text analysis models my first recommendation would be to improve their training data. This data-centric AI approach that we’ve adopted over the last 7 years is where we’ve seen the greatest performance increases.

The teams with the highest performing text models today are likely to be the teams with the greatest investment in training data. We’re currently running at over 2 million hand-coded example sentences which we use as a starting-point for any new piece of client work. Our labelling technology enables us to rapidly refine these models for new client questions.

Whilst many people think of text analysis performance as dependent on which models you’re using we see our competitive advantage as two-fold:

  • A large volume of high-quality domain-specific training data

  • An industrialised process (with supporting systems) for rapidly evolving and building this training data to meet client needs.

If you’re tempted to build this internally it is possible but it will require a lot of resource. A good text model is mostly dependent on hard work.