Thinking about training data

trainingSet.png

This article is an extension of a presentation that I gave at SwissText 2019. Whilst I cover the topics particularly from our perspective I hope that it is of interest to anyone doing applied machine learning in a commercial environment.

For the last few years our business has been to help clients make sense of large volumes of text provided by employees, typically in surveys or various HR systems. Our clients come to us because we are recognised as a global leader in this area - we’re typically working for the People Analytics teams of some of the most advanced firms.

We firmly believe that client and question specific models quickly outperform the generic models provided by our competitors. We established ourselves providing fully-inductive models where our job is to first identify what people are talking about and then classifying each statement into one or more themes. Typically we have to turn-around a new dataset within a day or two of receiving it. The key advantage of this approach is that we can understand the data from any question in the context of what was asked. 

In the more recent past we’ve developed a second-tier service where we start with a generic model built on a similar question and then run a layer of fine-tuning for each client. 

In both instances we’ve realised that the most valuable part of the modelling process is to have a good training set. For the fully inductive approach we have to develop this training set in a very short period of time. In the other instance we need to add / update an existing model.

Optimising for two variables

In most machine learning literature the objective of the modeller is to improve the overall accuracy of the model. This is the case for competitions and for academic examples. It’s how machine learning practitioners are taught.

For us this is, of course, really important. We attach a greater cost to false positives than false negatives as our experience is that users are much more willing to accept a model saying ‘I can’t classify this’ than to get it wrong. However this still results in an objective of maximising ‘quality’.

However there is another, equally important objective: to minimise the effort to create a training data set. Our approach is openly a ‘human in the loop’ approach and we need to optimise the value those humans deliver. 

How training is often done

In many machine learning approaches the creation of training data is done in a simple manner, a human, often someone who is low-cost, is provided an example and asked to assign a classification. Machine learning practitioners might use services like Amazon’s Mechanical Turk to create and run such tasks in a quick, cost-effective manner. In many instances this works well.

Our situation is different though. All of our data is highly confidential. In many instances it needs somebody with good domain knowledge to make sense of it, especially when the text is ambiguous (which are the examples ML algorithms have most difficulty classifying). Dealing with ambiguous text is one of the main challenges of text classification and a core reason why our approach of building question / organization-specific models outperforms generic approaches. The context of what the person was asked is integral to understanding ambiguous answers.

Compared to the first approach instead of assuming coding is cheap we need to assume it’s expensive. Therefore we need to be as smart as possible about how to use that expensive resource.

Optimising value of training examples

If we look at a typical training set not all examples are as valuable as others. If we have 20 examples of employees saying ‘pay me more’ the 21st example will have very little effect on the overall classification.

Our models are semantically-based, so the algorithms attempt to look at the meaning of the sentences not the words used. We’d want to differentiate between ‘pay me more’ and ‘pay attention’ or ‘better management’ from ‘financial management’. Keyword searches produce very crude classifications that are very poor at dealing with novel ways of expressing the same topic. They also tend to fall-apart when words are new, which is often the case with spelling mistakes.

If we think about our training data we need to therefore identify the examples which will provide the greatest value - i.e. which examples would provide the greatest value to coding quality at the lowest amount of coders time. Removing duplication of effort, or duplication of near-identical effort, is one way of doing that.

Active learning

In the Machine Learning literature the approach that we use is called ‘active learning’ (See this video for an easy to understand introduction). It is based on the assumption that creating training data is an expensive activity. Not only does it consume a large volume of resources of the typical data science team but it’s a thankless task that is boring (especially if you see the same or very similar examples frequently) and therefore tends to be prone to human error.

I like to explain this as how we teach a junior employee (or child) how to perform a task. As teachers it’s most valuable if we can provide a few clear examples which the learner can then apply. As teachers we will likely be comfortable providing more assistance if there are new cases that the learner hasn’t seen previously but we would like them to know how to handle examples close to those that they have previously been taught. Hopefully over time the edge cases will become more and more rare as the learner gains experience.

Active learning assumes therefore that some examples will be more valuable for model building than others and tends to handle all but edge cases reasonably quickly. To develop an active learning approach we need to develop our algorithms to know when to ask for help and to identify when it is an edge-case. We need to develop algorithms that not only provide accurate classifications, ideally with as few examples as possible, but also can flag examples which they think will provide the greatest value if a human provides assistance.

Active learning in practice

So what have we learned from doing this?

The first is that we need to identify which examples will be most valuable passing to a human. There are different strategies for doing this. We could pick examples with a probability as close to 0.5 - i.e. where the model is neither convinced of a positive or negative classification - however we have yet to find this as the best approach, mainly as models like humans are rarely accurate in determining their uncertainty.

We could use an ensemble approach where we use multiple different models and where there isn’t agreement we pass the example to a human for the final say. This typically is a bit more valuable than the simplest example and performs reasonably well. We can even develop a model to ‘weight’ the various underlying models and use the uncertainty from this model to determine what to pass to a human.

In practice we use a hybrid approach and one that continues to evolve. Of course we don’t want to spend more time developing approaches to identify valuable examples than it would take just to code a less-optimal set of examples. The cost of developing algorithms is also a cost to be optimised.

Measure, measure and analyse.

One thing that we learnt early on was to record every iteration of both the training data and the resulting predictions from each model. Doing so has been a great help to help diagnose where value is created and to help develop approaches to identify which examples to pass to a human.

We also visualise these histories. To some degree some of our classification algorithms are ‘black boxes’, at least in the practical sense of being too complex for a human to comprehend easily. However model performance can help shed some light into what is happening.

It’s worth noting at this stage that just analysing which sentences or examples have the most value to the overall model (maybe by iterating and excluding examples) isn’t a perfect approach as the value of the next sentence depends on which sentences have come before. It’s worth thinking of chains of sentences where the next most valuable example is a function of all the examples before.

Having great infrastructure

There are a few tools available that provide a user-interface for doing active learning. Many are designed to be used by data scientists as part of their process. We did perform a review of what was available but decided to build our own tools.

The first ‘gap’ was our need to control in a fine-grained manner who could see which data. We wanted to be able to permission reviewers to help with certain data sets where their domain knowledge was relevant and where they could meet the relevant data security requirements. We wanted this system to be cloud-based so they could do this wherever they were located and so they could use whichever device was most convenient (it was built from the outset to facilitate developing training data on a mobile phone).

We also needed to be able to do this without a data-scientist needing to set it up. It was important if we were able to scale the service that we didn’t want to be constrained by needing our data scientists to be involved in each data set. There are a few stages to our process and we wanted the user to be able to choose which part to do next, albeit with the interface signalling where it believes the most value would be available.

Finally we wanted to be in control of both the underlying models and the user interfaces in a flexible manner. We recognised that the former enabled us to add functionality where needed, change which examples were passed to the coder and to add visualisation to guide the coder across the dataset. We also recognised that UI was likely to be an area that help with optimising human performance as the service improved.

Cleaning training data

When we think about our training data an extension to identifying which examples to add is to identify which examples to remove. Any training data will have some dirty data. Humans make mistakes and training data is highly likely to include a proportion of these. We also see instances where people might disagree on a classification. There is always some degree of subjectivity with text classification.

The question we asked ourselves was how to deal with training examples which contributed to badly coded data. Would it be possible if a sentence was identified as badly coded to identify the training example or examples most responsible for that classification. If it was would it be valuable to pass these to a human for attention?

Developing a training set in the real world will be as much about pruning the data to improve the signal as it will be of adding new training examples.

Implications and challenges

As noted, our challenge is to optimise the quality of our outputs and the resource requirement needed to create a model. The key resource requirement is for human input to help build a training set.

Text analytics is evolving quickly yet the challenge of needing good training data remains constant. Focussing effort on building good tools has enabled us to take the data scientist out of the process to a large degree. Our data science resource has therefore been able to concentrate on research and development, creating, piloting and implementing new algorithmic approaches much more quickly than they would typically have been able to do.

Our clients have started to use us to help develop training data for them to build their own classification models. This is a universal challenge with all classification modelling and text is no different. Our active learning approaches has enabled us to be cost-effective in developing these training sets, especially for teams where there is a constraint on the availability of data scientists.

New machine learning approaches will definitely help. In some settings using techniques such as transfer learning, which promise to achieve good predictions with fewer training examples will help. However in an applied setting I’m not convinced they are a complete solution.

Where does this leave us? I think fundamentally there is a disconnect between (at least) 3 different groups, each with difference incentive systems.

The academic machine learning community is incentivised to develop approaches that increase model accuracy at all costs. There are a smallish number of standard benchmarks / datasets that these are measured against. Hence there has been a relatively small amount of focus on low-cost development of training data.

AI based tech-startups are incentivised via the funding system to produce general models that can be widely applied to a large number of clients hence producing similar economics to traditional software. At the moment, at least in my domain, specific models can be developed relatively quickly that outperform general models. (This article is a very good analysis that matches our experience.)

Which leaves applied data science, i.e. the bits that get done in industry. In an applied setting it’s far more likely that firms will be developing company-specific models that mirror unique business strategies. However it’s also likely that time, or resource need, becomes an important (maybe the important) factor to optimise. Creating and maintaining training data is a huge part of that resource requirement and therefore an area to optimise. I suspect that most firms, and especially management, don’t know their ‘spend’ in this area and therefore aren’t optimising it effectively.