GPT for People Analytics: Four concepts you need to know

Abstract banner image

Using GPT or Large Language Models (LLM) for People Analytics projects is an attractive option, especially to prototype solutions. However to get the most out of LLMs of all types it’s worth understanding a few key concepts.

What are LLMs?

Most Large Language Models have been created by training a next-word-prediction model over a very large, edited text dataset. They work by assigning probabilities to each possible next word given the previous words. Many of the properties that they demonstrate are emergent - ie they are effectively side effects of learning from a vast set of texts. For example knowledge recall is done by learning the most likely next words, not by explicitly being taught to learn the knowledge. What this probably means is that for many specific use cases they can be outperformed by a dedicated task-specific model, but these models typically can do only one thing well. A LLM demonstrates good performance over a broad range of tasks.

The 4 key concepts

In our experience the best way to use LLMs within the context of the typical People Analytics problem is to decompose the problem into small, discreet components and then chain the models together. This way, instead of trying to solve the entire problem in one grand stage you are trying to solve very focussed, smaller problems.

To use LLMs, as with most Machine Learning models, it’s worth understanding a basic amount of how the model is working. With LLMs there are 4 concepts that you need to understand:

  • Prompting

  • Context Windows

  • Fine Tuning

  • Embeddings.

Prompting

Prompting is the way that you interact with LLMs. It is the art and the craft of creating explanations of what you want the model to do, and what you don’t want it to do.

What is so appealing about LLMs is that you create these prompts using natural language. Whilst this dramatically increases the approachability to use LLMs, it can provide a false sense of ease - you get a decent result very quickly. However with time, practice and learning you can make a big improvement to your prompts.

We think of prompts as ‘n shots’. For example the question-answer prompt we have all probably used as the first prompt in a LLM doesn’t have any examples. We call this ‘zero-shot prompting’. From here we move to ‘one-shot prompting’ where an example is provided and then ‘few-shot prompting’. Theoretically, there is no limit to the number of examples that you provide though we see a decreasing marginal value as more examples are used. [It’s worth having a look at my earlier article on training sets as I believe the same criteria apply.]

It’s worth thinking about prompts as communicating with a computer, not with a fellow human. The difference between these modes is that with human communication often we don’t provide key parts of information, because that information we presume is obvious. Alternatively we use ambiguous terms, especially in within-domain communication, because we know the listener will understand what we mean (think HR people talking about Employee Engagement - often we don’t define what we mean, which is arguably part of the problem).

An example of the sort of challenge that you need to think about with prompting. We were using a LLM to prototype a model to identify whether feedback classified by our model as about managers was talking about a manager behaviour. The original prompt asked if the sentence was about something a manager was doing. The model ‘missed’ the answers when employees were talking about something the manager wasn’t doing (but they would expect them to) as this wasn’t explicitly about something the manager was doing.

There are a few good practices that we’ve learnt for creating good prompts:

  • prompts should be concise

  • prompts should be specific and have a well-defined definition

  • break down (decompose) complex prompts into a series of simple ‘one-task’ prompts

  • Ask a classification task instead of asking generative tasks

  • Add examples to the prompt of what you want to have returned.

Ultimately the best way of finding a good prompt is by experimentation.

Context Windows

For many People Analytics tasks you’re going to want to create a prompt where you’ll replace parts of a prompt with data from your dataset. In this way the length of your prompt (which is often a key factor in how you are being charged by the API provider) will need to incorporate the text of the request plus the length of the data.

As you can imagine, as you start to combine a large length prompt (well defined, with a few examples) with a set of data that you want to ‘query’ you’ll quickly see the prompt length becoming large.

At the moment (June 23) one of the key areas of competition for companies providing LLMs is the length of prompt that each model can accept. The common length at the moment is 2000 tokens but models are available that can handle 100,000 tokens.

There are a bunch of different techniques for reducing the size of the context window you need and it’s worth experimenting with a few to see how your models perform. Of course the big providers, who generate revenue from prompt length, want to encourage you to have longer prompts. Some approaches might actually increase the total number of tokens, but spread them over multiple calls. This might increase quality but at more expense. Others might lower the number of tokens. A combination of experimentation and monitoring API usage is required.

Fine Tuning

Fine tuning is available on some models and shares similarities to providing examples as was discussed in the section on prompt design. The key difference between incorporating the examples via a prompt and using the examples to fine-tune the model is a cost one.

Consider that you have a good set of human-labelled examples which you know are of high quality. For example in an model to understand and classify employee feedback you have a pre-defined set of potential topics and then for each topic a set of labelled examples. Obviously the number of examples in total is:

#topics * #examples per topic

We believe that a good minimum number of examples for each topic is 100. In our current ‘base’ model we have 260 topics. You can see how providing this for each prompt (you probably want to call the API once for each answer provided) results in a large set of very large prompts - i.e. it will be very expensive to call potentially thousands of times. (You might also hit context window limits).

The idea of fine tuning is that you provide the examples once, create a fine-tuned model which incorporates the information in these examples, and then instead of providing them for each prompt, you only need to provide the new information - for example feedback item from the employee that you want classifying.

The decision to fine tune is probably therefore going to be most influenced by cost. There is a cost, and certainly a time, to fine-tune a model but if you need to infer over a large number of examples it’s probably worth it. However if you only need to classify a few examples it might be best to provide the examples in the prompts.

Finally a complication: it’s currently not possible to fine tune each model. For example it is possible to fine-tune OpenAI’s GPT3.5 model but not their ChatGPT or GPT4 models. Typically we’re seeing that fine-tunable models are likely to be lagging-edge. Depending on your use case this may or may not be important.

Embeddings

Since we started focusing on text analysis in People Analytics about 8 years ago most of our models and approaches have been based on embeddings. The early models (Word2Vec, Glove, FastText….) were creating word (or part of word) embeddings. As text models have improved - via early transformer models through to LLMs the accuracy of the embeddings have increased, and the embeddings have started to incorporate longer and longer contexts.

What is an embedding? Put simply they are multi-dimensional map coordinates which ‘places’ each item that has been embedded (a word, sentence, full text…) into a specific position. They have certain useful properties. Semantically similar words (or longer texts) have embedding ‘co-ordinates’ that place them close together. The direction that takes you from word A to word B provides meaning.

LLMs provide embeddings that tend to be more accurate, and with longer contexts than earlier models. We tend to use the embeddings for similarity more than anything else. Find an answer that you believe is important in your feedback and via embeddings you can find the most similar other examples. Ask a question and via embeddings you can narrow down the examples that might provide the answer.

Whereas the earlier embeddings had about 300 dimensions, LLM embeddings are often of a magnitude larger. If you’re dealing with a small number of embeddings you might be able to find the nearest neighbour in memory. As you start dealing with larger datasets you might start to want to utilise databases designed for storing embeddings and finding nearest neighbours - so called vector databases.

Summary

LLMs provide a very convenient and approachable way of rapidly building solutions to a variety of People Analytics tasks. By understanding and experimenting with the 4 components that have been discussed in this article you can design solutions for a wide range of tasks, arguably with less need for classical ML skills.

As mentioned early in the article, the best approach to using LLMs is to decompose your problem into smaller discreet tasks and then chain a set of models together. You can chain LLM tasks or a mixture of LLM and non-LLM models depending on your use case.

LLMs provide a very easy entry point into text analysis. Depending on a range of factors you might or might not get a model that performs to your needs. To twist a famous expression - “if all you have is an LLM, every problem looks like an LLM task.” The reality is that the text analysis community hasn’t stopped working on non LLM solutions. At the basis they’re all currently generic models and task-specific models can still outperform them.