Surveys in People Analytics models

There have been a number of opinions that suggest that survey data isn’t useful to the People Analyst, especially when developing predictive models. These opinions go against our experience.

In a recent attrition project with a major client we were able to increase the accuracy of our models by around 10% at a relatively early stage by including survey data. Anyone who is doing Machine Learning knows that a 10% improvement isn’t easy to achieve.

A NY Times article written by two experienced data scientists from Google & Facebook, shows how these firms combine survey and system data to reveal what is really going on. As the authors note…

“Big data in the form of behaviors and small data in the form of surveys complement each other and produce insights rather than simple metrics. ”

This is certainly our experience, however how you use it can be non-obvious at first.

Most modelling in HR is Exploratory, at least initially

I would argue that the majority of modelling in HR has exploratory data analysis as at least one of its objectives. Most analysts default to tools such as tree models for understanding rules or something like naive Bayes for understanding variable importance.

HR models will almost certainly be used to inform, rather than automate a decision. Our loss or utility functions tend to be complex enough that for many scenarios data becomes another voice at the table.

For individuals to make decisions based on data they need to have some form of understanding how the recommendation was made. Actions tend to be aimed at altering one or more of the important attributes that is associated with the model and therefore interpretation is important. If we’re maintaining the confidentiality promises made when employees took a survey, for example by setting the minimum size of any leaf in a tree, do we need to use the survey data for ongoing forecasting and / or prediction?

Survey data is brilliant at informing feature selection

The longest part of any machine-learning centric project is feature selection or building. We take a few fields of data and create the variables which we expect to influence the model.

Using survey data in early stages of a machine-learning project can significantly contribute to your choices on which variables to create. For example in one recent attrition project we found a strong relationship between employees who self-reported that they didn’t have the necessary time to do their jobs to the best of their abilities & attrition. We then took this ‘lesson’ and created a set of variables that could represent this in a more accurate manner than relying on infrequent, self-reported data. We use the survey data to guide our feature engineering.

As the analysis continues what we find is that the survey’s influence on the predictive accuracy decreases as we replace survey findings with features that can certainly be used on an ongoing basis for forecasting or prediction without any confidentiality concerns regardless of the population size.

How to use survey data in model building

Survey data can be used a multiple levels within a modelling project, and the level to some degree depends on the question that is being asked. The fundamental need is to have a employee-level unique identifier with each row of data. Obviously if all you have is team data you can answer some questions, but the potential value is lower.

If we think of the organization as a tree where individual a reports into individual b who reports into individual c then via graph traversal it’s simple to aggregate responses to various levels – i.e. what was this individual’s response, what was the response of their team (all at a’s level who report to b), what was the response of the broader team (everyone who reports to c or someone who reports to a manager reporting the c). If you have a project or matrix organization it’s possible this way to have some more complex rules.

Within the feature engineering part of the project we can thus combine survey answers at any level to the data from your HR system. Some answers will add to the prediction at an individual level, others, for example about manager effectiveness might be more sensible to use as an aggregate for all people who report into that manager.

As noted, the models you create might inform what other data you add or collect. This is similar in many ways to including the knowledge of domain experts in your modelling team who can guide the analysts to where to look from their knowledge of the earlier research. The difference with survey data is that it is company-specific.

In conclusion, we believe survey data to be vitally important to most people analytics projects. As analysis is an iterative process we find it a rare project where during later stages, where we try and capture new data to improve an existing model, we don’t include some form of survey or questionnaire.

Insight depends not only on what has happened by illuminating why it is happening as well. Surveys can help provide the second part.