A recent blog post by John D. Cook brought me down to earth. Whilst big data is getting all the attention these days, much of the time, especially if we’re working with what is in your HR systems, you’ll actually find yourself drilling down into what would typically described as small data. Arguably many big data sets are compiled of large numbers of small data sets.
One aspect that is forever present in HR data analysis is that of sparsity. Many of the things we want to predict happen infrequently and hence we have to build our models on small numbers of instances. Our data sets, when constructed in the way we will need them to be for much analysis, will be largely full on nulls or zeros.
How we deal with this is something I put to Flying Binary’s Yodit Stanton this week. Yodit is one of the new breed of data scientists – she was developing equity trading systems for Lehmans at the crash and has since redirected her interest ‘to the data that we all produce every day’. As well as her client work she’s embarking on a PhD in deep learning, an up-and-coming part of machine learning incorporating lessons from neuroscience.
“You don’t need the terabytes of data [that everyone is talking about] it’s about picking the algorithm wisely. There are algorithms that are useful with sparse data – anomaly detection techniques would be a good example”.
On a more fundamental level there’s a basic problem, one which I see during my usability testing of reporting. People don’t just read data and take it as it’s shown; they extrapolate results into the future whether the data justifies it or not. It’s this tendency that has resulted in the ‘past performance isn’t an indicator of future performance’ statements on the bottom of financial products marketing. We know we shouldn’t do it, but most of us do.
Because of the sparsity, and because we’re interested in drilling down into greater levels of detail as we increase our data-sophistication we start hitting small data sets. Of course as the data sets become smaller we can be distracted by effects which are purely a function of the sample size. Has the French office really got a turnover problem compared to the German headquarters, or are we looking at something which is related to the small number of instances?
We need to push for visualizations not only to report the data, but to provide visual guides to guide reasonable interpretations. The next level would be for our reports not to compare country to country, or department to department but to compare actual performance to a natural level of turnover based on the make-up of the underlying employee population.
Sparsity is something we see as well with social media and employees. The sorts of analysis that an FMCG can do to understand its brands just isn’t currently possible if you want to understand what current or potential employees are saying. My advice given this is to piggyback on the research that marketing are conducting. HR-specific studies currently aren’t likely to deliver enough insight.
So whilst Big Data might be the hot topic of 2012 let’s not forget that we haven’t effectively dealt with small data.