Professional Documents
Culture Documents
The Importance of Regression in People Analytics
The Importance of Regression in People Analytics
The Importance of Regression in People Analytics
in People Analytics
In the 19th century, when Francis Galton first used the term ‘regression’ to
describe a statistical phenomenon (see Chapter 4), little did he know how
important that term would be today. Many of the most powerful tools of
statistical inference that we now have at our disposal can be traced back to the
types of early analysis that Galton and his contemporaries were engaged in. The
sheer number of different regression-related methodologies and variants that are
available to researchers and practitioners today is mind-boggling, and there are
still rich veins of ongoing research that are focused on defining and refining new
forms of regression to tackle new problems.
Neither could Galton have imagined the advent of the age of data we now live in.
Those of us (like me) who entered the world of work even as recently as 20 years
ago remember a time when most problems could not be expected to be solved
using a data-driven approach, because there simply was no data. Things are very
different now, with data being collected and processed all around us and available
to use as direct or indirect measures of the phenomena we are interested in.
Along with the growth in data that we have seen in recent years, we have also
seen a rapid growth in the availability of statistical tools—open source and free to
use—that fundamentally change how we go about analytics. Gone are the clunky,
complex, repeated steps on calculators or spreadsheets. In their place are lean
statistical programming languages that can implement a regression analysis in
milliseconds with a single line of code, allowing us to easily run and reproduce
multivariate analysis at scale.
It is my firm belief that all people analytics professionals should have a strong
understanding of regression models and how to implement and interpret them in
practice, and my aim with this book is to provide those who need it with help in
getting there. In this chapter we will set the scene for the technical learning in the
remainder of the book by outlining the relevance of regression models in people
analytics practice. We also touch on some general inferential modeling theory to
set a context for later chapters, and we provide a preview of the contents,
structure and learning objectives of this book.
The current reality in the field of people analytics is that inferential models are
more required than predictive models. There are two reasons for this. First, data
sets in people analytics are rarely large enough to facilitate satisfactory prediction
accuracy, and so attention is usually shifted to inference for this reason alone.
Second, in the field of people analytics, decisions often have a real impact on
individuals. Therefore, even in the rare situations where accurate predictive
modeling is attainable, stakeholders are unlikely to trust the output and bear the
consequences of predictive models without some sort of elementary
understanding of how the predictions are generated. This requires the analyst to
consider inference power as well as predictive accuracy in selecting their modeling
approach. Again, many regression models come to the fore because they are
commonly able to provide both inferential and predictive value.
least one of the constructs in C . We denote the set of data that measure O on
our sample set S as y. An upper-case X is used because the expectation is that
there will be several columns of data measuring our constructs, and a lower-case
y is used because the expectation is that the outcome is a single column.
y = f (X) + ϵ
f can take the form of a predetermined function with a formula defined on X, like
a linear function for example. In this case we can call our model a parametric
model. In a parametric model, the modeled value of y is known as soon as we
know the values of X by simply applying the formula. In a non-parametric model,
there is no predetermined formula that defines the modeled value of y purely in
terms of X. Non-parametric models need further information in addition to X in
order to determine the modeled value of y—for example the value of y in other
observations with similar X values.
are each of the three scores from the first three years, and y is the score in the
fourth year. We test f to be a linear relationship, and we establish that such a
relationship can be generalized to the entire population P with a substantial level
of statistical confidence1.
Almost all our work in this book will refer to the variables X as input variables and
the variable y as the outcome variable. There are many other common terms for
these which you may find in other sources—for example X are often known as
independent variables or covariates while y is often known as a dependent or
response variable.
This book is primarily focused on steps 7–10 of this process2. That is not to say
that steps 1–6 are not important. Indeed these steps are critical and often loaded
with analytic traps. Defining the problem, collecting reliable measures and
cleaning and organizing data are still the source of much pain and angst for
analysts, but these topics are for another day.
In most chapters, time is spent on the underlying mathematics. Not to the degree
of an academic theorist, but enough to ensure that the reader can associate some
mathematical meaning to the outputs of models. While it may be tempting to skip
the math, I strongly recommend against it if you intend to be a high performer in
your field. The best analysts are those who can genuinely understand what the
numbers are telling them.
The statistical programming language R is used for most of the practical
demonstration in each chapter. Because R is open source and particularly well
geared to inferential statistics, it is an excellent choice for those whose work
involves a lot of inferential analysis. In later chapters, we show implementations of
all of the available methodologies in Python and Julia, which are also powerful
open source tools for this sort of work.
Each chapter involves a walkthrough example to illustrate the specific method and
Handbook of Regression to allow the reader to replicate the analysis for themselves. The exercises at the On this page
Modeling in People end of each chapter are designed so that the reader can try the same method on 1 The Importance of
Analytics: With Examples in R, a different data set, or a different problem on the same data set, to test their Regression in People
Python and Julia learning and understanding. All in all, eleven different data sets are used as Analytics
walkthrough or exercise examples, and all of these data sets are fictitious 1.1 Why is regression
Search
constructions unless otherwise indicated. Despite the fiction, they are deliberately modeling so
designed to present the reader with something resembling how the data might important in people
Table of contents
look in practice, albeit cleaner and more organized. analytics?
Welcome
1.2 What do we mean
Foreword by Alexis Fink The chapters of this book are arranged as follows: by ‘modeling’?
Introduction 1.2.1 The theory of
Chapter 2 covers the basics of the R programming language for those who
1 The Importance of Regression in inferential modeling
want to attempt to jump straight in to the work in subsequent chapters but
People Analytics 1.2.2 The process of
have very little R experience. Experienced R programmers can skip this
2 The Basics of the R Programming inferential modeling
chapter.
Language 1.3 The structure,
Chapter 3 covers the essential statistical concepts needed to understand system and
3 Statistics Foundations
multivariate regression models. It also serves as a tutorial in univariate and organization of this
4 Linear Regression for Continuous
bivariate statistics illustrated with real data. If you need help developing a book
Outcomes
decent understanding of descriptive statistics, random distribution and
5 Binomial Logistic Regression for View source
hypothesis testing, this is an important chapter to study.
Binary Outcomes
Edit this page
6 Multinomial Logistic Regression Chapter 4 covers linear regression and in the course of that introduces many
for Nominal Category Outcomes other foundational concepts. The walkthrough example involves modeling
7 Proportional Odds Logistic academic results from prior results. The exercises involve modeling income
Regression for Ordered Category levels based on various work and demographic factors.
Outcomes
Chapter 5 covers binomial logistic regression. The walkthrough example
8 Modeling Explicit and Latent involves modeling promotion likelihood based on performance metrics. The
Hierarchy in Data
exercises involve modeling charitable donation likelihood based on prior
9 Survival Analysis for Modeling donation behavior and demographics.
Singular Events Over Time
Chapter 6 covers multinomial regression. The walkthrough example and
10 Alternative Technical Approaches
exercise involves modeling the choice of three health insurance products by
in R, Python and Julia
company employees based on demographic and position data.
11 Power Analysis to Estimate
Required Sample Sizes for Modeling Chapter 7 covers ordinal regression. The walkthrough example involves
Solutions to Exercises, Slide modeling in-game disciplinary action against soccer players based on prior
Presentations, Videos and Other discipline and other factors. The exercises involve modeling manager
Learning Resources performance based on varied data.
References
Chapter 8 covers modeling options for data with explicit or latent hierarchy.
The first part covers mixed modeling and uses a model of speed dating
View book source
decisions as a walkthrough and example. The second part covers structural
equation modeling and uses a survey for a political party as a walkthrough
example. The exercises involve modeling latent variables in an employee
engagement survey.
"Handbook of Regression Modeling in People Analytics: With This book was built by the bookdown R package.
Examples in R, Python and Julia" was written by Keith McNulty.