Human Resource Analytics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Human Resource Analytics

Degree of freedom: The term degrees of freedom refer to the number of independent observations for a
source of variation minus the number of independent parameters estimated in computing the variation. The
formula for Degrees of Freedom equals the size of the data sample minus one:

The degrees of freedom formula is n independent observations minus one independent parameter being
estimated (n-1)

Type 1 error and Type 2 error:

The hypothesis testing process uses sample statistics calculated from random data to reach conclusions about
population parameters, it is possible to make an incorrect decision about the null hypothesis. Two types of
errors can be made in testing hypotheses:

Type I error and Type II error.

• A Type I error is committed by rejecting a true null hypothesis. With a Type I error, the null
hypothesis is true, but the business researcher decides that it is not. For example, if a manager
fires an employee because some evidence indicates that she is stealing from the company and if she
really is not stealing from the company, then the manager has committed a Type I error.
• Conceptually and graphically,

✓ Statistical outcomes that result in the rejection of the null hypothesis lie in what is termed the
rejection region.
✓ Statistical outcomes that fail to result in the rejection of the null hypothesis lie in what is termed
the nonrejection region.

• The rejection region represents the possibility of committing a Type I error. “Means” that fall
beyond the critical values will be considered so extreme that the business researcher chooses to
reject the null hypothesis. However, if the null hypothesis is true, any mean that falls in a rejection
region will result in a decision that produces a Type I error.
• The probability of committing a Type I error is called alpha (α) or level of significance. Alpha equals
the area under the curve that is in the rejection region beyond the critical value(s). The value of
alpha is always set before the experiment or study is undertaken. Common values of alpha are .05,
.01, .10, and .001.

A Type II error is committed when a business researcher fails to reject a false null hypothesis. In this
case, the null hypothesis is false, but a decision is made to not reject it.

Ex: Suppose in the business world an employee is stealing from the company. A manager sees some
evidence that the stealing is occurring but lacks enough evidence to conclude that the employee is stealing
from the company. The manager decides not to fire the employee based on theft. The manager has
committed a Type II error.
✓ The probability of committing a Type II error is beta (β). Beta is not usually stated at the
beginning of the hypothesis testing procedure because beta occurs only when the null hypothesis
is not true.

Because alpha can only be committed when the null hypothesis is rejected and beta can only be
committed when the null hypothesis is not rejected, a business researcher cannot commit both a Type
I error and a Type II error at the same time on the same hypothesis test.

• alpha and beta are inversely related. If alpha is reduced, then beta is increased, and vice versa

• One way to reduce both errors is to increase the sample size. If a larger sample is taken, it is more
likely that the sample is representative of the population, which translates into a better chance that a
business researcher will make the correct choice

Power, which is equal to 1 - β, is the probability of a statistical test rejecting the null hypothesis when
the null hypothesis is false.

Skewness:

• Skewness is when a distribution is asymmetrical or lacks symmetry.


• The skewed portion is the long, thin part of the curve.
• Skewed distribution is used to denote that the data are sparse at one end of the distribution and
piled up at the other end.
• The concept of skewness helps to understand the relationship of the mean, median, and
mode.
• In a unimodal distribution (distribution with a single peak or mode) that is skewed, the mode is
the apex (high point) of the curve and the median is the middle value. The mean tends to be
located toward the tail of the distribution, because the mean is particularly affected by the
extreme values.
• A bell-shaped or normal distribution with the mean, median, and mode all at the center of
the distribution has no skewness.
Kurtosis:

Kurtosis describes the amount of peakedness of a distribution.

• Distributions that are high and thin are referred to as leptokurtic distributions.
• Distributions that are flat and spread out are referred to as platykurtic distributions.
• Between these two types are distributions that are more “normal” in shape, referred to as
mesokurtic distributions.

BOX PLOT

Box and whisker plot is used to describe a distribution of data.

A box plot, is a diagram that utilizes the upper and lower quartiles along with the median and the two
most extreme values to depict a distribution graphically.

This box is extended outward from the median along a continuum to the lower and upper quartiles,
enclosing not only the median but also the middle 50% of the data. From the lower and upper quartiles, lines
referred to as whiskers are extended out from the box toward the outermost data values. The box-and-
whisker plot is determined from five specific numbers.

• The median (Q2)


• The lower quartile (Q1)
• The upper quartile (Q3)
• The smallest value in the distribution
• The largest value in the distribution

✓ A box is drawn around the median with the lower and upper quartiles (Q1 and Q3) as the box
endpoints.
✓ These box endpoints (Q1 and Q3) are referred to as the hinges of the box.
✓ Next the value of the interquartile range (IQR) is computed by Q3 - Q1.
✓ The interquartile range includes the middle 50% of the data and should equal the length of the box.
However, here the interquartile range is used outside of the box also. At a distance of 1.5*IQR
outward from the lower and upper quartiles are what are referred to as inner fences.

The inner fences are established as follows.

• Q1 - 1.5*IQR
• Q3 + 1.5*IQR

If data fall beyond the inner fences, then outer fences can be constructed:

• Q1 - 3*IQR
• Q3 + 3*IQR
✓ A whisker, a line segment, is drawn from the lower hinge of the box outward to the smallest
data value.
✓ A second whisker is drawn from the upper hinge of the box outward to the largest data value.

Outliers are the more extreme values of a data set. However, sometimes outliers occur due to
measurement or recording errors.

Values in the data distribution that are outside the inner fences but within the outer fences are
referred to as mild outliers. Values that are outside the outer fences are called extreme outliers.

One of the main uses of a box-and-whisker plot is to identify outliers.

Another use of box-and-whisker plots is to determine whether a distribution is skewed.

✓ If the median is located on the right side of the box, then the middle 50% are skewed to the left.
✓ If the median is located on the left side of the box, then the middle 50% are skewed to the right.

NORMAL DISTRIBUTION

The normal distribution is sometimes referred to as the Gaussian distribution or the normal curve of
error.

The normal distribution exhibits the following characteristics.

• It is a continuous distribution.
• It is a symmetrical distribution about its mean - It means that the right portion of
the distribution is equal to the left portion of the distribution.Because the distribution is
symmetric, the area of the distribution on each side of the mean is 0.5.
• It is unimodal - It is unimodal because values peak in only one portion of the graph—the centre of
the curve.
• It is a family of curves - The normal distribution actually is a family of curves. Every unique value of
the mean and every unique value of the standard deviation result in a different normal curve
• Area under the curve is 1 - The area under the curve yields the probabilities, so the total of all
probabilities for a normal distribution is 1.

The mean, median, and mode are equal and are located at the centre of the distribution. The curve is
symmetric about the mean.

EMPIRICAL RULE:

A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low
standard deviation means data are clustered around the mean, and high standard deviation indicates
data are more spread out.

The empirical rule is an important rule of thumb that is used to state the approximate percentage of
values that lie within a given number of standard deviations from the mean of a set of data if the data
are normally distributed.

The empirical rule is used only for three numbers of standard deviations: l σ, 2 σ, and 3 σ.
Distance from the Mean Values within the distance
μ + 1σ 68%

μ + 2σ 95%

μ + 3σ 99.7%

If a set of data is normally distributed, or bell shaped, approximately 68% of the data values are within one
standard deviation of the mean, 95% are within two standard deviations, and almost 100% are within three
standard deviations.

INTERQUARTILE RANGE

✓ Interquartile range is a measure of variability.


✓ The interquartile range is the range of values between the first and third quartile.
✓ Essentially, it is the range of the middle 50% of the data and is determined by computing the value of
Q3 - Q1. The interquartile range is especially useful in situations where data users are more
interested in values toward the middle and less interested in extremes.

DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics allow you to describe a data set, while inferential statistics allow you to
make inferences based on a data set.

➢ Population as a collection of persons, objects, or items of interest


➢ A sample is a portion of the whole and, if properly taken, is representative of the whole.

DESCRIPTIVE:

If data gathered on a group is used to describe or reach conclusions about that same group, the statistics are
called descriptive statistics.

There are 3 main types of descriptive statistics:

• The distribution concerns the frequency of each value - A data set is made up of a distribution of
values, or scores. Tables or graphs can summarize the frequency of every possible value of a
variable in numbers or percentages.
• The central tendency concerns the averages of the values - Measures of central tendency estimate the
centre, or average, of a data set. The mean, median and mode are 3 ways of finding the average.
• The variability or dispersion concerns how spread out the values are - Measures of variability give
you a sense of how spread out the response values are. The range, standard deviation and
variance each reflect different aspects of spread.
Inferential Statistics

If a researcher gathers data from a sample and uses the statistics generated to reach conclusions about the
population from which the sample was taken, the statistics are inferential statistics. The data gathered from
the sample are used to infer something about a larger group

Inferential statistics have two main uses:

• making estimates about populations (for example, the mean SAT score of all 11th graders in the US).
• testing hypotheses to draw conclusions about populations (for example, the relationship between
SAT scores and family income).

✓ You randomly select a sample of 11th graders in your state and collect data on their SAT scores and
other characteristics.
✓ You can use inferential statistics to make estimates and test hypotheses about the whole population
of 11th graders in the state based on your sample data.

The characteristics of samples and populations are described by numbers called statistics and parameters:

✓ A statistic is a measure that describes the sample (e.g., sample mean).


✓ A parameter is a measure that describes the whole population (e.g., population mean).

There are two important types of estimates you can make about the population: point estimates and interval
estimates.

✓ A point estimate is a single value estimate of a parameter. For instance, a sample mean is a point
estimate of a population mean.
✓ An interval estimate gives you a range of values where the parameter is expected to lie. A confidence
interval is the most common type of interval estimate. Ex: Correlation tests, regression tests,
hypothesis testing, comparison tests (such as t-test, ANOVA), confidence intervals.

P-Value

The p-value is a number, calculated from a statistical test, that describes how likely you are to have found a
particular set of observations if the null hypothesis were true. P-values are used in hypothesis testing to help
decide whether to reject the null hypothesis. The smaller the p-value, the more likely you are to reject the
null hypothesis.
Correlation:

Correlation means association - more precisely it is a measure of the extent to which two variables are
related. There are three possible results of a correlational study: a positive correlation, a negative correlation,
and no correlation.

1. Lies between -1 and +1.


2. Does not imply causation

• < 0.3 – weak relationship


• 0.3 – 0.7 moderate relationship
• > 0.7 – strong relationship.

Linear Regression

Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data.
These assumptions are:

✓ Homogeneity of variance: the size of the error in our prediction doesn’t change significantly across
the values of the independent variable.
✓ Independence of observations: the observations in the dataset were collected using statistically valid
sampling methods, and there are no hidden relationships among observations.
✓ Normality: The data follows a normal distribution.
✓ The relationship between the independent and dependent variable is linear: the line of best fit
through the data points is a straight line (rather than a curve or some sort of grouping factor).

Y = α + Β*X + e

• y is the predicted value of the dependent variable (y) for any given value of the independent variable
(x).
• α is the intercept, the predicted value of y when the x is 0.
• β is the regression coefficient – how much we expect y to change as x increases.
• x is the independent variable (the variable we expect is influencing y).
• e is the error of the estimate, or how much variation there is in our estimate of the regression
coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient
(β) that minimizes the total error (e) of the model.

HR ANALYTICS

• HR analytics is a methodology for creating insights on how investments in human capital assets
contribute to the success of four principal outcomes: (a) generating revenue, (b) minimizing
expenses, (c) mitigating risks, and (d) executing strategic plans. This is done by applying statistical
methods to integrated HR, talent management, financial, and operational data,”
• Human resources is a people-oriented function and is so perceived by most people. When used
strategically, analytics can transform how HR operates, giving the team insights and allowing it to
actively and meaningfully contribute to the organization’s bottom line.
PROS AND CONS OF HRA:

Here are the pros and cons of implementing HR analytics:

Pros:

• More accurate decision-making can be had thanks to a data-driven approach, which reduces the need
for organizations to rely on intuition or guess-work in decision-making.
• Strategies to improve retention can be developed thanks to a deeper understanding of the reasons
employees leave or stay with an organization.
• Employee engagement can be improved by analyzing data about employee behavior, such as how
they work with co-workers and customers, and determining how processes and environment can be
fine-tuned.
• Recruitment and hiring can be better tailored to the organization’s actual skillset needs by analyzing
and comparing the data of current employees and potential candidates.
• Trends and patterns in HR data can lend itself to forecasting via predictive analytics, enabling
organizations to be proactive in maintaining a productive workforce.

Cons:

• Many HR departments lack the statistical and analytical skillset to work with large datasets.
• Different management and reporting systems within the organization can make it difficult to
aggregate and compare data.
• Access to quality data can be an issue for some organizations who do not have up-to-date systems.
• Organizations need access to good quality analytical and reporting software that can utilize the data
collected.
• Monitoring and collecting a greater amount of data with new technologies (eg. cloud-based systems,
wearable devices), as well as basing predictions on data, can create ethical issues.

Descriptive Analytics: This is the first level of true analysis. It looks for and describes relationships among
data without giving meaning to the patterns. It is exploratory rather than predictive. From it, we begin to see
trends from the past; yet, it is risky to extrapolate from the past into the future, considering the volatile,
rapidly changing markets of today and likely tomorrow.

Predictive Analytics analyses historical data in order to forecast the future. The differentiator is the way
data is used. In standard HR analytics, data is collected and analysed to report on what is working and what
needs improvement. In predictive analytics, data is also collected but is used to make future predictions
about employees or HR initiatives. This can include anything from predicting which candidates would be
more successful in the organization, to who is at risk of quitting within a year.

Advanced statistical techniques are used to create algorithmic models capable of identifying trends and
future behaviours. These future trends can describe possible risks or opportunities that organizations can
leverage in long-term decision-making.
For ex: Historical data can pinpoint reasons for poor performance, but predictive analytics can make
predictions about what initiatives are most likely to improve performance. If engagement levels are
identified as being correlated with performance, then organizations can implement specific initiatives that
boost employee engagement.

You might also like