Don't Tell You. (7 Marks) : A Sample Example of Research Problem

Module – 1
ARTIFICIAL INTELLIGENCE & DATA SCIENCE

[APPLIED STASTISTICS][20MAT4AD4]
QUESTION[1]: Explain Research problem and Purpose form these what tells you and
don’t tell you. [7 Marks]
A Sample Example of Research Problem
The function of a research design is to ensure that the evidence obtained enables you to
effectively address the research problem as unambiguously as possible. In social sciences
research, obtaining evidence relevant to the research problem generally entails specifying the
type of evidence needed to test a theory, to evaluate a program, or to accurately describe a
phenomenon. However, researchers can often begin their investigations far too early, before
they have thought critically about about what information is required to answer the study's
research questions. Without attending to these design issues beforehand, the conclusions drawn
risk being weak and unconvincing and, consequently, will fail to adequate address the overall
research problem.
Given this, the length and complexity of research designs can vary considerably, but any sound
design will do the following things:
Identify the research problem clearly and justify its selection,
1. Review previously published literature associated with the problem area,

2. Clearly and explicitly specify hypotheses [i.e., research questions] central to the
problem selected,
3. Effectively describe the data which will be necessary for an adequate test of the
hypotheses and explain how such data will be obtained, and
4. Describe the methods of analysis which will be applied to the data in determining
whether or not the hypotheses are true or false.
Definition and Purpose

The essentials of action research design follow a characteristic cycle whereby initially an
exploratory stance is adopted, where an understanding of a problem is developed and plans are
made for some form of interventionary strategy. Then the intervention is carried out (the action
in Action Research) during which time, pertinent observations are collected in various forms.
The new interventional strategies are carried out, and the cyclic process repeats, continuing
until a sufficient understanding of (or implement able solution for) the problem is achieved.
The protocol is iterative or cyclical in nature and is intended to foster deeper understanding of
a given situation, starting with conceptualizing and particularizing the problem and moving
through several interventions and evaluations.
What do these studies tell you?
1. A collaborative and adaptive research design that lends itself to use in work or
community situations.
2. Design focuses on pragmatic [practical or logical] and solution-driven research rather
than testing theories.
3. When practitioners use action research it has the potential to increase the amount they
learn consciously from their experience. The action research cycle can also be regarded
as a learning cycle.
4. Action search studies often have direct and obvious relevance to practice.
5. There are no hidden controls or pre-emption of direction by the researcher.
What these studies don't tell you?
1. It is harder to do than conducting conventional studies because the researcher takes on

responsibilities for encouraging change as well as for research.
2. Action research is much harder to write up because you probably can’t use a standard
format to report your findings effectively.
3. Personal over-involvement of the researcher may bias research results.
4. The cyclic nature of action research to achieve its twin outcomes of action (e.g. change)
and research (e.g. understanding) is time-consuming and complex to conduct.
QUESTION[2]: What is data science? [6 Marks]
What is data science?
The term “data science” was coined in 2001, attempting to describe a new field. Some argue
that it’s nothing more than the natural evolution of statistics, and shouldn’t be called a new
field at all. But others argue that it’s more interdisciplinary. [The Data Science Design
Manual (2017), Steven Skiena ].
Data science as lying at the intersection of computer science, statistics, and substantive
application domains. From computer science comes machine learning and high-performance
computing technologies for dealing with scale. From statistics comes a long tradition of
exploratory data analysis, significance testing, and visualization. From application domains in
business and the sciences comes challenges worthy of battle, and evaluation standards to assess
when they have been adequately conquered.
The Data Science Venn Diagram, in which he drew the following diagram to indicate the
various fields that come together to form what we call “data science.”
Regardless of whether data science is just a part of statistics, and regardless of the domain to
which we’re applying data science, the goal is the same: to turn data into actionable
value. The professional society INFORMS defines the related field of analytics as “the
scientific process of transforming data into insight for making better decisions.”
QUESTION[3]: What do data scientist do? [6 Marks]
What do data scientists do?
Turning data into actionable value usually involves answering questions using data. Here’s a
typical workflow for how that plays out in practice.
1. Obtain data that you hope will help answer the question.
2. Explore the data to understand it.
3. Clean and prepare the data for analysis.
4. Perform analysis, model building, testing, etc.
(The analysis is the step most people think of as data science, but it’s just one step!
Notice how much more there is that surrounds it.)
5. Draw conclusions from your work.
6. Report those conclusions to the relevant stakeholders.
QUESTION[4]: How to formulation of Research Problem? [6 Marks]
Formulation of Research Problem

Research problem
A research problem is a statement about an area of concern, a condition to be improved, a
difficulty to be eliminated, or a troubling question that exists in scholarly literature, in theory,
or in practice that points to the need for meaningful understanding and deliberate
investigation. In some social science disciplines the research problem is typically posed in
the form of a question. A research problem does not state how to do something, offer a vague
or broad proposition, or present a value question.
The purpose of a problem statement is to:
1. Introduce the reader to the importance of the topic being studied. The reader
is oriented to the significance of the study and the research questions or hypotheses
to follow.
2. Places the problem into a particular context that defines the parameters of what
is to be investigated.
3. Provides the framework for reporting the results and indicates what is probably
necessary to conduct the study and explain how the findings will present this
information.
Basic characteristics of research problem
For your research problem to be effective, make sure that it has these basic characteristics:
 Reflecting on important issues or needs;
 Basing on factual evidence (it’s non-hypothetical);
 Being manageable and relevant;
 Suggesting a testable and meaningful hypothesis (avoiding useless answers).
 Specify your research objectives;
 Review its context or environment;
 Explore its nature;
 Determine variable relationships;
 Alternative approaches.

QUESTION[5]: Explain research objectives, context and nature of problem? [6 Marks]
Specify research objectives
A clear statement that defines all objectives can help you conduct and develop effective and
meaningful research. They should be manageable to bring you success. A few goals will help
you keep your study relevant. This statement also helps professors evaluation the questions
your research project answers and different methods that you use to address them.
Review the context of your research problem

It’s necessary to work hard to define and test all kinds of environmental variables to make
your project successful. Why do you need to do that? This step can help you define if the
important findings of your study will deliver enough data to be worth considering. Identify
specific environmental variables that may potentially affect your research and start
formulating effective methods to control all of them.
Explore the nature of research problem.
Research problems may range from simple to complex, and everything depends on a range
of variables and their relationships. Some of them can be directly relevant to specific research
questions, while others are completely unimportant for your project.
Why should you understand their nature? This knowledge enables you to develop effective
solutions. To get a deep understanding of all dimensions, think about focus groups and other
relevant details to provide the necessary insight into a particular question.
QUESTION[6]: Explain variable relationship and alternative approaches? [6 Marks]
Determine variable relationships

Scientific, social, and other studies often focus on creating a certain sequence of repeating
behaviors over time. What does your project entail? Completing the entire process
involves:
 Identifying the variables that affect possible solutions to your research problem;
 Deciding on the degree to which you can use and control all of them for study
purposes;
 Determining functional relationships between existing variables;
 Choose the most critical variables for a solution of your research problem.
During the formulation stage, it’s necessary to consider and generate as many potential
approaches and variable relationships as you can.
Alternative approaches
Anticipate the possible consequences of alternative approaches
There are different consequences that each course of action or approach can bring, and that’s
why you need to anticipate them. Why communicate possible outcomes? It’s a primary goal
of any research process.
QUESTION[7]: Explain Basic steps in formulating a Research Problems ? [10 Marks]
Basic Steps in Formulating a Research Problem

What is the most essential part of your research project? It is obviously the formulating of
a research problem or selecting your research topic. This is because of the quality &
relevancy of your research work completely depends on it. The process of formulating a
research problem requires a series of steps. Look at 7 basic steps in formulating a research
problem.
(i) Identify the Broad Study Area
This is a great idea to thinking about the subject area of your interest. You should identify
the field in which you would like to work a long time after your academic study or
graduation. It will help you tremendously to get an interesting research topic. For example-
if you do graduation in sociology, you must decide your research study area in sociology.
You might choose social problems like unemployment, road accident, community health,
HIV/AIDS, etc.
For example- if you do post graduation in Computer Science with specialisation in Cyber
Security , you must decide your research study area in Cyber security. You might choose
problems related with cyber threats, cyber crimes, cyber trends etc.
(ii) Dissect the Broad Study Area into Subareas
In this stage, you need to dissect and specify your research broad study area into some
subareas. You would consult with your supervisor in this regard. Write down subareas. For
example- if you select unemployment as your broad study area, then dissect it into
unemployment & social stability, unemployment & crime, unemployment & individual
frustration, etc. In this case, your research title maybe how unemployment produces
criminal activities. Or how it creates frustration in mind among unemployed people.
For example- if you select Cyber security as your broad study area, then dissect it
into network security, web security, database security related with cyber crime , etc.
(iii) Mark-up your Interest
It is almost impossible to study all subareas. That’s why you must identify your area of
interest. You should select issues in which you are passionate about. Your interest must be
the most important determinant of your research study. Once you selected your research
study of interest, you should delete other subareas in which you do not feel interested. Keep
in mind that if you lose your interest in your research study it won’t bring any results
eventually.
(iv) Study Research Questions
In this step in formulating a research problem, you would point out your research questions
under the area of interest as you decided in the previous stage. If you select unemployment as
your study area, your questions might be “how unemployment impacts on individual social
status?” “How it affects social stability?” “How it creates frustration on individuals?” Define
what research problem or question you are going to study? The more you study the research
problem it will be just as relevant and fruitful to solve the problem indeed.
(v) Set Out Objectives
Set out conspicuously your research root objectives and sub-objectives. Research
objectives essentially come from research questions. If you do study “Impact of
unemployment on individual social status” as your research problem or research question.
Then, set out what would you like to explore to address. For Example- your main objective
might be to examine the unemployment status in a particular society or state. And sub-
objectives would be its effects on individuals’ social life. Setting out specific main and sub-
objectives are so crucial.
(vi) Assess your Objectives
Now, you should evaluate your objectives to make sure the possibility of attaining them
through your research study. Assess your objectives in terms of time, budget, resources
and technical expertise at your hand. You should also assess your research questions in
light of reality. Determine what outcome will bring your study. If you can assess accurately
the purpose of the research study it will bring significant results in the long run. In fact,
research objectives determine the value of the study you are going to work out.
(vii) Check Back

Before you go on research work you should review all steps in formulating a research
problem and all the things that you have done till now for the purpose of your research study.
Then, ask yourself about your enthusiasm. Do you have enough resources to step up? If you
are quite satisfied, then you forward to undertake your research work. You can change any
of your plans in the light of reality if it requires.
Samples and Populations
Definition: A population is all the individuals or units of interest; typically, there is not
available data for almost all individuals in a population.
Definition: A sample is a subset of the individuals in a population; there is typically data
available for individuals in samples. Samples and Populations Samples and Populations
Examples:
In the cow data set:
The sample is the 50 cows;
The population is cows of the same breed on dairy farms.
In the plantation example:
The sample is the three sites where data was collected;
The the population is all plantations in Costa Rica where one might consider restoration
to native forest.
QUESTION[8]: What are the properties of Samples Estimates ? [5 Marks]
Properties of Representative Samples Estimates:

 Estimates calculated from sample data are often used to make inferences about
populations.
 If a sample is representative of a population, then statistics calculated from sample data
will be close to corresponding values from the population.
 Samples contain less information than full populations, so estimates from samples
about population quantities always involve some uncertainty
 Random sampling, in which every potential sample of a given size has the same chance
of being selected, is the best way to obtain a representative sample.
 However, it often impossible or impractical to obtain a random sample.
 Nevertheless, we often will make calculations for statistical inference as if a sample
was selected at random, even when this is not the case
 Thus, it is important to understand both how to conduct a random sample in practice
and the properties of random samples.
Random Sampling
Definition A simple random sample is a sample chosen in such a manner that each
possible sample of the same size has the same chance of being selected.
In a simple random sample, all individuals are equally likely to be included in the
sample.
The converse, however, is untrue: Consider sampling either all five men or all five
women with equal probability from a population with ten people. Each person has a
50% chance of being included, but any sample with a mix of men and women has no
probability of being chosen while the two samples of all individuals of the same sex
each have probability one half of being selected
QUESTION[9]: Explain difference between descriptive and inferential statistics ? [7 Marks]
Difference Between Descriptive and Inferential Statistics
In today’s fast-paced world, statistics is playing a major role in the field of research;
that helps in the collection, analysis and presentation of data in a measurable form. It is
quite hard to identify, whether the research relies on descriptive statistics or inferential
statistics, as people usually, lacks knowledge about these two branches of statistics. As
the name suggests, descriptive statistics is one which describes the population.
On the other end, Inferential statistics is used to make the generalisation about the
population based on the samples. So, there is a big difference between descriptive and
inferential statistics, i.e. what you do with your data. Let’s take a glance at this article
to get some more details on the two topics.
Comparison Chart
BASIS FOR
DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS
COMPARISON
Meaning Descriptive Statistics is that Inferential Statistics is a type of statistics,

branch of statistics which is that focuses on drawing conclusions about
concerned with describing the the population, on the basis of sample
population under study. analysis and observation.
What it does? Organize, analyze and present Compares, test and predicts data.
data in a meaningful way.
BASIS FOR
DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS
COMPARISON
Form of final Charts, Graphs and Tables Probability

Result
Usage To describe a situation. To explain the chances of occurrence of

an event.
Function It explains the data, which is It attempts to reach the conclusion to learn
already known, to summarize about the population that extends beyond
sample. the data available.
QUESTION[10]: Define descriptive and inferential statistics ? [5 Marks]
Definition of Descriptive Statistics
Descriptive Statistics refers to a discipline that quantitatively describes the important

characteristics of the dataset. For the purpose of describing properties, it uses measures of
central tendency, i.e. mean, median, mode and the measures of dispersion i.e. range, standard
deviation, quartile deviation and variance, etc.
The data is summarised by the researcher, in a useful way, with the help of numerical and
graphical tools such as charts, tables, and graphs, to represent data in an accurate way.
Moreover, the text is presented in support of the diagrams, to explain what they represent.
Definition of Inferential Statistics
Inferential Statistics is all about generalising from the sample to the population, i.e. the results
of the analysis of the sample can be deduced to the larger population, from which the sample
is taken. It is a convenient way to draw conclusions about the population when it is not possible
to query each and every member of the universe. The sample chosen is a representative of the
entire population; therefore, it should contain important features of the population.
Inferential Statistics is used to determine the probability of properties of the population on the
basis of the properties of the sample, by employing probability theory. The major inferential
statistics are based on the statistical models such as Analysis of Variance, chi-square test,
student’s t distribution, regression analysis, etc. Methods of inferential statistics:
 Estimation of parameters
 Testing of hypothesis
Differences between Descriptive and Inferential Statistics
The difference between descriptive and inferential statistics can be drawn clearly on the
following grounds:
1. Descriptive Statistics is a discipline which is concerned with describing the population
under study. Inferential Statistics is a type of statistics; that focuses on drawing
conclusions about the population, on the basis of sample analysis and observation.
2. Descriptive Statistics collects, organises, analyses and presents data in a meaningful
way. On the contrary, Inferential Statistics, compares data, test hypothesis and make
predictions of the future outcomes.
3. There is a diagrammatic or tabular representation of final result in descriptive statistics
whereas the final result is displayed in the form of probability.
4. Descriptive statistics describes a situation while inferential statistics explains the
likelihood of the occurrence of an event.
5. Descriptive statistics explains the data, which is already known, to summarise sample.
Conversely, inferential statistics attempts to reach the conclusion to learn about the
population; that extends beyond the data available.
QUESTION[11]: Explain top, middle and lower level of managements ? [10 Marks]
Levels of Management
The term Levels of Management refers to the line of division that exists between various
managerial positions in an organization. As the size of the company and workforce increases,
the number of levels in management increases along with it, and vice versa. The
different Levels of Management can determine the chain of command within an organization,
as well as the amount of authority and typically decision-making influence accrued by all
managerial positions.
Levels of Management can be generally classified into three principal categories, all of which
direct managers to perform different functions.
Administrative, Managerial, or Top Level of Management
This level of management consists of an organization’s board of directors and the chief
executive or managing director. It is the ultimate source of power and authority, since it
oversees the goals, policies, and procedures of a company. Their main priority is on the
strategic planning and execution of the overall business success.
The roles and responsibilities of the top level of management can be summarized as follows:
 Laying down the objectives and broad policies of the business enterprise.
 Issuing necessary instructions for the preparation of department-specific budgets,
schedules, procedures, etc.
 Preparing strategic plans and policies for the organization.
 Appointing the executives for middle-level management, i.e. departmental managers.
 Establishing controls of all organizational departments.
 Since it consists of the Board of Directors, the top management level is also responsible
for communicating with the outside world and is held accountable towards an
organization’s shareholders for the performance of the enterprise.
 Providing overall guidance, direction, and encouraging harmony and collaboration.
Executive or Middle Level of Management
The branch and departmental managers form this middle management level. These people
are directly accountable to top management for the functioning of their respective departments,
devoting more time to organizational and directional functions. For smaller organizations, there
is often only one layer of middle management, but larger enterprises can see senior and
junior levels within this middle section.
The roles and responsibilities of the middle level of management can be summarized as
follows:
 Executing the plans of the organization in accordance with the policies and directives laid
out by the top management level.
 Forming plans for the sub-units of the organization that they supervise.
 Participating in the hiring and training processes of lower-level management.
 Interpreting and explaining the policies from top-level management to lower-level
management.
 Sending reports and data to top management in a timely and efficient manner.
 Evaluating the performance of junior managers.
 Inspiring lower level managers towards improving their performance.
Supervisory, Operative, or Lower Level of Management
This level of management consists of supervisors, foremen, section officers,

superintendents, and all other executives whose work must do largely with HR oversight and
the direction of operative employees. Simply put, managers at the lower level are primarily
concerned with the execution and coordination of day-to-day workflow that ensure completion
of projects and that deliverables are met. The roles and responsibilities of the lower level of
management can be summarized as follows:
 Assigning jobs and tasks to various workers.

 Guiding and instructing workers in day-to-day activities.
 Overseeing both the quality and quantity of production.
 Maintaining good relations within lower levels of the organization.
 Acting as mediators by communicating the problems, suggestions, and recommendatory
appeals, etc. of workers to the higher level of management, and in turn elucidating
higher-level goals and objectives to workers.
 Helping to address and resolve the grievances of workers.
 Supervising and guiding their subordinates.
 Taking part in the hiring and training processes of their workers.
 Arranging the necessary materials, machines, tools, and resources, etc. necessary for
accomplishing organizational tasks.
 Preparing periodical reports regarding the performance of the workers.
 Upholding discipline, decorum, and harmony within the workplace.
 Improving the enterprise’s image as a whole, due to their direct contact with the
workers.
Types of Variables in Research and Statistics
QUESTION[12]: what are variables? Explain independent, dependent and Intervening
variables[7 Marks]
QUESTION[13]: what are variables? Explain moderating, control, Extraneous, Intervening and
Quantitative variables [7 Marks]
What are variables?
Variables are things you measure, manipulate and control in statistics and research. All studies
analyze a variable, which can describe a person, place, thing or idea. A variable's value can
change between groups or over time. For example, if the variable in an experiment is a person's
eye color, its value can change from brown to blue to green from person to person.
Independent variables
An independent variable is a singular characteristic that the other variables in your experiment
cannot change. Age is an example of an independent variable. Where someone lives, what they
eat or how much they exercise are not going to change their age. Independent variables can,
however, change other variables. In studies, researchers often try to find out whether an
independent variable causes other variables to change and in what way.
Dependent variables
A dependent variable relies on and can be changed by other components. A grade on an exam
is an example of a dependent variable because it depends on factors such as how much sleep
you got and how long you studied. Independent variables can influence dependent variables,
but dependent variables cannot influence independent variables. For example, the time you
spent studying (dependent) can affect the grade on your test (independent) but the grade on
your test does not affect the time you spent studying.
When analyzing relationships between study objects, researchers often try to determine what
makes the dependent variable change and how.
Intervening variables
An intervening variable, sometimes called a mediator variable, is a theoretical variable the

researcher uses to explain a cause or connection between other study variables—usually
dependent and independent ones. They are associations instead of observations. For example,
if wealth is the independent variable, and a long life span is a dependent variable, the researcher
might hypothesize that access to quality healthcare is the intervening variable that links wealth
and life span.
Moderating variables
A moderating or moderator variable changes the relationship between dependent and

independent variables by strengthening or weakening the intervening variable's effect. For
example, in a study looking at the relationship between economic status (independent variable)
and how frequently people get physical exams from a doctor (dependent variable), age is a
moderating variable. That relationship might be weaker in younger individuals and stronger in
older individuals.
Control variables
Control or controlling variables are characteristics that are constant and do not change during
a study. They have no effect on other variables. Researchers might intentionally keep a control
variable the same throughout an experiment to prevent bias. For example, in an experiment
about plant development, control variables might include the amounts of fertilizer and water
each plant gets. These amounts are always the same so that they do not affect the plants' growth.
Extraneous variables
Extraneous variables are factors that affect the dependent variable but that the researcher did
not originally consider when designing the experiment. These unwanted variables can
unintentionally change a study's results or how a researcher interprets those results. Take, for
example, a study assessing whether private tutoring or online courses are more effective at
improving students' Spanish test scores. Extraneous variables that might unintentionally
influence the outcome include parental support, prior knowledge of a foreign language or
socioeconomic status.
Quantitative variables
Quantitative variables are any data sets that involve numbers or amounts. Examples might
include height, distance or number of items. Researchers can further categorize quantitative
variables into two types:
 Discrete: Any numerical variables you can realistically count, such as the coins in
your wallet or the money in your savings account.
 Continuous: Numerical variables that you could never finish counting, such as time.
 Binary: Variables with only two categories, such as male or female, red or blue.
 Nominal: Variables you can organize in more than two categories that do not follow
a particular order. Take, for example, housing types: Single-family home,
condominium, tiny home.
 Ordinal: Variables you can organize in more than two categories that follow a
particular order. Take, for example, level of satisfaction: Unsatisfied, neutral,
satisfied.
 Adjustment: Adjust study parameters to account for the confounding variable and
minimize its effects.
 Matching: Compare study groups with the same degree of confounding variables.
 Multivariate analysis: Use when analyzing multiple variables at once.
 Randomization: Spread confounding variables evenly between study groups.
 Restriction: Remove subjects or samples that have confounding factors.
 Stratification: Create study subgroups in which the confounding variable does not
vary or vary much.
QUESTION[13]: Explain HR scorecard? [20 Marks]
HR Scorecard
The HR scorecard, or Human Resource Scorecard, is a well-known HR tool. In this article, we
will explain what the HR scorecard is, the difference between the HR scorecard and the
balanced scorecard, modern-day critique, and show an example template of the HR scorecard.
What is the HR scorecard?

One of the key problems that HR has been facing in the past decades is the perception that
HR doesn’t add to the company strategy. Indeed, HR directors in many organizations are
often still looking for a seat at the proverbial (board) table. In many organizations, HR has
failed to do so.
The HR scorecard is a strategic HR measurement system that helps to measure, manage, and
improve the strategic role of the HR department.
The HR scorecard is meant to measure leading HR indicators of business performance.

Leading indicators are measurements that predict future business growth. These are called
HR deliverables. They are also known as HR metrics as they are metrics that are linked to the
business strategy.
1. Time to hire (time in days)

An important metric for recruitment is the ‘time to hire’. This measures the number of days
between a candidate applying for a job, and them accepting a job offer. Time to hire gives
insights into recruiting efficiency and candidate experience.
Recruitment efficiency measures the speed at which HR processes a candidate – assessment,
interview, and role acceptance. If your organization has a long time to hire, it reflects that your
processes are inefficient. Having a long time to hire, reflects on candidate experience.
Candidates may drop out of the recruitment process if it is long. Wouldn’t you rather wait two
weeks than two months to start a new job? The best candidates are in demand and do not need
to wait.
2. Cost per hire (total cost of hiring/the number of new hires)

Like the time to hire, the ‘cost per hire (CPH)’ metric shows how much it costs the company
to hire new employees. This also serves as an indicator of the efficiency of the recruitment
process.
Cost per hire can be time-consuming to work out. There used to be a huge variation in how
companies calculated this metric until The Society of Human Resource Management and the
American National Standards Institute agreed on a standard formula.
CPH can be calculated by adding together internal recruiting costs, and external recruiting
costs, divided by total number of hires. The costs and number of hires will both reflect a
selected measurement period – such as monthly, or annually.
3. Early turnover (percentage of recruits leaving in the first year)

This is arguably the most important metric to determine hiring success in a company This early
leaver metric indicates whether there is a mismatch between the person and the company or
between the person and his/her position. Early turnover is also very expensive. It usually takes
6 to 12 months before employees have fully learned the ropes and reach their ‘Optimum
Productivity Level’.
4. Time since last promotion (average time in months since last internal promotion)
This rather straightforward metric is useful in explaining why your high potentials leave.
HR metrics examples related to revenue

5. Revenue per employee (revenue/total number of employees)
This metric shows the efficiency of the organization as a whole. The ‘revenue per employee’
metric is an indicator of the quality of hired employees..
6. Performance and potential (the 9-box grid)

The 9-box grid appears when measuring and mapping both an individual’s performance and
potential in three levels. This model shows which employees are underperformers, valued
specialists, emerging potentials, or top talents. This metric is great for differentiating between,
for example, wanted and unwanted turnover.
7. Billable hours per employee
This is the most concrete example of a performance measure, and it is especially relevant in
professional service firms (e.g. law and consultancy firms). Relating this kind of performance
to employee engagement or other input metrics makes for an interesting analysis.
Benchmarking this metric between different departments and managers/partners can also
provide valuable insights.
 This metric also relates to employee utilization rate, which refers to the amount of working
time employee is spending on billable tasks [Working on a client's project].
.
8. Engagement rating
An engaged workforce is a productive workforce. Engagement might be the most important
‘soft’ HR outcome. People who like their job and who are proud of their company are generally
more engaged, even if the work environment is challenging and pressure can be high. Engaged
employees perform better and are more likely to perceive challenges as positive and interesting.
Additionally, team engagement is an important metric for a team manager’s success.
9. Cost of HR per employee

This metric shows the cost efficiency of HR expressed in dollars.
10. Ratio of HR professionals to employees

HR to employee ratio is another measure that shows HR’s cost efficiency. An organization
with fully developed analytical capabilities should be able to have a smaller number of HR
professionals do more. [FTE, or Full-Time Equivalent]
11. Ratio of HR business partners per employee
A similar metric to the previous one. Again, a set of highly developed analytics capabilities
will enable HR to measure and predict the impact of HR policies. This will enable HR to be
more efficient and reduce the number of business partners.
12. Turnover (number of leavers/total population in the organization)

This metric shows how many workers leave the company in a given year. When combined
with, for instance, a performance metric, the ‘turnover’ metric can track the difference in
attrition in high and low performers.
Preferably you would like to see low performers leave and high performers stay. This metric
also provides HR business partners with a great amount of information about the departments
and functions in which employees feel at home, and where in the organization they do not want
to work. Turnover is very useful data to know when shaping recruitment
strategies. Additionally, attrition could be a key metric in measuring a manager’s success.
For a deep dive into how to calculate employee turnover,
13. Effectiveness of HR software

This is a more complex metric. Effectiveness of, for instance, learning and development
software are measured in the number of active users, average time on the platform, session
length, total time on platform per user per month, screen flow, and software retention. These
metrics enable HR to determine what works for the employees and what does not.
14. Absenteeism
Like turnover, absenteeism is also a strong indicator of dissatisfaction and a predictor of
turnover. Absenteeism rate can give information to prevent this kind of leave, as long-term
absence can be very costly. Again, differences between individual managers and departments
are very interesting indicators of (potential) problems and bottlenecks.
HR metrics examples in key areas of business
Human Resource metrics are measurements that help you to track data across the HR
department and the organization. The most important areas are listed below. In this list of HR
metrics, we included the key HR metrics examples associated with those areas.
Organizational performance
 Turnover percentages
 % of regretted loss
 Statistics on why personnel is leaving
 Absence percentages and behavior
 Recruitment (time to fill, number of applicants, recruitment cost)
HR operations
 HR efficiency (e.g. time to resolving HR self-service tickets)

 HR effectiveness (e.g. perception of HR service quality)
Process optimization
Process optimization helps to analyze how we do what we do in Human Resource Management.
The HR metrics and analytics in this area focus on changes in HR efficiency and effectiveness
over time. These HR metrics and analytics are then used to re-engineer and reinvent what is
happening in HR. This helps to optimize the Human Resource delivery process. Process
optimization metrics are next-level. They are still very rare in modern organizations as they
require a very high level of both data maturity and analytics maturity.
Sample Mean Formula
Sample mean represents the measure of the centre of the data. Any population's mean is
estimated using the sample mean. In many of the situations and cases, we are required to
estimate what the whole population is doing, or what all are the factors going throughout the
population, without surveying everyone in the population. In such cases sample mean is useful.
An average value found in a sample is termed the sample mean. the sample mean so calculated
is used to find the variance and thereby the standard deviation. Let us see the sample mean
formula and its applications in the upcoming sections.
Example 1: Five friends having heights of 110 units, 115 units, 109 units, 112 units, and 114
units respectively. Find their sample mean height.
Solution:
To find: Sample mean height

Sum of all heights = 110 + 115 + 109 + 112 + 114 = 560
Number of person = 5
Using sample mean formula,
mean = (sum of terms)/(number of terms)
mean =560/5
= 112 units
Answer: The sample mean height of five friends is 112 units.
Why is there Variance?
Variance is a measure of the distance of each variable from the average value or mean in its
data set. It is used to calculate deviation within the set and it’s a valuable tool for investors and
finance professionals, we define variance, how to calculate it and the advantages and
disadvantages of using variance.
What is Variance? :Variance is a calculation that considers random variables in terms of their
relationship to the mean of its data set. Variance can be used to determine how far each variable
is from the mean and, in turn, how far each variable is from one another. It is also used in
statistical inferences, hypothesis testing, Monte Carlo methods (random sampling) and
goodness-of-fit analyses.
How to calculate variance: To calculate variance, you need to square each deviation of a given
variable (X) and the mean.
In a sample set of data, you would subtract every value from the mean individually, then square
the value, like this: (μ - X)². Then, you would add all the squared deviations and divide them
by the total number of values to reach an average. This number is the variance.
To find the standard deviation, you could simply take the square root of the variance.
The formula for variance is as follows:
Var(X) = E (x - μ)**² / N**
The formula shows that the variance of X (Var[X]) is equal to the average of the square of X
minus the square of its mean. And you can solve it by dividing it across the amount of numbers
in a set, or N.
What are the advantages of using variance?: The biggest advantage to using variance is to
gain information about a set of data. Whether you are an investor looking to mitigate risk or a
statistician who needs to understand the spread of a sample, the variance is information that
people can use to draw quick inferences.
Sample Standard Deviation Formula
QUESTION[14]: Explain sample standard deviation formula with example? [6 Marks]
Before learning the sample standard deviation formula, let us see when do we use it. In a
practical situation, when the population size N is large it becomes difficult to obtain value xi for
every observation in the population and hence it becomes difficult to calculate the standard
deviation (or variance) for the population. In such cases, we can estimate the standard deviation
by calculating it on a sample of size n taken from the population of size N. This estimated
variance is called the sample standard deviation(S). Since a sample standard deviation is a
statistic that is calculated from only a few individuals in a reference population. The sample
has greater variability and thus the standard deviation of the sample is almost always greater
than that of the population. Let us explore the sample standard deviation formula below.
There are two types of standard deviations, population standard deviation, and sample standard
deviation. While calculating the population standard deviation, we divide by n, the number of
data values. For calculating the sample standard deviation, we divide by n -1 i.e., one less than
the number of data values. Given a sample of data (observations) for the random variable x, its
sample standard deviation formula is given by:
1 n
S  ( xi  x ) 2
n  1 i 1
Here,

 x = sample average
 x = individual values in sample
 n = count of values in the sample
Example 1: In the data set 9, 6, 8, 5, 7, find the sample standard deviation.
Solution: n = 5, Mean = (9+6+8+5+7)/ 5,Mean = 35/5 = 7.Using the sample standard

1 n
deviation formula, S   ( xi  x ) 2 = 1.58.
n  1 i 1
Sampling Error:
QUESTION[15]: Explain sample error with reason? [6 Marks]
 Error due to measurement :
It is a well-known fact that precise measurement of any magnitude is not possible. If some
individuals, for example, are asked to measure the length of a particular piece of cloth
independently up to, say, two decimal points; we can be quite sure that their answers will not
be the same. In fact, the measuring instrument itself may .not have the same degree of accuracy.
 Error due to non-response

Sometimes the required data are collected by mailing questionnaires to the respondents. Many
of such respondents may return the questionnaires with incomplete answers or may not return
them at all. This kind of an attitude may be due to:
a) the respondents are too casual to fill up the answers to the questions asked
b) they are not in a position to understand the questions, or
c) they may not like to disclose the information that has been sought.
 Error in recording
This type of error may arise at the stage when the investigator records the answers or even at
the tabulation stage. A major reason for such error is the carelessness on the part of the
investigator
 Error due to inherent bias of the investigator

Every individual suffers personal prejudices and biases. Despite the provision of the best
possible training to the investigators, their personal biases may come into play when they
interpret the questions to be put to the respondents or record the answers to these questions.
In complete enumeration the extent of non-sampling error tends to be significantly large
because, generally, a large number of individuals are involved in the data collection process.
We try to minimise this error through:
i) a careful planning of the survey
ii) providing proper training to the investigators
iii) making the questionnaire simple.
 Sampling Error
By now it should be clear that in the sampling method also, non-sampling error may be
committed. It is almost impossible to make the data absolutely free of such errors. However,
since the number of respondents in a sample survey is much smaller than in census, the non-
sampling error is generally less pronounced in the sampling method. Besides the non-sampling
errors, there is sampling error in a sample survey. Sampling error is the absolute difference
between the parameter and the corresponding statistic, that is, lT -0l.
Sampling error is not due to any lapse on the part of the respondent or the investigator or some
such .reason. It arises because of the very nature of the procedure. It can never be completely
eliminated. However, we have well developed sampling theories with the help of which the
effect of sampling error can be minirnised.
QUESTION[16]: Define any three of following table? [6 Marks]
Estimate It is the articular value that can be obtained from an estimator.

Estimator It is the specific functional from of a statistic or the formula involved in its
calculation, Generally, the two terms, statistic and estimator, are used
interchangeably
Parameter It is a measure of some characteristic of the population.
Population It is the entire collection of units of a specified type in a given place and at
a particular point of time
Random It is a procedure where every member of the population has a definite
Sampling chance or probability' of being selected in the sample. It is also called
probability sampling
Sample It is a sub-set of the population. Therefore, it is a collection of some units
from the population.
Sampling It refers to the probability distribution of a statistic.
Distribution
Sampling Error The absolute difference between population parameter and relevant sample
statistic.
Simple This is a sampling procedure, in which, each member of the population has
Random the same chance of being selected in the sample.
Sampling
Standard Error It is the standard deviation of the sampling distribution of a statistic.
Statistic It is a function of the values of the units that are included in the sample.
The basic purpose of a statistic is to estimate some population parameter
Statistical It is the process of drawing conclusions about an unknown population
Inference characteristic on the basis of a known sample drawn from it.
Confidence Intervals:
QUESTION[17]:Explain confidence interval with example [6 Marks]
An interval that contains the unknown parameter (such as the population mean µ) with certain
degree of confidence.
Example: Consider the distribution of serum cholesterol levels for all males in the US who are
hypertensive and who smoke. This distribution has an unknown mean µ and a standard
deviation 46 mg/100ml. Suppose we draw a random sample of 12 individual from this
population and find that the mean cholesterol level is x¯ = 217mg/100ml
x¯ = 217mg/100ml is a point estimate of the unknown mean cholesterol level µ in the

population.
However, because of the sampling variability, it is important to construct an interval estimate

of µ to account for the sampling variability.
A 95% confidence interval for µ is
(217 − 1.96 46 √ 12 , 217 + 1.96 46 √ 12) or (191, 243).
A 99% confidence interval for µ is
(217 − 2.58 46 √ 12 , 217 + 2.58 46 √ 1) or (183, 251).

What are Degrees of Freedom?
QUESTION[18]:What are degree of freedom? [6 Marks]
The degrees of freedom (DF) in statistics indicate the number of independent values that can
vary in an analysis without breaking any constraints. It is an essential idea that appears in many
contexts throughout statistics including hypothesis tests, probability distributions, and
linear regression. Learn how this fundamental concept affects the power and precision of your
analysis!
In this post, I bring this concept to life in an intuitive manner. You’ll learn the degrees of
freedom definition and know how to find degrees of freedom for various analyses, such as
linear regression, t-tests, and chi-square. I’ll start by defining degrees of freedom and providing
the formula. However, I’ll quickly move on to practical examples in the context of various
statistical analyses because they make this concept easier to understand.
UNIT 2
Data screening
QUESTION[1]:Explain Data screening? [10 Marks]
1. Data screening should be conducted prior to data recoding and data analysis, to help
ensure the integrity of the data.
2. It is only necessary to screen the data for the variables and cases used for the analyses
presented in the lab report.
3. Data screening means checking data for errors and fixing or removing these errors.
The goal is to maximise "signal" and minimise "noise" by identifying and fixing or
removing errors.
4. Keep a record of data screening steps undertaken and any changes made to the data.
This should be summarised in a one to two paragraph section called "Data screening"
at the beginning of the Results.
5. Fixing or removing incorrect data:
1. Erroneous data can be changed to missing data. Alternatively, if a correct
value can be presumed, then this can be entered.
2. To change data in a cell, open the data file and using the data view, left-click
on the cell. Delete the data to make it missing data. Or change the value in the
cell by typing in the new value. Repeat for each problematic value.
3. It is probably best to make erroneous data missing unless the correct value is
obvious (e.g., if 77 was entered, it might reasonably be deduced that 7 was
intended) in which case the incorrect value can be replaced with a best guess
correct value.
4. For cases with a lot of erroneous data, it is probably best to remove the entire
case (i.e., delete the whole row).
6. Out-of-range values:
1. Out of range values are either below the minimum or above the maximum
possible value.
2. To know what the in-range values are, check:
1. the survey
2. download the data and check the SPSS Value Labels in Variable View
3. Identify out-of-range values by obtaining descriptive statistics (in SPSS, use
Analyze - Descriptive Statistics - Descriptives) to examine the minimum and
maximum values for all variables of interest. In the SPSS Data View, sort
variables with out-of-range values in ascending or descending order to help
identify the case(s) which has(have) the out-of-range values. Alternatively, use
search and find to identify the case(s) with out-of-range values.
4. Decide whether to accept, replace, or remove out-of-range values.
7. Unusual cases:
1. Unusual cases occur when a case's responses are very different from the
pattern of responses by most other respondents.
2. In SPSS, Data - Identify Unusual cases - Enter several to many variables (e.g.,
all the Time Management Skill variables).
3. The results will flag the top 10 anomalous cases.
4. Look carefully at each of these cases' responses to the target variables - do
they appear to be legitimate (e.g., are their out-of-range values, or consider if
the data was fabricated, often the responses to the reverse scored items aren't
fabricated in the expected direction). If so, consider removing the case.
8. Duplicate cases:
1. Duplicate cases occur when two or more cases have identical or near-identical
data
2. In SPSS, Data - Identify Duplicate cases - Enter several to many variables.
3. Consider whether to remove both cases (e.g., the integrity of the data may be
in doubt because it may have been fabricated and then duplicated?) or to retain
one copy of each case and delete duplicates.
9. Manual check for other anomalies
1. Check carefully through the data file (case by case and variable by variable)
looking for and addressing any oddities.
2. Empty cases: e.g., cases with no or little data could be removed
3. Cases with responses which lack meaningful variation: (e.g., 5 5 5 5 5 5 5 5
5) or which exhibit obvious arbitrary patterns (e.g., zig-zag - 1 2 3 4 5 4 3 2 1
2 3 4 5) - such responses are unlikely to be valid and probably should be
deleted.
Problems in Real Life
Email Spam
The goal is to predict whether an email is a spam and should be delivered to the Junk folder.
There are more than one method of identifying a mail as a spam. A simple method is discussed.
The raw data comprises only the text part but ignores all images. Text is a simple sequence of
words which is the input (X). The goal is to predict the binary response Y: spam or not.
QUESTION[2]:Explain Handwritten Digit Recognition? [5 Marks]
Handwritten Digit Recognition
The goal is to identify images of single digits 0 - 9 correctly.
The raw data comprises images that are scaled segments from five-digit ZIP codes. In the
diagram below every green box is one image. The original images are very small, containing
only 16 × 16 pixels. For convenience the images below are enlarged, hence the pixelation or
'boxiness' of the numbers.
Every image is to be identified as 0 or 1 or 2 ... or 9. Since the numbers are handwritten, the
task is not trivial. For instance, a '5' sometimes can very much look like a '6', and '7' is
sometimes confused with '1'.
Image segmentation
QUESTION[3]:Explain image segmentation? [5 Marks]
Here is a more complex example of an image processing problem. The satellite images are to
be identified into man-made or natural regions. For instance, in the aerial images shown below,
buildings are labeled as man-made, and the vegetation areas are labeled as natural.
These grayscale images are much larger than the previous example. These images are 512 ×
512 pixels and again because these are grayscale images we can present pixel intensity with
numbers 0 to 255.
Speech Recognition
QUESTION[4]:Explain Speech Recognition? [5 Marks]
Another interesting example of data mining deals with speech recognition. For instance,
if you call the University Park Airport, the system might ask you your flight number, or
your origin and destination cities. The system does a very good job recognizing city
names. This is a classification problem, in which each city name is a class. The number
of classes is very big but finite.
The raw data involves voice amplitude sampled at discrete time points (a time sequence), which
may be represented in the waveforms as shown above. In speech recognition, a very popular
method is the Hidden Markov Model.
At every time point, one or more features, such as frequencies, are computed. The speech signal
essentially becomes a sequence of frequency vectors. This sequence is assumed to be an
instance of a hidden Markov model (HMM). An HMM can be estimated using multiple sample
sequences under the same class (e.g., city name).
QUESTION[5]:Explain DNA Expression Microarray? [5 Marks]
DNA Expression Microarray

Our goal here is to identify disease or tissue types based on the gene expression levels.
For each sample taken from a tissue of a particular disease type, the expression levels of a very
large collection of genes are measured. The input data goes through a data cleaning process.
Data cleaning may include but is certainly not limited to, normalization, elimination of noise
and perhaps log-scale transformations. A large volume of literature exists on the topic of
cleaning microarray data.
DNA Sequence Classification

QUESTION[6]:Explain DNA Sequence Classification? [5 Marks]
Each genome is made up of DNA sequences and each DNA segment has specific biological
functions. However there are DNA segments which are non-coding, i.e. they do not have any
biological function (or their functionalities are not yet known). One problem in DNA
sequencing is to label the sampled segments as coding or non-coding (with a biological
function or without).
The raw DNA data comprises sequences of letters, e.g., A, C, G, T for each of the DNA
sequences. One method of classification assumes the sequences to be realizations of random
processes. Different random processes are assumed for different classes of sequences.
Data science defined

QUESTION[7]:Define Data Science? [5 Marks]
Data science combines multiple fields, including statistics, scientific methods, artificial
intelligence (AI), and data analysis, to extract value from data. Those who practice data science
are called data scientists, and they combine a range of skills to analyze data collected from the
web, smartphones, customers, sensors, and other sources to derive actionable insights.
Data science encompasses preparing data for analysis, including cleansing, aggregating, and
manipulating the data to perform advanced data analysis. Analytic applications and data
scientists can then review the results to uncover patterns and enable business leaders to draw
informed insights.
QUESTION[7]:What is quality control and different types of quality control? [5 Marks]
What is quality control of data?

Quality control (QC) of data refers to the application of methods or processes that determine
whether data meet overall quality goals and defined quality criteria for individual values.
What are the different types of quality control?

There are several methods of quality control. These include an x-bar chart, Six Sigma, 100%
inspection mode.
QUESTION[8]:What is QA data science and example of quality control? [5 Marks]
What is QA data science?
A necessary but often tedious task data scientists must perform as part of any project is quality
assurance (QA). In a data science context, QA is the task of ensuring that the data to be analyzed
and modeled is suitable for whatever the use case happens to be
QUESTION[9]:What are some examples of quality control ,data quality with examples
and different types of quality inspections? [8 Marks]
What are some examples of quality control?

Examples of quality control activities include inspection, deliverable peer reviews and the
software testing process. You may like to read more about the quality assurance vs quality
control.
What is data quality with example?

For example, if the data is collected from incongruous sources at varying times, it may not
actually function as a good indicator for planning and decision-making. High-quality data is
collected and analyzed using a strict set of guidelines that ensure consistency and accuracy.
What are the 3 types of quality control?
What are the different types of quality inspections?

There are three primary types of quality inspections: pre-production, in-line, and final.
There are a variety of details that must be inspected and approved during each phase in order
to detect and correct quality problems
QUESTION[10]:What are five quality control methods? [10 Marks]
What are 5 different quality control methods?

Quality Control Tools
 Checklists. At its most basic, quality control requires you to check off a list of items that are
imperative to manufacture and sell your product.
 Fishbone diagram. ...
 Control chart. ...
 Stratification. ...
 Pareto chart. ...

 Histogram. ...
 Scatter Diagram.
Statistical Package for the Social Sciences

QUESTION[11]:What do you mean by SPSS? [5 Marks]
SPSS is short for Statistical Package for the Social Sciences, and it's used by various kinds of
researchers for complex statistical data analysis. The SPSS software package was created for
the management and statistical analysis of social science data.
What is SPSS Software?
QUESTION[12]: What is SPSS Software? [5 Marks]

It is a suite of software programs that analyzes scientific data related to the social sciences.
SPSS offers a fast-visual modeling environment that ranges from the smallest to the most
complex models. The data obtained from it is used for surveys, data mining, market research,
etc. SPSS was originally launched in 1968 by SPSS Inc., and IBM acquired it in 2009. SPSS
is popular because of its simplicity, easy-to-follow command language, and well-documented
user manual.
Government entities, educational institutions, survey companies, market researchers,
marketing organizations, health researchers, data miners, and many others use it for analyzing
survey data.
What are the Core Features of SPSS?

QUESTION[13]: What are the core Features of SPSS ? [10 Marks]
The core functionalities offered in SPSS are:
 Statistical program for quantitative data analysis- It includes frequencies, cross-

tabulation, and bivariate statistics.
 Modeler program that allows for predictive modeling. It enables researchers to build
and validate predictive models using advanced statistical procedures.
 Text analysis helps you derive insights from qualitative inputs through open-ended
questionnaires.
 Visualization Designer allows researchers to use their data for a variety of visual
representations.
Apart from the above four functionalities, SPSS also provides data management solutions. Its
data management solutions enable researchers to perform case selection, create derived data,
and perform file reshaping.
SPSS Features that Make it a Must-have Analysis Tool

SPSS is a popular tool for research, experimentation, and decision-making. It is one of the
most widely used statistical software worldwide in the world for its attractive features, here
are some of them:
 Using SPSS features, users can extract every piece of information from files for the
execution of descriptive, inferential, and multiple variant statistical procedures.
 Thanks to SPSS’ Data Mining Manager, its users can conduct smart searches, extract
hidden information with the help of decision trees, design neural networks of artificial
intelligence, and market segmentation.
 Can be used to solve algebraic, arithmetic, and trigonometric operations.
 SPSS’ Report Generator feature lets you prepare attractive reports of investigations. It
incorporates text, tables, graphs, and statistical results of the report in the same file.
SPSS offers data documentation too, it enables researchers to store a metadata directory. It
acts as a centralized information repository in relation to the data, such as relationships with
other data, its meaning, origin, format, and usage.
Some of the statistical methods that can be leveraged in SPSS are:

 Descriptive Statistics- It includes methodologies such as cross-tabulation, frequencies,
and descriptive ratio statistics.
 Bivariate Statistics- It includes methodologies such as means, nonparametric tests,
correlation, and Analysis of Variance (ANOVA)
 Predicting numeral outcomes such as linear regression
 Methodologies such as cluster analysis and factor analysis which is great for prediction
for identifying groups
Two SPSS Types You Should Know About

QUESTION[14]: Explain Variable View? [5 Marks]
QUESTION[15]: Explain Data View? [5 Marks]
There are two SPSS types: Variable View and Data View
Variable View
Name: It is a column field that accepts a unique ID that helps in sorting the data. Some of the
parameters for sorting data are name, gender, sex, educational qualification, designation, etc.
Label: It gives the label and allows you to add special characters.
Type: It is useful to differentiate the type of data that is being used.
Width: The length of the characters can be measured here.
Decimal: It helps us understand how to define the digits required after the decimal.
Value: The user enters the value here.
Missing: Data that is unnecessary for analysis will be ignored.
Align: As the name suggests, it is for alignment-left or right.
Measure: It measures the data that is being entered in the tools, such as cardinal, ordinal, and
nominal.
Data View
The data view is displayed as rows and columns. You can import a file or add data manually.
SPSS Industry Use Cases

It is used primarily in market research, education, healthcare, retail, and government. SPSS
data analysis is employed in these industries across the entire analytics process, starting from
data planning and collection to analysis, reporting, and deployment. Let us look at how the
industries mentioned above are transforming themselves with the help of SPSS.
SPSS Statistics is one of the most commonly used statistical analysis tools in the business
world. Thanks to its powerful features and robustness, its users can manage and analyze data
and represent them in visually attractive graphical forms. It supports a graphical user interface
and command-line, thereby making the software more intuitive.
SPSS makes the processing of complex data pretty simple. It is not easy to work with such
data, and it is also a time-consuming process.
Let us see four of the major industries where SPSS is primarily

used.
QUESTION[16]: How SPSS implemented in Market Research? [5 Marks]
QUESTION[17]: How SPSS implemented in Education? [5 Marks]
QUESTION[18]: How SPSS implemented in Health Care? [5 Marks]
QUESTION[19]: How SPSS implemented in Retail? [5 Marks]
1. Market Research
Businesses want actionable insights using which they can make tough and effective business
decisions. There are tonnes of data generated by businesses, and scanning them manually is
not the right way to analyze them. For market researchers who are looking for a reliable
solution that will help them understand their data, analyze trends, forecast, plan, and arrive at
conclusions, SPSS is the best tool out there.
By using sophisticated statistical analyses, SPSS helps market researchers get actionable
insights from your customer data. Thanks to its powerful survey data analysis technology, it is
possible to get accurate information about market trends. Perceptual mapping, preference
scaling, predictive analysis, statistical learning, and a bunch of other advanced tools such as
stratified, clustered, and multistage sampling help with the decision-making process.
2. Education
Educational institutions have to bear the pressure of enrolling students and retaining them each
year. Not to mention the fact that they need to attract new students every year. This is where
SPSS comes in. More than 80% of all US colleges are currently using SPSS software.
SPSS software’s ability to focus on patterns lets them identify the chances of a student’s future
success. It uses a combination of factors that tells them about students who are at risk.
The institution’s faculty can use SPSS software to analyze a plethora of complex data sets to
uncover hidden patterns.
3. Healthcare
Applying SPSS’ statistical analysis for healthcare delivery has a number of use cases. We need
to solve a lot of issues to provide great healthcare. Healthcare institutions use this data for some
outdated practices in patient delivery, misaligned incentives for caregivers, are some of the
biggest issues. This is where analytics can be a life-saver, literally at that.
When it comes to the healthcare sector, the data of patients is sacrosanct. Not only can wrong
data result in terrible outcomes, but they are also timely, sensitive, and instant.
With the help of SPSS, healthcare organizations can implement a patient delivery program
using data, it will not only drive better patient outcomes but also reduce the costs involved.
For data sets that have complex relationships, univariate and multivariate modeling techniques
can be used.
4. Retail
The retail industry relies heavily on analytics for everything from initial stock planning to
forecasting future trends. Customers have a lot of leverage when it comes to retail products,
thanks to the advent of social media, forums, and review sites.
Customers are taking their decisions based on the brand’s reviews online. So it is imperative
that retail businesses give the best that can be offered. Thankfully, statistical analysis is a savior
for the retail industry. Retail businesses generate a lot of data and it needs to be collected,
analyzed, and converted into actionable insights. By using the data effectively, businesses will
end up providing excellent experiences for their customers.
SPSS analysis lets retailers understand their customers, provide them with the right solutions
and deliver them using the perfect channels. From understanding how different segments of
customers behave to why they make certain buying decisions, everything can be found with
the help of SPSS analysis.
Using the previous spending and behavior patterns, SPSS statistics will profile customers. By
leveraging this data, it will come up with customer preferences and give them an analysis of
what makes customers turn from casual browsers into shoppers.
5 Ways SPSS Predictive Analytics Benefits All Industries
QUESTION[20]: Explain five ways SPSS predictive Analysis Benefits in Industries? [10
Marks]
1. Improves customer satisfaction

When businesses reduce the time taken to investigate fraud or are able to predict issues even
before they occur, customers are the ones who will benefit immensely from it. With the help
of SPSS Predictive Analytics, marketers can segment their offerings to different sets of
customers. By doing so, customers will only get the right offers they are looking for, and such
personalization increases the satisfaction of customers.
2. Increase ROI[RETURNE OF INVESTMENT]
Predictive analytics increases the profitability and efficiency of an organization. With
predictive analysis, businesses will not indulge in activities that will not increase their bottom
line, instead, they will concentrate on what is utmost necessary. Thereby, they will reduce the
costs involved in going after mundane activities.
3. Defend against risks effectively

Risks can be detrimental to the good health of businesses as they will concentrate more on
alleviating the risk rather than focus on what’s pivotal at that point for the organization. With
the help of predictive analytics using SPSS, businesses can identify the vulnerabilities earlier
so that they can find out which are the risks that are acceptable and which are not.
4. Saves money
By using SPSS analysis, businesses can save a lot of money. For example, customers in the
banking and insurance industries saved more than $2.4 million as they thwarted a motor
insurance fraud syndicate within four months of using the tool.
5. Avoid problems before they happen

One of the biggest advantages of using SPSS software is its ability to predict the frequency of
operational failure or downtime. Downtime has a significant impact on the bottom line of an
organization and will affect how customers perceive the brand. SPSS software helps predict
costly issues before they occur as it optimizes production line uptime and decreases downtime.
Advantages of Using SPSS

QUESTION[20]: Explain advantages of using SPSS? [6 Marks]
The statistical analysis tool makes it possible to import and export data files from other
programs. Some of its data handling procedures are excellent, as its ability to merge files, no
matter whether it is the same subjects and different variables or different subjects and the same
variables, is excellent.
In SPSS, users are not forced to work with syntax, even though syntax files can be saved and
modified as needed. When there are saved syntax files, it helps immensely with documentation
and also gives an idea of how the new variables were calculated and how values that were
missing were handled.
 It offers reliable and fast answers

 It’s dynamic and has useful tables and graphs
 Since it offers a wide variety of languages, a lot of people can access
 Effective data management
 Doesn’t require a lot of effort to start using the software
 Useful for both quantitative and qualitative data
 The chances of errors are little with SPSS
 One of the easiest statistical tools to analyze data
 SPSS users can select the graph type which matches their data distribution requirement
How do you identify errors?

QUESTION[21]: How do you identify Sentence Errors Strategies and data errors ? [6
Marks]
Key Identifying Sentence Errors Strategies
1. Always read the entire sentence.
2. When looking for the error, examine each choice individually.
3. Check verbs and pronouns first, since they're the most likely to include errors.
4. When an answer choice contains more than one type of word, check both.
How do you identify data errors?

Detection and Correction: Four Ways to Find Data Errors
1. METHOD 1: Gauge min and max values. ...
2. METHOD 2: Look for missings. ...
3. METHOD 3: Check the values of categorical variables. ...
4. METHOD 4: Look at the 'incidence rate' of binary variables.
How errors in data can be detected and rectified?

QUESTION[22]: How errors in data can be detected and rectified? and What are the
common data entry errors? [6 Marks]
To detect and correct the errors, additional bits are added to the data bits at the time of
transmission. The additional bits are called parity bits. They allow detection or correction of
the errors. The data bits along with the parity bits form a code word.
What are the common data entry errors?

 Transcription errors. ...
 Transposition errors. ...
 Unit/representation inconsistencies. ...
 Incorrect data formatting. ...
 Data entry rule. ...
 Data cleansing. ...
 Strengthen the workforce. ...
 Provide a conducive working environment.
What are 5 types of errors?

QUESTION[23]: What are different types of errors? [5 marks]
1) Gross Errors. Gross errors are caused by mistake in using instruments or meters,
calculating measurement and recording data results. ...
2) Blunders. ...
3) Measurement Error. ...
4) Systematic Errors. ...
5) Instrumental Errors. ...
6) Environmental Errors. ...
7) Observational Errors.
QUESTION[24]: What is data inconsistency and how do you handle inconsistencies in

data? [6 marks]
Data inconsistency is a situation where there are multiple tables within a

database that deal with the same data but may receive it from different inputs.
How do you handle inconsistencies in data?

1. Read a string.
2. Expand abbreviations and acronyms.
3. Remove accents: e.g., A substitutes A´ and A, and a substitutes a´ and a`.
4. Shift string to lower-case.
5. Remove stop words.
How do you find data inconsistency?

QUESTION[25]: How do you find data inconsistency? [6 marks]
The key to finding the inconsistencies is to create a filter. The filter will allow you to see all
of the unique values in the column, making it easier to isolate the incorrect values.
Is removing inconsistencies from the data?

QUESTION[25]: Explain Data cleaning and what are the benefits of DBMS? [6 marks]
Data cleaning (cleansing) is the process of removing errors and resolving inconsistencies in
source data before loading them into a common repository. The aim of data cleaning, which is
especially required when integrating heterogeneous data sources, is improving data quality .
What are the benefits of DBMS?

Business Benefits of a DBMS
 Improved data sharing and data security. Database management systems help users share data
quickly, effectively, and securely across an organization. ...
 Effective data integration. ...
 Consistent, reliable data. ...
 Data that complies with privacy regulations. ...
 Increased productivity. ...
 Better decision-making.
How do you treat missing values in data?

QUESTION[26]: How do you treat missing values in data? [6 marks]
Imputing the Missing Value
1. Replacing With Arbitrary Value. ...
2. Replacing With Mode. ...
3. Replacing With Median. ...
4. Replacing with previous value – Forward fill. ...
5. Replacing with next value – Backward fill. ...
6. Interpolation. ...
7. Impute the Most Frequent Value
What kind of data are empirical data?
QUESTION[27]: Explain empirical data with example? [6 marks]
Empirical data is not influenced by opinion or bias, making it objective information that has
been tested and found to be true. Two types of empirical data are qualitative data and
quantitative data. Qualitative data can be categorized and involves descriptions that can be
observed but not measured.
What are some examples of empirical data?

For example, your grandmother may have a home remedy for dandruff, but it hasn't been tested
and proven effective like the shampoo you bought at the store. Empirical data is important
because it is objective evidence and is ideally not influenced by opinion or bias.
QUESTION[28]: Explain frequency distribution table extreme score? [6 marks]
A frequency distribution table is a chart that summarizes all the data under two columns
- variables/categories, and their frequency. It has two or three columns. Usually, the first
column lists all the outcomes as individual values or in the form of class intervals, depending
upon the size of the data set.
Extreme scores are the lowest and highest possible scores for persons on items, or for
items by persons. They include zero and perfect scores. They are shown in the Tables as
MINIMUM ESTIMATE MEASURE and MAXIMUM ESTIMATE MEASURE.
What is an extreme score called in statistics?
The extreme values which are also known as outliers are the values that are too far from the
other observations of the given data. And their presence tends to have a very bad
(disproportionate) effect on the statistical analysis, which can lead to ambiguous
understandings.
QUESTION[29]: Explain different types bar chart? [6 marks]
A bar chart plots numeric values for levels of a categorical feature as bars. Levels are
plotted on one chart axis, and values are plotted on the other axis. Each categorical value claims
one bar, and the length of each bar corresponds to the bar's value.
Types of Bar Graph

Horizontal bar graph.
Vertical bar graph.
Double bar graph (Grouped bar graph)
Multiple bar graph (Grouped bar graph)
Stacked bar graph.

Bar line graph.
Question[30] What is your perception of your own body? Do you feel that you are overweight,
underweight, or about right?[8 Marks]
A random sample of 1,200 college students were asked this question as part of a larger survey.
The following table shows part of the responses:
Student Body Image

student 25 overweight
student 26 about right
student 27 underweight
Here is some information that would be interesting to get from these data:
 What percentage of the sampled students fall into each category?

 How are students divided across the three body image categories? Are they equally
divided? If not, do the percentages follow some other kind of pattern?
There is no way that we can answer these questions by looking at the raw data, which are in
the form of a long list of 1,200 responses, and thus not very useful.
Both of these questions will be easily answered once we summarize and look at
the distribution of the variable Body Image (i.e., once we summarize how often each of the
categories occurs).
Category Count Percent
About right 855 (855/1200)*100 = 71.3%
Overweight 235 (235/1200)*100 = 19.6%
Underweight 110 (110/1200)*100 = 9.2%
Total n=1200 100
Question[31] Explain outlier with example ?[8 Marks]
An outlier is an observation that lies an abnormal distance from other values in a

random sample from a population. In a sense, this definition leaves it up to the analyst
(or a consensus process) to decide what will be considered abnormal.
Distinguish among different measurement scales, choose the appropriate descriptive and
inferential statistical methods based on these distinctions, and interpret the results.
Using appropriate graphical displays and/or numerical measures, describe the
distribution of a quantitative variable in context: a) describe the overall pattern, b)
describe striking deviations from the pattern
Define and describe the features of the distribution of one quantitative variable (shape,
center, spread, outliers).
An observation is considered a suspected outlier or potential outlier if it is:

below Q1 – 1.5(IQR) or
above Q3 + 1.5(IQR)
The following picture (not to scale) illustrates this rule:
Question[32] Explain Best Oscar Winner for the following data 34 34 26 37 42 41 35 31 41 33

30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33 ?[8 Marks]
EXAMPLE: Best Actress Oscar Winners
We will continue with the Best Actress Oscar winners example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25
35 33
Recall that when we first looked at the histogram of ages of Best Actress Oscar winners, there
were three observations that looked like possible outliers:
We can now use the 1.5(IQR) criterion to check whether the three highest ages should indeed
be classified as potential outliers:
 For this example, we found Q1 = 32 and Q3 = 41.5 which give an IQR = 9.5
 Q1 – 1.5 (IQR) = 32 – (1.5)(9.5) = 17.75
 Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75
The 1.5(IQR) criterion tells us that any observation with an age that is below 17.75 or above
55.75 is considered a suspected outlier.
We therefore conclude that the observations with ages of 61, 74 and 80 should be flagged as
suspected outliers in the distribution of ages. Note that since the smallest observation is 21,
there are no suspected low outliers in this distribution.
Correlation Analysis
Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate together. A positive correlation indicates the extent to which those variables increase
or decrease in parallel; a negative correlation indicates the extent to which one variable
increases as the other decreases.
When the fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change in the
other. However, correlation does not imply causation. There may be an unknown factor that
influences both variables similarly.
Correlation is a statistical technique that can show whether and how strongly pairs of variables
are related. Although this correlation is fairly obvious your data may contain unsuspected
correlations. You may also suspect there are correlations, but don't know which are the
strongest. An intelligent correlation analysis can lead to a greater understanding of your data.
 Correlation is Positive or direct when the values increase together, and
 Correlation is Negative when one value decreases as the other increases, and so called inverse
or contrary correlation.
.

Don't Tell You. (7 Marks) : A Sample Example of Research Problem

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Don't Tell You. (7 Marks) : A Sample Example of Research Problem

Uploaded by

Copyright:

Available Formats

Module – 1

ARTIFICIAL INTELLIGENCE & DATA SCIENCE

1. Review previously published literature associated with the problem area,

Definition and Purpose

What these studies don't tell you?

1. It is harder to do than conducting conventional studies because the researcher takes on

QUESTION[2]: What is data science? [6 Marks]

What is data science?

QUESTION[3]: What do data scientist do? [6 Marks]

What do data scientists do?

Formulation of Research Problem

QUESTION[5]: Explain research objectives, context and nature of problem? [6 Marks]

Specify research objectives

Review the context of your research problem

QUESTION[6]: Explain variable relationship and alternative approaches? [6 Marks]

Determine variable relationships

QUESTION[7]: Explain Basic steps in formulating a Research Problems ? [10 Marks]

Basic Steps in Formulating a Research Problem

(vii) Check Back

QUESTION[8]: What are the properties of Samples Estimates ? [5 Marks]

Properties of Representative Samples Estimates:

QUESTION[9]: Explain difference between descriptive and inferential statistics ? [7 Marks]

Difference Between Descriptive and Inferential Statistics

Meaning Descriptive Statistics is that Inferential Statistics is a type of statistics,

Form of final Charts, Graphs and Tables Probability

Usage To describe a situation. To explain the chances of occurrence of

QUESTION[10]: Define descriptive and inferential statistics ? [5 Marks]

Definition of Descriptive Statistics

Descriptive Statistics refers to a discipline that quantitatively describes the important

Definition of Inferential Statistics

Differences between Descriptive and Inferential Statistics

Administrative, Managerial, or Top Level of Management

Supervisory, Operative, or Lower Level of Management

This level of management consists of supervisors, foremen, section officers,

 Assigning jobs and tasks to various workers.

What are variables?

An intervening variable, sometimes called a mediator variable, is a theoretical variable the

A moderating or moderator variable changes the relationship between dependent and

What is the HR scorecard?

The HR scorecard is meant to measure leading HR indicators of business performance.

1. Time to hire (time in days)

2. Cost per hire (total cost of hiring/the number of new hires)

3. Early turnover (percentage of recruits leaving in the first year)

HR metrics examples related to revenue

6. Performance and potential (the 9-box grid)

9. Cost of HR per employee

10. Ratio of HR professionals to employees

12. Turnover (number of leavers/total population in the organization)

For a deep dive into how to calculate employee turnover,

13. Effectiveness of HR software

 HR efficiency (e.g. time to resolving HR self-service tickets)

Sample Mean Formula

To find: Sample mean height

Answer: The sample mean height of five friends is 112 units.

Why is there Variance?

The formula for variance is as follows:

Var(X) = E (x - μ)**² / N**

QUESTION[14]: Explain sample standard deviation formula with example? [6 Marks]

Example 1: In the data set 9, 6, 8, 5, 7, find the sample standard deviation.

Solution: n = 5, Mean = (9+6+8+5+7)/ 5,Mean = 35/5 = 7.Using the sample standard

QUESTION[15]: Explain sample error with reason? [6 Marks]

 Error due to measurement :

Var(X) = E (x - μ)² / N