Professional Documents
Culture Documents
Don't Tell You. (7 Marks) : A Sample Example of Research Problem
Don't Tell You. (7 Marks) : A Sample Example of Research Problem
1. A collaborative and adaptive research design that lends itself to use in work or
community situations.
2. Design focuses on pragmatic [practical or logical] and solution-driven research rather
than testing theories.
3. When practitioners use action research it has the potential to increase the amount they
learn consciously from their experience. The action research cycle can also be regarded
as a learning cycle.
4. Action search studies often have direct and obvious relevance to practice.
5. There are no hidden controls or pre-emption of direction by the researcher.
The term “data science” was coined in 2001, attempting to describe a new field. Some argue
that it’s nothing more than the natural evolution of statistics, and shouldn’t be called a new
field at all. But others argue that it’s more interdisciplinary. [The Data Science Design
Manual (2017), Steven Skiena ].
Data science as lying at the intersection of computer science, statistics, and substantive
application domains. From computer science comes machine learning and high-performance
computing technologies for dealing with scale. From statistics comes a long tradition of
exploratory data analysis, significance testing, and visualization. From application domains in
business and the sciences comes challenges worthy of battle, and evaluation standards to assess
when they have been adequately conquered.
The Data Science Venn Diagram, in which he drew the following diagram to indicate the
various fields that come together to form what we call “data science.”
Regardless of whether data science is just a part of statistics, and regardless of the domain to
which we’re applying data science, the goal is the same: to turn data into actionable
value. The professional society INFORMS defines the related field of analytics as “the
scientific process of transforming data into insight for making better decisions.”
Turning data into actionable value usually involves answering questions using data. Here’s a
typical workflow for how that plays out in practice.
1. Obtain data that you hope will help answer the question.
2. Explore the data to understand it.
3. Clean and prepare the data for analysis.
4. Perform analysis, model building, testing, etc.
(The analysis is the step most people think of as data science, but it’s just one step!
Notice how much more there is that surrounds it.)
5. Draw conclusions from your work.
6. Report those conclusions to the relevant stakeholders.
QUESTION[4]: How to formulation of Research Problem? [6 Marks]
A clear statement that defines all objectives can help you conduct and develop effective and
meaningful research. They should be manageable to bring you success. A few goals will help
you keep your study relevant. This statement also helps professors evaluation the questions
your research project answers and different methods that you use to address them.
For example- if you select Cyber security as your broad study area, then dissect it
into network security, web security, database security related with cyber crime , etc.
(iii) Mark-up your Interest
It is almost impossible to study all subareas. That’s why you must identify your area of
interest. You should select issues in which you are passionate about. Your interest must be
the most important determinant of your research study. Once you selected your research
study of interest, you should delete other subareas in which you do not feel interested. Keep
in mind that if you lose your interest in your research study it won’t bring any results
eventually.
(iv) Study Research Questions
In this step in formulating a research problem, you would point out your research questions
under the area of interest as you decided in the previous stage. If you select unemployment as
your study area, your questions might be “how unemployment impacts on individual social
status?” “How it affects social stability?” “How it creates frustration on individuals?” Define
what research problem or question you are going to study? The more you study the research
problem it will be just as relevant and fruitful to solve the problem indeed.
(v) Set Out Objectives
Set out conspicuously your research root objectives and sub-objectives. Research
objectives essentially come from research questions. If you do study “Impact of
unemployment on individual social status” as your research problem or research question.
Then, set out what would you like to explore to address. For Example- your main objective
might be to examine the unemployment status in a particular society or state. And sub-
objectives would be its effects on individuals’ social life. Setting out specific main and sub-
objectives are so crucial.
(vi) Assess your Objectives
Now, you should evaluate your objectives to make sure the possibility of attaining them
through your research study. Assess your objectives in terms of time, budget, resources
and technical expertise at your hand. You should also assess your research questions in
light of reality. Determine what outcome will bring your study. If you can assess accurately
the purpose of the research study it will bring significant results in the long run. In fact,
research objectives determine the value of the study you are going to work out.
In today’s fast-paced world, statistics is playing a major role in the field of research;
that helps in the collection, analysis and presentation of data in a measurable form. It is
quite hard to identify, whether the research relies on descriptive statistics or inferential
statistics, as people usually, lacks knowledge about these two branches of statistics. As
the name suggests, descriptive statistics is one which describes the population.
On the other end, Inferential statistics is used to make the generalisation about the
population based on the samples. So, there is a big difference between descriptive and
inferential statistics, i.e. what you do with your data. Let’s take a glance at this article
to get some more details on the two topics.
Comparison Chart
BASIS FOR
DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS
COMPARISON
What it does? Organize, analyze and present Compares, test and predicts data.
data in a meaningful way.
BASIS FOR
DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS
COMPARISON
Function It explains the data, which is It attempts to reach the conclusion to learn
already known, to summarize about the population that extends beyond
sample. the data available.
The data is summarised by the researcher, in a useful way, with the help of numerical and
graphical tools such as charts, tables, and graphs, to represent data in an accurate way.
Moreover, the text is presented in support of the diagrams, to explain what they represent.
Inferential Statistics is all about generalising from the sample to the population, i.e. the results
of the analysis of the sample can be deduced to the larger population, from which the sample
is taken. It is a convenient way to draw conclusions about the population when it is not possible
to query each and every member of the universe. The sample chosen is a representative of the
entire population; therefore, it should contain important features of the population.
Inferential Statistics is used to determine the probability of properties of the population on the
basis of the properties of the sample, by employing probability theory. The major inferential
statistics are based on the statistical models such as Analysis of Variance, chi-square test,
student’s t distribution, regression analysis, etc. Methods of inferential statistics:
Estimation of parameters
Testing of hypothesis
The difference between descriptive and inferential statistics can be drawn clearly on the
following grounds:
1. Descriptive Statistics is a discipline which is concerned with describing the population
under study. Inferential Statistics is a type of statistics; that focuses on drawing
conclusions about the population, on the basis of sample analysis and observation.
2. Descriptive Statistics collects, organises, analyses and presents data in a meaningful
way. On the contrary, Inferential Statistics, compares data, test hypothesis and make
predictions of the future outcomes.
3. There is a diagrammatic or tabular representation of final result in descriptive statistics
whereas the final result is displayed in the form of probability.
4. Descriptive statistics describes a situation while inferential statistics explains the
likelihood of the occurrence of an event.
5. Descriptive statistics explains the data, which is already known, to summarise sample.
Conversely, inferential statistics attempts to reach the conclusion to learn about the
population; that extends beyond the data available.
QUESTION[11]: Explain top, middle and lower level of managements ? [10 Marks]
Levels of Management
The term Levels of Management refers to the line of division that exists between various
managerial positions in an organization. As the size of the company and workforce increases,
the number of levels in management increases along with it, and vice versa. The
different Levels of Management can determine the chain of command within an organization,
as well as the amount of authority and typically decision-making influence accrued by all
managerial positions.
Levels of Management can be generally classified into three principal categories, all of which
direct managers to perform different functions.
This level of management consists of an organization’s board of directors and the chief
executive or managing director. It is the ultimate source of power and authority, since it
oversees the goals, policies, and procedures of a company. Their main priority is on the
strategic planning and execution of the overall business success.
The roles and responsibilities of the top level of management can be summarized as follows:
Laying down the objectives and broad policies of the business enterprise.
Issuing necessary instructions for the preparation of department-specific budgets,
schedules, procedures, etc.
Preparing strategic plans and policies for the organization.
Appointing the executives for middle-level management, i.e. departmental managers.
Establishing controls of all organizational departments.
Since it consists of the Board of Directors, the top management level is also responsible
for communicating with the outside world and is held accountable towards an
organization’s shareholders for the performance of the enterprise.
Providing overall guidance, direction, and encouraging harmony and collaboration.
Executive or Middle Level of Management
The branch and departmental managers form this middle management level. These people
are directly accountable to top management for the functioning of their respective departments,
devoting more time to organizational and directional functions. For smaller organizations, there
is often only one layer of middle management, but larger enterprises can see senior and
junior levels within this middle section.
The roles and responsibilities of the middle level of management can be summarized as
follows:
Executing the plans of the organization in accordance with the policies and directives laid
out by the top management level.
Forming plans for the sub-units of the organization that they supervise.
Participating in the hiring and training processes of lower-level management.
Interpreting and explaining the policies from top-level management to lower-level
management.
Sending reports and data to top management in a timely and efficient manner.
Evaluating the performance of junior managers.
Inspiring lower level managers towards improving their performance.
QUESTION[13]: what are variables? Explain moderating, control, Extraneous, Intervening and
Quantitative variables [7 Marks]
Variables are things you measure, manipulate and control in statistics and research. All studies
analyze a variable, which can describe a person, place, thing or idea. A variable's value can
change between groups or over time. For example, if the variable in an experiment is a person's
eye color, its value can change from brown to blue to green from person to person.
Independent variables
An independent variable is a singular characteristic that the other variables in your experiment
cannot change. Age is an example of an independent variable. Where someone lives, what they
eat or how much they exercise are not going to change their age. Independent variables can,
however, change other variables. In studies, researchers often try to find out whether an
independent variable causes other variables to change and in what way.
Dependent variables
A dependent variable relies on and can be changed by other components. A grade on an exam
is an example of a dependent variable because it depends on factors such as how much sleep
you got and how long you studied. Independent variables can influence dependent variables,
but dependent variables cannot influence independent variables. For example, the time you
spent studying (dependent) can affect the grade on your test (independent) but the grade on
your test does not affect the time you spent studying.
When analyzing relationships between study objects, researchers often try to determine what
makes the dependent variable change and how.
Intervening variables
Moderating variables
Control variables
Control or controlling variables are characteristics that are constant and do not change during
a study. They have no effect on other variables. Researchers might intentionally keep a control
variable the same throughout an experiment to prevent bias. For example, in an experiment
about plant development, control variables might include the amounts of fertilizer and water
each plant gets. These amounts are always the same so that they do not affect the plants' growth.
Extraneous variables
Extraneous variables are factors that affect the dependent variable but that the researcher did
not originally consider when designing the experiment. These unwanted variables can
unintentionally change a study's results or how a researcher interprets those results. Take, for
example, a study assessing whether private tutoring or online courses are more effective at
improving students' Spanish test scores. Extraneous variables that might unintentionally
influence the outcome include parental support, prior knowledge of a foreign language or
socioeconomic status.
Quantitative variables
Quantitative variables are any data sets that involve numbers or amounts. Examples might
include height, distance or number of items. Researchers can further categorize quantitative
variables into two types:
Discrete: Any numerical variables you can realistically count, such as the coins in
your wallet or the money in your savings account.
Continuous: Numerical variables that you could never finish counting, such as time.
Binary: Variables with only two categories, such as male or female, red or blue.
Nominal: Variables you can organize in more than two categories that do not follow
a particular order. Take, for example, housing types: Single-family home,
condominium, tiny home.
Ordinal: Variables you can organize in more than two categories that follow a
particular order. Take, for example, level of satisfaction: Unsatisfied, neutral,
satisfied.
Adjustment: Adjust study parameters to account for the confounding variable and
minimize its effects.
Matching: Compare study groups with the same degree of confounding variables.
Multivariate analysis: Use when analyzing multiple variables at once.
Randomization: Spread confounding variables evenly between study groups.
Restriction: Remove subjects or samples that have confounding factors.
Stratification: Create study subgroups in which the confounding variable does not
vary or vary much.
QUESTION[13]: Explain HR scorecard? [20 Marks]
HR Scorecard
The HR scorecard, or Human Resource Scorecard, is a well-known HR tool. In this article, we
will explain what the HR scorecard is, the difference between the HR scorecard and the
balanced scorecard, modern-day critique, and show an example template of the HR scorecard.
The HR scorecard is a strategic HR measurement system that helps to measure, manage, and
improve the strategic role of the HR department.
Cost per hire can be time-consuming to work out. There used to be a huge variation in how
companies calculated this metric until The Society of Human Resource Management and the
American National Standards Institute agreed on a standard formula.
CPH can be calculated by adding together internal recruiting costs, and external recruiting
costs, divided by total number of hires. The costs and number of hires will both reflect a
selected measurement period – such as monthly, or annually.
4. Time since last promotion (average time in months since last internal promotion)
This rather straightforward metric is useful in explaining why your high potentials leave.
This metric also relates to employee utilization rate, which refers to the amount of working
time employee is spending on billable tasks [Working on a client's project].
.
8. Engagement rating
An engaged workforce is a productive workforce. Engagement might be the most important
‘soft’ HR outcome. People who like their job and who are proud of their company are generally
more engaged, even if the work environment is challenging and pressure can be high. Engaged
employees perform better and are more likely to perceive challenges as positive and interesting.
Additionally, team engagement is an important metric for a team manager’s success.
Preferably you would like to see low performers leave and high performers stay. This metric
also provides HR business partners with a great amount of information about the departments
and functions in which employees feel at home, and where in the organization they do not want
to work. Turnover is very useful data to know when shaping recruitment
strategies. Additionally, attrition could be a key metric in measuring a manager’s success.
14. Absenteeism
Like turnover, absenteeism is also a strong indicator of dissatisfaction and a predictor of
turnover. Absenteeism rate can give information to prevent this kind of leave, as long-term
absence can be very costly. Again, differences between individual managers and departments
are very interesting indicators of (potential) problems and bottlenecks.
HR metrics examples in key areas of business
Human Resource metrics are measurements that help you to track data across the HR
department and the organization. The most important areas are listed below. In this list of HR
metrics, we included the key HR metrics examples associated with those areas.
Organizational performance
Turnover percentages
% of regretted loss
Statistics on why personnel is leaving
Absence percentages and behavior
Recruitment (time to fill, number of applicants, recruitment cost)
HR operations
Sample mean represents the measure of the centre of the data. Any population's mean is
estimated using the sample mean. In many of the situations and cases, we are required to
estimate what the whole population is doing, or what all are the factors going throughout the
population, without surveying everyone in the population. In such cases sample mean is useful.
An average value found in a sample is termed the sample mean. the sample mean so calculated
is used to find the variance and thereby the standard deviation. Let us see the sample mean
formula and its applications in the upcoming sections.
Example 1: Five friends having heights of 110 units, 115 units, 109 units, 112 units, and 114
units respectively. Find their sample mean height.
Solution:
Variance is a measure of the distance of each variable from the average value or mean in its
data set. It is used to calculate deviation within the set and it’s a valuable tool for investors and
finance professionals, we define variance, how to calculate it and the advantages and
disadvantages of using variance.
What is Variance? :Variance is a calculation that considers random variables in terms of their
relationship to the mean of its data set. Variance can be used to determine how far each variable
is from the mean and, in turn, how far each variable is from one another. It is also used in
statistical inferences, hypothesis testing, Monte Carlo methods (random sampling) and
goodness-of-fit analyses.
How to calculate variance: To calculate variance, you need to square each deviation of a given
variable (X) and the mean.
In a sample set of data, you would subtract every value from the mean individually, then square
the value, like this: (μ - X)². Then, you would add all the squared deviations and divide them
by the total number of values to reach an average. This number is the variance.
To find the standard deviation, you could simply take the square root of the variance.
The formula shows that the variance of X (Var[X]) is equal to the average of the square of X
minus the square of its mean. And you can solve it by dividing it across the amount of numbers
in a set, or N.
What are the advantages of using variance?: The biggest advantage to using variance is to
gain information about a set of data. Whether you are an investor looking to mitigate risk or a
statistician who needs to understand the spread of a sample, the variance is information that
people can use to draw quick inferences.
Sample Standard Deviation Formula
Before learning the sample standard deviation formula, let us see when do we use it. In a
practical situation, when the population size N is large it becomes difficult to obtain value xi for
every observation in the population and hence it becomes difficult to calculate the standard
deviation (or variance) for the population. In such cases, we can estimate the standard deviation
by calculating it on a sample of size n taken from the population of size N. This estimated
variance is called the sample standard deviation(S). Since a sample standard deviation is a
statistic that is calculated from only a few individuals in a reference population. The sample
has greater variability and thus the standard deviation of the sample is almost always greater
than that of the population. Let us explore the sample standard deviation formula below.
There are two types of standard deviations, population standard deviation, and sample standard
deviation. While calculating the population standard deviation, we divide by n, the number of
data values. For calculating the sample standard deviation, we divide by n -1 i.e., one less than
the number of data values. Given a sample of data (observations) for the random variable x, its
sample standard deviation formula is given by:
1 n
S ( xi x ) 2
n 1 i 1
Here,
x = sample average
x = individual values in sample
n = count of values in the sample
Sampling Error:
It is a well-known fact that precise measurement of any magnitude is not possible. If some
individuals, for example, are asked to measure the length of a particular piece of cloth
independently up to, say, two decimal points; we can be quite sure that their answers will not
be the same. In fact, the measuring instrument itself may .not have the same degree of accuracy.
Error in recording
This type of error may arise at the stage when the investigator records the answers or even at
the tabulation stage. A major reason for such error is the carelessness on the part of the
investigator
Sampling Error
By now it should be clear that in the sampling method also, non-sampling error may be
committed. It is almost impossible to make the data absolutely free of such errors. However,
since the number of respondents in a sample survey is much smaller than in census, the non-
sampling error is generally less pronounced in the sampling method. Besides the non-sampling
errors, there is sampling error in a sample survey. Sampling error is the absolute difference
between the parameter and the corresponding statistic, that is, lT -0l.
Sampling error is not due to any lapse on the part of the respondent or the investigator or some
such .reason. It arises because of the very nature of the procedure. It can never be completely
eliminated. However, we have well developed sampling theories with the help of which the
effect of sampling error can be minirnised.
Confidence Intervals:
QUESTION[17]:Explain confidence interval with example [6 Marks]
An interval that contains the unknown parameter (such as the population mean µ) with certain
degree of confidence.
Example: Consider the distribution of serum cholesterol levels for all males in the US who are
hypertensive and who smoke. This distribution has an unknown mean µ and a standard
deviation 46 mg/100ml. Suppose we draw a random sample of 12 individual from this
population and find that the mean cholesterol level is x¯ = 217mg/100ml
The degrees of freedom (DF) in statistics indicate the number of independent values that can
vary in an analysis without breaking any constraints. It is an essential idea that appears in many
contexts throughout statistics including hypothesis tests, probability distributions, and
linear regression. Learn how this fundamental concept affects the power and precision of your
analysis!
In this post, I bring this concept to life in an intuitive manner. You’ll learn the degrees of
freedom definition and know how to find degrees of freedom for various analyses, such as
linear regression, t-tests, and chi-square. I’ll start by defining degrees of freedom and providing
the formula. However, I’ll quickly move on to practical examples in the context of various
statistical analyses because they make this concept easier to understand.
UNIT 2
Data screening
QUESTION[1]:Explain Data screening? [10 Marks]
1. Data screening should be conducted prior to data recoding and data analysis, to help
ensure the integrity of the data.
2. It is only necessary to screen the data for the variables and cases used for the analyses
presented in the lab report.
3. Data screening means checking data for errors and fixing or removing these errors.
The goal is to maximise "signal" and minimise "noise" by identifying and fixing or
removing errors.
4. Keep a record of data screening steps undertaken and any changes made to the data.
This should be summarised in a one to two paragraph section called "Data screening"
at the beginning of the Results.
5. Fixing or removing incorrect data:
1. Erroneous data can be changed to missing data. Alternatively, if a correct
value can be presumed, then this can be entered.
2. To change data in a cell, open the data file and using the data view, left-click
on the cell. Delete the data to make it missing data. Or change the value in the
cell by typing in the new value. Repeat for each problematic value.
3. It is probably best to make erroneous data missing unless the correct value is
obvious (e.g., if 77 was entered, it might reasonably be deduced that 7 was
intended) in which case the incorrect value can be replaced with a best guess
correct value.
4. For cases with a lot of erroneous data, it is probably best to remove the entire
case (i.e., delete the whole row).
6. Out-of-range values:
1. Out of range values are either below the minimum or above the maximum
possible value.
2. To know what the in-range values are, check:
1. the survey
2. download the data and check the SPSS Value Labels in Variable View
3. Identify out-of-range values by obtaining descriptive statistics (in SPSS, use
Analyze - Descriptive Statistics - Descriptives) to examine the minimum and
maximum values for all variables of interest. In the SPSS Data View, sort
variables with out-of-range values in ascending or descending order to help
identify the case(s) which has(have) the out-of-range values. Alternatively, use
search and find to identify the case(s) with out-of-range values.
4. Decide whether to accept, replace, or remove out-of-range values.
7. Unusual cases:
1. Unusual cases occur when a case's responses are very different from the
pattern of responses by most other respondents.
2. In SPSS, Data - Identify Unusual cases - Enter several to many variables (e.g.,
all the Time Management Skill variables).
3. The results will flag the top 10 anomalous cases.
4. Look carefully at each of these cases' responses to the target variables - do
they appear to be legitimate (e.g., are their out-of-range values, or consider if
the data was fabricated, often the responses to the reverse scored items aren't
fabricated in the expected direction). If so, consider removing the case.
8. Duplicate cases:
1. Duplicate cases occur when two or more cases have identical or near-identical
data
2. In SPSS, Data - Identify Duplicate cases - Enter several to many variables.
3. Consider whether to remove both cases (e.g., the integrity of the data may be
in doubt because it may have been fabricated and then duplicated?) or to retain
one copy of each case and delete duplicates.
9. Manual check for other anomalies
1. Check carefully through the data file (case by case and variable by variable)
looking for and addressing any oddities.
2. Empty cases: e.g., cases with no or little data could be removed
3. Cases with responses which lack meaningful variation: (e.g., 5 5 5 5 5 5 5 5
5) or which exhibit obvious arbitrary patterns (e.g., zig-zag - 1 2 3 4 5 4 3 2 1
2 3 4 5) - such responses are unlikely to be valid and probably should be
deleted.
Email Spam
The goal is to predict whether an email is a spam and should be delivered to the Junk folder.
There are more than one method of identifying a mail as a spam. A simple method is discussed.
The raw data comprises only the text part but ignores all images. Text is a simple sequence of
words which is the input (X). The goal is to predict the binary response Y: spam or not.
QUESTION[2]:Explain Handwritten Digit Recognition? [5 Marks]
The raw data comprises images that are scaled segments from five-digit ZIP codes. In the
diagram below every green box is one image. The original images are very small, containing
only 16 × 16 pixels. For convenience the images below are enlarged, hence the pixelation or
'boxiness' of the numbers.
Every image is to be identified as 0 or 1 or 2 ... or 9. Since the numbers are handwritten, the
task is not trivial. For instance, a '5' sometimes can very much look like a '6', and '7' is
sometimes confused with '1'.
Image segmentation
QUESTION[3]:Explain image segmentation? [5 Marks]
Here is a more complex example of an image processing problem. The satellite images are to
be identified into man-made or natural regions. For instance, in the aerial images shown below,
buildings are labeled as man-made, and the vegetation areas are labeled as natural.
These grayscale images are much larger than the previous example. These images are 512 ×
512 pixels and again because these are grayscale images we can present pixel intensity with
numbers 0 to 255.
Speech Recognition
QUESTION[4]:Explain Speech Recognition? [5 Marks]
Another interesting example of data mining deals with speech recognition. For instance,
if you call the University Park Airport, the system might ask you your flight number, or
your origin and destination cities. The system does a very good job recognizing city
names. This is a classification problem, in which each city name is a class. The number
of classes is very big but finite.
The raw data involves voice amplitude sampled at discrete time points (a time sequence), which
may be represented in the waveforms as shown above. In speech recognition, a very popular
method is the Hidden Markov Model.
At every time point, one or more features, such as frequencies, are computed. The speech signal
essentially becomes a sequence of frequency vectors. This sequence is assumed to be an
instance of a hidden Markov model (HMM). An HMM can be estimated using multiple sample
sequences under the same class (e.g., city name).
For each sample taken from a tissue of a particular disease type, the expression levels of a very
large collection of genes are measured. The input data goes through a data cleaning process.
Data cleaning may include but is certainly not limited to, normalization, elimination of noise
and perhaps log-scale transformations. A large volume of literature exists on the topic of
cleaning microarray data.
Each genome is made up of DNA sequences and each DNA segment has specific biological
functions. However there are DNA segments which are non-coding, i.e. they do not have any
biological function (or their functionalities are not yet known). One problem in DNA
sequencing is to label the sampled segments as coding or non-coding (with a biological
function or without).
The raw DNA data comprises sequences of letters, e.g., A, C, G, T for each of the DNA
sequences. One method of classification assumes the sequences to be realizations of random
processes. Different random processes are assumed for different classes of sequences.
Data science combines multiple fields, including statistics, scientific methods, artificial
intelligence (AI), and data analysis, to extract value from data. Those who practice data science
are called data scientists, and they combine a range of skills to analyze data collected from the
web, smartphones, customers, sensors, and other sources to derive actionable insights.
Data science encompasses preparing data for analysis, including cleansing, aggregating, and
manipulating the data to perform advanced data analysis. Analytic applications and data
scientists can then review the results to uncover patterns and enable business leaders to draw
informed insights.
QUESTION[7]:What is quality control and different types of quality control? [5 Marks]
A necessary but often tedious task data scientists must perform as part of any project is quality
assurance (QA). In a data science context, QA is the task of ensuring that the data to be analyzed
and modeled is suitable for whatever the use case happens to be
QUESTION[9]:What are some examples of quality control ,data quality with examples
and different types of quality inspections? [8 Marks]
Stratification. ...
Scatter Diagram.
SPSS is short for Statistical Package for the Social Sciences, and it's used by various kinds of
researchers for complex statistical data analysis. The SPSS software package was created for
the management and statistical analysis of social science data.
Using SPSS features, users can extract every piece of information from files for the
execution of descriptive, inferential, and multiple variant statistical procedures.
Thanks to SPSS’ Data Mining Manager, its users can conduct smart searches, extract
hidden information with the help of decision trees, design neural networks of artificial
intelligence, and market segmentation.
Can be used to solve algebraic, arithmetic, and trigonometric operations.
SPSS’ Report Generator feature lets you prepare attractive reports of investigations. It
incorporates text, tables, graphs, and statistical results of the report in the same file.
SPSS offers data documentation too, it enables researchers to store a metadata directory. It
acts as a centralized information repository in relation to the data, such as relationships with
other data, its meaning, origin, format, and usage.
There are two SPSS types: Variable View and Data View
Variable View
Name: It is a column field that accepts a unique ID that helps in sorting the data. Some of the
parameters for sorting data are name, gender, sex, educational qualification, designation, etc.
Label: It gives the label and allows you to add special characters.
Decimal: It helps us understand how to define the digits required after the decimal.
Measure: It measures the data that is being entered in the tools, such as cardinal, ordinal, and
nominal.
Data View
The data view is displayed as rows and columns. You can import a file or add data manually.
SPSS Statistics is one of the most commonly used statistical analysis tools in the business
world. Thanks to its powerful features and robustness, its users can manage and analyze data
and represent them in visually attractive graphical forms. It supports a graphical user interface
and command-line, thereby making the software more intuitive.
SPSS makes the processing of complex data pretty simple. It is not easy to work with such
data, and it is also a time-consuming process.
1. Market Research
Businesses want actionable insights using which they can make tough and effective business
decisions. There are tonnes of data generated by businesses, and scanning them manually is
not the right way to analyze them. For market researchers who are looking for a reliable
solution that will help them understand their data, analyze trends, forecast, plan, and arrive at
conclusions, SPSS is the best tool out there.
By using sophisticated statistical analyses, SPSS helps market researchers get actionable
insights from your customer data. Thanks to its powerful survey data analysis technology, it is
possible to get accurate information about market trends. Perceptual mapping, preference
scaling, predictive analysis, statistical learning, and a bunch of other advanced tools such as
stratified, clustered, and multistage sampling help with the decision-making process.
2. Education
Educational institutions have to bear the pressure of enrolling students and retaining them each
year. Not to mention the fact that they need to attract new students every year. This is where
SPSS comes in. More than 80% of all US colleges are currently using SPSS software.
SPSS software’s ability to focus on patterns lets them identify the chances of a student’s future
success. It uses a combination of factors that tells them about students who are at risk.
The institution’s faculty can use SPSS software to analyze a plethora of complex data sets to
uncover hidden patterns.
3. Healthcare
Applying SPSS’ statistical analysis for healthcare delivery has a number of use cases. We need
to solve a lot of issues to provide great healthcare. Healthcare institutions use this data for some
outdated practices in patient delivery, misaligned incentives for caregivers, are some of the
biggest issues. This is where analytics can be a life-saver, literally at that.
When it comes to the healthcare sector, the data of patients is sacrosanct. Not only can wrong
data result in terrible outcomes, but they are also timely, sensitive, and instant.
With the help of SPSS, healthcare organizations can implement a patient delivery program
using data, it will not only drive better patient outcomes but also reduce the costs involved.
For data sets that have complex relationships, univariate and multivariate modeling techniques
can be used.
4. Retail
The retail industry relies heavily on analytics for everything from initial stock planning to
forecasting future trends. Customers have a lot of leverage when it comes to retail products,
thanks to the advent of social media, forums, and review sites.
Customers are taking their decisions based on the brand’s reviews online. So it is imperative
that retail businesses give the best that can be offered. Thankfully, statistical analysis is a savior
for the retail industry. Retail businesses generate a lot of data and it needs to be collected,
analyzed, and converted into actionable insights. By using the data effectively, businesses will
end up providing excellent experiences for their customers.
SPSS analysis lets retailers understand their customers, provide them with the right solutions
and deliver them using the perfect channels. From understanding how different segments of
customers behave to why they make certain buying decisions, everything can be found with
the help of SPSS analysis.
Using the previous spending and behavior patterns, SPSS statistics will profile customers. By
leveraging this data, it will come up with customer preferences and give them an analysis of
what makes customers turn from casual browsers into shoppers.
QUESTION[20]: Explain five ways SPSS predictive Analysis Benefits in Industries? [10
Marks]
4. Saves money
By using SPSS analysis, businesses can save a lot of money. For example, customers in the
banking and insurance industries saved more than $2.4 million as they thwarted a motor
insurance fraud syndicate within four months of using the tool.
In SPSS, users are not forced to work with syntax, even though syntax files can be saved and
modified as needed. When there are saved syntax files, it helps immensely with documentation
and also gives an idea of how the new variables were calculated and how values that were
missing were handled.
Data cleaning (cleansing) is the process of removing errors and resolving inconsistencies in
source data before loading them into a common repository. The aim of data cleaning, which is
especially required when integrating heterogeneous data sources, is improving data quality .
A frequency distribution table is a chart that summarizes all the data under two columns
- variables/categories, and their frequency. It has two or three columns. Usually, the first
column lists all the outcomes as individual values or in the form of class intervals, depending
upon the size of the data set.
Extreme scores are the lowest and highest possible scores for persons on items, or for
items by persons. They include zero and perfect scores. They are shown in the Tables as
MINIMUM ESTIMATE MEASURE and MAXIMUM ESTIMATE MEASURE.
What is an extreme score called in statistics?
The extreme values which are also known as outliers are the values that are too far from the
other observations of the given data. And their presence tends to have a very bad
(disproportionate) effect on the statistical analysis, which can lead to ambiguous
understandings.
A bar chart plots numeric values for levels of a categorical feature as bars. Levels are
plotted on one chart axis, and values are plotted on the other axis. Each categorical value claims
one bar, and the length of each bar corresponds to the bar's value.
Question[30] What is your perception of your own body? Do you feel that you are overweight,
underweight, or about right?[8 Marks]
A random sample of 1,200 college students were asked this question as part of a larger survey.
The following table shows part of the responses:
Both of these questions will be easily answered once we summarize and look at
the distribution of the variable Body Image (i.e., once we summarize how often each of the
categories occurs).
Category Count Percent
About right 855 (855/1200)*100 = 71.3%
Overweight 235 (235/1200)*100 = 19.6%
Underweight 110 (110/1200)*100 = 9.2%
Total n=1200 100
Question[31] Explain outlier with example ?[8 Marks]
Recall that when we first looked at the histogram of ages of Best Actress Oscar winners, there
were three observations that looked like possible outliers:
We can now use the 1.5(IQR) criterion to check whether the three highest ages should indeed
be classified as potential outliers:
For this example, we found Q1 = 32 and Q3 = 41.5 which give an IQR = 9.5
Q1 – 1.5 (IQR) = 32 – (1.5)(9.5) = 17.75
Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75
The 1.5(IQR) criterion tells us that any observation with an age that is below 17.75 or above
55.75 is considered a suspected outlier.
We therefore conclude that the observations with ages of 61, 74 and 80 should be flagged as
suspected outliers in the distribution of ages. Note that since the smallest observation is 21,
there are no suspected low outliers in this distribution.
Correlation Analysis
Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate together. A positive correlation indicates the extent to which those variables increase
or decrease in parallel; a negative correlation indicates the extent to which one variable
increases as the other decreases.
When the fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change in the
other. However, correlation does not imply causation. There may be an unknown factor that
influences both variables similarly.
Correlation is a statistical technique that can show whether and how strongly pairs of variables
are related. Although this correlation is fairly obvious your data may contain unsuspected
correlations. You may also suspect there are correlations, but don't know which are the
strongest. An intelligent correlation analysis can lead to a greater understanding of your data.
Correlation is Positive or direct when the values increase together, and
Correlation is Negative when one value decreases as the other increases, and so called inverse
or contrary correlation.
.