Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 7

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

Analyze student survey data set by applying Normal


TALHA BIN SAEED, GHULAM MUSTAFA, LIAQAT KHAN, ALAM NASIR ,SYED MUHAMMAD
JAWAD, MOHSIN GUL, ASFANDYAR
Distribution, linear Regression and Hypothesis Testing
\ score occurrences in a continuous data set that has
Abstract— The paper analysis student survey been divided into classes, called bins.
data set by applying various methods. The Bar charts, on the other hand,
distribution of the data set is calculated using can be used for a great deal of other types of
normal distribution, linear regression, testing it variables including ordinal and nominal data sets.
via hypothesis Test then applying. Results on MATLAB are given below. First, Fig. 1
show the Random weights of students. Second, Fig.
Index Terms— Datasets Histogram, Normal 2 show the no. of Students. Fig. 3 show the graph
Distributions, Linear Regression, Hypothesis Test and
Least Square Error. between students weights (x-axis) and no. of
students on (y-axis). Fig. 4 Show the MATLAB
code for Histogram of X and Y Parameters.
I. INTRODUCTION

T
HIS Paper represent the Data Collection of
Students Survey in which 25 Students and their
weights are given. The data set gives description
about the students gender, vegetarian or not, their
heights, the food they like and their weights. First,
we draw Histogram of 25 students, weights and
graph of both random variables. Second, we draw
Normal Distributions of random variables of
weights. Third, we draw Linear Regression. Fourth,
we apply Hypothesis test on the data set. Then, fifth
we apply least means square on the data. In the end
we plot ROC graph and calculate the precision of
the data calculated.
The normal distribution is the most
important probability distribution in statistics Fig. 1. Histogram of data X
because it fits many natural phenomena. For
example, heights, blood pressure, measurement
error, and IQ scores follow the normal distribution.
It is also known as the Gaussian distribution and the
bell curve.

II. PROCEDURE
A. Histogram
A histogram is a graphical display
of data using bars of different heights. In
a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range.
A histogram displays the shape and spread of
continuous sample data. The major difference is that
a histogram is only used to plot the frequency of


> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

Fig. 2. Histogram of data Y Fig. 4 Show the MATLAB code for Histogram of
X and Y Parameters.

B. Normal Distribution
Normal distribution, is a
probability distribution that is symmetric about the
mean, showing that data near the mean are more
frequent in occurrence than data far from the mean.
In graph form, normal distribution will appear as a
bell curve.

The parameter is the mean or expectation of


the distribution and is its standard deviation.
The variance of the distribution is .
The Arithmetic mean is the sum of the sampled
values divided by the number of items in the sample.

Fig. 3. Graph Between data X and data Y Variance is the expectation of the squared
deviation of a random variable from its mean.

Normal distribution, also known


as the Gaussian distribution, is a
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

probability distribution that is symmetric about the person or the price of a share. Such variables may be
mean, showing that data near the mean are more better described by other distributions, such as the
frequent in occurrence than data far from the mean. log-normal distribution or the Pareto distribution.
In graph form, normal distribution will appear as a The value of the normal
bell curve. In probability theory, a normal distribution is practically zero when the value lies
distribution is a type of continuous probability more than a few standard deviations away from the
distribution for a real-valued random variable. The mean. Therefore, it may not be an appropriate model
general form of its probability density function is the when one expects a significant fraction of outliers’
parameter is the mean or expectation of the values that lie many standard deviations away from
distribution; and is its standard deviation. the mean and least squares and other statistical
A random variable with a inference methods that are optimal for normally
Gaussian distribution is said to be normally distributed variables often become highly unreliable
distributed and is called a normal deviate. Normal when applied to such data. In those cases, a more
distributions are important in statistics and are often heavy-tailed distribution should be assumed and the
used in the natural and social sciences to represent appropriate robust statistical inference methods
real-valued random variables whose distributions are applied. The Gaussian distribution belongs to the
not known. Their importance is partly due to the family of stable distributions which are the attractors
central limit theorem. It states that, under some of sums of independent, identically distributed
conditions, the average of many samples distributions whether or not the mean or variance is
(observations) of a random variable with finite mean finite. Except for the Gaussian which is a limiting
and variance is itself a random variable whose case, all stable distributions have heavy tails and
distribution converges to a normal distribution as the infinite variance.
number of samples increases. In our dataset we measure mean
Therefore, physical quantities and standard deviation of weights of students which
that are expected to be the sum of many independent are 149.5kg and 11.5kg.Minimum Weight is 130kg
processes (such as measurement errors) often have and maximum weight is 170kg.
distributions that are nearly normal. Moreover, Fig. 5 give the MATLAB result of
Gaussian distributions have some unique properties normal Distribution. Fig. 6 Show the MATLAB
that are valuable in analytic studies. For instance, Code of normal Distribution. Fig. 7 give the
any linear combination of a fixed collection of MATLAB result of output (mean, Variance).
normal deviates is a normal deviate. Many results
and methods (such as propagation of uncertainty and
least squares parameter fitting) can be derived
analytically in explicit form when the relevant
variables are normally distributed.
The normal distribution is the only
distribution whose cumulates beyond the first two
(i.e., other than the mean and variance) are zero. It
is also the continuous distribution with the maximum
entropy for a specified mean and variance.
Assuming that the mean and variance are finite, that
the normal distribution is the only distribution where
the mean and variance calculated from a set of
independent draws are independent of each other.

The normal distribution is a


subclass of the elliptical distributions. The normal
distribution is symmetric about its mean, and is non-
zero over the entire real line. As such it may not be a
suitable model for variables that are inherently
positive or strongly skewed, such as the weight of a Fig. 5. Normal Distribution
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

procedure, without any probabilistic assumptions.


Yet, the linear regression formulas may also be
interpreted one explanatory variable, the process is
called multiple linear regression.
The linear regression
estimates can be viewed as ML estimates within a
suitable linear or normal context. In fact they can be
shown to be unbiased estimates in this context.
Furthermore, the variances of the estimates can be
calculated using convenient and then used to
construct confidence intervals using the
methodology.
Linear Regression formulas are
given. Using these formulas we find Straight line
Fig. 6 show the MATLAB code of normal Equation which show the error between points.
Distribution

Fig. 7 give the MATLAB result of output (mean, Below, Fig. 8 Show the Linear Regression
Variance). graph on MATLAB. Fig. 9 Show the Linear
Regression code on MATLAB.
C. Linear Regression
In statistics, linear
regression is a linear approach to modeling the
relationship between a scalar response and one or
more explanatory variables. The case of one
explanatory variable is called simple linear
regression. For more than one explanatory variable,
the process is called multiple linear regression.
To find the linear relation that
matches best a set of data pairs, in the sense that it
minimizes the sum of the squares of the
discrepancies between the model and the data.
The linear regression
methodology for building a model of the relation
between two or more variables of interest on the
basis of available data. An interesting feature of this
methodology is that it may be explained and
developed simply as a least squares approximation
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

his primary hypothesis comes out to be true or


else. Statistical analysts test a hypothesis by
measuring and examining a random sample of the
population being analyzed. It is used to present the
main categories of inference problems (parameter
estimation, hypothesis testing, and significance
testing).
In hypothesis testing, the unknown
parameter takes one of a finite number of values,
corresponding to competing hypotheses; we want to
choose one of the hypotheses, aiming to achieve a
small probability of error. Where as in in parameter
estimation, we want to generate estimates that are
close to the true values of the parameters in some
probabilistic sense. In hypothesis testing, the
unknown parameter takes a finite number m of
values, corresponding to competing hypotheses; we
want to choose one of the hypotheses, aiming to
achieve a small probability of error under any of the
possible hypotheses.
Hypothesis testing problems
encountered in realistic settings do not always
involve two well-specified alternatives. Classical
Fig. 8 Linear Regression
hypothesis testing methods aim at small error
probabilities, combined with simplicity and
convenience of calculation. We have focused on
tests that reject the null hypothesis when the
observations fall within a simple type of rejection
region. The likelihood ratio test is the primary
approach for the case of two competing simple
hypotheses, and derives strong theoretical support
from the as Nyman-Pearson Lemma. We also
addressed significance testing, which applies when
one (or both) of the competing hypotheses is
composite. The main approach here involves a
suitably chosen statistic that summarizes the
observations, and a rejection region whose
probability under the null hypothesis is set to a
desired significance level.
In our dataset we apply
hypothesis test. If we says that No. of students
Fig. 8. Show the Linear Regression code on which have 150kg weights are 10%. When we take
MATLAB. 25 Samples of students than 6 students have 150kg
weights. P is probability for 150kg students.
Step:1 Null and Alternative Hypothesis
Ho:P=0.1
D. Hypothesis Test Ha:P=/=0.1
Hypothesis test is defined as Step:2 Find Phat
Hypothesis testing is used to infer the result of Phat=6/25=0.240000
a hypothesis performed on sample data from a larger Step:3 Observed value
population. The test tells the analyst whether or not
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6

minimizes the sum of squared errors. So, when


we square each of those errors and add them all up,
the total is as small as possible. The performance of
a linear regression with the MSE. Surely, when we
Z=0.018667 start with our regression analysis, we will want to
Step: 4 Set Afpha=0.05 then P value maximize the performance of our analysis. So what
P=normcdf(- 0.018667 ) we do is create an objective function that maximizes
P=0.492553 the performance. Maximizing performance in linear
Step: 5 Significance Testing regression means minimizing the MSE. So we try to
P>Alpha find regression coefficients β that minimize the
0.492553 >0.05 MSE.
So ,Ho is accept Hypothesis due to P> Alpha An interesting feature of this
values. methodology is that it may be explained and
Fig. 9 Show the Hypothesis Test code on MATLAB. developed simply as a least squares approximation
Fig. 10 Show output Results on MATLAB. procedure, without any probabilistic assumptions.
Yet, the linear regression formulas may also be
interpreted in the context of various probabilistic
frameworks, which provide perspective and a
quantitate analysis. The linear least squares approach
aims at finding the best possible linear model. And
involves an implicit hypothesis that a linear model is
valid. In practice, there is often an additional phase
where we examine whether the hypothesis of a linear
model is supported by the data and try to validate
the estimated model.
Fig. 9 Show the Hypothesis Test code on The formulas for the linear regression
MATLAB estimates Øo and Ø1, we observe that once the data
are given, the sum of the squared residuals is a
quadratic function of Ø 0o and Ø 1. To perform the
minimization, we set to zero the partial derivatives
with respect to Ø o and Ø l. We obtain two linear
equations in Ø o and Ø l, which can be solved
explicitly.

Fig. 10 Show output Results on MATLAB.


III. CONCLUSION
E. Least Mean Square Error The paper analysis student survey data
set by applying various methods. The distribution of
A least mean square error (LMSE) estimator is an the data set is calculated using normal distribution,
estimation method which minimizes the mean square linear regression and Hypothesis Testing. Further,
error (MSE), which is a common measure of works on LMS Error estimation, ROC graph and
estimator quality, of the fitted values of a dependent precision of the data calculate.
variable.
It works by making the total of REFERENCES
the square of the errors as small as possible (that is [1] Dimitri P. Bertsekas and John N. Tsitsiklis
why it is called "least squares"). The straight line “Introduction to Probability” 2nd Edition,2008.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7

[2]W. R. Gilks, S. Richardson, and D. J.


Spiegelhalter, Eds., Markov Chain Monte Carlo in
Practice. London, UK: Chapman and Hall, 1996.

[3] Daniel Lowd , Jesse Davis" Learning Markov


Network Structure with Decision Trees", IEEE
International Conference on Data
Mining, Conference Paper,2010.

You might also like