Professional Documents
Culture Documents
What Is A Data Set?
What Is A Data Set?
What Is A Data Set?
▶ A data set is an organized collection of data. They are generally associated with a unique
body of work and typically cover one topic at a time.
▶ Data sets describe values for each variable for unknown quantities such as height, weight,
temperature, volume, etc., of an object or values of random numbers.
▶ The data set consists of data of one or more members corresponding to each row.
▶ Datasets can be written as a list of integers in a random order, a table, or with curly
brackets around them. The data sets are normally labelled so you understand what the data
represents.
Numerical
■ A numerical data set is one in which all the data are numbers.
■ You can also refer to this type as a quantitative data set, as the
numerical values can apply to mathematical calculations when
necessary.
■ Many financial analysis processes also rely on numerical data sets, as
the values in the set can represent numbers in dollar amounts.
■ Ex: The number of cards in a deck.
Bivariate
o A data set with just two variables is a bivariate data set.
o In this type of data set, data scientists look at the relationship between the
two variables.
o Therefore, these data sets typically have two types of related data.
o For example, a data set containing the weight and running speed of a track
team represents two separate variables, where you can look for a
relationship between the two.
Multivariate
Unlike a bivariate data set, a multivariate data set contains more than two
variables.
For example, the height, width, length and weight of a package you ship through
the mail requires more than two variable inputs to create a data set.Since each
value is unique, you can use different variables to represent each one. For the
dimensions of the example package, the values for each measurement
represent the variables.
Categorical
Categorical data sets contain information relating to the characteristics of a person
or object.
Data scientists also refer to categorical data sets as qualitative data sets because
they contain information relating to the qualities of an object.
There are two types of categorical data sets: dichotomous and polytomous.
In a dichotomous data set, each variable can only have one of two values. For
example, a data set containing answers to true and false questions is dichotomous
because it only supplies one result or the other.
In a polytomous data set, there can be more than two possible values for each
variable. For example, a data set containing a person's eye color can give you
multiple results.
Correlation
▶ When there is a relationship between variables within a data set, it becomes a
correlation data set. This means that the values depend on one another to exhibit
change.
▶ Correlation can either be positive, negative or zero.
▶ In positive correlations, the related variables move in the same direction, whereas a
negative correlation shows variables moving in opposing directions. A zero
correlation shows no relationship.
Correlation coefficient
You can use the following equation to calculate correlation:
Having information stored in a data set often makes it easier to perform math
operations and analysis.
Mean of a dataset is the average of all the observations present in the table. It is the
ratio of the sum of observations to the total number of elements present in the data
set. The formula of mean is given by;
Median of a dataset is the middle value of the collection of data when arranged in
ascending order and descending order.
Range of a dataset is the difference between the maximum value and minimum value.
Properties of Dataset
Before performing any statistical analysis, it is essential to understand the nature of
the data. We can use different Exploratory Data Analysis (EDA techniques), which
helps to identify the properties of data, so that the appropriate statistical methods can
be applied on the data
• Centre of data
• Skewness of data
• Spread among the data members
• Presence of outliers
• Correlation among the data
• Type of probability distribution that the data follows
P(E) = n(E)/n(S)
Here,
• For events to be considered dependent, one must have an influence over how
probable another is.
• In other words, a dependent event can only occur if another event occurs first.
• The primary focus when analyzing dependent events is probability. The
occurrence of one event exerts an effect on the probability of another event.
• Thus, If whether one event occurs does affect the probability that the other
event will occur, then the two events are said to be dependent.
Example :
Independent Events
o An event is deemed independent when it isn’t connected to another event, or its probability
of happening, or conversely, of not happening.
o Independent events don’t influence one another or have any effect on how probable
another event is.
o If the probability of occurrence of an event A is not affected by the occurrence of
another event B, then A and B are said to be independent events.
Examples:
Tossing a coin
Here, Sample Space S = {H, T}, and both H and T are independent
events.
■ If the probability of events A and B are P(A) and P(B) respectively then the
conditional probability of B such that A has already occurred is P(A/B). Given,
P(A)>0,
we know that
P(A∩B)=P(B∩A)=P(B|A)P(A),
which implies
P(A|B)=P(B|A)P(A)P(B)
Bayes’ Theorem
Bayes’ theorem describes the probability of occurrence of an event related to any condition. It is
also considered for the case of conditional probability. Bayes theorem is also known as the formula
for the probability of “causes”.
Let E1, E2,…, En be a set of events associated with a sample space S, where all the
events E1, E2,…, En have nonzero probability of occurrence and they form a partition of
S. Let A be any event associated with S, then according to Bayes theorem,
To prove the Bayes Theorem, we will use the total probability and conditional
probability formulas. The total probability of an event A is calculated when not enough
data is known about event A, then we use other events related to event A to determine
its probability. Conditional probability is the probability of event A given that other
related events have already occurred.
(Ei), be is a partition of the sample space S. Let A be an event that occurred. Let us
express A in terms of (Ei).
A=A∩S
= A ∩ (E1,E2,E3,...,En)
We know that when A and B are disjoint sets, then P(A∪ B) = P(A) + P(B)
Thus here, P(A) = P(A ∩E1) +P(A ∩E1)+ P(A ∩E1).....P(A ∩En)
P(Ei|A)=P(A|Ei)P(Ei)∑nk=1P(Ek)P(A|Ek),i=1,2,3,...,n
Sample Space
A sample space is a collection or a set of possible outcomes of a random
experiment.
The sample space is represented using the symbol, “S”.
The subset of possible outcomes of an experiment is called events.
A sample space may contain a number of outcomes that depends on the
experiment. If it contains a finite number of outcomes, then it is known as
discrete or finite sample spaces.
The samples spaces for a random experiment is written within curly braces “ { }
“
Tossing a Coin
When flipping a coin, two outcomes are possible, such as head and tail. Therefore the
sample space for this experiment is given as
Discrete Probability:
If the sample space consists of a finite number of possible outcomes, then the
probability law is specified by the probabilities of the events that consist of a single
element. In particular, the probability of any event { 1, ,..., }
Continuous Probability:
Probabilistic models with continuous sample spaces differ from their discrete
counterparts in that the probabilities of the single-element events may not be
sufficient to characterize the probability law.
Random variables
■ A random variable is a rule that assigns a numerical value to each outcome in a
sample space
■ Random variables may be either discrete or continuous. A random variable is
said to be discrete if it assumes only specified values in an interval. Otherwise,
it is continuous.
■ We generally denote the random variables with capital letters such as X and Y.
When X takes values 1, 2, 3, …, it is said to have a discrete random variable.
As discussed in the introduction, there are two random variables, such as:
A discrete random variable is a variable that can take on a finite number of distinct
values. A probability distribution is used to determine what values a random variable
can take and how often does it take on these values.
Discrete: Can take on only a countable number of distinct values like 0, 1, 2, 3, 50, 100,
etc.
A random variable that can take on an infinite number of possible values is known as a
continuous random variable. Such a variable is defined over an interval of values
rather than a specific value.
Continuous: Can take on an infinite number of possible values like 0.03, 1.2374553,
etc.
What is Population?
▶ In statistics, population is the entire set of items from which you draw data for a
statistical study. It can be a group of individuals, a set of items, etc.
▶ Generally, population refers to the people who live in a particular area at a
specific time. But in statistics, population refers to data on your study of
interest. It can be a group of individuals, objects, events, organizations, etc.
What is a Sample?
Hypothesis Testing is a type of statistical analysis in which you put your assumptions
about a population parameter to the test. It is used to estimate the relationship
between 2 statistical variables.
▶ The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.H0 is the
symbol for it, and it is pronounced H-naught.
▶ The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.
The One-Tailed test, also called a directional test, considers a critical region of
data that would result in the null hypothesis being rejected if the test sample
falls into it, inevitably meaning the acceptance of the alternate hypothesis.
In a one-tailed test, the critical distribution area is one-sided, meaning the test
sample is either greater or lesser than a specific value.
In two tails, the test sample is checked to be greater or less than a range of
values in a Two-Tailed test, implying that the critical distribution area is
two-sided.
If the sample falls within this range, the alternate hypothesis will be accepted,
and the null hypothesis will be rejected.
Example:
According to the H1, the mean can be greater than or less than 50. This is an example
of a Two-tailed test.
P-Value
The p-value of 0.05 is known as the level of significance (α). Usually, it is considered
using two suggestions, which are given below:
If p-value>0.05: The large p-value shows that the null hypothesis needs to be
accepted.
If p-value<0.05: The small p-value shows that the null hypothesis needs to be
rejected, and the result is declared as statically significant.
Type 1 Error: A Type-I error occurs when sample results reject the null
hypothesis despite being true.
Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected
when it is false, unlike a Type-I error.
Example:
Type I error will be the teacher failing the student [rejects H0] although the
student scored the passing marks [H0 was true].
Type II error will be the case where the teacher passes the student [do not reject
H0] although the student did not score the passing marks [H1 is true].
What Is P-Hacking
▶ is the act of misusing data analysis to show that patterns in data are
statistically significant, when in reality they are not.
▶ P-value hacking, also known as data dredging, data fishing, data snooping or data
butchery, is an exploitation of data analysis in order to discover patterns which
would be presented as statistically significant, when in reality, there is no underlying
effect.
▶ for example: by stopping the collection of data once you get a P<0.05, analyzing
many outcomes, but only reporting those with P<0.05, using covariates, excluding
participants, etc.
What is Cost-function?
• A Cost Function is used to measure just how wrong the model is in finding a
relation between the input and output. It tells you how badly your model is
behaving/predicting
o
o Gradient Descent is an algorithm that is used to optimize the cost function or
the error of the model. It is used to find the minimum value of error possible in
your model.
o Gradient Descent can be thought of as the direction you have to take to reach
the least possible error. The error in your model can be different at different
points, and you have to find the quickest way to minimize it, to prevent resource
wastage.
There are three popular types of gradient descent that mainly differ in the amount of
data they use:
o By contrast, stochastic gradient descent (SGD) does this for each training
example within the dataset, meaning it updates the parameters for each
training example one by one.
o Depending on the problem, this can make SGD faster than batch gradient
descent.
o One advantage is the frequent updates allow us to have a pretty detailed rate of
improvement.
■ Mini-batch gradient descent is the go-to method since it’s a combination of the
concepts of Stochastic Gradient Descent and batch gradient descent.
■ It simply splits the training dataset into small batches and performs an update
for each of those batches.
■ This creates a balance between the robustness of stochastic gradient descent
and the efficiency of batch gradient descent.
• Batch gradient descent, also called vanilla gradient descent, calculates the
error for each example within the training dataset, but only after all training
examples have been evaluated does the model get updated.
• This whole process is like a cycle and it’s called a training epoch.
Gradient Descent
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms. The goal
of Gradient Descent is to minimize the objective convex function f(x) using iteration.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
In order to find the gradient of the function with respect to x dimension, take the
derivative of the function with respect to x , then substitute the x-coordinate of the
point of interest in for the x values in the derivative. Once gradient of the function at
any point is calculated, the gradient descent can be calculated by multiplying the
gradient with -1. Here are the steps of finding minimum of the function using gradient
descent:
• Calculate the gradient by taking the derivative of the function with respect to the specific
parameter. In case, there are multiple parameters, take the partial derivatives with respect
to different parameters.
• Calculate the descent value for different parameters by multiplying the value of derivatives
with learning or descent rate (step size) and -1.
• Update the value of parameter by adding up the existing value of parameter and the
descent value. The diagram below represents the updation of parameter ith the value of
gradient in the opposite direction while taking small steps.
The central limit theorem states that whenever a random sample of size n is
taken from any distribution with mean and variance, then the sample mean will be
approximately normally distributed with mean and variance. The larger the value of the
sample size, the better the approximation to the normal.
Formula
.
The information about the mean, population size, standard deviation, sample size and
a number that is associated with “greater than”, “less than”, or two numbers
associated with both values for a range of “between” is identified from the problem.
2) A graph with a centre as mean is drawn.
4) The z-table is referred to find the ‘z’ value obtained in the previous step.
Step 3 is executed.
The last step is common to all three cases, that is to convert the decimal obtained into
a percentage.
Example 1:
20 students are selected at random from a clinical psychology class; find the
probability that their mean GPA is more than 5. If the average GPA scored by the entire
batch is 4.91. The standard deviation is 0.72.
Solution:
Here,
Since the sample size is smaller than 30, use t-score instead of the z-score, even
though the population standard deviation is known.
= 0.161
Now, Find t-score:
= 0.559
Find the probability for t value using the t-score table. The degree of freedom here
would be:
Df = 20 – 1 = 19
P (t ≤ 0.559) = 0.7087
1. Bernoulli’s Distribution
This is one of the simplest distributions that can be used as an initial point to
derive more complex distributions. Bernoulli’s distribution has possibly two
outcomes (success or failure) and a single trial.
• For multiple trials provided, each trial is independent to each other, i.e, the result
of one trial cannot influence other trials.
• Each of the trials can have two possible outcomes, either success or failure,
with probabilities p, and (1-p).
• A total number of n identical trials can be conducted, and the probability of
success and failure is the same for all trials.
According to the formula, the distribution is said to be normal if mean (μ) = 0 and
standard deviation (σ) = 1
The graph of normal distribution is shown below which is symmetric about the
centre (mean).
Normal distribution has the following properties;
4. Poisson Distribution
Where 𝝺 represents the possible number of events take place in a fixed period of
time, and X is the number of events in that time period.
• The events are independent of each other, i.e, if an event occurs, it doesn’t
affect the probability of another event occurring.
• An event could occur any number of times in a defined period of time.
• Any two events can’t be occurring at the same time.
• The average rate of events to take place is constant.
5. Exponential Distribution
Like the Poisson distribution, exponential distribution has the time element; it gives
the probability of time duration before an event takes place.
Exponential distribution is used for survival analysis, for example, life of an air
conditioner, expected life of a machine, and length of time between metro arrivals.
A variable X is said to possess an exponential distribution when
Where λ stands for rate and always has value greater than zero.
The graph of exponential distribution is shown below;
The exponential distribution has following characteristics;
As shown in the graph, the higher the rate, the faster the curve drops, and lower the
rate, flatter the curve.
• In survival analysis, λ is termed as a failure rate of a machine at any time t with
the assumption that the machine will survive up to t time.
6. Multinomial Distribution
A very popular Mendel experiment where two strains of peas (one green and
wrinkled seeds and other is yellow and smooth seeds) are hybridized that
produced four different strains of seeds-green and wrinkled, green and round,
yellow and round, and yellow and wrinkled. This resulted in multinomial
distribution and led to the discovery of the basic principles of genetics.