What Is A Data Set?

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

What Is a Data Set?

▶ A data set is an organized collection of data. They are generally associated with a unique
body of work and typically cover one topic at a time.
▶ Data sets describe values for each variable for unknown quantities such as height, weight,
temperature, volume, etc., of an object or values of random numbers.
▶ The data set consists of data of one or more members corresponding to each row.
▶ Datasets can be written as a list of integers in a random order, a table, or with curly
brackets around them. The data sets are normally labelled so you understand what the data
represents.

What are the types of data sets?


There are several types of data sets. What determines the type of the data set is
the information within it.

Numerical

■ A numerical data set is one in which all the data are numbers.
■ You can also refer to this type as a quantitative data set, as the
numerical values can apply to mathematical calculations when
necessary.
■ Many financial analysis processes also rely on numerical data sets, as
the values in the set can represent numbers in dollar amounts.
■ Ex: The number of cards in a deck.

Bivariate
o A data set with just two variables is a bivariate data set.
o In this type of data set, data scientists look at the relationship between the
two variables.
o Therefore, these data sets typically have two types of related data.
o For example, a data set containing the weight and running speed of a track
team represents two separate variables, where you can look for a
relationship between the two.

Multivariate
Unlike a bivariate data set, a multivariate data set contains more than two
variables.
For example, the height, width, length and weight of a package you ship through
the mail requires more than two variable inputs to create a data set.Since each
value is unique, you can use different variables to represent each one. For the
dimensions of the example package, the values for each measurement
represent the variables.
Categorical
Categorical data sets contain information relating to the characteristics of a person
or object.
Data scientists also refer to categorical data sets as qualitative data sets because
they contain information relating to the qualities of an object.
There are two types of categorical data sets: dichotomous and polytomous.
In a dichotomous data set, each variable can only have one of two values. For
example, a data set containing answers to true and false questions is dichotomous
because it only supplies one result or the other.
In a polytomous data set, there can be more than two possible values for each
variable. For example, a data set containing a person's eye color can give you
multiple results.

Correlation
▶ When there is a relationship between variables within a data set, it becomes a
correlation data set. This means that the values depend on one another to exhibit
change.
▶ Correlation can either be positive, negative or zero.
▶ In positive correlations, the related variables move in the same direction, whereas a
negative correlation shows variables moving in opposing directions. A zero
correlation shows no relationship.

Correlation coefficient
You can use the following equation to calculate correlation:

∑ (x(i) - x̅)(y(i) - ȳ) / √ ∑(x(i) - x̅) ^2 ∑(y(i) - ȳ)^2

When calculating a correlation, keep in mind the following representations:

x(i) = the value of x

y(i) = the value of y

x̅ = the mean of the x-value

ȳ = the mean of the y-value

What techniques can be used to represent data sets?

Having information stored in a data set often makes it easier to perform math
operations and analysis.
Mean of a dataset is the average of all the observations present in the table. It is the
ratio of the sum of observations to the total number of elements present in the data
set. The formula of mean is given by;

Mean = Sum of Observations / Total Number of Elements in Data Set

Median of a dataset is the middle value of the collection of data when arranged in
ascending order and descending order.

Mode of a dataset is the variable or number or value which is repeated maximum


number of times in the set.

Range of a dataset is the difference between the maximum value and minimum value.

Range = Maximum Value – Minimum Value

Properties of Dataset
Before performing any statistical analysis, it is essential to understand the nature of
the data. We can use different Exploratory Data Analysis (EDA techniques), which
helps to identify the properties of data, so that the appropriate statistical methods can
be applied on the data

• Centre of data
• Skewness of data
• Spread among the data members
• Presence of outliers
• Correlation among the data
• Type of probability distribution that the data follows

What are Events in Probability?

A probability event can be defined as a set of outcomes of an experiment. In other


words, an event in probability is the subset of the respective sample space.

P(E) = Number of Favourable Outcomes/Number of total outcomes

P(E) = n(E)/n(S)

Here,

n(E) = Number of event favourable to event E

n(S) = Total number of outcomes


Dependent Events

• For events to be considered dependent, one must have an influence over how
probable another is.
• In other words, a dependent event can only occur if another event occurs first.
• The primary focus when analyzing dependent events is probability. The
occurrence of one event exerts an effect on the probability of another event.
• Thus, If whether one event occurs does affect the probability that the other
event will occur, then the two events are said to be dependent.

Example :

A card is chosen at random from a standard deck of 52 playing cards. Without


replacing it, a second card is chosen.

P (king on the first pick) = 4 /5

Independent Events
o An event is deemed independent when it isn’t connected to another event, or its probability
of happening, or conversely, of not happening.
o Independent events don’t influence one another or have any effect on how probable
another event is.
o If the probability of occurrence of an event A is not affected by the occurrence of
another event B, then A and B are said to be independent events.

Examples:

Tossing a coin

Here, Sample Space S = {H, T}, and both H and T are independent
events.

Conditional Probability Formula

■ If the probability of events A and B are P(A) and P(B) respectively then the
conditional probability of B such that A has already occurred is P(A/B). Given,
P(A)>0,

P(A/B) = P(A ∩ B)/P(A) or P(B ∩ A)/P(A)

According to the definition of conditional


probability,
P(A|B)=P(A∩B)P(B),P(B)≠0 and

we know that
P(A∩B)=P(B∩A)=P(B|A)P(A),

which implies
P(A|B)=P(B|A)P(A)P(B)
Bayes’ Theorem
Bayes’ theorem describes the probability of occurrence of an event related to any condition. It is
also considered for the case of conditional probability. Bayes theorem is also known as the formula
for the probability of “causes”.

Bayes Theorem Statement

Let E1, E2,…, En be a set of events associated with a sample space S, where all the
events E1, E2,…, En have nonzero probability of occurrence and they form a partition of
S. Let A be any event associated with S, then according to Bayes theorem,

for any k = 1, 2, 3, …., n

Proof of Bayes Theorem

To prove the Bayes Theorem, we will use the total probability and conditional
probability formulas. The total probability of an event A is calculated when not enough
data is known about event A, then we use other events related to event A to determine
its probability. Conditional probability is the probability of event A given that other
related events have already occurred.

(Ei), be is a partition of the sample space S. Let A be an event that occurred. Let us
express A in terms of (Ei).

A=A∩S

= A ∩ (E1,E2,E3,...,En)

A = (A ∩E1) ∪ (A ∩E1) ∪ (A ∩E1)....∪ ( A ∩E1)

P(A) = P[(A ∩E1) ∪ (A ∩E1) ∪ (A ∩E1)....∪ ( A ∩E1)]

We know that when A and B are disjoint sets, then P(A∪ B) = P(A) + P(B)

Thus here, P(A) = P(A ∩E1) +P(A ∩E1)+ P(A ∩E1).....P(A ∩En)

According to the multiplication theorem of a dependent event, we have

P(A) = P(E). P(A/E1) + P(E). P(A/E2) + P(E). P(A/E3)......+ P(A/En)

Thus total probability of P(A) = ∑ni=1P(Ei)P(A|Ei),i=1,2,3,...,n--- (1)


Recalling the conditional probability, we get
P(Ei|A)=P(Ei∩A)P(A),i=1,2,3,...,n---(2)

Using the formula for conditional probability of P(A|Ei), we have P(Ei∩


A)=P(A|Ei)P(Ei)- (3)

Substituting equations (1) and (3) in equation (2), we get

P(Ei|A)=P(A|Ei)P(Ei)∑nk=1P(Ek)P(A|Ek),i=1,2,3,...,n

Hence, Bayes Theorem is proved.

Sample Space
A sample space is a collection or a set of possible outcomes of a random
experiment.
The sample space is represented using the symbol, “S”.
The subset of possible outcomes of an experiment is called events.
A sample space may contain a number of outcomes that depends on the
experiment. If it contains a finite number of outcomes, then it is known as
discrete or finite sample spaces.
The samples spaces for a random experiment is written within curly braces “ { }

Tossing a Coin

When flipping a coin, two outcomes are possible, such as head and tail. Therefore the
sample space for this experiment is given as

Sample Space, S = { H, T } = { Head, Tail }

In a random experiment, the outcomes, also known as , are


mutually exclusive (i.e., they cannot occur simultaneously).

Types of sample space


A sample space can be discrete or continuous. A sample space can be
countable or uncountable.

Discrete Probability:

If the sample space consists of a finite number of possible outcomes, then the
probability law is specified by the probabilities of the events that consist of a single
element. In particular, the probability of any event { 1, ,..., }

is the sum of the probabilities of its elements.

Continuous Probability:
Probabilistic models with continuous sample spaces differ from their discrete
counterparts in that the probabilities of the single-element events may not be
sufficient to characterize the probability law.

Random variables
■ A random variable is a rule that assigns a numerical value to each outcome in a
sample space
■ Random variables may be either discrete or continuous. A random variable is
said to be discrete if it assumes only specified values in an interval. Otherwise,
it is continuous.
■ We generally denote the random variables with capital letters such as X and Y.
When X takes values 1, 2, 3, …, it is said to have a discrete random variable.

Types of Random Variable

As discussed in the introduction, there are two random variables, such as:

• Discrete Random Variable


• Continuous Random Variable

Discrete Random Variable

A discrete random variable is a variable that can take on a finite number of distinct
values. A probability distribution is used to determine what values a random variable
can take and how often does it take on these values.

Discrete: Can take on only a countable number of distinct values like 0, 1, 2, 3, 50, 100,
etc.

Continuous Random Variable

A random variable that can take on an infinite number of possible values is known as a
continuous random variable. Such a variable is defined over an interval of values
rather than a specific value.

Continuous: Can take on an infinite number of possible values like 0.03, 1.2374553,
etc.

What is Population?

▶ In statistics, population is the entire set of items from which you draw data for a
statistical study. It can be a group of individuals, a set of items, etc.
▶ Generally, population refers to the people who live in a particular area at a
specific time. But in statistics, population refers to data on your study of
interest. It can be a group of individuals, objects, events, organizations, etc.
What is a Sample?

▶ A sample is defined as a smaller and more manageable representation of a


larger group.
▶ A subset of a larger population that contains characteristics of that population
▶ A sample is used in statistical testing when the population size is too large for
all members or observations to be included in the test.
▶ The sample is an unbiased subset of the population that best represents the
whole data.

▶ The process of selecting a sample is known as sampling. Number of elements


in the sample is the sample size.

What Is Hypothesis Testing in Statistics?

Hypothesis Testing is a type of statistical analysis in which you put your assumptions
about a population parameter to the test. It is used to estimate the relationship
between 2 statistical variables.

Null Hypothesis and Alternate Hypothesis

▶ The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.H0 is the
symbol for it, and it is pronounced H-naught.
▶ The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.

One-Tailed and Two-Tailed Hypothesis Testing

The One-Tailed test, also called a directional test, considers a critical region of
data that would result in the null hypothesis being rejected if the test sample
falls into it, inevitably meaning the acceptance of the alternate hypothesis.
In a one-tailed test, the critical distribution area is one-sided, meaning the test
sample is either greater or lesser than a specific value.
In two tails, the test sample is checked to be greater or less than a range of
values in a Two-Tailed test, implying that the critical distribution area is
two-sided.
If the sample falls within this range, the alternate hypothesis will be accepted,
and the null hypothesis will be rejected.

Example:

Suppose H0: mean = 50 and H1: mean not equal to 50

According to the H1, the mean can be greater than or less than 50. This is an example
of a Two-tailed test.

In a similar manner, if H0: mean >=50, then H1: mean <50

Here the mean is less than 50. It is called a One-tailed test.

P-Value

P-value is also used as an alternative to determine the point of rejection in order


to provide the smallest significance level at which the null hypothesis is least or
rejected.
It is expressed as the level of significance that lies between 0 and 1, and

If the value of p-value is very small, then it means the observed


output is feasible but doesn't lie under the null hypothesis conditions (H0).

The p-value of 0.05 is known as the level of significance (α). Usually, it is considered
using two suggestions, which are given below:

If p-value>0.05: The large p-value shows that the null hypothesis needs to be
accepted.
If p-value<0.05: The small p-value shows that the null hypothesis needs to be
rejected, and the result is declared as statically significant.

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

Type 1 Error: A Type-I error occurs when sample results reject the null
hypothesis despite being true.

Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected
when it is false, unlike a Type-I error.

Example:

Suppose a teacher evaluates the examination paper to decide whether a student


passes or fails.
H0: Student has passed

H1: Student has failed

Type I error will be the teacher failing the student [rejects H0] although the
student scored the passing marks [H0 was true].

Type II error will be the case where the teacher passes the student [do not reject
H0] although the student did not score the passing marks [H1 is true].

What Is P-Hacking

▶ is the act of misusing data analysis to show that patterns in data are
statistically significant, when in reality they are not.
▶ P-value hacking, also known as data dredging, data fishing, data snooping or data
butchery, is an exploitation of data analysis in order to discover patterns which
would be presented as statistically significant, when in reality, there is no underlying
effect.
▶ for example: by stopping the collection of data once you get a P<0.05, analyzing
many outcomes, but only reporting those with P<0.05, using covariates, excluding
participants, etc.

What is Cost-function?

• A Cost Function is used to measure just how wrong the model is in finding a
relation between the input and output. It tells you how badly your model is
behaving/predicting

What Is Gradient Descent?


o One of the easiest optimization algorithms to understand and implement is the
gradient descent algorithm.
o

o
o Gradient Descent is an algorithm that is used to optimize the cost function or
the error of the model. It is used to find the minimum value of error possible in
your model.
o Gradient Descent can be thought of as the direction you have to take to reach
the least possible error. The error in your model can be different at different
points, and you have to find the quickest way to minimize it, to prevent resource
wastage.

Types of Gradient Descent

There are three popular types of gradient descent that mainly differ in the amount of
data they use:

Stochastic Gradient Descent

o By contrast, stochastic gradient descent (SGD) does this for each training
example within the dataset, meaning it updates the parameters for each
training example one by one.
o Depending on the problem, this can make SGD faster than batch gradient
descent.
o One advantage is the frequent updates allow us to have a pretty detailed rate of
improvement.

Mini-Batch Gradient Descent

■ Mini-batch gradient descent is the go-to method since it’s a combination of the
concepts of Stochastic Gradient Descent and batch gradient descent.
■ It simply splits the training dataset into small batches and performs an update
for each of those batches.
■ This creates a balance between the robustness of stochastic gradient descent
and the efficiency of batch gradient descent.

Batch Gradient Descent

• Batch gradient descent, also called vanilla gradient descent, calculates the
error for each example within the training dataset, but only after all training
examples have been evaluated does the model get updated.
• This whole process is like a cycle and it’s called a training epoch.

Gradient Descent
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms. The goal
of Gradient Descent is to minimize the objective convex function f(x) using iteration.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:

• If we move towards a negative gradient or away from the gradient of the


function at the current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.

How to calculate Gradient Descent?

In order to find the gradient of the function with respect to x dimension, take the
derivative of the function with respect to x , then substitute the x-coordinate of the
point of interest in for the x values in the derivative. Once gradient of the function at
any point is calculated, the gradient descent can be calculated by multiplying the
gradient with -1. Here are the steps of finding minimum of the function using gradient
descent:

• Calculate the gradient by taking the derivative of the function with respect to the specific
parameter. In case, there are multiple parameters, take the partial derivatives with respect
to different parameters.
• Calculate the descent value for different parameters by multiplying the value of derivatives
with learning or descent rate (step size) and -1.
• Update the value of parameter by adding up the existing value of parameter and the
descent value. The diagram below represents the updation of parameter ith the value of
gradient in the opposite direction while taking small steps.

• Some advantages of batch gradient descent are its computational efficiency: it


produces a stable error gradient and a stable convergence.

Central Limit Theorem


Central limit theorem is a statistical theory which states that when the large sample size
has a finite variance, the samples will be normally distributed and the mean of samples will be
approximately equal to the mean of the whole population. The central limit theorem (CLT) is at the
heart of hypothesis testing – a critical component of the data science lifecycle.
In other words, the central limit theorem states that for any population with mean and
standard deviation, the distribution of the sample mean for sample size N has mean μ and standard
deviation σ / √n .

Central Limit Theorem Statement

The central limit theorem states that whenever a random sample of size n is
taken from any distribution with mean and variance, then the sample mean will be
approximately normally distributed with mean and variance. The larger the value of the
sample size, the better the approximation to the normal.

Assumptions of Central Limit Theorem

• The sample should be drawn randomly following the condition of randomization.


• The samples drawn should be independent of each other. They should not influence the
other samples.
• When the sampling is done without replacement, the sample size shouldn’t exceed 10% of
the total population.
• The sample size should be sufficiently large.

Formula

The formula for the central limit theorem is given below:

.
The information about the mean, population size, standard deviation, sample size and
a number that is associated with “greater than”, “less than”, or two numbers
associated with both values for a range of “between” is identified from the problem.
2) A graph with a centre as mean is drawn.

3) Theformula z=x¯– μσn is used tofind thez-score.

4) The z-table is referred to find the ‘z’ value obtained in the previous step.

5) Case 1: Central limit theorem involving “>”.

Subtract the z-score value from 0.5.

Case 2: Central limit theorem involving “<”.

Add 0.5 to the z-score value.

Case 3: Central limit theorem involving “between”.

Step 3 is executed.

6) The z-value is found along with x bar.

The last step is common to all three cases, that is to convert the decimal obtained into
a percentage.

Examples on Central Limit Theorem

Example 1:

20 students are selected at random from a clinical psychology class; find the
probability that their mean GPA is more than 5. If the average GPA scored by the entire
batch is 4.91. The standard deviation is 0.72.

Solution:

Here,

Population mean = μ = 4.91

Population standard deviation= σ = 0.72

Sample size = n = 20 (which is less than 30)

Since the sample size is smaller than 30, use t-score instead of the z-score, even
though the population standard deviation is known.

Substituting the values we have:

= 0.161
Now, Find t-score:

For our problem, the raw score x = 5

= 0.559

Find the probability for t value using the t-score table. The degree of freedom here
would be:

Df = 20 – 1 = 19

P (t ≤ 0.559) = 0.7087

P (t > 0.559) = 1 – 0.7087 = 0.2913

Thus the probability that the score is more than 5 is 9.13 %.

1. Bernoulli’s Distribution

This is one of the simplest distributions that can be used as an initial point to
derive more complex distributions. Bernoulli’s distribution has possibly two
outcomes (success or failure) and a single trial.

For example, tossing a coin, the success probability of an outcome to be heads is


p, then the probability of having tail as outcome is (1-p). Bernoulli’s distribution is
the special case of binomial distribution with a single trial.

The density function can be given as

f(x) = px (1-p)(1-x) where x € (0,1)

It can also be written as;

The graph of Bernoulli's distribution is shown below where the probability of


success is less than probability of failure.
The distribution has following characteristics;

• The number of trials, to be performed, need to be predefined for a single


experiment.
• Each trial has only two possible outcomes-success or failure.
• The probability of success of each event/experiment must be the same.
• Each event must be independent of each other.
2. Binomial Distribution

The binomial distribution is applied in binary outcomes events where the


probability of success is equal to the probability of failure in all the successive
trials. Its example includes tossing a biased/unbiased coin for a repeated
number of times.

As input, the distribution considers two parameters, and is thus called as


bi-parametric distribution. The two parameters are;

• The number of times an event occurs, n, and


• Assigned probability, p, to one of the two classes

For n number of trials, and success probability, p, the probability of successful


event (x) within n trials can be determined by the following formula

The graph of binomial distribution is shown below when the probability of


success is equal to probability of failure.

The binomial distribution holds the following properties;

• For multiple trials provided, each trial is independent to each other, i.e, the result
of one trial cannot influence other trials.
• Each of the trials can have two possible outcomes, either success or failure,
with probabilities p, and (1-p).
• A total number of n identical trials can be conducted, and the probability of
success and failure is the same for all trials.

3. Normal (Gaussian) Distribution

Being a continuous distribution, the normal distribution is most commonly used


in data science. A very common process of our day to day life belongs to this
distribution- income distribution, average employees report, average weight of a
population, etc.

The formula for normal distribution;


Where μ = Mean value,
σ = Standard probability distribution of probability,
x = random variable

According to the formula, the distribution is said to be normal if mean (μ) = 0 and
standard deviation (σ) = 1
The graph of normal distribution is shown below which is symmetric about the
centre (mean).
Normal distribution has the following properties;

• Mean, mode and median coincide with each other.


• The distribution has a bell-shaped distribution curve.
• The distribution curve is symmetrical to the centre.
• The area under the curve is equal to 1.

4. Poisson Distribution

Being a part of discrete probability distribution, poisson distribution outlines the


probability for a given number of events that take place in a fixed time period or
space, or particularized intervals such as distance, area, volume.
For example, conducting risk analysis by the insurance/banking industry,
anticipating the number of car accidents in a particular time interval and in a
specific area.
Poisson distribution considers following assumptions;
The success probability for a short span is equal to success probability for a
long period of time.
• The success probability in duration equals to zero as the duration becomes
smaller.
• A successful event can’t impact the result of another successful event

A Poisson distribution can be modeled using the formula below,

Where 𝝺 represents the possible number of events take place in a fixed period of
time, and X is the number of events in that time period.

The graph of Poisson distribution is shown below;


Poisson distribution has the following characteristics;

• The events are independent of each other, i.e, if an event occurs, it doesn’t
affect the probability of another event occurring.
• An event could occur any number of times in a defined period of time.
• Any two events can’t be occurring at the same time.
• The average rate of events to take place is constant.

5. Exponential Distribution

Like the Poisson distribution, exponential distribution has the time element; it gives
the probability of time duration before an event takes place.
Exponential distribution is used for survival analysis, for example, life of an air
conditioner, expected life of a machine, and length of time between metro arrivals.
A variable X is said to possess an exponential distribution when
Where λ stands for rate and always has value greater than zero.
The graph of exponential distribution is shown below;
The exponential distribution has following characteristics;
As shown in the graph, the higher the rate, the faster the curve drops, and lower the
rate, flatter the curve.
• In survival analysis, λ is termed as a failure rate of a machine at any time t with
the assumption that the machine will survive up to t time.

6. Multinomial Distribution

The multinomial distribution is used to measure the outcomes of experiments


that have two or more variables. It is the special type of binomial distribution
when there are two possible outcomes such as true/false or success/failure.

The distribution is commonly used in biological, geological and financial


applications.

A very popular Mendel experiment where two strains of peas (one green and
wrinkled seeds and other is yellow and smooth seeds) are hybridized that
produced four different strains of seeds-green and wrinkled, green and round,
yellow and round, and yellow and wrinkled. This resulted in multinomial
distribution and led to the discovery of the basic principles of genetics.

The density function for multinomial distribution is

Where n= number of experiments.

Px= probability of occurrence of an experiment.

The graph of exponential distribution is shown below;

The following are properties of multinomial distribution;

• An experiment can have a repeated number of trials, for example, rolling of a


dice multiple times.
• Each trial is independent of each other.
• The success probability of each outcome must be the same (constant) for all
trials of an experiment.

You might also like