Unit 4

Introduction to Inferential Statistics
Statistics has a significant part in the field of data science. It helps us in the collection, analysis
and representation of data either by visualisation or by numbers into a general understandable
format. Generally, we divide statistics into two main branches which are Descriptive Statistics
and Inferential Statistics. In this article, we will discuss the Inferential statistics in detail.
What is Inferential Statistics?
Descriptive statistics describe the important characteristics of data by using mean, median, mode,
variance etc. It summarises the data through numbers and graphs.
In Inferential statistics, we make an inference from a sample about the population. The main aim
of inferential statistics is to draw some conclusions from the sample and generalise them for the
population data. E.g. we have to find the average salary of a data analyst across India. There are
two options.
1. The first option is to consider the data of data analysts across India and ask them their
salaries and take an average.
2. The second option is to take a sample of data analysts from the major IT cities in India
and take their average and consider that for across India.
The first option is not possible as it is very difficult to collect all the data of data analysts across
India. It is time-consuming as well as costly. So, to overcome this issue, we will look into the
second option to collect a small sample of salaries of data analysts and take their average as India
average. This is the inferential statistics where we make an inference from a sample about the
population.
In inferential statistics, we will discuss probability, distributions, and hypothesis testing.
Importance of Inferential Statistics
• Making conclusions from a sample about the population

• To conclude if a sample selected is statistically significant to the whole population or not
• Comparing two models to find which one is more statistically significant as compared to
the other.
• In feature selection, whether adding or removing a variable helps in improving the model
or not.
Probability
It is a measure of the chance of occurrence of a phenomenon. We will now discuss some terms
which are very important in probability:
•
Random Experiment: Random experiment or statistical experiment is an experiment in
which all the possible outcomes of the experiments are already known. The experiment
can be repeated numerous times under identical or similar conditions.
• Sample space: Sample space of a random experiment is the collection or set of all the
possible outcomes of a random experiment.
• Event: A subset of sample space is called an event.
• Trial: Trial refers to a special type of experiment in which we have two types of possible
outcomes: success or failure with varying Success probability.
• Random Variable: A variable whose value is subject to variations due to randomness is
called a random variable. A random variable is of two types: Discrete and Continuous
variable. In a mathematical way, we can say that a real-valued function X: S -> R is
called a random variable where S is probability space and R is a set of real numbers.
Conditional Probability
Conditional probability is the probability of a particular event Y, given a certain condition which
has already occurred , i.e., X. Then conditional probability, P(Y|X) is defined as,
P(Y|X) = N(X∩Y) / N(X); provide N(X) > 0
N(X): – Total cases favourable to the event X
N(X∩Y): – Total favourable simultaneous
Or, we can write as:
P(Y|X) = P(X∩Y) / P(X); P(X) > 0
Probability Distribution and Distribution function
The mathematical function describing the randomness of a random variable is called probability
distribution. It is a depiction of all possible outcomes of a random variable and their associated
probabilities
For a random variable X, CDF (Cumulative Distribution function) is defined as:
F(x) = P {s ε S; X(s) ≤ x}
Or,
F(x) = P {X ≤ x}
E.g. P (X > 7) = 1- P (X ≤ 7)
= 1- {P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4) + P (X = 5) + P (X = 6) + P
(X = 7)}
Sampling Distribution
Probability distribution of statistics of a large number of samples selected from the population is
called sampling distribution. When we increase the size of sample, sample mean becomes more
normally distributed around population mean. The variability of the sample decreases as we
increase sample size.
What is big data analytics?
Big data analytics describes the process of uncovering trends, patterns, and correlations in large
amounts of raw data to help make data-informed decisions. These processes use familiar
statistical analysis techniques—like clustering and regression—and apply them to more
extensive datasets with the help of newer tools. Big data has been a buzz word since the early
2000s, when software and hardware capabilities made it possible for organizations to handle
large amounts of unstructured data. Since then, new technologies—from Amazon to
smartphones—have contributed even more to the substantial amounts of data available to
organizations. With the explosion of data, early innovation projects like Hadoop, Spark, and
NoSQL databases were created for the storage and processing of big data. This field continues to
evolve as data engineers look for ways to integrate the vast amounts of complex information
created by sensors, networks, transactions, smart devices, web usage, and more. Even now, big
data analytics methods are being used with emerging technologies, like machine learning, to
discover and scale more complex insights.
How big data analytics works
Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to help
organizations operationalize their big data.
1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations
can gather both structured and unstructured data from a variety of sources — from cloud storage
to mobile applications to in-store IoT sensors and beyond. Some data will be stored in data
warehouses where business intelligence tools and solutions can access it easily. Raw or
unstructured data that is too diverse or complex for a warehouse may be assigned metadata and
stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option
is batch processing, which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing data. Stream
processing looks at small batches of data at once, shortening the delay time between collection
and analysis for quicker decision-making. Stream processing is more complex and often more
expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data
must be formatted correctly, and any duplicative or irrelevant data must be eliminated or
accounted for. Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can
turn big data into big insights. Some of these big data analysis methods include:
• Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.
• Predictive analytics uses an organization’s historical data to make predictions about the
future, identifying upcoming risks and opportunities.
• Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and abstract
data.
Big Challenges with Big Data
The challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled then the failure of the technology may
take place which can also lead to some unpleasant result. Big data challenges include the
storing, analyzing the extremely large and fast-growing data.
Some of the Big Data challenges are:
1. Sharing and Accessing Data:
• Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets
from external sources.
• Sharing data can cause substantial challenges.
• It include the need for inter and intra- institutional legal documents.
• Accessing data from public repositories leads to multiple difficulties.
• It is necessary for the data to be available in an accurate, complete and timely manner
because if data in the companies information system is to be used to make accurate
decisions in time then it becomes necessary for data to be available in this manner.
2 Privacy and Security:
• It is another most important challenge with Big Data. This challenge includes sensitive,
conceptual, technical as well as legal significance.
• Most of the organizations are unable to maintain regular checks due to large amounts of
data generation. However, it should be necessary to perform security checks and
observation in real time because it is most beneficial.
• There is some information of a person which when combined with external large data
may lead to some facts of a person which may be secretive and he might not want the
owner to know this information about that person.
• Some of the organization collects information of the people in order to add value to
their business. This is done by making insights into their lives that they’re unaware of.
1. Analytical Challenges:
• There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too large?
• Or how to find out the important data points?
• Or how to use data to the best advantage?
• These large amount of data on which these type of analysis is to be done can be
structured (organized data), semi-structured (Semi-organized data) or unstructured
(unorganized data). There are two techniques through which decision making can be
done:
• Either incorporate massive data volumes in the analysis.
• Or determine upfront which Big data is relevant.
Technical challenges:
• Quality of data:
• When there is a collection of a large amount of data and storage of this data, it
comes at a cost. Big companies, business leaders and IT leaders always want
large data storage.
• For better results and conclusions, Big data rather than having irrelevant data,
focuses on quality data storage.
• This further arise a question that how it can be ensured that data is relevant, how
much data would be enough for decision making and whether the stored data is
accurate or not.
• Fault tolerance:
• Fault tolerance is another technical challenge and fault tolerance computing is
extremely hard, involving intricate algorithms.
• Nowadays some of the new technologies like cloud computing and big data
always intended that whenever the failure occurs the damage done should be
within the acceptable threshold that is the whole task should not begin from the
scratch.

Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

Introduction to Inferential Statistics

What is Inferential Statistics?

In inferential statistics, we will discuss probability, distributions, and hypothesis testing.

Importance of Inferential Statistics

• Making conclusions from a sample about the population

P(Y|X) = N(X∩Y) / N(X); provide N(X) > 0

N(X): – Total cases favourable to the event X

N(X∩Y): – Total favourable simultaneous

Or, we can write as:

P(Y|X) = P(X∩Y) / P(X); P(X) > 0

Probability Distribution and Distribution function

For a random variable X, CDF (Cumulative Distribution function) is defined as:

What is big data analytics?

How big data analytics works

You might also like