Professional Documents
Culture Documents
Data Science Unit-i
Data Science Unit-i
Data science is a deep study of massive amount of data. It is a field that gives
insights from the structured and unstructured data using different scientific
methods and algorithms. and it helps in generating insights, making predictions. It
uses a large amount of data to get meaningful outputs or insights using statistics
and computation for decision making.
The data used in data science is usually collected from different sources, such
as e-commerce sites, surveys, social media, and internet searches by using
advanced technologies.This data helps in making predictions and providing profits
to the businesses accordingly
With the help of data science the online food delivery companies understand
the requirements of their customers
Data science also helps in making future predictions. For example airlines
can predict the prices for the flights according to the customers previous
booking history
Some years ago, data was less and mostly available in a structured form, which
could be easily stored in excel sheets, and processed using BI tools. But in today's
world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is
generating on every day, which led to data explosion.
Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that technology
came into existence as data Science. Following are some main reasons for using
data science technology:
With the help of data science technology, we can convert the massive amount of
raw and unstructured data into meaningful insights.
Data science can help in different predictions such as various survey, elections,
flight ticket confirmation, etc
Statistics: statistics will give you the numbers in data. So a good understanding of
statistics is very important for becoming a data scientist. You have to be familiar
with stastical test , contest distributions, maximum likelihood estimators, also you
should know probability theory and descriptive statistics. These concepts will help
you make better business decisions.
Data extracting and processing: data extraction can be done on multiple data
sources like mysql databases, mongo databases. Data extraction is nothing but
extracting data from databases and putting it in a structured format so that you
can analyze
Data wrangling and exploration: it is one of the most difficult task in data science.
Data wrangling is about cleaning data. After you are going to explore the data.
Machine learning: if you have huge amount of data in organization then you need
to know machine learning algorithms like k-map , knn (knok nearesr neighbor) etc,
all these algorithms are implemented using R or python libraries.
Big data processing frame works: we have being generating a lot of data and most
of the data can be structured and unstructured format so on such data we cannot
use traditional data processing system. thats why you need to know frameworks
like Hadoop, spark so these frame works can be used to handle big data
Big data can be defined as a concept used to describe a large volume of data, which
are both structured and unstructured, and that gets increased day by day by any
system or business.
Volume – Organizations and firms gather as well as pull together different data
from different sources, which includes business transactions and data, data from
social media, login data, as well as information from the sensor as well as
machine-to-machine data. Earlier, this data storage would have been an issue - but
because of the advent of new technologies for handling extensive data with tools
like Apache Spark, Hadoop, the burden of enormous data got decreased.
Variety – The releases of data from various systems have diverse types and
formats. They range from structured to unstructured, numeric data of traditional
databases to non-numeric or text documents, emails, audios and videos, stock
ticker data, login data, Blockchains' encrypted data, or even financial transactions.
Big Data does not take care of how much data is there, but how it can be used.
Data can be taken from various sources for analyzing it and finding answers which
enable:
Reduction in cost.
Time reductions.
When you merge big data with high-powered data analytics, it is possible to
achieve business-related tasks like:
The data generated in bulk amount with high velocity can be categorized as:
Black Box Data: Black box data is a type of data that is collected from private and government helicopters,
airplanes, and jets. This data includes the capture of Flight Crew Sounds, separate recording of the
microphone as well as earphones, etc.
Stock Exchange Data: Stock exchange data includes various data prepared about 'purchase' and 'selling' of
different raw and well-made decisions.
Social Media Data: This type of data contains information about social media activities that include posts
submitted by millions of people worldwide.
Transport Data: Transport data includes vehicle models, capacity, distance (from source to destination),
and the availability of different vehicles.
Search Engine Data: Retrieve a wide variety of unprocessed information that is stored in SE databases.
BIG DATA TECHNOLOGIES:
This technology is significant for presenting a more precise analysis that leads the business analyst to highly
accurate decision-making, ensuring more considerable operational efficiencies by reducing costs and trade risks.
Now to implement such analytics and hold such a wide variety of data, one must need an infrastructure that can
facilitate and manage and process huge data volumes in real-time. This way, big data is classified into two
subcategories:
Operational Big Data: comprises of data on systems such as MongoDB, Apache Cassandra, or CouchDB,
which offer equipped capabilities in real-time for large data operations.
Analytical Big Data: comprises systems such as MapReduce, BigQuery, Apache Spark, or Massively
Parallel Processing (MPP) database, which offer analytical competence to process complex analysis on
large datasets.
CHALLENGES OF BIG DATA :
Rapid Data Growth: The growth velocity at such a high rate creates a problem to look for insights using
it. There no 100% efficient way to filter out relevant data.
Storage: The generation of such a massive amount of data needs space for storage, and organizations face
challenges to handle such extensive data without suitable tools and technologies.
Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are totally (100%)
accurate. Redundant data, contradicting data, or incomplete data are challenges that remain within it.
Data Security: Firms and organizations storing such massive data (of users) can be a target of
cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such colossal data is also a
challenge for firms and organizations.
DATAFICATION:
Datafication is a technological trend turning many aspects of our life into data which is subsequently
transferred into information realised as a new form of value. Kenneth Cukier and Victor Mayer-Schöenberger
introduced the term Datafication to the broader lexicon in 2013.Up until this time, datafication had been
associated with the analysis of representations of our lives captured through data, but not on the present scale.
This change was primarily due to the impact of big data and the computational opportunities afforded
to predictive analytics.
IMPACT :
human resources: Data obtained from mobile phones, apps or social media usage is used to identify
potential employees and their specific characteristics such as risk taking profile and personality. This data
will replace personality tests. Rather using the traditional personality tests or the exams that measure the
analytical thinking, using the data obtained through datafication will change existing exam providers. Also,
with this data new personality measures will be developed.
Data is used to understand an individual's risk profile and likelihood to pay a loan.
Various industries are using datafication to understand their customers better and create appropriate triggers
based on each customer's personality and behaviour. This data is obtained from the language and tone a
person uses in emails, phone calls or social medias.
smartcity:
Through the data obtained from the sensors that are implemented into the smart city, issues that can arise
might be noticed and tackled in areas such as transportation, waste management, logistics, and energy. On
the basis of real-time data, commuters could change their routes when there is a traffic jam. With the
sensors that can measure air and water quality, cities can not only gain a more detailed understanding of
the pollution levels, but may also enact new environmental regulations based on real-time data.
STATISTICAL INFERENCE:
Using data analysis and statistics to make conclusions about a population is called statistical inference.
Estimation
Hypothesis testing
Estimation:
Statistics from a sample are used to estimate population parameters. (or) estimates parameters of the probability
density function along with its confidence region.There is always uncertainty when estimating.The uncertainty is
often expressed as confidence intervals defined by a likely lowest and highest value for the parameter.
An example could be a confidence interval for the number of bicycles a Dutch person owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely, it checks how
likely it is that a hypothesis is true is based on the sample data.
A single group
Some examples of claims or questions that can be checked with hypothesis testing:
For good statistical analysis, the sample needs to be as "similar" as possible to the population. If they are similar enough,
we say that the sample is representative of the population.
The sample is used to make conclusions about the whole population. If the sample is not similar enough to the whole
population, the conclusions could be useless.
Statistical modeling:
statistical modeling is the formalization of relationships between the variables in the form of
mathematical equations.
this module introduces the basic concepts in probability and statistics that are necessary for performing
data analysis.
Probability Distributions :
Statistical modeling rely on probability calculation and probability distributions. Probability distributions are functions
that calculates the probabilities of the outcomes of random variables.
normal distribution:
The normal distribution is described by the mean (μ) and the standard deviation (σ).
Where
μ = Mean
σ = Standard deviation
x = Normal random variable
The normal distribution is often referred to as a 'bell curve' because of it's shape:
The area under the curve of the normal distribution represents probabilities for the data.
Here is a graph of a normal distribution with probabilities between standard deviations (σ):
Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ)
Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ)
Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ)
example:
Here is a histogram of the age of Nobel Prize winners when they won the prize:
The normal distribution drawn on top of the histogram is based on the population mean ( μ) and standard deviation (σ) of
the real data.
Test scores
Height
Birth weight
T Distribution
T-distribution, also known as student's t-distribution, is a probability distribution that is used to estimate population
parameters when the sample size is small and the population variance is unknown.
If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower.
The bigger the sample size is, the closer the t-distribution gets to the standard normal distribution.
df - (degree of freedom).