Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

UNIT-I

INTRODUCTION TO DATA SCIENCE

WHAT IS DATA SCIENCE :

Data science is a deep study of massive amount of data. It is a field that gives
insights from the structured and unstructured data using different scientific
methods and algorithms. and it helps in generating insights, making predictions. It
uses a large amount of data to get meaningful outputs or insights using statistics
and computation for decision making.

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.

The data used in data science is usually collected from different sources, such
as e-commerce sites, surveys, social media, and internet searches by using
advanced technologies.This data helps in making predictions and providing profits
to the businesses accordingly

Data science examples:

 With the help of data science the online food delivery companies understand
the requirements of their customers

 Data science also helps in making future predictions. For example airlines
can predict the prices for the flights according to the customers previous
booking history

 Data science also helps in getting recommendations

Need for Data Science:

Some years ago, data was less and mostly available in a structured form, which
could be easily stored in excel sheets, and processed using BI tools. But in today's
world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is
generating on every day, which led to data explosion.

Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that technology
came into existence as data Science. Following are some main reasons for using
data science technology:

With the help of data science technology, we can convert the massive amount of
raw and unstructured data into meaningful insights.

Data science technology is opting by various companies, whether it is a big brand


or a startup. Google, Amazon, Netflix, etc, which handle the huge amount of data,
are using data science algorithms for better customer experience.

Data science is working for automating transportation such as creating a self-


driving car, which is the future of transportation.

Data science can help in different predictions such as various survey, elections,
flight ticket confirmation, etc

SKILL SETS NEEDED:


A data scientist uses data to understand and explains the phenomena around them
and helps organizations to make better decisions. Data scientist master of all the
Trades, he should be proficient in Maths, he should have good Computer science
skills, business Knowledge.

Statistics: statistics will give you the numbers in data. So a good understanding of
statistics is very important for becoming a data scientist. You have to be familiar
with stastical test , contest distributions, maximum likelihood estimators, also you
should know probability theory and descriptive statistics. These concepts will help
you make better business decisions.

Programming languages: you have to know a statistical programming languages


like R or python. you need to know data querying language like SQL.

Data extracting and processing: data extraction can be done on multiple data
sources like mysql databases, mongo databases. Data extraction is nothing but
extracting data from databases and putting it in a structured format so that you
can analyze

Data wrangling and exploration: it is one of the most difficult task in data science.
Data wrangling is about cleaning data. After you are going to explore the data.

Machine learning: if you have huge amount of data in organization then you need
to know machine learning algorithms like k-map , knn (knok nearesr neighbor) etc,
all these algorithms are implemented using R or python libraries.

Big data processing frame works: we have being generating a lot of data and most
of the data can be structured and unstructured format so on such data we cannot
use traditional data processing system. thats why you need to know frameworks
like Hadoop, spark so these frame works can be used to handle big data

Data visualization: it is always very important to present data in an understandable


and visually attracting format. Tools like diablo, power bi are few most popular
visualization tools.

so these are the needed skill sets for data science.


BIG DATA :

Big data can be defined as a concept used to describe a large volume of data, which
are both structured and unstructured, and that gets increased day by day by any
system or business.

Characteristics of big data:

Volume – Organizations and firms gather as well as pull together different data
from different sources, which includes business transactions and data, data from
social media, login data, as well as information from the sensor as well as
machine-to-machine data. Earlier, this data storage would have been an issue - but
because of the advent of new technologies for handling extensive data with tools
like Apache Spark, Hadoop, the burden of enormous data got decreased.

Variety – The releases of data from various systems have diverse types and
formats. They range from structured to unstructured, numeric data of traditional
databases to non-numeric or text documents, emails, audios and videos, stock
ticker data, login data, Blockchains' encrypted data, or even financial transactions.

Velocity- Data is now streaming at an exceptional speed, which has to be dealt


with suitably. Sensors, smart metering, user data as well as RFID tags are lashing
the need for dealing with an inundation of data in near real-time.

Veracity – how accurate the data set may be

Value – usefulness of gathered data for the business

IMPORTANCE OF BIG DATA :

Big Data does not take care of how much data is there, but how it can be used.
Data can be taken from various sources for analyzing it and finding answers which
enable:

Reduction in cost.

Time reductions.

New product development with optimized offers.


Well-groomed decision making.

When you merge big data with high-powered data analytics, it is possible to
achieve business-related tasks like:

Real-time determination of core causes of failures, problems, or faults.

Produce token and coupons as per the customer's buying behavior.

Risk-management can be done in minutes by calculating risk portfolios.

Detection of deceptive behavior before its influence.

TYPES OF DATA HANDLED BY BIG DATA :

The data generated in bulk amount with high velocity can be categorized as:

1. Structured Data: These are relational data.


2. Semi-structured Data: example: XML, JSON data.
3. Unstructured Data: Data of different formats: document files, multimedia files, images, backup files, etc.

EXAMPLES OF BIG DATA :

 Black Box Data: Black box data is a type of data that is collected from private and government helicopters,
airplanes, and jets. This data includes the capture of Flight Crew Sounds, separate recording of the
microphone as well as earphones, etc.
 Stock Exchange Data: Stock exchange data includes various data prepared about 'purchase' and 'selling' of
different raw and well-made decisions.
 Social Media Data: This type of data contains information about social media activities that include posts
submitted by millions of people worldwide.
 Transport Data: Transport data includes vehicle models, capacity, distance (from source to destination),
and the availability of different vehicles.
 Search Engine Data: Retrieve a wide variety of unprocessed information that is stored in SE databases.
BIG DATA TECHNOLOGIES:

This technology is significant for presenting a more precise analysis that leads the business analyst to highly
accurate decision-making, ensuring more considerable operational efficiencies by reducing costs and trade risks.
Now to implement such analytics and hold such a wide variety of data, one must need an infrastructure that can
facilitate and manage and process huge data volumes in real-time. This way, big data is classified into two
subcategories:

 Operational Big Data: comprises of data on systems such as MongoDB, Apache Cassandra, or CouchDB,
which offer equipped capabilities in real-time for large data operations.
 Analytical Big Data: comprises systems such as MapReduce, BigQuery, Apache Spark, or Massively
Parallel Processing (MPP) database, which offer analytical competence to process complex analysis on
large datasets.
CHALLENGES OF BIG DATA :
 Rapid Data Growth: The growth velocity at such a high rate creates a problem to look for insights using
it. There no 100% efficient way to filter out relevant data.
 Storage: The generation of such a massive amount of data needs space for storage, and organizations face
challenges to handle such extensive data without suitable tools and technologies.
 Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are totally (100%)
accurate. Redundant data, contradicting data, or incomplete data are challenges that remain within it.
 Data Security: Firms and organizations storing such massive data (of users) can be a target of
cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such colossal data is also a
challenge for firms and organizations.
DATAFICATION:

Datafication is a technological trend turning many aspects of our life into data which is subsequently
transferred into information realised as a new form of value. Kenneth Cukier and Victor Mayer-Schöenberger
introduced the term Datafication to the broader lexicon in 2013.Up until this time, datafication had been
associated with the analysis of representations of our lives captured through data, but not on the present scale.
This change was primarily due to the impact of big data and the computational opportunities afforded
to predictive analytics.

IMPACT :

human resources: Data obtained from mobile phones, apps or social media usage is used to identify
potential employees and their specific characteristics such as risk taking profile and personality. This data
will replace personality tests. Rather using the traditional personality tests or the exams that measure the
analytical thinking, using the data obtained through datafication will change existing exam providers. Also,
with this data new personality measures will be developed.

Insurance and Banking

Data is used to understand an individual's risk profile and likelihood to pay a loan.

Customer relationship management

Various industries are using datafication to understand their customers better and create appropriate triggers
based on each customer's personality and behaviour. This data is obtained from the language and tone a
person uses in emails, phone calls or social medias.

smartcity:

Through the data obtained from the sensors that are implemented into the smart city, issues that can arise
might be noticed and tackled in areas such as transportation, waste management, logistics, and energy. On
the basis of real-time data, commuters could change their routes when there is a traffic jam. With the
sensors that can measure air and water quality, cities can not only gain a more detailed understanding of
the pollution levels, but may also enact new environmental regulations based on real-time data.

STATISTICAL INFERENCE:
Using data analysis and statistics to make conclusions about a population is called statistical inference.

The main types of statistical inference are:

 Estimation
 Hypothesis testing

Estimation:
Statistics from a sample are used to estimate population parameters. (or) estimates parameters of the probability
density function along with its confidence region.There is always uncertainty when estimating.The uncertainty is
often expressed as confidence intervals defined by a likely lowest and highest value for the parameter.

An example could be a confidence interval for the number of bicycles a Dutch person owns:

"The average number of bikes a Dutch person owns is between 3.5 and 6."

Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely, it checks how
likely it is that a hypothesis is true is based on the sample data.

The steps of the test depends on:

Type of data (categorical or numerical)

If you are looking at:

A single group

Comparing one group to another

Comparing the same group before and after a change

Some examples of claims or questions that can be checked with hypothesis testing:

Is the average weight of dogs more than 40kg?

Do doctors make more money than lawyers?

populations and samples:

Population: Everything in the group that we want to learn about.

Sample: A part of the population.

Examples of populations and a sample from those populations:


Population Sample

All of the people in India 500 indians

All of the customers of Netflix 300 Netflix customers

Every car manufacturer Tesla, Toyota, BMW, Ford

For good statistical analysis, the sample needs to be as "similar" as possible to the population. If they are similar enough,
we say that the sample is representative of the population.

The sample is used to make conclusions about the whole population. If the sample is not similar enough to the whole
population, the conclusions could be useless.

Statistical modeling:

statistical modeling is the formalization of relationships between the variables in the form of
mathematical equations.

this module introduces the basic concepts in probability and statistics that are necessary for performing
data analysis.

Probability Distributions :
Statistical modeling rely on probability calculation and probability distributions. Probability distributions are functions
that calculates the probabilities of the outcomes of random variables.

The normal distribution is an important probability distribution used in statistics.

normal distribution:
The normal distribution is described by the mean (μ) and the standard deviation (σ).

Normal Distribution Formula:

Where
μ = Mean
σ = Standard deviation
x = Normal random variable

The normal distribution is often referred to as a 'bell curve' because of it's shape:

 Most of the values are around the center (μ)


 The median and mean are equal
 It has only one mode
 It is symmetric, meaning it decreases the same amount on the left and the right of the center

The area under the curve of the normal distribution represents probabilities for the data.

The area under the whole curve is equal to 1, or 100%

Here is a graph of a normal distribution with probabilities between standard deviations (σ):

Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ)

 Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ)
 Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ)

example:

Real world data is often normally distributed.

Here is a histogram of the age of Nobel Prize winners when they won the prize:
The normal distribution drawn on top of the histogram is based on the population mean ( μ) and standard deviation (σ) of
the real data.

Examples of real world variables that can be normally distributed:

 Test scores
 Height
 Birth weight

T Distribution
T-distribution, also known as student's t-distribution, is a probability distribution that is used to estimate population
parameters when the sample size is small and the population variance is unknown.

If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower.

The bigger the sample size is, the closer the t-distribution gets to the standard normal distribution.

Below is a graph of a few different t-distributions.


chi square distribution:
Chi Square distribution is used as a basis to verify the hypothesis.

It has two parameters:

df - (degree of freedom).

size - The shape of the returned array.

graph of the chi square distribution with different degrees of freedom(df).

You might also like