Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

Fundamentals of datascience

Introduction :
In these times data science has become the most
demanding job of the 21st century.
Every organization is looking for the candidates with
the knowledge of data science
Why data science became so popular
Evaluation of technology
Iot( internet of things)
Social media
Other factors
What is data science
Data science is a deep study of massive amount of data
It is a field that gives insights from the structured and
unstructured data using different scientific methods
and algorithms .and it helps in generating insights,
making predictions.
It uses a large amount of data to get meaningful
outputs or insights using statistics and computation for
decision making
The data used in data science is usually collected from
different sources, such as e-commerce sites, surveys,
social media, and internet searches by using advanced
technologies.
This data helps in making predictions and providing
profits to the businesses accordingly
Data science examples
With the help of data science the online food delivery
companies understand the requirements of their
customers
Data science also helps in making future predictions.
For example airlines can predict the prices for the
flights according to the customers previous booking
history
Data science also helps in getting recommendations
Who is data scientist
A data scientist uses data to understand and explains
the phenomena around them and helps organizations to
make better decisions
Data scientist master of all the
Trades, he should be proficient in
Maths, he should have good
Computer science skills, business
Knowledge.
Knowing math is a very important skills of the data
scientist, mathematics is important because inorder to
find solution you are going build a lot of predictive
models and these predictive models are going to be
based on hard math. So you have be able to understand
all the underlying mechanisms of the predictive
models. So most of the models, algorithms requires
mathematics.
Data science skill set
Statistics: statistics will give you the numbers in data.
So a good understanding of statistics is very important
for becoming a data scientist. You have to be familiar
with stastical test , contest distributions, maximum
likelihood estimators, also you should know
probability theory and descriptive statistics.
These concepts will help you make better business
decisions
Programming languages: you have to know a
statistical programming languages like R or python.
you need to know data querying language like SQL.
Data extracting and processing: data extraction can
be done on multiple data sources like mysql
databases, mongo databases. Data extraction is
nothing but extracting data from databases and
putting it in a structured format so that you can analyze
Data wrangling and exploration: it is one of the
most difficult task in data science. Data wrangling is
about cleaning data. After you are going to explore the
data.
Machine learning: if you have huge amount of data in
organization then you need to know machine learning
algorithms like k-map , knn (knok nearesr neighbor)
etc, all these algorithms are implemented using R or
python libraries.
Big data processing frame works: we have being
generating a lot of data and most of the data can be
structured and unstructured format so on such data we
cannot use traditional data processing system. thats
why you need to know frameworks like Hadoop,
spark so these frame works can be used to handle big
data
Data visualization: it is always very important to
present data in an understandable and visually
attracting format. Tools like diablo, power bi are few
most popular visualization tools.
so these are the needed skill sets to become a data
scientist.
Stastics
Programming
Data extracting and preprocessing
Data wrangling and exploration
Machine learning algorithms
Big data processing frameworks
visualization
Data science job role
 Data scientist: has to understand the challeges over the
business and they have to offer the best solution using
data analysis and data processing. To become a data
scientist you have be expert in R ,MATLAB , PYTHON,
SQL
 Data analyst : a data analyst is responsible for a variety
tasks including visualization , data processing of massive
amout of data. They have to also perform queries on
databases, they should the optimization that can be used
to access the information from the biggest databases
without corrupting the data. They must know the
technologies like SQL , R, PYTHON etc
Data architect: creates blue prints for data
management so that data bases can be easily
integrated, centralized and protected with best security
measures. To become a architect you should be expert
in data warehousing , data modeling extraction and
transformations
 Data engineer: the main responsibility of data engineer is
to build and test scalable big data ecosystem. They are also
needed to upgrade the existing systems with newer or
upgraded versions. They are also responsible for increasing
the efficiency of databases.the technologies required are
HIVE, NOSQL,R , RUBY, JAVA ,C++ , MAT LAB
 Statistician: They should know statistical theories, and
data organization. they also creates different methodologies
for engineers to apply. to become a statistician you should
good in maths, good knowledge about different databases
and machine learning algorithms
Administration: they are responsible for the proper
functioning of all the databases. They are also
responsible for granting permissions. They are also
responsible for data backups and recoveries.
Business analyst : focus on how the data can be
linked to actionable business inside. so they mainly
focus on business growth. The business analyst acts
link between the data engineers and management
executives. To become business analyst you have to
understand the business finances and business
intelligence
Data & analytics manager:
Is responsible data science operations. he’s responsible
for assigning duties to the team according their skills.
What is big data :

Data which are very large in size is called big data


The sources of big data are:
Social networking sites
E-commerce sites
Weather stations
Telecom company
Share markets
Characteristics of big data
Volume – quantity of the data to be stored
Veracity – how accurate the data set may be
Variety – refers all structured and unstructured data
Value – usefulness of gathered data for the business
Velocity- how quickly the data is generating how
quickly the data is moving
Datafication:
Taking all the aspects of life and turning them into data
Statistical inference
Statistics is a branch of mathematics. It is defined as
the collection of quantitative data.
The main purpose of statistics is to make an accurate
conclusion using a limited sample about a greater
population
Statistical inference is the process of analyzing the
result and making conclusions from the data subject to
random variation
population
The population is the entire group that is taking for
analysis or prediction

Sample:
Sample is the subset of the population (i.e taking random
samples from the population )
The size of the population is always less than the total size of
population
Statistical modeling:
 statistical modeling is the formalization of
relationships between the variables in the form of
mathematical equations.
 this module introduces the basic concepts in
probability and statistics that are necessary for
performing data analysis.
machine learning algorithms leverage probability
distribution to model uncertainity in predictions,
enchancing their ability to make accurate forecasts.
Probability Distributions :
Statistical modeling rely on probability calculation and
probability distributions. Probability distributions are
functions that calculates the probabilities of the
outcomes of random variables.
We will divide these Probability distributions based on
whether the data is discrete or continuous.
if a random variable takes discrete values than it called
discrete random variable and corresponding
distribution is discrete random distribution

it random variable takes continues values range then it


is called continues random distribution
Discrete
Bernoulli Distribution
Binomial Distribution
Poisson Distribution

Continuous:
Normal Distribution
chi2 Distribution
Student-t Distribution
Log-Normal Distribution
Exponential Distribution
Bernoulli Distribution:
a bernoulli distribution is a discrete distribution which
has only 2 possible outcome ie. sucess-1 and failure-0
probaility mass function=px * (1-p)1-x
 Binomial distribution:
 a random experiment consisting of n repeated bernoulli
experiments
 conditions:
 each experiment is indepent
 each experiment result in sucess or failure
 the probability of sucess in each experiment is p.
 a random variable x that equals number of experiments
that result in a sucess has a binomial distribution with
parameters n and p
 pmf= ncr px (1-p)n-x
poisson distribution:
Poisson Distribution describes the probability of a
given number of events occurring in a fixed interval
pmf= e-ƛ ƛx
 xˡ
Normal Distribution
The normal distribution is described by
the mean (μ) and the standard deviation (σ).
While taking the samples from the population there
are 2 types
1. Sampling with replacement
2. Sampling without replacement
Sampling
It is a process of selecting a sample from the
population
For this sampling population is divided into a number
of parts called sampling units
Statistical analysis
Descriptive statistics
i) graphical
ii) numerical
• Inferential statistics
i) estimation : estimates parameters of the
probability density function along with its confidence
region
ii) hypotheses testing : making judgments about
f(x) and its parameters
eg: business , medical ,social ,engineering
Statistical modeling
statistical modeling is the formalization of
relationships between the variables in the form of
mathematical equations
this module introduces the basic concepts in
probability and statistics that are necessary for
performing data analysis
Statistical modeling divided into 2 parts
1) Random variables and probability density functions
2) estimation , hypothesis testing
Random phenomena
2 types
Deterministic phenomena
Stochastic phenomena
Sources of errors
i) due to lack of knowledge
ii) measurement errors

Types of random phenomena


i. Discrete
ii. continuous
Discrete phenomena
Sample space: set of all possible out comes of a
random phenomena
One coin toss- {H ,T};

EVENT: subset of sample space.


Probability measure:
Probability measure is a function that assigns a real
value to every outcome of a random phenomena
It should satisfy the following statements
i) 0 <= p(A) <= 1 ( probabilities are non-negative and
less than 1 for any event A )
ii ) p(s)=1 (one of the outcome should occur)
iii)for 2 mutually exclusive events A &B
P(AUB) =P(A) + P(B)
• Conduct an experiment N times if NA is number of
times outcome A occurs then P(A)=NA / N
Types of events
 Independent events:
two events are independent if occurrence of one has no
influence on occurrence of other
p(HH)=P(H in 1st toss) * P( H in 2nd toss)
=0.5 * 0.5= 0.25
• Mutually exclusive events:
two events are mutually exclusive if occurrence of one
implies other event does not occur
In two coin toss experiment p(HH & HT)=P(HH)+P(HT)
=0.25+0.25
=0.5
Random variable
A random variable(RV) is a map from sample space to
real line such that there is a unique real number
corresponding to every outcome of sample space

You might also like