Professional Documents
Culture Documents
DOC-20220428-WA0004. (1) (1) (1)
DOC-20220428-WA0004. (1) (1) (1)
Introduction :
In these times data science has become the most
demanding job of the 21st century.
Every organization is looking for the candidates with
the knowledge of data science
Why data science became so popular
Evaluation of technology
Iot( internet of things)
Social media
Other factors
What is data science
Data science is a deep study of massive amount of data
It is a field that gives insights from the structured and
unstructured data using different scientific methods
and algorithms .and it helps in generating insights,
making predictions.
It uses a large amount of data to get meaningful
outputs or insights using statistics and computation for
decision making
The data used in data science is usually collected from
different sources, such as e-commerce sites, surveys,
social media, and internet searches by using advanced
technologies.
This data helps in making predictions and providing
profits to the businesses accordingly
Data science examples
With the help of data science the online food delivery
companies understand the requirements of their
customers
Data science also helps in making future predictions.
For example airlines can predict the prices for the
flights according to the customers previous booking
history
Data science also helps in getting recommendations
Who is data scientist
A data scientist uses data to understand and explains
the phenomena around them and helps organizations to
make better decisions
Data scientist master of all the
Trades, he should be proficient in
Maths, he should have good
Computer science skills, business
Knowledge.
Knowing math is a very important skills of the data
scientist, mathematics is important because inorder to
find solution you are going build a lot of predictive
models and these predictive models are going to be
based on hard math. So you have be able to understand
all the underlying mechanisms of the predictive
models. So most of the models, algorithms requires
mathematics.
Data science skill set
Statistics: statistics will give you the numbers in data.
So a good understanding of statistics is very important
for becoming a data scientist. You have to be familiar
with stastical test , contest distributions, maximum
likelihood estimators, also you should know
probability theory and descriptive statistics.
These concepts will help you make better business
decisions
Programming languages: you have to know a
statistical programming languages like R or python.
you need to know data querying language like SQL.
Data extracting and processing: data extraction can
be done on multiple data sources like mysql
databases, mongo databases. Data extraction is
nothing but extracting data from databases and
putting it in a structured format so that you can analyze
Data wrangling and exploration: it is one of the
most difficult task in data science. Data wrangling is
about cleaning data. After you are going to explore the
data.
Machine learning: if you have huge amount of data in
organization then you need to know machine learning
algorithms like k-map , knn (knok nearesr neighbor)
etc, all these algorithms are implemented using R or
python libraries.
Big data processing frame works: we have being
generating a lot of data and most of the data can be
structured and unstructured format so on such data we
cannot use traditional data processing system. thats
why you need to know frameworks like Hadoop,
spark so these frame works can be used to handle big
data
Data visualization: it is always very important to
present data in an understandable and visually
attracting format. Tools like diablo, power bi are few
most popular visualization tools.
so these are the needed skill sets to become a data
scientist.
Stastics
Programming
Data extracting and preprocessing
Data wrangling and exploration
Machine learning algorithms
Big data processing frameworks
visualization
Data science job role
Data scientist: has to understand the challeges over the
business and they have to offer the best solution using
data analysis and data processing. To become a data
scientist you have be expert in R ,MATLAB , PYTHON,
SQL
Data analyst : a data analyst is responsible for a variety
tasks including visualization , data processing of massive
amout of data. They have to also perform queries on
databases, they should the optimization that can be used
to access the information from the biggest databases
without corrupting the data. They must know the
technologies like SQL , R, PYTHON etc
Data architect: creates blue prints for data
management so that data bases can be easily
integrated, centralized and protected with best security
measures. To become a architect you should be expert
in data warehousing , data modeling extraction and
transformations
Data engineer: the main responsibility of data engineer is
to build and test scalable big data ecosystem. They are also
needed to upgrade the existing systems with newer or
upgraded versions. They are also responsible for increasing
the efficiency of databases.the technologies required are
HIVE, NOSQL,R , RUBY, JAVA ,C++ , MAT LAB
Statistician: They should know statistical theories, and
data organization. they also creates different methodologies
for engineers to apply. to become a statistician you should
good in maths, good knowledge about different databases
and machine learning algorithms
Administration: they are responsible for the proper
functioning of all the databases. They are also
responsible for granting permissions. They are also
responsible for data backups and recoveries.
Business analyst : focus on how the data can be
linked to actionable business inside. so they mainly
focus on business growth. The business analyst acts
link between the data engineers and management
executives. To become business analyst you have to
understand the business finances and business
intelligence
Data & analytics manager:
Is responsible data science operations. he’s responsible
for assigning duties to the team according their skills.
What is big data :
Sample:
Sample is the subset of the population (i.e taking random
samples from the population )
The size of the population is always less than the total size of
population
Statistical modeling:
statistical modeling is the formalization of
relationships between the variables in the form of
mathematical equations.
this module introduces the basic concepts in
probability and statistics that are necessary for
performing data analysis.
machine learning algorithms leverage probability
distribution to model uncertainity in predictions,
enchancing their ability to make accurate forecasts.
Probability Distributions :
Statistical modeling rely on probability calculation and
probability distributions. Probability distributions are
functions that calculates the probabilities of the
outcomes of random variables.
We will divide these Probability distributions based on
whether the data is discrete or continuous.
if a random variable takes discrete values than it called
discrete random variable and corresponding
distribution is discrete random distribution
Continuous:
Normal Distribution
chi2 Distribution
Student-t Distribution
Log-Normal Distribution
Exponential Distribution
Bernoulli Distribution:
a bernoulli distribution is a discrete distribution which
has only 2 possible outcome ie. sucess-1 and failure-0
probaility mass function=px * (1-p)1-x
Binomial distribution:
a random experiment consisting of n repeated bernoulli
experiments
conditions:
each experiment is indepent
each experiment result in sucess or failure
the probability of sucess in each experiment is p.
a random variable x that equals number of experiments
that result in a sucess has a binomial distribution with
parameters n and p
pmf= ncr px (1-p)n-x
poisson distribution:
Poisson Distribution describes the probability of a
given number of events occurring in a fixed interval
pmf= e-ƛ ƛx
xˡ
Normal Distribution
The normal distribution is described by
the mean (μ) and the standard deviation (σ).
While taking the samples from the population there
are 2 types
1. Sampling with replacement
2. Sampling without replacement
Sampling
It is a process of selecting a sample from the
population
For this sampling population is divided into a number
of parts called sampling units
Statistical analysis
Descriptive statistics
i) graphical
ii) numerical
• Inferential statistics
i) estimation : estimates parameters of the
probability density function along with its confidence
region
ii) hypotheses testing : making judgments about
f(x) and its parameters
eg: business , medical ,social ,engineering
Statistical modeling
statistical modeling is the formalization of
relationships between the variables in the form of
mathematical equations
this module introduces the basic concepts in
probability and statistics that are necessary for
performing data analysis
Statistical modeling divided into 2 parts
1) Random variables and probability density functions
2) estimation , hypothesis testing
Random phenomena
2 types
Deterministic phenomena
Stochastic phenomena
Sources of errors
i) due to lack of knowledge
ii) measurement errors