Professional Documents
Culture Documents
Data Science Unit 1
Data Science Unit 1
Data Science Unit 1
MODULE-II
STATISTICAL ANALYSIS: -
Introduction to statistics, statistical and non -statistical analysis, major categories
of statistics, population and sample, Measure of central tendency and dispersion,
Moments, Skewness and kurtosis, Correlation and regression, Theoretical
distributions – Binomial, Poisson, Normal
MODULE-III
INTRODUCTION TO MACHINE LEARNING: -
Machine learning, Types of learning, Properties of learning algorithms, Linear
regression and regularization, model selection and evaluation, classification: SVM,
kNN and decision tree, Ensemble methods: random forest, Naive Bayes and
logistic regression, Clustering: k -means, feature engineering and selection,
Dimensionality reduction: PCA
MODULE-IV
PYTHON SETUP FOR MATHEMATICAL AND SCIENTIFIC COMPUTING: -
Anaconda installation process, data types with python, basic operators and setup,
introduction to numpy, mathematical functions of numpy, introduction to scipy,
scipy packages, data frame and data operations, data visualisation using
matplotlib
Text Books:
1. N.G.Das , Statistical Methods (combined edition Vol.I and Vol.II) – Mc Graw Hill
2. Roger D. Peng, Elizabeth Matusi, The Art of Data Science: A Guide for Anyone
who work with data - Leanpub
3. AurelienGeron, Hands-On Machine Learning with Scikit – Learn &TensorFlow –
O’reilly
Reference Books:
1. AndriyBurkov, The Hundred Page Machine Learning Book – Xpress Publishing
2. James, G., Witten, D., Hastie, T., Tibshirani, R. An introduction to statistical
learning with applications in R. Springer.
3. Murphy, K. Machine Learning: A Probabilistic Perspective. - MIT Press
4. Jan Erik Solem, Programming Computer Vision with Python – O’ Reilly
As you can see from the above image, a Data Analyst usually explains what is
going on by processing history of the data. On the other hand, Data Scientist not
only does the exploratory analysis to discover insights from it, but also uses
various advanced machine learning algorithms to identify the occurrence of a
particular event in the future. A Data Scientist will look at the data from many
angles, sometimes angles not known earlier.
So, Data Science is primarily used to make decisions and predictions making use
of predictive causal analytics, prescriptive analytics (predictive plus decision
science) and machine learning.
• Predictive causal analytics – If you want a model that can predict the
possibilities of a particular event in the future, you need to apply predictive
causal analytics. Say, if you are providing money on credit, then the
probability of customers making future credit payments on time is a matter
of concern for you. Here, you can build a model that can perform predictive
Let’s see how the proportion of above-described approaches differ for Data
Analysis as well as Data Science. As you can see in the image below, Data
Analysis includes descriptive analytics and prediction to a certain extent. On the
other hand, Data Science is more about Predictive Causal Analytics and Machine
Learning.
• Traditionally, the data that we had was mostly structured and small in size,
which could be analyzed by using simple BI tools. Unlike data in
the traditional systems which was mostly structured, today most of the
data is unstructured or semi-structured. Let’s have a look at the data trends
in the image given below which shows that by 2020, more than 80 % of the
data will be unstructured.
This data is generated from different sources like financial logs, text files,
multimedia forms, sensors, and instruments. Simple BI tools are not
capable of processing this huge volume and variety of data. This is why we
need more complex and advanced analytical tools and algorithms for
processing, analyzing and drawing meaningful insights out of it.
This is not the only reason why Data Science has become so popular. Let’s dig
deeper and see how Data Science is being used in various domains.
Let’s have a look at the below infographic to see all the domains where Data
Science is creating its impression.
There are several definitions available on Data Scientists. In simple words, a Data
Scientist is one who practices the art of Data Science. The term “Data Scientist”
has been coined after considering the fact that a Data Scientist draws a lot of
information from the scientific fields and applications whether it is statistics or
mathematics.
To know more about a Data Scientist, you can refer to this article on Who is a Data
Scientist?
Moving further, lets now discuss BI. I am sure you might have heard of Business
Intelligence (BI) too. Often Data Science is confused with BI. I will state some
concise and clear contrasts between the two which will help you in getting a better
understanding. Let’s have a look.
Business Intelligence
Features Data Science
(BI)
Structured
Both Structured and Unstructured
Data Sources (Usually SQL, often Data
Warehouse) (Logs, cloud data, SQL, NoSQL, text)
Pentaho, Microsoft
Tools RapidMiner, BigML, Weka, R
BI, QlikView, R
This was all about what is Data Science, now let’s understand the lifecycle of Data
Science.
A common mistake made in Data Science projects is rushing into data collection
and analysis, without understanding the requirements or even framing the
business problem properly. Therefore, it is very important for you to follow all the
phases throughout the lifecycle of Data Science to ensure the smooth functioning
of the project.
Here is a brief overview of the main phases of the Data Science Lifecycle:
You can use R for data cleaning, transformation, and visualization. This will help
you to spot the outliers and establish a relationship between the variables. Once
you have cleaned and prepared the data, it’s time to do exploratory analytics on
it. Let’s see how you can achieve that.
Although, many tools are present in the market but R is the most commonly used
tool.
Now that you have got insights into the nature of your data and have decided the
algorithms to be used. In the next stage, you will apply the algorithm and build up
a model.
Now, I will take a case study to explain you the various phases described above.
Step 1:
Attributes:
Step 2:
• Now, once we have the data, we need to clean and prepare the data for
data analysis.
• This data has a lot of inconsistencies like missing values, blank columns,
abrupt values and incorrect data format which need to be cleaned.
• Here, we have organized the data into a single table under different
attributes – making it look more structured.
• Let’s have a look at the sample data below.
• So, we will clean and preprocess this data by removing the outliers, filling
up the null values and normalizing the data type. If you remember, this is
our second phase which is data preprocessing.
• Finally, we get the clean data as shown below which can be used for
analysis.
• First, we will load the data into the analytical sandbox and apply various
statistical functions on it. For example, R has functions like describe which
gives us the number of missing values and unique values. We can also use
the summary function which will give us statistical information like mean,
median, range, min and max values.
• Then, we use visualization techniques like histograms, line graphs, box
plots to get a fair idea of the distribution of data.
Step 4:
Now, based on insights derived from the previous step, the best fit for this kind of
problem is the decision tree. Let’s see how?
• Since, we already have the major attributes for analysis like npreg, bmi, etc.,
so we will use supervised learning technique to build a model here.
• Further, we have particularly used decision tree because it takes all
attributes into consideration in one go, like the ones which have a linear
relationship as well as those which have a non-linear relationship. In our
case, we have a linear relationship between npreg and age, whereas the
nonlinear relationship between npreg and ped.
• Decision tree models are also very robust as we can use the different
combination of attributes to make various trees and then finally implement
the one with the maximum efficiency.
Here, the most important parameter is the level of glucose, so it is our root node.
Now, the current node and its value determine the next important parameter to
be taken. It goes on until we get the result in terms of pos or neg. Pos means the
tendency of having diabetes is positive and neg means the tendency of having
diabetes is negative.
If you want to learn more about the implementation of the decision tree, refer this
blog How To Create A Perfect Decision Tree
Step 5:
In this phase, we will run a small pilot project to check if our results are
appropriate. We will also look for performance constraints if any. If the results are
not accurate, then we need to replan and rebuild the model.
Step 6:
Once we have executed the project successfully, we will share the output for full
deployment. Being a Data Scientist is easier said than done. So, let’s see what all
you need to be a Data Scientist. A Data Scientist requires skills basically from
three major areas as shown below.
As you can see in the above image, you need to acquire various hard skills and
soft skills. You need to be good at statistics and mathematics to analyze and
visualize data. Needless to say, Machine Learning forms the heart of Data Science
and requires you to be good at it. Also, you need to have a solid understanding of
the domain you are working in to understand the business problems clearly. Your
task does not end here. You should be capable of implementing various
algorithms which require good coding skills. Finally, once you have made certain
key decisions, it is important for you to deliver them to the stakeholders. So,
good communication will definitely add brownie points to your skills.
I urge you to see this Data Science video tutorial that explains what is Data Science
and all that we have discussed in the blog. Go ahead, enjoy the video and tell me
what you think.
What Is Data Science? Data Science Course – Data Science Tutorial For
Beginners
This Data Science course video will take you through the need of data science,
what is data science, data science use cases for business, BI vs data science, data
analytics tools, data science lifecycle along with a demo.
In the end, it won’t be wrong to say that the future belongs to the Data Scientists.
It is predicted that by the end of the year 2018, there will be a need of around one
million Data Scientists. More and more data will provide opportunities to drive key
business decisions. It is soon going to change the way we look at the world
deluged with data around us. Therefore, a Data Scientist should be highly skilled
and motivated to solve the most complex problems.
l hope you enjoyed reading my blog and understood what is Data Science. Check
out our Data Science with R certification and Data Science
Data science has been effective in tackling many real-world problems and is
being increasingly adopted across industries to power more intelligent and
better-informed decision-making. With the increased use of computers for day-
to-day business and personal operations, there is a demand for intelligent
machines, can learn human behavior and work patterns. This brings Data
science and big data analytics to the forefront.
A study says that the global data science market is estimated to grow to USD 115
billion in 2023 with a CAGR of ~ 29%. A report by Deloitte Access Economics says
that a massive 76% percent of businesses have plans to increase their spend
over the next two years on increasing their data analytic capabilities. Almost all
industries can benefit from data science and analytics. However, below are some
industries that are better poised to make use of data science and analytics.
1. BSFI
2. Media & Entertainment
3. Healthcare
4. Retail
5. Telecommunications
6. Automotive
7. Digital Marketing
8. Professional Services
9. Cyber Security
1. BFSI
There are many ways data science and AI can help financial institutions to be more
efficient in providing services to their clients, some of which include –
• Fraud detection
• Lending and loan appraisal management
• Risk modeling
• Securing and managing customer data
• Lifetime value prediction
• Customer segmentation
• Algorithmic trading
• Underwriting and credit scoring
Top Employers
• JPMorgan Chase
• HDFC
• ICICI Bank
• HSBC
• Citi Group
• BNP Paribas
To learn more about machine learning, read our blog on – What is Machine Learning?
The big players in the field of media and entertainment industry, including the
likes of YouTube, Netflix, Hotstar, etc. have started applying data science to
understand their customers and offer them the most relevant and customized
recommendations. Even the regular entertainment channels and gossip
newsfeeds are relying heavily on user data.
A recent report by PwC suggests that India’s OTT video market will grow at a 21.8%
CAGR from INR 4464Cr in 2018 to INR 11976Cr in 2023. Subscription video on
demand will increase at a 23.3% CAGR from INR 3756Cr in 2018 to INR 10708Cr in
2023.
Data science strategies, especially machine learning and artificial intelligence have
scaled up the media and entertainment industry through –
Fig – This graph shows consumer and advertising spending in digital and
traditional channels across media and entertainment sectors including TV, film,
music, gaming, books, magazines, and news.
bes Top Employers Dish Network Netflix Time Warner Fox BuzzFeed Viacom
Hindustan Media NDTV Explore data science courses 3. Healthcare In the
healthcare industry where most data is unstructured, and it isn’t easy to access
and analyze all the data, hospitals and healthcare centers seek data scientists who
can assemble fragmented heterogeneous data. From electronic medical records,
clinical trials, and genetic information to billing, wearable data, care management
• prevention plans
• Diagnosis of diseases
• Delivering more precise prescriptions and customized care
• Post-Care Monitoring
• Hospital operations
Top Employers
• GSK
• GE Healthcare
• Sanofi
4. Retail
Even the onslaught of the global pandemic, store closures, and layoffs couldn’t
impact data scientists’ demand in the retail segment. The consumer-focused retail
industry thrives on increasing personalization and relevance, with one aim – to
understand the shopper’s behavior and patterns through data.
Top Recruiters
• Amazon
• Flipkart
• Walmart
• Aditya Birla Fashion & Retail Ltd.
• Future Enterprises Ltd.
• Reliance Retail Ltd.
• K. Raheja Group (Shoppers’ Stop)
• Landmark Group (Lifestyle)
• ITC
5. Telecommunications
Top Employers
6. Automotive
With automobiles getting more complex and capable of collecting more data, it
was not possible to monitor wear and tear and report on mileage, fuel efficiency,
and routes without data science. Be ready for the future cars that will
communicate, collaborate, and navigate without human intervention! All thanks
to the power of data.
Top Employers
• General Motors
• Volkswagen
• Maruti Suzuki
Almost every industry is trying to leverage the power of data to thrive in the
market. If you too want to ride the data science bandwagon then consider picking
up a data science course.
7. Digital Marketing
Search pages, social networks, web traffic display networks, videos, web pages,
CRMs, databases, etc. are now fetching huge volumes of data from its customers.
Analysis of such high volumes of data requires a high level of business intelligence
and this can only be achieved through the right usage of data science
methodologies. Given these reasons, digital marketing is now increasingly using
data science for analytical purposes. This data-driven information can help
marketing/brand managers obtain crucial information such as –
Top Recruiters
8. Professional Services
Data science jobs are abundantly available across the professional services
industry since analytics is generating buzz. Big conglomerates, as well as SMEs,
seek professionals who can collect company data, store it, guarantee its security
and treat it appropriately. The process of implementing data science strategies
are very different for different business, and the methodologies in data collection,
data processing, and data visualization vary basis the industry functions. The job
Top Employers
• Legal services
• Accounting and bookkeeping
• Marketing consultancy
• IT services
• Customer service
• Logistics
9. Cyber Security
Data science in cybersecurity has made the computing processes more actionable
and intelligent compared to conventional information handling. It creates an
environment for data collection from relevant cybersecurity sources and analysis
that complements data-driven patterns. This fusion of data science and
cybersecurity represents a partial paradigm shift from traditional security
solutions such as user authentication, access control, cryptography, and firewalls
to systematic data management.
• Accenture
• Cisco
• IBM
• Microsoft
• National Informatics Centre
• QuickHeal Technologies Limited
• McAfee
Python is one of the world's most popular programming languages, and there are
a few reasons why Python is so popular:
Python’s syntax, or the words and symbols used in order to make a computer
program work, is simple and intuitive. They're basically English words!
Python supports various paradigms, but most people would describe Python as
an object oriented-programming language. In an object-oriented programming
language, everything you create is an object, different objects have different
properties, and you can operate on different objects in different ways.
The better question is what can't it be used for? Here are some key places where
you may see Python:
Web Development – Developers, engineers, and data scientists use Python for
web scraping or creating a mock-up an app.
Automating Reports – Analysts or product managers who need to make the same
Excel report every single week can use Python to help create reports and save
time.
Finance and Business – Used for reporting, predictive models, and academic
research.
Why do you think Python has recently overtaken R in popularity among data
scientists?
There are a couple of reasons I think Python has taken off. Python is a general
purpose language, used by data scientists and developers, which makes it easy to
collaborate across your organization through its simple syntax. People choose to
use Python so that they can communicate with other people. The other reason is
rooted in academic research and statistical models. I would say that R has better
statistical packages than Python, but Python has deep learning, structured ways
to do machine learning, and can deal with larger amounts of data. As people shift
more to deep learning, the bias has been shifting toward Python.
Within any field, you have to get the fundamentals of Python down first before
you can move on to more interesting things. Here’s a list of fundamentals you can
started with in order:
Understand what data types are (integers, strings, floating point numbers) and
how all of those data types are different.
Learn loops and conditionals – Loops execute a block of code several times and
conditionals tell the program when to stop executing that block of code.
Learn how to manipulate data – Practice this by reading data into your Python
program and then doing some kind of computations on it, cleaning it up, and
maybe even writing it out to a CSV file. You'll want to understand exactly how you
can manipulate data because that is a huge part of a data scientist’s job.
Algorithms – use algorithms to build models and maybe even create your own
models.
Data Analysis Process consists of the following phases that are iterative in nature
Data Collection
Data Processing
The data that is collected must be processed or organized for analysis. This
includes structuring the data as required for the relevant Analysis Tools. For
example, the data might have to be placed into rows and columns in a table within
a Spreadsheet or Statistical Application. A Data Model might have to be created.
Data Cleaning
Data Analysis
Data that is processed, organized and cleaned would be ready for the analysis.
Various data analysis techniques are available to understand, interpret, and
derive conclusions based on the requirements. Data Visualization may also be
used to examine the data in graphical format, to obtain additional insight
regarding the messages within the data.
The process might require additional Data Cleaning or additional Data Collection,
and hence these activities are iterative in nature.
Communication
The results of the data analysis are to be reported in a format as required by the
users to support their decisions and further action. The feedback from the users
might result in additional analysis.
The data analysts can choose data visualization techniques, such as tables and
charts, which help in communicating the message clearly and efficiently to the
• Python3
import pandas as pd
import numpy as np
df = pd.read_csv('employees.csv')
df.head()
Output:
EDA techniques are either graphical or quantitative. Each of these techniques are
in turn, either univariate or multivariate (usually just bivariate). Quantitative
methods normally involve calculation of summary statistics. Graphical methods
summarize the data in a diagrammatic or visual way. Univariate methods look at
one variable (data column) at a time, while multivariate methods look at two or
more variables at a time to explore relationships. Usually, multivariate EDA will be
For graphical analysis of univariate categorical data, histograms are typically used.
The histogram represents the frequency (count) or proportion (count/total count)
of cases for a range of values. Typically, between about 5 and 30 bins are chosen.
Histograms are one of the best ways to quickly learn a lot about your data,
including central tendency, spread, modality, shape and outliers. Stem and Leaf
plots could also be used for the same purpose. Boxplots can also be used to
present information about the central tendency, symmetry and skew, as well as
outliers. Quantile normal plots or QQ plots and other techniques could also be
used here.
1. Bar Graph
MODULE-II