Data Science Unit 1

Data Science
Computer Science & Engineering

Code: CSC602 Data Science
MODULE-I
INTRODUCTION: -
Introduction to data science, Different sectors of using data science, Purpose and
components of Python, Data Analytics processes, Exploratory data analytics,
Quantitative technique and graphical technique, Data types for plotting.
MODULE-II
STATISTICAL ANALYSIS: -
Introduction to statistics, statistical and non -statistical analysis, major categories
of statistics, population and sample, Measure of central tendency and dispersion,
Moments, Skewness and kurtosis, Correlation and regression, Theoretical
distributions – Binomial, Poisson, Normal
MODULE-III
INTRODUCTION TO MACHINE LEARNING: -
Machine learning, Types of learning, Properties of learning algorithms, Linear
regression and regularization, model selection and evaluation, classification: SVM,
kNN and decision tree, Ensemble methods: random forest, Naive Bayes and
logistic regression, Clustering: k -means, feature engineering and selection,
Dimensionality reduction: PCA
MODULE-IV
PYTHON SETUP FOR MATHEMATICAL AND SCIENTIFIC COMPUTING: -
Anaconda installation process, data types with python, basic operators and setup,
introduction to numpy, mathematical functions of numpy, introduction to scipy,
scipy packages, data frame and data operations, data visualisation using
matplotlib
Text Books:
1. N.G.Das , Statistical Methods (combined edition Vol.I and Vol.II) – Mc Graw Hill
2. Roger D. Peng, Elizabeth Matusi, The Art of Data Science: A Guide for Anyone
who work with data - Leanpub
3. AurelienGeron, Hands-On Machine Learning with Scikit – Learn &TensorFlow –
O’reilly
Reference Books:
1. AndriyBurkov, The Hundred Page Machine Learning Book – Xpress Publishing
2. James, G., Witten, D., Hastie, T., Tibshirani, R. An introduction to statistical
learning with applications in R. Springer.
3. Murphy, K. Machine Learning: A Probabilistic Perspective. - MIT Press
4. Jan Erik Solem, Programming Computer Vision with Python – O’ Reilly
1|Page Notes by Prof. Deepak Kumar

Data Science
MODULE-I
INTRODUCTION: -
Introduction to data science, Different sectors of using data science, Purpose and
components of Python, Data Analytics processes, Exploratory data analytics,
Quantitative technique and graphical technique, Data types for plotting.
What is Data Science?
Data Science is a blend of various tools, algorithms, and machine learning

principles with the goal to discover hidden patterns from the raw data. But how is
this different from what statisticians have been doing for years?
The answer lies in the difference between explaining and predicting.
As you can see from the above image, a Data Analyst usually explains what is
going on by processing history of the data. On the other hand, Data Scientist not
only does the exploratory analysis to discover insights from it, but also uses
various advanced machine learning algorithms to identify the occurrence of a
particular event in the future. A Data Scientist will look at the data from many
angles, sometimes angles not known earlier.
So, Data Science is primarily used to make decisions and predictions making use
of predictive causal analytics, prescriptive analytics (predictive plus decision
science) and machine learning.
• Predictive causal analytics – If you want a model that can predict the
possibilities of a particular event in the future, you need to apply predictive
causal analytics. Say, if you are providing money on credit, then the
probability of customers making future credit payments on time is a matter
of concern for you. Here, you can build a model that can perform predictive

Data Science
analytics on the payment history of the customer to predict if the future
payments will be on time or not.
• Prescriptive analytics: If you want a model that has the intelligence of

taking its own decisions and the ability to modify it with dynamic
parameters, you certainly need prescriptive analytics for it. This relatively
new field is all about providing advice. In other terms, it not only predicts
but suggests a range of prescribed actions and associated outcomes.
The best example for this is Google’s self-driving car which I had discussed
earlier too. The data gathered by vehicles can be used to train self-driving
cars. You can run algorithms on this data to bring intelligence to it. This will
enable your car to take decisions like when to turn, which path to
take, when to slow down or speed up.
• Machine learning for making predictions — If you have transactional

data of a finance company and need to build a model to determine the
future trend, then machine learning algorithms are the best bet. This falls
under the paradigm of supervised learning. It is called supervised because
you already have the data based on which you can train your machines. For
example, a fraud detection model can be trained using a historical record
of fraudulent purchases.
• Machine learning for pattern discovery — If you don’t have the

parameters based on which you can make predictions, then you need to
find out the hidden patterns within the dataset to be able to make
meaningful predictions. This is nothing but the unsupervised model as you
don’t have any predefined labels for grouping. The most common
algorithm used for pattern discovery is Clustering.
Let’s say you are working in a telephone company and you need to establish
a network by putting towers in a region. Then, you can use the clustering
technique to find those tower locations which will ensure that all the users
receive optimum signal strength.
Let’s see how the proportion of above-described approaches differ for Data
Analysis as well as Data Science. As you can see in the image below, Data
Analysis includes descriptive analytics and prediction to a certain extent. On the
other hand, Data Science is more about Predictive Causal Analytics and Machine
Learning.

Data Science
Why Data Science?
• Traditionally, the data that we had was mostly structured and small in size,
which could be analyzed by using simple BI tools. Unlike data in
the traditional systems which was mostly structured, today most of the
data is unstructured or semi-structured. Let’s have a look at the data trends
in the image given below which shows that by 2020, more than 80 % of the
data will be unstructured.
This data is generated from different sources like financial logs, text files,
multimedia forms, sensors, and instruments. Simple BI tools are not
capable of processing this huge volume and variety of data. This is why we
need more complex and advanced analytical tools and algorithms for
processing, analyzing and drawing meaningful insights out of it.
This is not the only reason why Data Science has become so popular. Let’s dig
deeper and see how Data Science is being used in various domains.
• How about if you could understand the precise requirements of your

customers from the existing data like the customer’s past browsing history,
purchase history, age and income. No doubt you had all this data earlier
too, but now with the vast amount and variety of data, you can train models
more effectively and recommend the product to your customers with more

Data Science
precision. Wouldn’t it be amazing as it will bring more business to your
organization?
• Let’s take a different scenario to understand the role of Data Science

in decision making. How about if your car had the intelligence to drive you
home? Self-driving cars collect live data from sensors, including radars,
cameras, and lasers to create a map of its surroundings. Based on this data,
it takes decisions like when to speed up, when to speed down, when to
overtake, where to take a turn – making use of advanced machine learning
algorithms.
• Let’s see how Data Science can be used in predictive analytics. Let’s take
weather forecasting as an example. Data from ships, aircraft, radars,
satellites can be collected and analyzed to build models. These models will
not only forecast the weather but also help in predicting the occurrence of
any natural calamities. It will help you to take appropriate measures
beforehand and save many precious lives.
Let’s have a look at the below infographic to see all the domains where Data
Science is creating its impression.
Who is a Data Scientist?
There are several definitions available on Data Scientists. In simple words, a Data
Scientist is one who practices the art of Data Science. The term “Data Scientist”
has been coined after considering the fact that a Data Scientist draws a lot of
information from the scientific fields and applications whether it is statistics or
mathematics.
What does a Data Scientist do?

Data scientists are those who crack complex data problems with their strong
expertise in certain scientific disciplines. They work with several elements related

Data Science
to mathematics, statistics, computer science, etc (though they may not be an
expert in all these fields). They make a lot of use of the latest technologies in
finding solutions and reaching conclusions that are crucial for an organization’s
growth and development. Data Scientists present the data in a much more useful
form as compared to the raw data available to them from structured as well as
unstructured forms.
To know more about a Data Scientist, you can refer to this article on Who is a Data
Scientist?
Moving further, lets now discuss BI. I am sure you might have heard of Business
Intelligence (BI) too. Often Data Science is confused with BI. I will state some
concise and clear contrasts between the two which will help you in getting a better
understanding. Let’s have a look.
Business Intelligence (BI) vs. Data Science
• Business Intelligence (BI) basically analyses the previous data to find

hindsight and insight to describe business trends. Here BI enables you to
take data from external and internal sources, prepare it, run queries on it
and create dashboards to answer questions like quarterly revenue
analysis or business problems. BI can evaluate the impact of certain events
in the near future.
• Data Science is a more forward-looking approach, an exploratory way with

the focus on analysing the past or current data and predicting the future
outcomes with the aim of making informed decisions. It answers the open-
ended questions as to “what” and “how” events occur.
Let’s have a look at some contrasting features.
Business Intelligence
Features Data Science
(BI)
Structured
Both Structured and Unstructured
Data Sources (Usually SQL, often Data
Warehouse) (Logs, cloud data, SQL, NoSQL, text)
Statistics, Machine Learning, Graph

Statistics and
Approach Analysis, Neuro- linguistic Programming
Visualization
(NLP)

Data Science
Focus Past and Present Present and Future
Pentaho, Microsoft
Tools RapidMiner, BigML, Weka, R
BI, QlikView, R
This was all about what is Data Science, now let’s understand the lifecycle of Data
Science.
A common mistake made in Data Science projects is rushing into data collection
and analysis, without understanding the requirements or even framing the
business problem properly. Therefore, it is very important for you to follow all the
phases throughout the lifecycle of Data Science to ensure the smooth functioning
of the project.
Lifecycle of Data Science
Here is a brief overview of the main phases of the Data Science Lifecycle:
Phase 1—Discovery: Before you

begin the project, it is important
to understand the various
specifications, requirements, priorities and
required budget. You must possess the
ability to ask the right questions. Here, you
assess if you have the required resources
present in terms of people, technology,
time and data to support the project. In
this phase, you also need to frame the
business problem and formulate initial
hypotheses (IH) to test.
Phase 2—Data preparation: In this phase, you require analytical

sandbox in which you can perform analytics for the entire duration
of the project. You need to explore, preprocess and condition data
prior to modeling. Further, you will perform ETLT (extract, transform,
load and transform) to get data into the sandbox. Let’s have a look at the

Data Science
Statistical Analysis flow below.
You can use R for data cleaning, transformation, and visualization. This will help
you to spot the outliers and establish a relationship between the variables. Once
you have cleaned and prepared the data, it’s time to do exploratory analytics on
it. Let’s see how you can achieve that.
Phase 3—Model planning: Here, you will determine the methods

and techniques to draw the relationships between variables. These
relationships will set the base for the algorithms which you will
implement in the next phase. You will apply Exploratory Data
Analytics (EDA) using various statistical formulas and visualization
tools.
Let’s have a look at various model planning tools.
1. R has a complete set of modeling capabilities and provides a good

environment for building interpretive models.
2. SQL Analysis services can perform in-database analytics using common
data mining functions and basic predictive models.
3. SAS/ACCESS can be used to access data from Hadoop and is used for
creating repeatable and reusable model flow diagrams.
Although, many tools are present in the market but R is the most commonly used
tool.
Now that you have got insights into the nature of your data and have decided the
algorithms to be used. In the next stage, you will apply the algorithm and build up
a model.

Data Science
Phase 4—Model building: In this phase, you will develop
datasets for training and testing purposes. Here you need
to consider whether your existing tools will suffice for running the
models or it will need a more robust environment (like fast and
parallel processing). You will analyze various learning techniques like
classification, association and clustering to build the model.
You can achieve model building through the following tools.
Phase 5—Operationalize: In this phase, you deliver final reports,

briefings, code and technical documents. In addition, sometimes
a pilot project is also implemented in a real-time production
environment. This will provide you a clear picture of the
performance and other related constraints on a small scale before
full deployment.
Phase 6—Communicate results: Now it is important to evaluate

if you have been able to achieve your goal that you had planned
in the first phase. So, in the last phase, you identify all the key
findings, communicate to the stakeholders and determine if the
results of the project are a success or a failure based on the criteria developed in
Phase 1.
Now, I will take a case study to explain you the various phases described above.
Case Study: Diabetes Prevention
What if we could predict the occurrence of diabetes and take appropriate

measures beforehand to prevent it?
In this use case, we will predict the occurrence of diabetes making use of the entire
lifecycle that we discussed earlier. Let’s go through the various steps.
Step 1:

Data Science
• First, we will collect the data based on the medical history of the patient as
discussed in Phase 1. You can refer to the sample data below.
• As you can see, we have the various attributes as mentioned below.
Attributes:
1. npreg – Number of times pregnant

2. glucose – Plasma glucose concentration
3. bp – Blood pressure
4. skin – Triceps skinfold thickness
5. bmi – Body mass index
6. ped – Diabetes pedigree function
7. age – Age
8. income – Income
Step 2:
• Now, once we have the data, we need to clean and prepare the data for
data analysis.
• This data has a lot of inconsistencies like missing values, blank columns,
abrupt values and incorrect data format which need to be cleaned.
• Here, we have organized the data into a single table under different
attributes – making it look more structured.
• Let’s have a look at the sample data below.
10 | P a g e Notes by Prof. Deepak Kumar

Data Science
This data has a lot of inconsistencies.
1. In the column npreg, “one” is written in words, whereas it should be in the

numeric form like 1.
2. In column bp one of the values is 6600 which is impossible (at least for
humans) as bp cannot go up to such huge value.
3. As you can see the Income column is blank and also makes no sense in
predicting diabetes. Therefore, it is redundant to have it here and should
be removed from the table.
• So, we will clean and preprocess this data by removing the outliers, filling
up the null values and normalizing the data type. If you remember, this is
our second phase which is data preprocessing.
• Finally, we get the clean data as shown below which can be used for
analysis.

Data Science
Step 3:
Now let’s do some analysis as discussed earlier in Phase 3.
• First, we will load the data into the analytical sandbox and apply various
statistical functions on it. For example, R has functions like describe which
gives us the number of missing values and unique values. We can also use
the summary function which will give us statistical information like mean,
median, range, min and max values.
• Then, we use visualization techniques like histograms, line graphs, box
plots to get a fair idea of the distribution of data.
Step 4:
Now, based on insights derived from the previous step, the best fit for this kind of
problem is the decision tree. Let’s see how?
• Since, we already have the major attributes for analysis like npreg, bmi, etc.,
so we will use supervised learning technique to build a model here.
• Further, we have particularly used decision tree because it takes all
attributes into consideration in one go, like the ones which have a linear
relationship as well as those which have a non-linear relationship. In our
case, we have a linear relationship between npreg and age, whereas the
nonlinear relationship between npreg and ped.
• Decision tree models are also very robust as we can use the different
combination of attributes to make various trees and then finally implement
the one with the maximum efficiency.
Let’s have a look at our decision tree.

Data Science
Here, the most important parameter is the level of glucose, so it is our root node.
Now, the current node and its value determine the next important parameter to
be taken. It goes on until we get the result in terms of pos or neg. Pos means the
tendency of having diabetes is positive and neg means the tendency of having
diabetes is negative.
If you want to learn more about the implementation of the decision tree, refer this
blog How To Create A Perfect Decision Tree
Step 5:
In this phase, we will run a small pilot project to check if our results are
appropriate. We will also look for performance constraints if any. If the results are
not accurate, then we need to replan and rebuild the model.
Step 6:
Once we have executed the project successfully, we will share the output for full
deployment. Being a Data Scientist is easier said than done. So, let’s see what all
you need to be a Data Scientist. A Data Scientist requires skills basically from
three major areas as shown below.

Data Science
As you can see in the above image, you need to acquire various hard skills and
soft skills. You need to be good at statistics and mathematics to analyze and
visualize data. Needless to say, Machine Learning forms the heart of Data Science
and requires you to be good at it. Also, you need to have a solid understanding of
the domain you are working in to understand the business problems clearly. Your
task does not end here. You should be capable of implementing various
algorithms which require good coding skills. Finally, once you have made certain
key decisions, it is important for you to deliver them to the stakeholders. So,
good communication will definitely add brownie points to your skills.
I urge you to see this Data Science video tutorial that explains what is Data Science
and all that we have discussed in the blog. Go ahead, enjoy the video and tell me
what you think.
What Is Data Science? Data Science Course – Data Science Tutorial For
Beginners
This Data Science course video will take you through the need of data science,
what is data science, data science use cases for business, BI vs data science, data
analytics tools, data science lifecycle along with a demo.
In the end, it won’t be wrong to say that the future belongs to the Data Scientists.
It is predicted that by the end of the year 2018, there will be a need of around one
million Data Scientists. More and more data will provide opportunities to drive key
business decisions. It is soon going to change the way we look at the world
deluged with data around us. Therefore, a Data Scientist should be highly skilled
and motivated to solve the most complex problems.
l hope you enjoyed reading my blog and understood what is Data Science. Check
out our Data Science with R certification and Data Science

Data Science
certification training here, which comes with instructor-led live training and
real-life project experience.
Different sectors of using data science
Data science has been effective in tackling many real-world problems and is
being increasingly adopted across industries to power more intelligent and
better-informed decision-making. With the increased use of computers for day-
to-day business and personal operations, there is a demand for intelligent
machines, can learn human behavior and work patterns. This brings Data
science and big data analytics to the forefront.
A study says that the global data science market is estimated to grow to USD 115
billion in 2023 with a CAGR of ~ 29%. A report by Deloitte Access Economics says
that a massive 76% percent of businesses have plans to increase their spend
over the next two years on increasing their data analytic capabilities. Almost all
industries can benefit from data science and analytics. However, below are some
industries that are better poised to make use of data science and analytics.
1. BSFI
2. Media & Entertainment
3. Healthcare
4. Retail
5. Telecommunications
6. Automotive
7. Digital Marketing
8. Professional Services
9. Cyber Security
1. BFSI

Data Science
Increased numbers of use cases in the Banking, Financial Services, and Insurance
(BFSI) sector have led to a massive increase in data to be analyzed and acted upon.
The segment has mainly been integrating data science in all decision-making
processes based on actionable insights from customer data.
There are many ways data science and AI can help financial institutions to be more
efficient in providing services to their clients, some of which include –
• Fraud detection
• Lending and loan appraisal management
• Risk modeling
• Securing and managing customer data
• Lifetime value prediction
• Customer segmentation
• Algorithmic trading
• Underwriting and credit scoring
Top Employers
• JPMorgan Chase
• HDFC
• ICICI Bank
• HSBC
• Citi Group
• BNP Paribas
To learn more about machine learning, read our blog on – What is Machine Learning?
2. Media & Entertainment
The big players in the field of media and entertainment industry, including the
likes of YouTube, Netflix, Hotstar, etc. have started applying data science to
understand their customers and offer them the most relevant and customized
recommendations. Even the regular entertainment channels and gossip
newsfeeds are relying heavily on user data.
A recent report by PwC suggests that India’s OTT video market will grow at a 21.8%
CAGR from INR 4464Cr in 2018 to INR 11976Cr in 2023. Subscription video on
demand will increase at a 23.3% CAGR from INR 3756Cr in 2018 to INR 10708Cr in
2023.

Data Science
This new face of digital reality aims towards matching the personal preferences
of the users and is evoked in the concept of addressability, along with the ability
to interact with consumers as per their choices. In such a scenario where
everything is dependent on data, the media and entertainment industry seeks
data scientists who can collect, process, analyze, store, and provide
recommendations, and make a desirable impact on the business.
Data science strategies, especially machine learning and artificial intelligence have
scaled up the media and entertainment industry through –
• Customer sentiment analysis

• Hyper-Targeted Advertising
• Smart Recommendations & Personalized Content Experiences
• Real-time analytics
• Optimized Media Scheduling
• Programmatic Ad Buying
• Predictive Modelling for Targeted Content Generation
• Leveraging mobile and social media content
Fig – This graph shows consumer and advertising spending in digital and
traditional channels across media and entertainment sectors including TV, film,
music, gaming, books, magazines, and news.
bes Top Employers Dish Network Netflix Time Warner Fox BuzzFeed Viacom
Hindustan Media NDTV Explore data science courses 3. Healthcare In the
healthcare industry where most data is unstructured, and it isn’t easy to access
and analyze all the data, hospitals and healthcare centers seek data scientists who
can assemble fragmented heterogeneous data. From electronic medical records,
clinical trials, and genetic information to billing, wearable data, care management

Data Science
databases, scientific articles, etc., data science has made it easier to manage all
the information. The healthcare industry has surprisingly emerged as the top
creator for data science jobs in India in recent years. Data science has also
contributed to designing and evaluating healthcare strategies that improve
equity, opportunity, access, and health services quality. Some of the areas with
enormous scope for data science applicability, include – Drug Discovery
Recognizing health risks and recommending
• prevention plans
• Diagnosis of diseases
• Delivering more precise prescriptions and customized care
• Post-Care Monitoring
• Hospital operations
While data scientists need to be proficient with programming languages and

statistics, this sector requires professionals with “softer” skills like storytelling and
excellent data communication skills to derive and communicate the desired
results.
Top Employers
• GSK
• GE Healthcare
• Sanofi
4. Retail
Even the onslaught of the global pandemic, store closures, and layoffs couldn’t
impact data scientists’ demand in the retail segment. The consumer-focused retail
industry thrives on increasing personalization and relevance, with one aim – to
understand the shopper’s behavior and patterns through data.
Data science has helped retail businesses to ensure better consumer

understanding. Data scientists are in high demand in retail as they bring a rare
mix of strong data knowledge, business acumen, technology skills, intuition, and
statistical expertise.
Data science in retail helps to –

Data Science
• Analyze people’s past searches and purchases and help them find relevant
products
• Create a recommendation and personalization system
• Analyze customer behavior and market insights
• Improve customer experience through predictive analytics
Top Recruiters
• Amazon
• Flipkart
• Walmart
• Aditya Birla Fashion & Retail Ltd.
• Future Enterprises Ltd.
• Reliance Retail Ltd.
• K. Raheja Group (Shoppers’ Stop)
• Landmark Group (Lifestyle)
• ITC
5. Telecommunications
Now that subscribers consistently connect to telecommunications networks

through voice, text messages, social media, etc., telecom providers have access to
vast amounts of data. Other data sources, including website visits, past purchases,
search patterns, and customer demographics like address, age, gender, and
location, have proved to be crucial for the telecom businesses, and this is where
the role of data science comes in.
Useful classification and utilization of this humongous data have been

groundbreaking for telecom companies and have helped them cater to their more
extensive range of consumers more accurately.
Data science enables telecom companies to –
• Make personalized offers to customers

• Allocate network resources
• Order predictive maintenance
• Ensure smarter network deployment
• Allow optimization and predictive maintenance of the networks
• Detect fraudulent activities
• Product innovation
• Contextualized location-based promotions

Data Science
• Targeted campaigns
• Call Detail Record (CDR) analysis
• Optimized pricing
Top Employers
• Bharti Airtel Limited

• Reliance Jio
• BSNL
• Vodafone-IDEA
6. Automotive
Data Science has helped the automotive industry to remain competitive by

improving everything from research to design manufacturing to marketing
processes. Moreover, the deployment of advanced analytics has led to the
development of autonomous automotive systems including sensors, cameras,
and radar, Global Navigation Satellite System (GNSS), Inertial Navigation System
(INS), Light Detection and Ranging (LiDAR), and much more.
The role of data science in the automotive industry is not limited to –
• Enhance vehicle safety with cognitive IoT

• Decrease repair costs
• Create and manage schedules more effectively
• Improve production line performance
• Identify defects in produced components using predictive maintenance
• Enable manufacturers to gain greater control over their supply chains,
including logistics and management
With automobiles getting more complex and capable of collecting more data, it
was not possible to monitor wear and tear and report on mileage, fuel efficiency,
and routes without data science. Be ready for the future cars that will
communicate, collaborate, and navigate without human intervention! All thanks
to the power of data.
Top Employers
• General Motors
• Volkswagen
• Maruti Suzuki

Data Science
• Hyundai
• Honda
Almost every industry is trying to leverage the power of data to thrive in the
market. If you too want to ride the data science bandwagon then consider picking
up a data science course.
7. Digital Marketing
Search pages, social networks, web traffic display networks, videos, web pages,
CRMs, databases, etc. are now fetching huge volumes of data from its customers.
Analysis of such high volumes of data requires a high level of business intelligence
and this can only be achieved through the right usage of data science
methodologies. Given these reasons, digital marketing is now increasingly using
data science for analytical purposes. This data-driven information can help
marketing/brand managers obtain crucial information such as –
• Predict user behaviors and make more informed business decisions

• Anticipate the user’s needs to send them highly personalized offers and
content they desire, thereby increasing the chances of converting into leads
• Spot patterns and trends that can help in product innovation
• Segment the markets and interact with the user more effectively
Top Recruiters
• Reliance Industries Limited

• Amazon
• Google
• Facebook
• Flipkart
• Walmart
• Aditya Birla Fashion & Retail Ltd.
8. Professional Services
Data science jobs are abundantly available across the professional services
industry since analytics is generating buzz. Big conglomerates, as well as SMEs,
seek professionals who can collect company data, store it, guarantee its security
and treat it appropriately. The process of implementing data science strategies
are very different for different business, and the methodologies in data collection,
data processing, and data visualization vary basis the industry functions. The job

Data Science
profile of data scientists in these professional services is mainly that of data
science consulting. With the help of data science, these services can improve their
analytical side, develop competencies, and understand the machinations of their
business, resulting in meeting the overall business objectives.
Top Employers
The professional services category mainly includes –
• Legal services
• Accounting and bookkeeping
• Marketing consultancy
• IT services
• Customer service
• Logistics
9. Cyber Security
Cybersecurity Ventures forecasts global cybercrime costs to go up by15% percent

annually and reaching $10.5 trillion USD annually by 2025, up from $3 trillion USD
in 2015. It is interesting to know that hackers these days use more sophisticated
Artificial intelligence and deep learning techniques to do their business, which is
cyber attacks.
To combat this increasing use of algorithms to perform malicious activities, the

cyber security industry too is now adopting data science and AI. Data scientists
use AI and machine learning techniques to understand the source of the malicious
activities by reading patterns of attack and devise solutions to tackle these attacks
and prevent their reoccurrence.
Data science in cybersecurity has made the computing processes more actionable
and intelligent compared to conventional information handling. It creates an
environment for data collection from relevant cybersecurity sources and analysis
that complements data-driven patterns. This fusion of data science and
cybersecurity represents a partial paradigm shift from traditional security
solutions such as user authentication, access control, cryptography, and firewalls
to systematic data management.

Data Science
Top Recruiters
• Accenture
• Cisco
• IBM
• Microsoft
• National Informatics Centre
• QuickHeal Technologies Limited
• McAfee
Purpose and components of Python

What is Python?
Python is one of the world's most popular programming languages, and there are
a few reasons why Python is so popular:
Python’s syntax, or the words and symbols used in order to make a computer
program work, is simple and intuitive. They're basically English words!
Python supports various paradigms, but most people would describe Python as
an object oriented-programming language. In an object-oriented programming
language, everything you create is an object, different objects have different
properties, and you can operate on different objects in different ways.
Python integrates well with other software components, making it a general

purpose language that can be used to build a full end-to-end pipeline – starting
with data, cleaning a model, and building that straight into production.
What can Python be used for besides data science?
The better question is what can't it be used for? Here are some key places where
you may see Python:
Web Development – Developers, engineers, and data scientists use Python for
web scraping or creating a mock-up an app.
Automating Reports – Analysts or product managers who need to make the same
Excel report every single week can use Python to help create reports and save
time.
Finance and Business – Used for reporting, predictive models, and academic
research.

Data Science
Simulations – As a postdoctoral fellow at Ohio State University, my colleagues
used Python to create simulations to study various different behaviors with a
computer.
Why do you think Python has recently overtaken R in popularity among data
scientists?
There are a couple of reasons I think Python has taken off. Python is a general
purpose language, used by data scientists and developers, which makes it easy to
collaborate across your organization through its simple syntax. People choose to
use Python so that they can communicate with other people. The other reason is
rooted in academic research and statistical models. I would say that R has better
statistical packages than Python, but Python has deep learning, structured ways
to do machine learning, and can deal with larger amounts of data. As people shift
more to deep learning, the bias has been shifting toward Python.
Python for Beginners
Python is an excellent first programming language for beginners because its

simple syntax allows you to quickly hit the ground running. Python is flexible in
that you can use it to do just about anything. It's also forgiving! Python will try to
interpret what you mean. Let's say we wanted to add together two words like
school and house. In our minds, we would link these two words by using the plus
symbol (school + house) which is exactly how you would do it using Python!
Python is also one of those languages that leaves plenty of room for growth and
ways to improve your code.
Within any field, you have to get the fundamentals of Python down first before
you can move on to more interesting things. Here’s a list of fundamentals you can
started with in order:
Understand what data types are (integers, strings, floating point numbers) and
how all of those data types are different.
Learn loops and conditionals – Loops execute a block of code several times and
conditionals tell the program when to stop executing that block of code.
Learn how to manipulate data – Practice this by reading data into your Python
program and then doing some kind of computations on it, cleaning it up, and
maybe even writing it out to a CSV file. You'll want to understand exactly how you
can manipulate data because that is a huge part of a data scientist’s job.
Algorithms – use algorithms to build models and maybe even create your own
models.

Data Science
Data Visualizations - This is my favorite part of data science! There are multiple
Python libraries or packages to help you do this.
Communication – Begin communicating these things that you’ve learned in a way

that other people can explain to solidify that learning.
Data Analytics processes

Data Analysis is a process of collecting, transforming, cleaning, and modeling data
with the goal of discovering the required information. The results so obtained are
communicated, suggesting conclusions, and supporting decision-making. Data
visualization is at times used to portray the data for the ease of discovering the
useful patterns in the data. The terms Data Modeling and Data Analysis mean the
same.
Data Analysis Process consists of the following phases that are iterative in nature
1. Data Requirements Specification

2. Data Collection
3. Data Processing
4. Data Cleaning
5. Data Analysis
6. Communication
Data Requirements Specification
The data required for analysis is based on

a question or an experiment. Based on the
requirements of those directing the
analysis, the data necessary as inputs to
the analysis is identified (e.g., Population of
people). Specific variables regarding a
population (e.g., Age and Income) may be
specified and obtained. Data may be
numerical or categorical.
Data Collection
Data Collection is the process of gathering

information on targeted variables
identified as data requirements. The
emphasis is on ensuring accurate and honest collection of data. Data Collection
ensures that data gathered is accurate such that the related decisions are valid.
Data Collection provides both a baseline to measure and a target to improve.

Data Science
Data is collected from various sources ranging from organizational databases to
the information in web pages. The data thus obtained, may not be structured and
may contain irrelevant information. Hence, the collected data is required to be
subjected to Data Processing and Data Cleaning.
Data Processing
The data that is collected must be processed or organized for analysis. This
includes structuring the data as required for the relevant Analysis Tools. For
example, the data might have to be placed into rows and columns in a table within
a Spreadsheet or Statistical Application. A Data Model might have to be created.
Data Cleaning
The processed and organized data may be incomplete, contain duplicates, or

contain errors. Data Cleaning is the process of preventing and correcting these
errors. There are several types of Data Cleaning that depend on the type of data.
For example, while cleaning the financial data, certain totals might be compared
against reliable published numbers or defined thresholds. Likewise, quantitative
data methods can be used for outlier detection that would be subsequently
excluded in analysis.
Data Analysis
Data that is processed, organized and cleaned would be ready for the analysis.
Various data analysis techniques are available to understand, interpret, and
derive conclusions based on the requirements. Data Visualization may also be
used to examine the data in graphical format, to obtain additional insight
regarding the messages within the data.
Statistical Data Models such as Correlation, Regression Analysis can be used to

identify the relations among the data variables. These models that are descriptive
of the data are helpful in simplifying analysis and communicate results.
The process might require additional Data Cleaning or additional Data Collection,
and hence these activities are iterative in nature.
Communication
The results of the data analysis are to be reported in a format as required by the
users to support their decisions and further action. The feedback from the users
might result in additional analysis.
The data analysts can choose data visualization techniques, such as tables and
charts, which help in communicating the message clearly and efficiently to the

Data Science
users. The analysis tools provide facility to highlight the required information with
color codes and formatting in tables and charts.
What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is an approach to analyze the data using
visual techniques. It is used to discover trends, patterns, or to check assumptions
with the help of statistical summary and graphical representations.
Dataset Used
For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender, Start
Date, Last Login, Salary, Bonus%, Senior Management, and Team.
Let’s read the dataset using the Pandas module and print the 1st five rows. To
print the first five rows we will use the head() function.
Example:
• Python3
import pandas as pd
import numpy as np
df = pd.read_csv('employees.csv')
df.head()
Output:
Categorising EDA techniques
EDA techniques are either graphical or quantitative. Each of these techniques are
in turn, either univariate or multivariate (usually just bivariate). Quantitative
methods normally involve calculation of summary statistics. Graphical methods
summarize the data in a diagrammatic or visual way. Univariate methods look at
one variable (data column) at a time, while multivariate methods look at two or
more variables at a time to explore relationships. Usually, multivariate EDA will be

Data Science
bivariate (looking at exactly two variables). Thus, the four types of EDA techniques
are Univariate non-graphical; Univariate graphical; Multivariate non-graphical;
Multivariate graphical. Non-graphical and graphical methods complement each
other. We can see graphical methods as more qualitative (providing subjective
analysis) vs quantitative methods as objective.
If we are focusing on data from observation of a single variable on n subjects, i.e.

a sample of size n, we also need to look graphically at the distribution of the
sample. Given a large enough sample size, we assume that the distribution is
normal. A more detailed explanation is HERE. There are exceptions to this idea –
for example – distributions could evolve over time, the distribution could be
unknown etc but for most cases, the normality conditions apply.
Univariate non-graphical EDA
Univariate non-graphical EDA techniques are concerned with understanding the

underlying sample distribution and make observations about the population. This
also involves Outlier detection. For univariate categorical data, we are interested
in the range and the frequency. Univariate EDA for quantitative data involves
making preliminary assessments about the population distribution of the variable
using the data from the observed sample. The characteristics of the population
distribution inferred include center, spread, modality, shape and outliers.
Measures of central tendency include Mean, Median, Mode. The most common
measure of central tendency is the mean. For skewed distribution or when there
is concern about outliers, the median may be preferred. Measures of spread
include variance, standard deviation, and interquartile range. Spread is an
indicator of how far away from the center we are still likely to find data values.
Univariate EDA also involves finding the skewness (measure of asymmetry) and
Kurtosis (measure of peakedness relative to a Gaussian shape).
Univariate graphical EDA
For graphical analysis of univariate categorical data, histograms are typically used.
The histogram represents the frequency (count) or proportion (count/total count)
of cases for a range of values. Typically, between about 5 and 30 bins are chosen.
Histograms are one of the best ways to quickly learn a lot about your data,
including central tendency, spread, modality, shape and outliers. Stem and Leaf
plots could also be used for the same purpose. Boxplots can also be used to
present information about the central tendency, symmetry and skew, as well as
outliers. Quantile normal plots or QQ plots and other techniques could also be
used here.
Multivariate non-graphical EDA

Data Science
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables in the form of either cross-tabulation or statistics.
For each combination of categorical variable (usually explanatory) and one
quantitative variable (usually outcome), we can create a statistic for a quantitative
variables separately for each level of the categorical variable, and then compare
the statistics across levels of the categorical variable. Comparing the means is an
informal version of ANOVA. Comparing medians is a robust informal version of
one-way ANOVA. (adapted from source. For two quantitative variables, we can
calculate co-variance and/or correlation. When we have many quantitative
variables, we typically calculate the pairwise covariances and/or correlations and
assemble them into a matrix.
Multivariate graphical EDA
For categorical multivariate quantities, the most commonly used graphical

technique is the barplot with each group rep-resenting one level of one of the
variables and each bar within a group representing the levels of the other variable.
For each category, we could have side-by-side boxplots or Parallel box plots. For
two quantitative multivariate variables, the basic graphical EDA technique is the
scatterplot which has one variable on the x-axis, one on the y-axis and a point for
each case in your dataset. Typically, the explanatory variable goes on the X axis.
Additional categorical variables can be accommodated by the use of colour or
symbols.
Data types for plotting
12 Data Plot Types for Visualisation from Concept to Code
Introduction - When data is collected, there is a need to interpret and analyze it to

provide insight into it. This insight can be about patterns, trends, or relationships
between variables. Data interpretation is the process of reviewing data through
well-defined methods. They help assign meaning to the data and arrive at a
relevant conclusion. The analysis is the process of ordering, categorizing, and
summarizing data to answer research questions. It should be done quickly and
effectively. The results need to stand out and should be right in your face. Data
Plot types for Visualization is an important aspect of this end. With growing data,
this need is growing and hence data plots become very important in today’s world.
However, there are many types of plots used in data visualization. It is often tricky
to choose which type is best for your business or data. Each of these plots has its
strengths and weaknesses that make it better than others in some situations.
1. Bar Graph

Data Science
2. Grouped Bar Graph
3. Stacked Bar Graph
4. Segmented Bar Graph
5. Line Graph
6. Simple Line Graph
7. Multiple Line Graph
8. Compound Line Graph
9. Pie Chart
10. Simple Pie Chart
11. Exploded Pie Chart
12. Donut Chart
13. Pie of Pie
14. Bar of Pie
15. 3D Pie Chart
16. Histogram
17. Area Chart
18. 3-D Area Chart
19. Dot Graph
20. Scatter Plot
21. Positive Correlation
22. Negative Correlation
23. No Correlation
24. Bubble Chart
25. Radar Chart
26. Pictogram Graph
27. Spline Chart
28. Box Plot
MODULE-II

Data Science Unit 1

Uploaded by

Copyright:

Available Formats

You might also like

Data Science Unit 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Unit 1

Uploaded by

Copyright:

Available Formats

Data Science

Computer Science & Engineering

1|Page Notes by Prof. Deepak Kumar

What is Data Science?

Data Science is a blend of various tools, algorithms, and machine learning

The answer lies in the difference between explaining and predicting.

2|Page Notes by Prof. Deepak Kumar

• Prescriptive analytics: If you want a model that has the intelligence of

• Machine learning for making predictions — If you have transactional

• Machine learning for pattern discovery — If you don’t have the

3|Page Notes by Prof. Deepak Kumar

Why Data Science?

• How about if you could understand the precise requirements of your

4|Page Notes by Prof. Deepak Kumar

• Let’s take a different scenario to understand the role of Data Science

Who is a Data Scientist?

What does a Data Scientist do?

5|Page Notes by Prof. Deepak Kumar

Business Intelligence (BI) vs. Data Science

• Business Intelligence (BI) basically analyses the previous data to find

• Data Science is a more forward-looking approach, an exploratory way with

Let’s have a look at some contrasting features.

Statistics, Machine Learning, Graph

6|Page Notes by Prof. Deepak Kumar

Focus Past and Present Present and Future

Lifecycle of Data Science

Phase 1—Discovery: Before you

Phase 2—Data preparation: In this phase, you require analytical

7|Page Notes by Prof. Deepak Kumar

Phase 3—Model planning: Here, you will determine the methods

Let’s have a look at various model planning tools.

1. R has a complete set of modeling capabilities and provides a good

8|Page Notes by Prof. Deepak Kumar

You can achieve model building through the following tools.

Phase 5—Operationalize: In this phase, you deliver final reports,

Phase 6—Communicate results: Now it is important to evaluate

Case Study: Diabetes Prevention

What if we could predict the occurrence of diabetes and take appropriate

9|Page Notes by Prof. Deepak Kumar

• As you can see, we have the various attributes as mentioned below.

1. npreg – Number of times pregnant

10 | P a g e Notes by Prof. Deepak Kumar

This data has a lot of inconsistencies.

1. In the column npreg, “one” is written in words, whereas it should be in the

11 | P a g e Notes by Prof. Deepak Kumar

Now let’s do some analysis as discussed earlier in Phase 3.

Let’s have a look at our decision tree.

12 | P a g e Notes by Prof. Deepak Kumar

13 | P a g e Notes by Prof. Deepak Kumar

14 | P a g e Notes by Prof. Deepak Kumar

Different sectors of using data science

15 | P a g e Notes by Prof. Deepak Kumar

2. Media & Entertainment

16 | P a g e Notes by Prof. Deepak Kumar

• Customer sentiment analysis

17 | P a g e Notes by Prof. Deepak Kumar

While data scientists need to be proficient with programming languages and

Data science has helped retail businesses to ensure better consumer