Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 82

Data Science

Dr. Vinay Chopra


AP(CSE)
INTRODUCTION

 Data is created constantly, and at an ever-increasing rate. Mobile phones, social


media, imaging technologies to determine a medical diagnosis—all these and
more create new data, and that must be stored somewhere for some purpose.
 Devices and sensors automatically generate diagnostic information that needs to
be stored and processed in real time. Merely keeping up with this huge influx of
data is difficult, but substantially more challenging is analyzing vast amounts of
it, especially when it does not conform to traditional notions of data structure,
to identify meaningful patterns and extract useful information.
 Data Science has become the most demanding job of the 21st century. Every
organization is looking for candidates with knowledge of data science.
 In this course, we will come across introduction to data science, with data
science Job roles, tools for data science, components of data science,
application, etc.
INTRODUCTION(Cont.)

Several industries have led the way in developing their ability to gather and exploit
data:
 Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived
by processing billions of transactions.
 Mobile phone companies analyze subscribers’ calling patterns to determine, for
example, whether a caller’s frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to
defect, the mobile phone company can proactively offer the subscriber an
incentive to remain in her contract.
 For companies such as LinkedIn and Facebook, data itself is their primary
product. The valuations of these companies are heavily derived from the data they
gather and host, which contains more and more intrinsic value as the data grows.
Attributes defining Big Data characteristics:

 Huge volume of data (Volume): Rather than thousands\ or millions of rows,


Big Data can be billions of rows and millions of columns.
 Complexity of data types and structures (Variety): Big Data reflects the
variety of new data sources, formats, and structures, including digital traces
being left on the web and other digital repositories for subsequent analysis.
 Speed of new data creation and growth (Velocity): Big Data can describe high
velocity data, with rapid data ingestion and near real time analysis.
Attributes defining Big Data
characteristics:
 Although the volume of Big Data tends to attract the most attention,
generally the variety and velocity of the data provide a more apt definition of
Big Data.
 (Big Data is sometimes described as having 3 Vs: volume, variety, and
velocity.) Due to its size or structure, Big Data cannot be efficiently analyzed
using only traditional databases or methods.
 Big Data problems require new tools and technologies to store, manage, and
realize the business benefit. These new tools and technologies enable
creation, manipulation, and management of large datasets and the storage
environments that house them.
The Ascendance of Data

We live in a world that’s drowning in data.


 Websites track every user’s every click.
 Your smartphone is building up a record of your location and speed every second of
every day. “Quantified selfers” wear pedometers-on-steroids that are ever recording
their heart rates, movement habits, diet, and sleep patterns.
 Smart cars collect driving habits, smart homes collect living habits, and smart
marketers collect purchasing habits.
 The Internet itself represents a huge graph of knowledge that contains (among other
things) an enormous cross-referenced encyclopedia; domain-specific databases about
movies, music, sports results, pinball machines, memes, and cocktails; and too many
government statistics (some of them nearly true!) from too many governments to wrap
your head around.
 Buried in these data are answers to countless questions that no one’s ever thought to ask
Who is data scientist?

 There’s a joke that says a data scientist is someone who knows more statistics than a
computer scientist and more computer science than a statistician. (I didn’t say it was a good
joke.)
 In fact, some data scientists are — for all practical purposes — statisticians, while others are
pretty much indistinguishable from software engineers.
 Some are machine-learning experts, while others couldn’t machine-learn their way out of
kindergarten.
 Some are PhDs with impressive publication records, while others have never read an
academic paper (shame on them, though).
 In short, pretty much no matter how you define data science, you’ll find practitioners for
whom the definition is totally, absolutely wrong. Nonetheless, we won’t let that stop us from
trying.
 We’ll say that a data scientist is someone who extracts insights from messy data. Today’s
world is full of people trying to turn data into insight.
Definition: Big Data

 Definition of Big Data comes from the McKinsey Global report from 2011:
 Big Data is data whose scale, distribution, diversity, and/or timeliness require
the use of new technical architectures and analytics to enable insights that
unlock new sources of business value.
 McKinsey’s definition of Big Data implies that organizations will need new
data architectures and analytic sandboxes, new tools, new analytical
methods, and an integration of multiple skills into the new role of the data
scientist
Sources of Big data deluge

The rate of data creation is accelerating, driven by many


of the items in Figure 1.1
 Social media and genetic sequencing are among the fastest-growing sources of
Big Data and examples of untraditional sources of data being used for analysis.
 For example, in 2012 Facebook users posted 700 status updates per second
worldwide, which can be leveraged to deduce latent interests or political
views of users and show relevant ads. For instance, an update in which a
woman changes her relationship status from “single” to “engaged” would
trigger ads on bridal dresses, wedding planning, or name-changing services.
 Facebook can also construct social graphs to analyze which users are
connected to each other as an interconnected network. In March 2013,
Facebook released a new feature called “Graph Search,” enabling users and
developers to search social graphs for people with similar interests, hobbies,
and shared locations.
 Big data can come in multiple forms, including structured and non-structured
data such as financial data, text files, multimedia files, and genetic
mappings.
 Contrary to much of the traditional data analysis performed by organizations,
most of the Big Data is unstructured or semi-structured in nature, which
requires different techniques and tools to process and analyze.
 Distributed computing environments and massively parallel processing (MPP)
architectures that enable parallelized data ingest and analysis are the
preferred approach to process such complex data.
Different Data Structures
Here are examples of how each of the four main types of data structures may
look.
 Structured data: Data containing a defined data type, format, and structure
(that is, transaction data, online analytical processing [OLAP] data cubes,
traditional RDBMS, CSV files, and even simple spreadsheets). See Figure 1.4.
 Unstructured data: Data that has no inherent structure, which may include
text documents, PDFs, images, and video. See Figure 1.7.
Structured Data
Example of Semi Structured data
Semi-structured data: Textual data files
with a distingushed pattern that enables
parsing (such as Extensible Markup
Language [XML] data files that are self
describing and defined by an XML
schema).
Semi-structured data is a form of 
structured data that does not obey the
tabular structure of data models associated
with relational databases or other forms of 
data tables, but nonetheless contains tags
 or other markers to separate semantic
elements and enforce hierarchies of
records and fields within the data.
Therefore, it is also known as 
self-describing structure.
Quasi-structured data

Quasi-structured data: Textual data with erratic


data formats that can be formatted with effort,
software system tools, and time. An example of
quasi-structured data is the data about webpages
a user visited and in what order.
Quasi-structured data(Example)

 Quasi-structured data is a common phenomenon that bears closer scrutiny.


 Consider the following example. A user attends the EMC World conference and subsequently runs a
Google search online to find information related to EMC and Data Science.
 This would produce a URL such as https://www.google.com/#q=EMC+ data+science and a list of
results, such as in the first graphic of Figure 1.5
 After doing this search, the user may choose the second link, to read more about the headline
“Data Scientist—EMC Education, Training, and Certification.” This brings the user to an emc.com
site focused on this topic and a new URL,
https://education.emc.com/guest/campaign/data_science.aspx, that displays the page shown as
(2) in Figure 1.6.
 Arriving at this site, the user may decide to click to learn more about the process of becoming
certified in data science. The user chooses a link toward the top of the page on Certifications,
bringing the user to a new URL:
 https://education.emc.com/guest/certification/framework/stf/data_science.aspx which is (3) in
Figure 1.6.
Quasi-structured data(Example)

 Visiting these three websites adds three URLs to the log files monitoring the user’s
computer or network use.
These three URLs are:
a) https://www.google.com/#q=EMC+data+science.
b) https://education.emc.com/guest/campaign/data_science.aspx
c) https://education.emc.com/guest/certification/framework/stf/data_science.aspx
This set of three URLs reflects the websites and actions taken to find
 Data Science information related to EMC.
 Together, this comprises a clickstream that can be parsed and mined by data
scientists to discover usage patterns and uncover relationships among clicks and
areas of interest on a website or group of sites.
Unstructured data:

 This type of data doesn’t have an


information model and isn’t organized
in any specific format. Data that has no
inherent structure, which may include
text documents, PDFs, images, and
video. Approx 90% of the digital data
generated these days is non-structured
data which is either semi-, quasi-, and
unstructured data.
Current Analytical Architecture

 Data Science projects need workspaces that are purpose-built for


experimenting with data, with flexible and agile data architectures.
 Most organizations still have data warehouses that provide excellent support
for traditional reporting and simple data analysis activities but unfortunately
have a more difficult time supporting more robust analyses.
 A typical analytical data architecture that may exist within an organization is
explained below
Current Analytical Architecture(Cont.)

Figure shows typical data architecture and


several of the challenges it presents to data
scientists and others trying to do advanced
analytics.
This section examines the data flow to the Data
Scientist and how this individual fits into the
process of getting data to analyze on projects.
Current Analytical Architecture(Cont.)

 For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type
definitions.
 Although this kind of centralization enables security, backup, and failover of
highly critical data, it also means that data typically must go through
significant preprocessing and checkpoints before it can enter this sort of
controlled environment, which does not lend itself to data exploration and
iterative analytics
Current Analytical Architecture(Cont.)

 As a result of this level of control on the EDW, additional local systems may
emerge in the form of departmental warehouses and local data marts that
business users create to accommodate their need for flexible analysis.
 These local data marts may not have the same constraints for security and
structure as the main EDW and allow users to do some level of more in-depth
analysis.
 However, these one-off systems reside in isolation, often are not
synchronized or integrated with other data stores, and may not be backed up.
Current Analytical Architecture(Cont.)

 Once in the data warehouse, data is read by additional applications across the
enterprise for BI and reporting purposes. These are high-priority operational
processes getting critical data feeds from the data warehouses and repositories.
 At the end of this workflow, analysts get data provisioned for their downstream
analytics. Because users generally are not allowed to run custom or intensive
analytics on production databases, analysts create data extracts from the EDW
to analyze data offline in R or other local analytical tools.
 Many times these tools are limited to in-memory analytics on desktops analyzing
samples of data, rather than the entire population of a dataset.
 Because these analyses are based on data extracts, they reside in a separate
location, and the results of the analysis—and any insights on the quality of the
data or anomalies—rarely are fed back into the main data repository
Role of Data Scientists

 Data scientists play the most active roles in the four A’s of data:
 Data architecture: A data scientist would help the system architect by providing
input on how the data would need to be routed and organized to support the
analysis, visualization, and presentation of the data to the appropriate people.
 Data acquisition: Representing, transforming, grouping, and linking the data are
all tasks that need to occur before the data can be profitably analyzed, and
these are all tasks in which the data scientist is actively involved.
 Data analysis: The analysis phase is where data scientists are most heavily
involved. In this context we are using analysis to include summarization of the
data, using portions of data (samples) to make inferences about the larger
context, and visualization of the data by presenting it in tables, graphs, and
even animations.
 Data archiving: Finally, the data scientist must become involved in the
archiving of the data. Preservation of collected data in a form that makes it
highly reusable - what you might think of as "data duration"- is a difficult
challenge because it is so hard to anticipate all of the future uses of the data.
The data science process: six steps
The data science process

 1.Setting the research goal:


a) Data science is mostly done in the context of an organization. When the business asks
you to perform a data science project, you’ll first prepare a project charter.
b) This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and
deliverables.
 2. Retrieving data
a) The second step is to collect data. You’ve stated in the project charter which data
you need and where you can find it. In this step you ensure that you can use the data
in your program, which means checking the existence, quality, and access to the data.
b) Data can also be delivered by third party companies and take many forms ranging
from Excel spreadsheets to different types of databases
The data science process

 Data cleansing: Data collection is an error-prone process; in this phase you


enhance the quality of the data and prepare it for use in subsequent steps.
 This phase consists of three subphases:
a) data cleansing removes false values from a data source and inconsistencies
across data sources,
b) data integration enriches data sources by combining information from
multiple data sources,
c) and data transformation ensures that the data is in a suitable format for use
in your models.
The data science process

 Data exploration Data exploration is concerned with


building a deeper understanding of your data.
 You try to understand how variables interact with each
other, the distribution of the data, and whether there are
outliers.
 To achieve this you mainly use descriptive statistics,
visual techniques, and simple modeling.
 This step often goes under the abbreviation EDA for
Exploratory Data Analysis.
The data science process

 Presentation and automation: Finally, you present the


results to your business.
 These results can take many forms, ranging from
presentations to research reports.
 Sometimes you’ll need to automate the execution of the
process because the business will want to use the insights
you gained in another project or enable an operational
process to use the outcome from your model.
AN ITERATIVE PROCESS:

 The previous description of the data science process gives you the
impression that you walk through this process in a linear way, but in
reality you often have to step back and rework certain findings.
 For instance, you might find outliers in the data exploration phase
that point to data import errors.
 As part of the data science process you gain incremental insights,
which may lead to new questions.
 To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.
 Data modeling or model building: In this phase you use models, domain
knowledge, and insights about the data you found in the previous steps to
answer the research question.
 You select a technique from the fields of statistics, machine learning,
operations research, and so on.
 Building a model is an iterative step between selecting the variables for the
model, executing the model, and model diagnostics
Challenges in Data Science

 Preparing Data (Noisy, Incomplete, Diverse, Streaming …)


 • Analyze Data (Scalable, Accurate, Realtime, Advanced Methods,
Probabilities & Uncertainties)
 • Represent Analysis Results (i.e. data product) (Story-telling, Interactive,
explainable…)
What is Data Science?

 Data science is a deep study of the massive amount of data, which


involves extracting meaningful insights from raw, structured, and
unstructured data that is processed using the scientific method,
different technologies, and algorithms.
 It is a multidisciplinary field that uses tools and techniques to
manipulate the data so that you can find something new and
meaningful.
 Data science uses the most powerful hardware, programming systems,
and most efficient algorithms to solve the data related problems. It is
the future of artificial intelligence.
Data science is all about:

 Asking the correct questions and analyzing the raw data.


 Modeling the data using various complex and efficient
algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and
finding the final result.
Data science is all about
Example:

 Let suppose we want to travel from station A to station B


by car.
 Now, we need to take some decisions such as which route
will be the best route to reach faster at the location, in
which route there will be no traffic jam, and which will be
cost-effective.
 All these decision factors will act as input data, and we
will get an appropriate answer from these decisions, so
this analysis of data is called the data analysis, which is a
part of data science.
Need for Data Science:
Need of data Science

 Some years ago, data was less and mostly available in a structured form,
which could be easily stored in excel sheets, and processed using BI tools.
 But in today's world, data is becoming so vast, i.e., approximately 2.5
quintals bytes of data is generating on every day, which led to data
explosion.
 It is estimated as per researches, that by 2020, 1.7 MB of data will be
created at every single second, by a single person on earth. Every Company
requires data to work, grow, and improve their businesses.
 Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that
technology came into existence as data Science.
Following are some main reasons for
using data science technology:
 With the help of data science technology, we can convert the massive amount
of raw and unstructured data into meaningful insights.
 Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
 Data science is working for automating transportation such as creating a self-
driving car, which is the future of transportation.
 Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.
Data science Jobs:

 As per various surveys, data scientist job is becoming the most


demanding Job of the 21st century due to increasing demands for
data science.
 Some people also called it "the hottest job title of the 21st
century".
 Data scientists are the experts who can use various statistical tools
and machine learning algorithms to understand and analyze the data.
 The average salary range for data scientist will be
approximately $95,000 to $ 165,000 per annum, and as per
different researches, about 11.5 millions of job will be created by
the year 2026.
Types of Data Science Job

 If you learn data science, then you get the opportunity to find the various
exciting job roles in this domain. The main job roles are given below:
 Data Scientist
 Data Analyst
 Machine learning expert
 Data engineer
 Data Architect
 Data Administrator
 Business Analyst
 Business Intelligence Manager
Some critical job titles of data science

 1. Data Analyst:
a) Data analyst is an individual, who performs mining of huge amount of data,
models the data, looks for patterns, relationship, trends, and so on.
b) At the end of the day, he comes up with visualization and reporting for
analyzing the data for decision making and problem-solving process.
c) Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge
of statistics.
d) You should also be familiar with some computer languages and tools such
as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
Machine Learning Expert

 The machine learning expert is the one who works with various
machine learning algorithms used in data science such as regression,
clustering, classification, decision tree, random forest, etc.
 Skill Required: Computer programming languages such as Python, C+
+, R, Java, and Hadoop.
 You should also have an understanding of various algorithms,
problem-solving analytical skill, probability, and statistics.
Data Engineer

 A data engineer works with massive amount of data and responsible


for building and maintaining the data architecture of a data science
project.
 Data engineer also works for the creation of data set processes used
in modeling, mining, acquisition, and verification.
 Skill required: Data engineer must have depth knowledge of SQL,
MongoDB, Cassandra, HBase, Apache Spark, Hive, MapReduce, with
language knowledge of Python, C/C++, Java, Perl, etc.
Data Scientist

 A data scientist is a professional who works with an enormous amount


of data to come up with compelling business insights through the
deployment of various tools, techniques, methodologies, algorithms,
etc.
 Skill required: To become a data scientist, one should have technical
language skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark,
MATLAB.
 Data scientists must have an understanding of Statistics,
Mathematics, visualization, and communication skills.
Prerequisite for Data Science

 Non-Technical Prerequisite:
 Curiosity: 
a) To learn data science, one must have curiosities.
b) When you have curiosity and ask various questions, then you can understand the
business problem easily.
 Critical Thinking: 
a) It is also required for a data scientist so that you can find multiple new ways to
solve the problem with efficiency.
 Communication skills: 
a) Communication skills are most important for a data scientist because after
solving a business problem, you need to communicate it with the team.
Technical Prerequisite:
 Machine learning: To understand data science, one needs to understand the concept of
machine learning. Data science uses machine learning algorithms to solve various
problems.
 Mathematical modeling: Mathematical modeling is required to make fast mathematical
calculations and predictions from the available data.
 Statistics: Basic understanding of statistics is required, such as mean, median, or standard
deviation. It is needed to extract knowledge and obtain better results from the data.
 Computer programming: For data science, knowledge of at least one programming
language is required. R, Python, Spark are some required computer programming
languages for data science.
 Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.
Difference between BI and Data Science

 BI stands for business intelligence, which is also used for data analysis of
business information: Below are some differences between BI and Data
sciences:
Criterion Business intelligence Data science
Data Source Business intelligence deals with Data science deals with structured and
structured data, e.g., data warehouse. unstructured data, e.g., weblogs,
feedback, etc.

Method Analytical(historical data) Scientific(goes deeper to know the


reason for the data report)

Skills Statistics and Visualization are the two Statistics, Visualization, and Machine
skills required for business intelligence. learning are the required skills for data
science.

Focus Business intelligence focuses on both Data science focuses on past data,
Past and present data present data, and also future
predictions.
Data Science Components:
Components of Data Science

 1.Statistics: Statistics is one of the most important components of data science. Statistics is a


way to collect and analyze the numerical data in a large amount and finding meaningful insights
from it.
 2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there are
various areas for which we need domain experts.
 3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata (data
about data) to the data.
 4. Visualization: Data visualization is meant by representing data in a visual context so that
people can easily understand the significance of data. Data visualization makes it easy to
access the huge amount of data in visuals.
 5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of computer
programs.
Components of Data Science
Components of Data Science

 Mathematics: Mathematics is the critical part of data science. Mathematics


involves the study of quantity, structure, space, and changes. For a data
scientist, knowledge of good mathematics is essential.
 7. Machine learning: Machine learning is backbone of data science. Machine
learning is all about to provide training to a machine so that it can act as a
human brain. In data science, we use various machine learning algorithms to
solve the problems.
Tools for Data Science

 Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB,


Excel, RapidMiner.
 Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
 Data Visualization tools: R, Jupyter, Tableau, Cognos.
 Machine learning tools: Spark, Mahout, Azure ML studio.
Machine learning in Data Science
 To become a data scientist, one should also be aware of machine learning and
its algorithms, as in data science, there are various machine learning algorithms
which are broadly being used.
 Following are the name of some machine learning algorithms used in data
science:
 Regression
 Decision tree
 Clustering
 Principal component analysis
 Support vector machines
 Naive Bayes
 Artificial neural network
 Apriori
Linear Regression Algorithm:

 Linear regression is the most popular machine


learning algorithm based on supervised
learning. This algorithm work on regression,
which is a method of modeling target values
based on independent variables.
 It represents the form of the linear
equation, which has a relationship between
the set of inputs and predictive output.
 This algorithm is mostly used in forecasting
and predictions. Since it shows the linear
relationship between input and output
variable, hence it is called linear regression.
 The below equation can describe the relationship between x and y variables:
 Y=mx+c
 Where, y= Dependent variable
X= independent variable
M= slope
C= intercept.
Decision Tree:
 Decision Tree algorithm is another machine learning
algorithm, which belongs to the supervised learning
algorithm.
 This is one of the most popular machine learning
algorithms.
 It can be used for both classification and regression
problems.
 In the decision tree algorithm, we can solve the problem,
by using tree representation in which, each node
represents a feature, each branch represents a decision,
and each leaf represents the outcome.
example

In the decision tree, we start from the


root of the tree and compare the
values of the root attribute with
record attribute.

On the basis of this comparison, we


follow the branch as per the value and
then move to the next node.

We continue comparing these values


until we reach the leaf node with
predicated class value.
K-Means Clustering:

 K-means clustering is one of the most popular algorithms


of machine learning, which belongs to the unsupervised
learning algorithm.
 It solves the clustering problem.
 If we are given a data set of items, with certain features
and values, and we need to categorize those set of items
into groups, so such type of problems can be solved using
k-means clustering algorithm.
 K-means clustering algorithm aims at minimizing an
objective function, which known as squared error
function.
Where, J(V) => Objective function
'||xi - vj||' => Euclidean distance between xi and vj.
ci' => Number of data points in ith cluster.
C => Number of clusters.
How to solve a problem in Data Science using Machine
learning algorithms?

 Now, let's understand what are the


most common types of problems
occurred in data science and what is
the approach to solving the problems.
 So in data science, problems are
solved using algorithms, and below is
the diagram representation for
applicable algorithms for possible
questions:
 Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1 or
0, may or may not. And this type of problems can be solved using classification algorithms.
 Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find odd
from them. Such type of problems can be solved using Anomaly Detection Algorithms.
 How much or how many?
The other type of problem occurs which ask for numerical values or figures such as what is the
time today, what will be the temperature today, can be solved using regression algorithms.
 How is this organized?
Now if you have a problem which needs to deal with the organization of data, then it can be
solved using clustering algorithms. Clustering algorithm organizes and groups the data based on
features, colors, or other common characteristics.
Data Science Lifecycle
Data Science Lifecycle

 The main phases of data science life cycle are given below:
 Discovery: The first phase is discovery, which involves asking the
right questions.
 When you start any data science project, you need to determine
what are the basic requirements, priorities, and project budget.
 In this phase, we need to determine all the requirements of the
project such as the number of people, technology, time, data, an end
goal, and then we can frame the business problem on first hypothesis
level.
Data Science Lifecycle

 Data preparation: Data preparation is also known as Data Munging. In this


phase, we need to perform the following tasks:
 Data cleaning
 Data Reduction
 Data integration
 Data transformation,

After performing all the above tasks, we can easily use this data for our
further processes.
Data Science Lifecycle

 Model Planning: In this phase, we need to determine the various methods


and techniques to establish the relation between input variables.
 We will apply Exploratory data analytics(EDA) by using various statistical
formula and visualization tools to understand the relations between variable
and to see what data can inform us.
Common tools used for model planning are:
 SQL Analysis Services
 R
 SAS
 Python
Data Science Lifecycle

 Model-building: In this phase, the process of model building starts. We will


create datasets for training and testing purpose.
 We will apply different techniques such as association, classification, and
clustering, to build the model.
Following are some common Model building tools:
 SAS Enterprise Miner
 WEKA
 SPCS Modeler
 MATLAB
Data Science Lifecycle

 Operationalize: 
a) In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents.
b) This phase provides you a clear overview of complete project performance
and other components on a small scale before the full deployment.
 6. Communicate results: 
a) In this phase, we will check if we reach the goal, which we have set on the
initial phase.
b) We will communicate the findings and final result with the business team.
Applications of Data Science:

 Image recognition and speech recognition:

Data science is currently using for Image and speech recognition. When you
upload an image on Facebook and start getting the suggestion to tag to your
friends.
This automatic tagging suggestion uses image recognition algorithm, which
is part of data science.

When you say something using, "Ok Google, Siri, Cortana", etc., and these
devices respond as per voice control, so this is possible with speech
recognition algorithm.
Applications of Data Science:

 Gaming world:

In the gaming world, the use of Machine learning algorithms is increasing day
by day. EA Sports, Sony, Nintendo, are widely using data science for
enhancing user experience.
 Internet search:

When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc.
 All these search engines use the data science technology to make the search
experience better, and you can get a search result with a fraction of seconds.
Applications of Data Science:

Transport:
a)

Transport industries also using data science technology to create self-driving


cars.
b) With self-driving cars, it will be easy to reduce the number of road accidents.
Healthcare:
c)

In the healthcare sector, data science is providing lots of benefits.


d) Data science is being used for tumor detection, drug discovery, medical
image analysis, virtual medical bots, etc.
Applications of Data Science:

 Recommendation systems:

Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science
technology for making a better user experience with personalized recommendations.
Such as, when you search for something on Amazon, and you started getting suggestions for
similar products, so this is because of data science technology.
 Risk detection:

Finance industries always had an issue of fraud and risk of losses, but with the help of data
science, this can be rescued.

Most of the finance companies are looking for the data scientist to avoid risk and any type
of losses with an increase in customer satisfaction.
Security of Data Science

In this section, we discuss challenges and solution approaches related to the


security of data science methods and applications.
 infrastructure security,
 software security,
 data protection
 data anonymization.
Infrastructure Security

 Infrastructure security is concerned with securing information systems against


physical and virtual intruders, insider threats, and technical failures of the
infrastructure itself.
 As a consequence, some of the more important building blocks to secure an
infrastructure are access control, encryption of data at rest and in transit,
vulnerability scanning and patching, security monitoring, network
segmentation, firewalls, anomaly detection, server hardening, and (endpoint)
security policies.
 Resources such as the NIST special publications series (National Institute of
Standards and Technology, 2017) or the CIS top twenty security controls
(Center for Internet Security, 2017) provide guidance and (some) practical
advice. However, getting all of this right is far from easy and failing might
carry a hefty price tag.
Infrastructure Security

 Fortunately, data scientists do rarely have to build and secure an


infrastructure from scratch.
 However, they often have to select, configure and deploy base technologies
and products such as MongoDB, Elastic search or Apache Spark.
 It is therefore important that data scientists are aware of the security of
these products. What are the security mechanisms they offer? Are they secure
by default?
 Can they be configured to be secure or is there a need for additional security
measures and tools? Recent events have demonstrated that this is often not
the case.
Software Security

 Software security sets the focus on the methodologies of how applications can
be implemented and protected so that they do not have or expose any
vulnerabilities.
 To achieve this, traditional software development life cycle (SDLC) models
(Waterfall, Iterative, Agile, etc.) must integrate activities to help discover
and reduce vulnerabilities early and effectively and refrain from the common
practice to perform security-related activities only towards the end of the
SDLC as part of testing.
 A secure SDLC (SSDLC) ensures that security assurance activities such as
security requirements, defining the security architecture, code reviews, and
penetration tests, are an integral part of the entire development process
Software Security

 An important aspect of implementing an SSDLC is to know


the threats and how relevant they are for a specific
product.
 This allows to prioritize the activities in the SSDLC. For
data products, injection attacks and components that are
insecure by default are among the biggest threats.
 They process data from untrusted sources including data
from IoT devices, data from public data sources such as
Twitter, and various kinds of user input, to control and
use the data product.
Data Protection

 A core activity in data science is the processing of (large amounts of) data.
For most processing tasks, the data must be available in unencrypted form.
 This has two main drawbacks. The first one is that when security measures
such as access control fail, attackers can easily steal the data and make use
of any information it contains.
 To make this more difficult, the data should always be stored in encrypted
form. This way, the attacker must steal the data when it is being processed or
manage to steal the keys used to encrypt it.
 The second drawback is that the vast amount of processing power available in
data centers around the world cannot be exploited if the data contains
confidential information or is subject to data protection laws prohibiting the
processing by (foreign) third parties.
Privacy Preservation / Data
Anonymization
 In many cases, data science analyzes data of human individuals, for instance
health data. Due to legal and ethical obligations, such data should be
anonymized to make sure the privacy of the individuals is protected.
 For instance, in 2006, Netflix started an open competition with the goal to
find algorithms that allow to predict user ratings for films. As a basis,
Netflix provided a large data set of user ratings as training data, where both
users and movies were replaced by numerical IDs.
 By correlating this data with ratings from the Internet Movie Database, two
researchers demonstrated that it is possible to de-anonymize users.
Machine Learning under Attack

 The combination of sophisticated algorithms and untrusted data can open the
door for different kinds of attacks.
 In the 2010 Flash Crash (Kirilenko, Kyle, Samadi, & Tuzun, 2015), the New
York Stock Exchange experienced a temporary market loss of one trillion
dollar caused by market manipulations.
 Mozaffari-Kermani, Sur-Kolay, Raghunathan, and Jha (2015) propose a method
to generate data, which, when added to the training set, causes the machine
learning algorithms to deliver wrong predictions for specific queries.
 Thus, this method could for example, be used to compromise the
effectiveness of a system to diagnose cancer or to identify anomalies in
computer networks.

You might also like