Unit 1

Advanced Data Science
Unit -I
Data science
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data. This analysis helps data scientists to ask and answer questions like what
happened, why it happened, what will happen, and what can be done with the results.
Importance of Data Science
Data science is important because it combines tools, methods, and technology to generate
meaning from data. Modern organizations are inundated with data; there is a proliferation of
devices that can automatically collect and store information. Online systems and payment
portals capture more data in the fields of e-commerce, medicine, finance, and every other
aspect of human life. We have text, audio, video, and image data available in vast quantities
Fundamentals and Components

Domain Knowledge:
Most people thinking that domain knowledge is not important in data science but it is
essential. The foremost objective of data science is to extract useful insights from that data so
that it can be profitable to the company’s business. If you are not aware of the business side
of the company that how the business model of the company works and how you can’t build
it better than you are of no use for this company.
You need to know how to ask the right questions from the right people so that you can
perceive the appropriate information you need to obtain the information you need. There are
some visualization tools used on the business end like Tableau that help you display your
valuable results or insights in a proper non-technical format such as graphs or pie charts that
business people can understand.
Math Skills:
Linear Algebra, Multivariable Calculus & Optimization Technique: These three things are
very important as they help us in understanding various machine learning algorithms that
play an important role in Data Science.
Statistics & Probability: Understanding of Statistics is very significant as this is a part of
Data analysis. Probability is also significant to statistics and it is considered a prerequisite for
mastering machine learning.
Programming Knowledge: One needs to have a good grasp of programming concepts such
as Data structures and Algorithms. The programming languages used are Python, R, Java,
Scala. C++ is also useful in some places where performance is very important.
Relational Databases:
One needs to know databases such as SQL or Oracle so that he/she can retrieve the necessary
data from them whenever required.
Non-Relational Databases:
There are many types of non-relational databases but mostly used types are Cassandra,
HBase, MongoDB, CouchDB, Redis, Dynamo.
Machine Learning:
It is one of the most vital parts of data science and the hottest subject of research among
researchers so each year new advancements are made in this. One at least needs to understand
basic algorithms of Supervised and Unsupervised Learning. There are multiple libraries
available in Python and R for implementing these algorithms.
Distributed Computing: It is also one of the most important skills to handle a large amount
of data because one can’t process this much data on a single system. The tools that mostly
used are Apache Hadoop and Spark. The two major parts of these tolls are HDFS(Hadoop
Distributed File System) that is used for collecting data over a distributed file system.
Another part is map-reduce, by which we manipulate the data. One can write map-reduce in
programs in Java or Python. There are various other tools such as PIG, HIVE, etc.
Communication Skill:
It includes both written and verbal communication. What happens in a data science project is
after drawing conclusions from the analysis, the project has to be communicated to others.
Sometimes this may be a report you send to your boss or team at work. Other times it may be
a blog post. Often it may be a presentation to a group of colleagues. Regardless, a data
science project always involves some form of communication of the projects’ findings. So it’s
necessary to have communication skills for becoming a data scientist.
Data Scientist —
a data scientist is defined as someone:

“who integrates the skills of software programmer, statistician and storyteller slash artist to
extract the nuggets of gold hidden under mountains of data”
Roles & Responsibilities of a Data Scientist:

Management: The Data Scientist plays an insignificant managerial role where he supports the
construction of the base of futuristic and technical abilities within the Data and Analytics
field in order to assist various planned and continuing data analytics projects.
Analytics: The Data Scientist represents a scientific role where he plans, implements, and
assesses high-level statistical models and strategies for application in the business’s most
complex issues. The Data Scientist develops econometric and statistical models for various
problems including projections, classification, clustering, pattern analysis, sampling,
simulations, and so forth.
Strategy/Design: The Data Scientist performs a vital role in the advancement of innovative
strategies to understand the business’s consumer trends and management as well as ways to
solve difficult business problems, for instance, the optimization of product fulfillment and
entire profit.
Collaboration: The role of the Data Scientist is not a solitary role and in this position, he
collaborates with superior data scientists to communicate obstacles and findings to relevant
stakeholders in an effort to enhance drive business performance and decision-making.
Knowledge: The Data Scientist also takes leadership to explore different technologies and
tools with the vision of creating innovative data-driven insights for the business at the most
agile pace feasible. In this situation, the Data Scientist also uses initiative in assessing and
utilizing new and enhanced data science methods for the business, which he delivers to senior
management of approval.
Other Duties: A Data Scientist also performs related tasks and tasks as assigned by the Senior
Data Scientist, Head of Data Science, Chief Data Officer, or the Employer.
Terminologies Used in Big Data Environments —
Data science
Data science is the professional field that deals with turning data into value such as new
insights or predictive models. It brings together expertise from fields including statistics,
mathematics, computer science, communication as well as domain expertise such as business
knowledge. Data scientist has recently been voted the No 1 job in the U.S., based on current
demand and salary and career opportunities.
Data mining
Data mining is the process of discovering insights from data. In terms of Big Data, because it
is so large, this is generally done by computational methods in an automated way using
methods such as decision trees, clustering analysis and, most recently, machine learning. This
can be thought of as using the brute mathematical power of computers to spot patterns in data
which would not be visible to the human eye due to the complexity of the dataset.
Hadoop
Hadoop is a framework for Big Data computing which has been released into the public
domain as open source software, and so can freely be used by anyone. It consists of a number
of modules all tailored for a different vital step of the Big Data process – from file storage
(Hadoop File System – HDFS) to database (HBase) to carrying out data operations (Hadoop
MapReduce – see below). It has become so popular due to its power and flexibility that it has
developed its own industry of retailers (selling tailored versions), support service providers
and consultants.
Predictive modelling
At its simplest, this is predicting what will happen next based on data about what has
happened previously. In the Big Data age, because there is more data around than ever before,
predictions are becoming more and more accurate. Predictive modelling is a core component
of most Big Data initiatives, which are formulated to help us choose the course of action
which will lead to the most desirable outcome. The speed of modern computers and the
volume of data available means that predictions can be made based on a huge number of
variables, allowing an ever-increasing number of variables to be assessed for the probability
that it will lead to success.
MapReduce
MapReduce is a computing procedure for working with large datasets, which was devised
due to difficulty of reading and analysing really Big Data using conventional computing
methodologies. As its name suggest, it consists of two procedures – mapping (sorting
information into the format needed for analysis – i.e. sorting a list of people according to their
age) and reducing (performing an operation, such checking the age of everyone in the dataset
to see who is over 21).
NoSQL
NoSQL refers to a database format designed to hold more than data which is simply arranged
into tables, rows, and columns, as is the case in a conventional relational database. This
database format has proven very popular in Big Data applications because Big Data is often
messy, unstructured and does not easily fit into traditional database frameworks.
Python
Python is a programming language which has become very popular in the Big Data space due
to its ability to work very well with large, unstructured datasets (see Part II for the difference
between structured and unstructured data). It is considered to be easier to learn for a data
science beginner than other languages such as R (see also Part II) and more flexible.
R
R is another programming language commonly used in Big Data, and can be thought of as
more specialised than Python, being geared towards statistics. Its strength lies in its powerful
handling of structured data. Like Python, it has an active community of users who are
constantly expanding and adding to its capabilities by creating new libraries and extensions.
Recommendation engine
A recommendation engine is basically an algorithm, or collection of algorithms, designed to
match an entity (for example, a customer) with something they are looking for.
Recommendation engines used by the likes of Netflix or Amazon heavily rely on Big Data
technology to gain an overview of their customers and, using predictive modelling, match
them with products to buy or content to consume. The economic incentives offered by
recommendation engines has been a driving force behind a lot of commercial Big Data
initiatives and developments over the last decade.
Real-time
Real-time means “as it happens” and in Big Data refers to a system or process which is able
to give data-driven insights based on what is happening at the present moment. Recent years
have seen a large push for the development of systems capable of processing and offering
insights in real-time (or near-real-time), and advances in computing power as well as
development of techniques such as machine learning have made it a reality in many
applications today.
Reporting
The crucial “last step” of many Big Data initiative involves getting the right information to
the people who need it to make decisions, at the right time. When this step is automated,
analytics is applied to the insights themselves to ensure that they are communicated in a way
that they will be understood and easy to act on. This will usually involve creating multiple
reports based on the same data or insights but each intended for a different audience (for
example, in-depth technical analysis for engineers, and an overview of the impact on the
bottom line for c-level executives).
Spark
Spark is another open source framework like Hadoop but more recently developed and more
suited to handling cutting-edge Big Data tasks involving real time analytics and machine
learning. Unlike Hadoop it does not include its own filesystem, though it is designed to work
with Hadoop’s HDFS or a number of other options. However, for certain data related
processes it is able to calculate at over 100 times the speed of Hadoop, thanks to its in-
memory processing capability. This means it is becoming an increasingly popular choice for
projects involving deep learning, neural networks and other compute-intensive tasks.
Visualisation
Humans find it very hard to understand and draw insights from large amounts of text or
numerical data – we can do it, but it takes time, and our concentration and attention is limited.
For this reason effort has been made to develop computer applications capable of rendering
information in a visual form – charts and graphics which highlight the most important
insights which have resulted from our Big Data projects. A subfield of reporting (see above),
visualising is now often an automated process, with visualisations customised by algorithm to
be understandable to the people who need to act or take decisions based on them.
Types of Digital Data —
Structured Data
 Able to be processed, sorted, analyzed, and stored in a predetermined format, then
retrieved in a fixed format
 Accessed by a computer with the help of search algorithms
 First type of big data to be gathered
 Easiest of the three types of big data to analyze
 Examples of structured data include:
 Application-generated data
 Dates
 Names
 Numbers (e.g., telephone, credit card, US ZIP Codes, social security)
Semi-Structured Data
 Contains both structured as well as unstructured information
 Data may be formatted in segments
 Appears to be fully-structured, but may not be
 Not in the standardized database format as structured data
 Has some properties that make it easier to process than unstructured data
 Examples
 CSV
 Electronic data interchange (EDI)
 HTML
 JSON documents
 NoSQL databases
 Portable Document Files (PDF)
 RDF
 XML
Unstructured Data
 Not in any predetermined format (i.e., no apparent format)
 Accounts for the majority of the digital data that makes up big data
 Examples of the different types of unstructured data include:
 Human-generated data
 Email
 Text messages
 Invoices
 Text files
 Social media data
 Machine-generated data
 Geospatial data
 Weather data
 Data from IoT and smart devices
 Radar data
 Videos
 Satellite images
 Scientific data
Introduction to Big Data —

Big Data is a collection of data that is huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but with huge size.
Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Characteristics of Data —
Data quality is crucial – it assesses whether information can serve its purpose in a particular
context (such as data analysis, for example). So, how do you determine the quality of a given
set of information? There are data quality characteristics of which you should be aware.
There are five traits that you’ll find within data quality: accuracy, completeness, reliability,
relevance, and timeliness – read on to learn more.
 Accuracy
 Completeness
 Reliability
 Relevance
 Timeliness
Characteristic How it’s measured
Accuracy Is the information correct in every detail?
Completeness How comprehensive is the information?
Reliability Does the information contradict other trusted resources?
Relevance Do you really need this information?
Timeliness How up- to-date is information? Can it be used for real-
time reporting?
Big Data Characteristics

Big Data contains a large amount of data that is not being processed by traditional data
storage or the processing unit. It is used by many multinational companies to process the data
and business of many organizations. The data flow would exceed 150 exabytes per day
before replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data

Volume
Veracity
Variety
Value
Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Play
Next
Unmute
Current TimeÂ
0:00
/
DurationÂ
18:10
Â
Fullscreen
Backward Skip 10s
Play Video
Forward Skip 10s

Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.

The data is categorized as below:
Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
The primary characteristics of Big Data are –
1. Volume
Volume refers to the huge amounts of data that is collected and generated every second in
large organizations. This data is generated from different sources such as IoT devices, social
media, videos, financial transactions, and customer logs.
Storing and processing this huge amount of data was a problem earlier. But now distributed
systems such as Hadoop are used for organizing data collected from all these sources. The
size of the data is crucial for understanding its value. Also, the volume is useful in
determining whether a collection of data is Big Data or not.
Data volume can vary. For example, a text file is a few kilobytes whereas a video file is a few
megabytes. In fact, Facebook from Meta itself can produce an enormous proportion of data in
a single day. Billions of messages, likes, and posts each day contribute to generating such
huge data.
The global mobile traffic was tallied to be around 6.2 ExaBytes( 6.2 billion GB) per month in
the year 2016.
2. Variety
Another one of the most important Big Data characteristics is its variety. It refers to the
different sources of data and their nature. The sources of data have changed over the years.
Earlier, it was only available in spreadsheets and databases. Nowadays, data is present in
photos, audio files, videos, text files, and PDFs.
The variety of data is crucial for its storage and analysis.
A variety of data can be classified into three distinct parts:
Structured data
Semi-Structured data
Unstructured data
3. Velocity
This term refers to the speed at which the data is created or generated. This speed of data
producing is also related to how fast this data is going to be processed. This is because only
after analysis and processing, the data can meet the demands of the clients/users.
Massive amounts of data are produced from sensors, social media sites, and application logs
– and all of it is continuous. If the data flow is not continuous, there is no point in investing
time or effort on it.
As an example, per day, people generate more than 3.5 billion searches on Google.
4. Value
Among the characteristics of Big Data, value is perhaps the most important. No matter how
fast the data is produced or its amount, it has to be reliable and useful. Otherwise, the data is
not good enough for processing or analysis. Research says that poor quality data can lead to
almost a 20% loss in a company’s revenue.
Data scientists first convert raw data into information. Then this data set is cleaned to retrieve
the most useful data. Analysis and pattern identification is done on this data set. If the process
is a success, the data can be considered to be valuable.
5. Veracity
This feature of Big Data is connected to the previous one. It defines the degree of
trustworthiness of the data. As most of the data you encounter is unstructured, it is important
to filter out the unnecessary information and use the rest for processing.
Veracity is one of the characteristics of big data analytics that denotes data inconsistency as
well as data uncertainty.
As an example, a huge amount of data can create much confusion on the other hand, when
there is a fewer amount of data, that creates inadequate information.
Other than these five traits of big data in data science, there are a few more characteristics of
big data analytics that have been discussed down below:
1. Volatility
One of the big data characteristics is Volatility. Volatility means rapid change. And Big data is
in continuous change. Like data collected from a particular source change within a span of a
few days or so. This characteristic of Big Data hampers data homogenization. This process is
also known as the variability of data.
2. Visualization
Visualization is one more characteristic of big data analytics. Visualization is the method of
representing that big data that has been generated in the form of graphs and charts. Big data
professionals have to share their big data insights with non-technical audiences on a daily
basis.
Evolution of Big Data —
If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an open-
source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This has
led to the development of artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of
data in real time.
7. Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source of
the data.
Overall, big data technology has come a long way since the early days of data warehousing.
The introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data
streaming, and edge computing has revolutionized how we store, process, and analyze large
volumes of data. As technology evolves, we can expect Big Data to play a very important
role in various industries.
Big Data Analytics —
Big data analytics is the often complex process of examining big data to uncover information
-- such as hidden patterns, correlations, market trends and customer preferences -- that can
help organizations make informed business decisions.
On a broad scale, data analytics technologies and techniques give organizations a way to
analyze data sets and gather new information. Business intelligence (BI) queries answer basic
questions about business operations and performance.
Big data analytics is a form of advanced analytics, which involve complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by
analytics systems.
Classification of Analytics —
There are four types of analytics, Descriptive, Diagnostic, Predictive, and Prescriptive. The
chart below outlines the levels of these four categories. It compares the amount of value-
added to an organization versus the complexity it takes to implement.
The idea is that you should start with the easiest to implement, Descriptive Analytics. In this
blog, we will review the four analytics types and an example of their use cases, and how they
all work together.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling,
machine learning, data mining, and game theory that analyze current and historical facts to
make predictions about a future event. Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining
 Basic Corner Stones of Predictive Analytics
 Predictive modeling
 Decision Analysis and optimization
 Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic reviews
like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem. We try to find any dependency and pattern in the historical data
of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
Top Challenges Facing Big Data —

Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially when
the data is in different formats) within legacy systems. Unstructured data cannot be stored in
traditional databases.
Processing
Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.
Security
Security is a big concern for organizations. Non-encrypted information is at risk of theft or
damage by cyber-criminals. Therefore, data security professionals must balance access to
data against maintaining strict security protocols.
Finding and Fixing Data Quality Issues
Many of you are probably dealing with challenges related to poor data quality, but solutions
are available. The following are four approaches to fixing data problems:
 Correct information in the original database.
 Repairing the original data source is necessary to resolve any data inaccuracies.
 You must use highly accurate methods of determining who someone is.
Scaling Big Data Systems
Database sharding, memory caching, moving to the cloud and separating read-only and write-
active databases are all effective scaling methods. While each one of those approaches is
fantastic on its own, combining them will lead you to the next level.
Evaluating and Selecting Big Data Technologies
Companies are spending millions on new big data technologies, and the market for such tools
is expanding rapidly. In recent years, however, the IT industry has caught on to big data and
analytics potential. The trending technologies include the following:
 Hadoop Ecosystem
 Apache Spark
 NoSQL Databases
 R Software
 Predictive Analytics
 Prescriptive Analytics
Big Data Environments
In an extensive data set, data is constantly being ingested from various sources, making it
more dynamic than a data warehouse. The people in charge of the big data environment will
fast forget where and what each data collection came from.
Real-Time Insights
The term "real-time analytics" describes the practice of performing analyses on data as a
system is collecting it. Decisions may be made more efficiently and with more accurate
information thanks to real-time analytics tools, which use logic and mathematics to deliver
insights on this data quickly.
Data Validation
Before using data in a business process, its integrity, accuracy, and structure must be
validated. The output of a data validation procedure can be used for further analysis, BI, or
even to train a machine learning model.
Healthcare Challenges
Electronic health records (EHRs), genomic sequencing, medical research, wearables, and
medical imaging are just a few examples of the many sources of health-related big data.
Barriers to Effective Use Of Big Data in Healthcare
 The price of implementation
 Compiling and polishing data
 Security
 Disconnect in communication
Importance of Big Data Analytics —
Take the music streaming platform Spotify for example. The company has nearly 96 million
users that generate a tremendous amount of data every day. Through this information, the
cloud-based platform automatically generates suggested songs—through a smart
recommendation engine—based on likes, shares, search history, and more. What enables this
is the techniques, tools, and frameworks that are a result of Big Data analytics.
If you are a Spotify user, then you must have come across the top recommendation section,
which is based on your likes, past history, and other things. Utilizing a recommendation
engine that leverages data filtering tools that collect data and then filter it using algorithms
works. This is what Spotify does.
Data Analytics Tools
Data Analysis Software tools build it easier for users to process and manipulate information,
analyze the relationships and correlations between datasets:
 Data Analysis Software provides tools to assist with qualitative analysis like
transcription analysis, content analysis, discourse analysis, and grounded theory
methodology.
 Data Analysis Software has the Statistical and Analytical Capability for decision-
making methods.
 Data Analysis The software process can be classified into descriptive
statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).
 R and Python
 Microsoft Excel
 Tableau
 RapidMiner
 KNIME
 Power BI
 Apache Spark
 QlikView
 Talend
 Splunk
Linear Regression—
Linear regression analysis is used to predict the value of a variable based on the value of
another variable. The variable you want to predict is called the dependent variable. The
variable you are using to predict the other variable's value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression
fits a straight line or surface that minimizes the discrepancies between predicted and actual
output values. There are simple linear regression calculators that use a “least squares” method
to discover the best-fit line for a set of paired data. You then estimate the value of X
(dependent variable) from Y (independent variable).
You can perform the linear regression method in a variety of programs and environments,
including:
 R linear regression
 MATLAB linear regression
 Sklearn linear regression
 Linear regression Python
 Excel linear regression
Why linear regression is important

Linear-regression models are relatively simple and provide an easy-to-interpret mathematical
formula that can generate predictions. Linear regression can be applied to various areas in
business and academic study.
Key assumptions of effective linear regression

Assumptions to be considered for success with linear-regression analysis:
 For each variable: Consider the number of valid cases, mean and standard deviation.
 For each model: Consider regression coefficients, correlation matrix, part and partial
correlations, multiple R, R2, adjusted R2, change in R2, standard error of the
estimate, analysis-of-variance table, predicted values and residuals. Also, consider 95-
percent-confidence intervals for each regression coefficient, variance-covariance
matrix, variance inflation factor, tolerance, Durbin-Watson test, distance measures
(Mahalanobis, Cook and leverage values), DfBeta, DfFit, prediction intervals and
case-wise diagnostic information.
 Plots: Consider scatterplots, partial plots, histograms and normal probability plots.
 Data: Dependent and independent variables should be quantitative. Categorical
variables, such as religion, major field of study or region of residence, need to be
recoded to binary (dummy) variables or other types of contrast variables.
 Other assumptions: For each value of the independent variable, the distribution of
the dependent variable must be normal. The variance of the distribution of the
dependent variable should be constant for all values of the independent variable. The
relationship between the dependent variable and each independent variable should be
linear and all observations should be independent.

Polynomial Regression —
o Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:
o If we apply a linear model on a linear dataset, then it provides us a good result as we
have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way using
the below comparison diagram of the linear dataset and non-linear dataset.
o
In the above image, we have taken a dataset which is arranged non-linearly. So if we

try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
Multivariate Regression
Multivariate regression is a technique used to measure the degree to which the various
independent variable and various dependent variables are linearly related to each other. The
relation is said to be linear due to the correlation between the variables. Once the multivariate
regression is applied to the dataset, this method is then used to predict the behaviour of the
response variable based on its corresponding predictor variables.
Multivariate regression is commonly used as a supervised algorithm in machine learning, a
model to predict the behaviour of dependent variables and multiple independent variables.
UNIT 2
Introducing Hadoop
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Hadoop Overview —
Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and
to execute distributed processes against huge amounts of data. Hadoop provides the building
blocks on which other services and applications can be built.
Applications that collect data in various formats can place data into the Hadoop cluster by
using an API operation to connect to the NameNode. The NameNode tracks the file directory
structure and placement of “chunks” for each file, replicated across DataNodes. To run a job
to query the data, provide a MapReduce job made up of many map and reduce tasks that run
against the data in HDFS spread across the DataNodes. Map tasks run on each node against
the input files supplied, and reducers run to aggregate and organize the final output.
The Hadoop ecosystem has grown significantly over the years due to its extensibility. Today,
the Hadoop ecosystem includes many tools and applications to help collect, store, process,
analyze, and manage big data. Some of the most popular applications are:
 Spark – An open source, distributed processing system commonly used for big
data workloads. Apache Spark uses in-memory caching and optimized execution
for fast performance, and it supports general batch processing, streaming
analytics, machine learning, graph databases, and ad hoc queries.
 Presto – An open source, distributed SQL query engine optimized for low-
latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including
complex queries, aggregations, joins, and window functions. Presto can process
data from multiple data sources including the Hadoop Distributed File System
(HDFS) and Amazon S3.
 Hive – Allows users to leverage Hadoop MapReduce using a SQL interface,
enabling analytics at a massive scale, in addition to distributed and fault-tolerant
data warehousing.
 HBase – An open source, non-relational, versioned database that runs on top of
Amazon S3 (using EMRFS) or the Hadoop Distributed File System (HDFS).
HBase is a massively scalable, distributed big data store built for random, strictly
consistent, real-time access for tables with billions of rows and millions of
columns.
 Zeppelin – An interactive notebook that enables interactive data exploration.
RDBMS versus Hadoop —
S.No. RDBMS Hadoop
An open-source software used for storing

Traditional row-column based databases, basically
1. and running applications or proc
used for data storage, manipulation and retrieval.
concurrently.
S.No. RDBMS Hadoop
In this both structured and unstructured d

2. In this structured data is mostly processed.
processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.
4. It is less scalable than Hadoop. It is highly scalable.
5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop
6. It stores transformed and aggregated data. It stores huge volume of data.
7. It has no latency in response. It has some latency in response.
8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type
9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open source software
HDFS (Hadoop
Distributed File System):
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well as
supporting big data analytics.
There are many companies across the globe that use HDFS, so what exactly is it and why is it
needed? Let's take a deep dive into what HDFS is and why it may be useful for businesses.
What is HDFS?
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system
designed to run on commodity hardware.
HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS
provides high throughput data access to application data and is suitable for applications that
have large data sets and enables streaming access to file system data in Apache Hadoop.
So, what is Hadoop? And how does it vary from HDFS? A core difference between Hadoop
and HDFS is that Hadoop is the open source framework that can store, process and analyze
data, while HDFS is the file system of Hadoop that provides access to data. This essentially
means that HDFS is a module of Hadoop.
Let's take a look at HDFS architecture:
As we can see, it focuses on NameNodes and DataNodes. The NameNode is the hardware
that contains the GNU/Linux operating system and software. The Hadoop distributed file
system acts as the master server and can manage the files, control a client's access to files,
and overseas file operating processes such as renaming, opening, and closing files.
A DataNode is hardware having the GNU/Linux operating system and DataNode software.
For every node in a HDFS cluster, you will locate a DataNode. These nodes help to control
the data storage of their system as they can perform operations on the file systems if the client
requests, and also create, replicate, and block files when the NameNode instructs.
The HDFS meaning and purpose is to achieve the following goals:
 Manage large datasets - Organizing and storing datasets can be a hard talk to handle.
HDFS is used to manage the applications that have to deal with huge datasets. To do
this, HDFS should have hundreds of nodes per cluster.
 Detecting faults - HDFS should have technology in place to scan and detect faults
quickly and effectively as it includes a large number of commodity hardware.
Failure of components is a common issue.
 Hardware efficiency - When large datasets are involved it can reduce the network
traffic and increase the processing speed.
Components and Block Replication —
Replication of blocks
HDFS is a reliable storage component of Hadoop. This is because every block stored in the
filesystem is replicated on different Data Nodes in the cluster. This makes HDFS fault-
tolerant.
The default replication factor in HDFS is 3. This means that every block will have two more
copies of it, each stored on separate DataNodes in the cluster. However, this number is
configurable.

But you must be wondering doesn’t that mean that we are taking up too much storage. For
instance, if we have 5 blocks of 128MB each, that amounts to 5*128*3 = 1920 MB. True.
But then these nodes are commodity hardware. We can easily scale the cluster to add more of
these machines. The cost of buying machines is much lower than the cost of losing the data!
Now, you must be wondering, how does Namenode decide which Datanode to store the
replicas on? Well, before answering that question, we need to have a look at what is a Rack in
Hadoop.
Processing Data with
Hadoop —
Components and Daemons of Hadoop

The Hadoop consists of three major components that are HDFS, MapReduce, and YARN.
1. Hadoop HDFS
It is the storage layer for Hadoop. Hadoop Distributed File System stores data across various
nodes in a cluster. It divides the data into blocks and stores them on different nodes. The
block size is 128 MB by default. We can configure the block size as per our requirements.
2. Hadoop MapReduce
It is the processing layer in Hadoop. Hadoop MapReduce processes the data stored in Hadoop
HDFS in parallel across various nodes in the cluster. It divides the task submitted by the user
into the independent task and processes them as subtasks across the commodity hardware.
3. Hadoop YARN
It is the resource and process management layer of Hadoop. YARN is responsible for sharing
resources amongst the applications running in the cluster and scheduling the task in the
cluster.
These are the three core components in Hadoop.
Daemons running in the Hadoop Cluster
There are some Daemons that run on the Hadoop Cluster. Daemons are the light-weight
process that runs in the background.
Some Daemons run on the Master node and some on the Slave node. Let us now study the
Hadoop Daemons.
The major Hadoop Daemon are:
1. Master Daemons
 NameNode: It is the master Daemon in Hadoop HDFS. It maintains the
filesystem namespace. It stores metadata about each block of the files.
 ResourceManager: It is the master daemon of YARN. It arbitrates resources
amongst all the applications running in the cluster.
2. Slave Daemons
 DataNode: DataNode is the slave daemon of Hadoop HDFS. It runs on slave
machines. It stores actual data or blocks.
 NodeManager: It is the slave daemon of YARN. It takes care of all the
individual computing nodes in the cluster.
How Hadoop works?
Hadoop stores and processes the data in a distributed manner across the cluster of commodity
hardware. To store and process any data, the client submits the data and program to the
Hadoop cluster.
Hadoop HDFS stores the data, MapReduce processes the data stored in HDFS, and YARN
divides the tasks and assigns resources.
Introduction to MapReduce —
raditional Enterprise Systems normally have a centralized server to store and process data.
The following illustration depicts a schematic view of a traditional enterprise system.
Traditional model is certainly not suitable to process huge volumes of scalable data and
cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.

Features of MapReduce
Key Features of MapReduce

The following advanced features characterize MapReduce:
1. Highly scalable
A framework with excellent scalability is Apache Hadoop MapReduce. This is because of its
capacity for distributing and storing large amounts of data across numerous servers. These
servers can all run simultaneously and are all reasonably priced.
By adding servers to the cluster, we can simply grow the amount of storage and computing
power. We may improve the capacity of nodes or add any number of nodes (horizontal
scalability) to attain high computing power. Organizations may execute applications from
massive sets of nodes, potentially using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.
2. Versatile
Businesses can use MapReduce programming to access new data sources. It makes it possible
for companies to work with many forms of data. Enterprises can access both organized and
unstructured data with this method and acquire valuable insights from the various data
sources.
Since Hadoop is an open-source project, its source code is freely accessible for review,
alterations, and analyses. This enables businesses to alter the code to meet their specific
needs. The MapReduce framework supports data from sources including email, social media,
and clickstreams in different languages.
3. Secure
The MapReduce programming model uses the HBase and HDFS security approaches, and
only authenticated users are permitted to view and manipulate the data. HDFS uses a
replication technique in Hadoop 2 to provide fault tolerance. Depending on the replication
factor, it makes a clone of each block on the various machines. One can therefore access data
from the other devices that house a replica of the same data if any machine in a cluster goes
down. Erasure coding has taken the role of this replication technique in Hadoop 3. Erasure
coding delivers the same level of fault tolerance with less area. The storage overhead with
erasure coding is less than 50%.
4. Affordability
With the help of the MapReduce programming framework and Hadoop’s scalable design, big
data volumes may be stored and processed very affordably. Such a system is particularly cost-
effective and highly scalable, making it ideal for business models that must store data that is
constantly expanding to meet the demands of the present.
In terms of scalability, processing data with older, conventional relational database

management systems was not as simple as it is with the Hadoop system. In these situations,
the company had to minimize the data and execute classification based on presumptions
about how specific data could be relevant to the organization, hence deleting the raw data.
The MapReduce programming model in the Hadoop scale-out architecture helps in this
situation.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage technique used by MapReduce, is
a mapping system for finding data in a cluster. The data processing technologies, such as
MapReduce programming, are typically placed on the same servers that enable quicker data
processing.
Thanks to Hadoop’s distributed data storage, users may process data in a distributed manner
across a cluster of nodes. As a result, it gives the Hadoop architecture the capacity to process
data exceptionally quickly. Hadoop MapReduce can process unstructured or semi-structured
data in high numbers in a shorter time.
6. Based on a simple programming model

Hadoop MapReduce is built on a straightforward programming model and is one of the
technology’s many noteworthy features. This enables programmers to create MapReduce
applications that can handle tasks quickly and effectively. Java is a very well-liked and
simple-to-learn programming language used to develop the MapReduce programming model.
Java programming is simple to learn, and anyone can create a data processing model that
works for their company. Hadoop is straightforward to utilize because customers don’t need
to worry about computing distribution. The framework itself does the processing.
7. Parallel processing-compatible
The parallel processing involved in MapReduce programming is one of its key components.
The tasks are divided in the programming paradigm to enable the simultaneous execution of
independent activities. As a result, the program runs faster because of the parallel processing,
which makes it simpler for the processes to handle each job. Multiple processors can carry
out these broken-down tasks thanks to parallel processing. Consequently, the entire software
runs faster.
8. Reliable
The same set of data is transferred to some other nodes in a cluster each time a collection of
information is sent to a single node. Therefore, even if one node fails, backup copies are
always available on other nodes that may still be retrieved whenever necessary. This ensures
high data availability.
The framework offers a way to guarantee data trustworthiness through the use of Block
Scanner, Volume Scanner, Disk Checker, and Directory Scanner modules. Your data is safely
saved in the cluster and is accessible from another machine that has a copy of the data if your
device fails or the data becomes corrupt.
9. Highly available
Hadoop’s fault tolerance feature ensures that even if one of the DataNodes fails, the user may
still access the data from other DataNodes that have copies of it. Moreover, the high
accessibility Hadoop cluster comprises two or more active and passive NameNodes running
on hot standby. The active NameNode is the active node. A passive node is a backup node
that applies changes made in active NameNode’s edit logs to its namespace.

Unit 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

Advanced Data Science

Fundamentals and Components

a data scientist is defined as someone:

Roles & Responsibilities of a Data Scientist:

Terminologies Used in Big Data Environments —

Types of Digital Data —

Introduction to Big Data —

Following are some of the Big Data examples-

Big Data Characteristics

5 V's of Big Data

Backward Skip 10s

Forward Skip 10s

Big Data Characteristics

Big Data Characteristics

Big Data Characteristics

The primary characteristics of Big Data are –

Top Challenges Facing Big Data —

analyze the relationships and correlations between datasets:

transcription analysis, content analysis, discourse analysis, and grounded theory

 Data Analysis The software process can be classified into descriptive

statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).

Why linear regression is important

Key assumptions of effective linear regression

Need for Polynomial Regression:

In the above image, we have taken a dataset which is arranged non-linearly. So if we

An open-source software used for storing

In this both structured and unstructured d

4. It is less scalable than Hadoop. It is highly scalable.

5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

Processing Data with

Components and Daemons of Hadoop

Key Features of MapReduce

In terms of scalability, processing data with older, conventional relational database

6. Based on a simple programming model

You might also like