Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 69

UNIT-I

INTRODUCTION TO BIG DATA

What is big data?

Big data is a term used to describe data of great variety, huge volumes, and even more velocity.
Apart from the significant volume, big data is also complex such that none of the conventional
data management tools can effectively store or process it. The data can be structured or
unstructured.

Examples of big data include:

 Mobile phone details

 Social media content

 Health records

 Transactional data

 Web searches

 Financial documents

 Weather information.

Big data can be generated by users (emails, images, transactional data, etc.), or machines (IoT,
ML algorithms, etc.). And depending on the owner, the data can be made commercially available
to the public through API or FTP. In some instances, it may require a subscription for you to be
granted access to it.

Types of big data:

Big data is a term used to describe large volumes of data that are hard to manage. Due to its large
size and complexity, traditional data management tools cannot store or process it efficiently.
There are three types of big data:

 Structured
 Unstructured
 Semi-structured
1. Structured:
Big data can be stored, accessed, and processed in a fixed format. Although recent
advancements in computer science have made it possible to process such data, experts agree that
issues might arise when the data grows to a huge extent.
2. Unstructured:
Data is data whose form and structure are undefined. In addition to being large,
unstructured data also poses multiple challenges in terms of processing. Large organizations
have data sources containing a combination of text, video, and image files. Despite having such
an abundance of data, they still struggle to derive value from it due to its intricate format.
3. Semi-structured:
Data contains both structured and unstructured data. At its essence, we can view semi-
structured data in a structured form, but it is not clearly defined, just like in this XML file .

Characteristics of big data:

IBM describes the phenomenon of big data through the four V’s:
 volume
 velocity
 variety and
 Veracity
 Volume:- Because data are collected electronically, we are able to collect more of it. To be
useful, these data must be stored, and this storage has led to vast quantities of data. Many
companies now store in excess of 100 terabytes of data (a terabyte is 1,024 gigabytes).

 Velocity:- Real-time capture and analysis of data present unique challenges both in how data
are stored and the speed with which those data can be analyzed for decision making.
Ex:- The New York Stock Exchange collects 1 terabyte of data in a single trading session,
and having current data and real-time rules for trades and predictive modeling are important
for managing stock portfolios.

 Variety:- In addition to the sheer volume and speed with which companies now collect data,
more complicated types of data are now available and are proving to be of great value to
businesses.
 Text data are collected by monitoring what is being said about a company’s products or
services on social media platforms such as Twitter.
 Audio data are collected from service calls (on a service call, you will often hear “this
call may be monitored for quality control”).
 Video data collected by in-store video cameras are used to analyze shopping behavior.
 Analyzing information generated by these nontraditional sources is more complicated in
part because of the processing required to transform the data into a numerical form that
can be analyzed.
 Veracity :- Veracity has to do with how much uncertainty is in the data. For example, the
data could have many missing values, which makes reliable analysis a challenge.
Inconsistencies in units of measure and the lack of reliability of responses in terms of bias
also increase the complexity of the data. This has led to new technologies like Hadoop &
MapReduce….

Big Data platforms:

The constant stream of information from various sources is becoming more intense,
especially with the advance in technology. And this is where big data platforms come in to store
and analyze the ever-increasing mass of information.

A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves all
the data needs of a business regardless of the volume and size of the data at hand. Due to their
efficiency in data management, enterprises are increasingly adopting big data platforms to gather
tons of data and convert them into structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.

Features of Big Data Platform:

a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or due to
change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets

Big Data platforms:

The following are the some of the Big Data platforms

a) Hadoop b) Cloudera c) Amazon Web Services d) Horton works e) Map R.

a) Hadoop:
 Hadoop is open-source, Java based programming framework and server software which is used
to save and analyze data with the help of 100s or even 1000s of commodity servers in a clustered
environment.  Hadoop is designed to storage and process large datasets extremely fast and in
fault tolerant way.
 Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no loss of data
even in hardware failure.

b) Cloudera:
 Cloudera is one of the first commercial Hadoop based Big Data Analytics Platform offering
Big Data solution.
 Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data
Science & Engineering and Cloudera Essentials.
 All these products are based on the Apache Hadoop and provides real-time processing and
analytics of massive data sets.
c) Amazon Web Services:
 Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services
package.
 AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute and
Simple Storage Service (S3).
 Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud
environment.
 Amazon EMR allows companies to setup and easily scale Apache Hadoop, Spark, HBase,
Presto, Hive, and other Big Data Frameworks using its cloud hosting environment
d) Horton works:
 Horton works is using 100% open-source software without any propriety software. Horton
works were the one who first integrated support for Apache HCatalog.
 The Horton works is a Big Data company based in California.
 This company is developing and supports application for Apache Hadoop. Horton works
Hadoop distribution is 100% open source and its enterprise ready with following features:
 Centralized management and configuration of clusters
 Security and data governance are built in feature of the system
 Centralized security administration across the system
e) Map R:
 Map R is another Big Data platform which us using the Unix file system for handling data.
 It is not using HDFS and this system is easy to learn anyone familiar with the Unix system.
 This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing
feature.
Web Data and use cases of Web Data:
Web data is an incredibly broad term. It encompasses a wide range of information which is
collected from websites and apps about different users’ browsing habits, online behaviors and
preferences. It can also include information about the consumer themselves, such as their details,
search and purchase intent or online interests. Examples of web data include online product
reviews, social media posts, website traffic statistics, and search engine results.
Use Cases
E-commerce Price Monitoring
One of the main use cases of web data is e-commerce price monitoring. With the vast amount of
products and prices available online, businesses can leverage web data to track and monitor the
prices of their competitors’ products. By collecting data from various e-commerce websites,
businesses can gain insights into market trends, identify pricing strategies, and adjust their own
pricing accordingly. This use case helps businesses stay competitive and make informed pricing
decisions.
Sentiment Analysis and Brand Monitoring
Web data is also widely used for sentiment analysis and brand monitoring. By analyzing data
from social media platforms, review websites, and online forums, businesses can gain valuable
insights into customer opinions, feedback, and sentiments towards their brand or products. This
use case allows businesses to understand customer preferences, identify areas for improvement,
and manage their brand reputation effectively.
Market Research and Trend Analysis
Web data is a valuable resource for market research and trend analysis. By collecting data from
various sources such as news websites, blogs, and industry forums, businesses can gather
information about market trends, consumer behaviour, and emerging technologies. This use case
helps businesses make data-driven decisions, identify new market opportunities, and stay ahead
of their competitors.
These are just a few examples of the main use cases of web data. The versatility and abundance
of web data make it a valuable asset for businesses across various industries.
Main Attributes of Web Data
Web data refers to the vast amount of information available on the internet, encompassing
various attributes that can be associated with it. Some possible attributes of web data include the
source or website from which the data originates, the date and time of data retrieval, the format
in which the data is presented (such as HTML, XML, JSON), the structure of the data (such as
tables, lists, or graphs), the content or topic of the data (ranging from news articles and social
media posts to scientific research papers and e-commerce product listings), and the metadata
associated with the data (such as author, title, keywords, and tags). Additionally, web data can
have attributes related to its accessibility, quality, reliability, and licensing.

CHALLENGES OF CONVENTIONAL SYSTEMS


Introduction to Conventional Systems
 The system consists of one or more zones each having either manually operated call points or
automatic detection devices, or a combination of both.
 Big data is huge amount of data which is beyond the processing capacity of conventional data
base systems to manage and analyze the data in a specific time interval.

Difference between conventional computing and intelligent computing:


 The conventional computing functions logically with a set of rules and calculations while the
neural computing can function via images, pictures, and concepts.
 Conventional computing is often unable to manage the variability of data obtained in the real
world.
 On the other hand, neural computing, like our own brains, is well suited to situations that have
no clear algorithmic solutions and are able to manage noisy imprecise data. This allows them to
excel in those areas that conventional computing often finds difficult.

List of challenges of Conventional Systems:


The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
1) Uncertainty of Data Management Landscape:
 Because big data is continuously expanding, there are new companies and technologies that are
being developed every day.
 A big challenge for companies is to find out which technology works bests for them without
the introduction of new risks and problems.
2) The Big Data Talent Gap:
 While Big Data is a growing field, there are very few experts available in this field.
 This is because Big data is a complex field and people who understand the complexity and
intricate nature of this field are far few and between.
3) The talent gap that exists in the industry Getting data into the big data
platform:
 Data is increasing every single day. This means that companies have to tackle limitless amount
of data on a regular basis.
 The scale and variety of data that is available today can overwhelm any data practitioner and
that is why it is important to make data accessibility simple and convenient for brand managers
and owners.
4) Need for synchronization across data sources:
 As data sets become more diverse, there is a need to incorporate them into an analytical
platform.
 If this is ignored, it can create gaps and lead to wrong insights and messages.
5) Getting important insights through the use of Big data analytics:
 It is important that companies gain proper insights from big data analytics and it is important
that the correct department has access to this information.
 A major challenge in the big data analytics is bridging this gap in an effective fashion.
Some of the Big Data challenges are:
Big data challenges include the storing, analyzing the extremely large and fast-growing
data.

1. Sharing and Accessing Data:


 Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets
from external sources.
 Sharing data can cause substantial challenges.
 It include the need for inter and intra- institutional legal documents.
 Accessing data from public repositories leads to multiple difficulties.
 It is necessary for the data to be available in an accurate, complete and timely manner
because if data in the companies information system is to be used to make accurate
decisions in time then it becomes necessary for data to be available in this manner.
2. Privacy and Security:
 It is another most important challenge with Big Data. This challenge includes sensitive,
conceptual, technical as well as legal significance.
 Most of the organizations are unable to maintain regular checks due to large amounts of
data generation. However, it should be necessary to perform security checks and
observation in real time because it is most beneficial.
 There is some information of a person which when combined with external large data
may lead to some facts of a person which may be secretive and he might not want the
owner to know this information about that person.
 Some of the organization collects information of the people in order to add value to
their business. This is done by making insights into their lives that they’re unaware of.
3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too large?
 Or how to find out the important data points?
 Or how to use data to the best advantage?
 These large amount of data on which these type of analysis is to be done can be
structured (organized data), semi-structured (Semi-organized data) or unstructured
(unorganized data). There are two techniques through which decision making can be
done:
 Either incorporates massive data volumes in the analysis.
 Or determine upfront which Big data is relevant.

4. Technical challenges:
 Quality of data:
 When there is a collection of a large amount of data and storage of this data, it comes
at a cost. Big companies, business leaders and IT leaders always want large data
storage.
 For better results and conclusions, Big data rather than having irrelevant data,
focuses on quality data storage.
 This further arise a question that how it can be ensured that data is relevant, how
much data would be enough for decision making and whether the stored data is
accurate or not.
 Fault tolerance:
 Fault tolerance is another technical challenge and fault tolerance computing is
extremely hard, involving intricate algorithms.
 Nowadays some of the new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not begin from the scratch.
 Scalability:
 Big data projects can grow and evolve rapidly. The scalability issue of Big Data has
lead towards cloud computing.
 It leads to various challenges like how to run and execute various jobs so that goal
of each workload can be achieved cost-effectively.
 It also requires dealing with the system failures in an efficient manner. This leads to
a big question again that what kinds of storage devices are to be used.
Modern tools of Big Data:
1. Apache Hadoop:
A large data framework is the Apache Hadoop software library. It enables massive data sets to
be processed across clusters of computers in a distributed manner. It's one of the most powerful
big data technologies, with the ability to grow from a single server to thousands of computers
Features
• When utilizing an HTTP proxy server, authentication is improved.
• Hadoop Compatible File system effort specification. Extended characteristics for POSIX- style
file systems are supported.
• It has big data technologies and tools that offers robust ecosystem that is well suited to meet the
analytical needs of developer.
• It brings Flexibility in Data Processing. It allows for faster data Processing
2. HPCC:
HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a single platform,
a single architecture and a single programming language for data processing
Features
• It is one of the Highly efficient big data tools that accomplish big data tasks with far less code.
• It is one of the big data processing tools which offers high redundancy and availability.
• It can be used both for complex data processing on a Thor cluster. Graphical IDE for simplifies
development, testing and debugging. It automatically optimizes code for parallel processing
• Provide enhance scalability and performance. ECL code compiles into optimized C++, and it
can also extend using C++ libraries.
3. Apache STORM:
Storm is a free big data open source computation system. It is one of the best big data tools
which offers distributed real-time, fault-tolerant processing system. With real-time computation
capabilities.
Features
• It is one of the best tool from big data tools list which is benchmarked as processing one
million 100 byte messages per second per node
• It has big data technologies and tools that uses parallel calculations that run across a cluster of
machines.
• It will automatically restart in case a node dies. The worker will be restarted on another node.
Storm guarantees that each unit of data will be processed at least once or exactly once
• Once deployed Storm is surely easiest tool for Big data analysis.
4. Qubole Qubole:
Data is Autonomous Big data management platform. It is a big data open-source tool which is
self-managed, self-optimizing and allows the data team to focus on business outcomes.
Features
• Single Platform for every use case
• It is an Open-source big data software having Engines, optimized for the Cloud.
• Comprehensive Security, Governance, and Compliance
• Provides actionable Alerts, Insights, and Recommendations to optimize reliability,
performance, and costs.
• Automatically enacts policies to avoid performing repetitive manual actions.
5. Apache Cassandra:
The Apache Cassandra database is widely used today to provide an effective management of
large amounts of data.
Features
• Support for replicating across multiple data centers by providing lower latency for users
• Data is automatically replicated to multiple nodes for fault-tolerance
• It one of the best big data tools which is most suitable for applications that can't afford to lose
data, even when an entire data center is down
• Cassandra offers support contracts and services are available from third parties.
6. CouchDB:
CouchDB stores data in JSON documents that can be accessed web or query using JavaScript. It
offers distributed scaling with fault-tolerant storage. It allows accessing data by defining the
Couch Replication Protocol.
Features
• CouchDB is a single-node database that works like any other database
• It is one of the big data processing tools that allows running a single logical database server on
any number of servers.
• It makes use of the ubiquitous HTTP protocol and JSON data format. Easy replication of a
database across multiple server instances. Easy interface for document insertion, updates,
retrieval and deletion
• JSON-based document format can be translatable across different languages.
7. Apache Flink:
Apache Flink is one of the best open source data analytics tools for stream processing big data. It
is distributed, high-performing, always-available, and accurate data streaming applications.
Features:
• Provides results that are accurate, even for out-of-order or late-arriving data
• It is shameful and fault-tolerant and can recover from failures.
• It is big data analytics software which can perform at a large scale, running on thousands of
nodes
• Has good throughput and latency characteristics
• This big data tool supports stream processing and windowing with event time semantics. It
supports flexible windowing based on time, count, or sessions to data-driven windows
• It supports a wide range of connectors to third-party systems for data sources and sinks
8. Cloudera:
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows anyone to
get any data across any environment within single, scalable platform.

Features:
• High-performance big data analytics software
• It offers provision for multi-cloud
• Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google Cloud
Platform. Spin up and terminate clusters, and only pay for what is needed when need it.
• Developing and training data models
• Reporting, exploring, and self-servicing business intelligence
• Delivering real-time insights for monitoring and detection
• Conducting accurate model scoring and serving

ANALYTIC PROCESS AND TOOLS:


Step 1: Deployment
• Here we need to: – plan the deployment and monitoring and maintenance, – we need to produce
a final report and review the project. – In this phase,
• We deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding
• Business Understanding – The very first step consists of business understanding. – Whenever
any requirement occurs, firstly we need to determine the business objective, – assess the
situation, – determine data mining goals and then – produce the project plan as per the
requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration
• The second step consists of Data understanding. – For the further process, we need to gather
initial data, describe and explore the data and verify data quality to ensure it contains the data we
require. Data collected from the various sources is described in terms of its application and the
need for the project in this phase. – This is also known as data exploration.
• This is necessary to verify the quality of data collected.
Step 4: Data Preparation
• From the data collected in the last step, – we need to select data as per the need, clean it,
construct it to get useful information and – then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.
Step 5: Data Modeling
• We need to – select a modeling technique, generate test design, build a model and assess the
model built.
• The data model is built to – analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented on the data in
this phase.
• Where processing is hosted? – Distributed Servers / Cloud (e.g. Amazon EC2) • Where data is
stored? – Distributed Storage (e.g. Amazon S3)
• What is the programming model? – Distributed Processing (e.g. MapReduce)
• How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data? – Analytic / Semantic Processing
• Big data tools for HPC and supercomputing – MPI
• Big data tools on clouds – Map Reduce model – Iterative Map Reduce model – DAG model –
Graph model – Collective model
• Other BDA tools – SAS – R – Hadoop Thus the BDA tools are used throughout the BDA
applications development.

Analytics vs Reporting: Key Differences & Importance


Analytics and reporting can help a business improve operational efficiency and production in
several ways. Analytics is the process of making decisions based on the data presented, while
reporting is used to make complicated information easier to understand.
Analytics vs reporting?
Analytics is the technique of examining data and reports to obtain actionable insights that can be
used to comprehend and improve business performance. Business users may gain insights from
data, recognize trends, and make better decisions with workforce analytics.
On the one hand, analytics is about finding value or making new data to help you decide.
This can be performed either manually or mechanically. Next-generation analytics uses new
technologies like AI or machine learning to make predictions about the future based on past and
present data.
The steps involved in data analytics are as follows:
 Developing a data hypothesis
 Data collection and transformation
 Creating analytical research models to analyze and provide insights
 Utilization of data visualization, trend analysis, deep dives, and other tools.
 Making decisions based on data and insights
On the other hand, reporting is the process of presenting data from numerous sources clearly and
simply. The procedure is always carefully set out to report correct data and avoid
misunderstandings.
Today’s reporting applications offer cutting-edge dashboards with advanced data
visualization features. Companies produce a variety of reports, such as financial reports,
accounting reports, operational reports, market studies, and more. This makes it easier to see how
each function is operating quickly.
In general, the procedures needed to create a report are as follows:
 Determining the business requirement
 Obtaining and compiling essential data
 Technical data translation
 Recognizing the data context
 Building dashboards for reporting
 Providing real-time reporting
 Allowing users to dive down into reports.
Key differences between analytics vs reporting:
Differences between analytics and reporting can significantly benefit your business. If you want
to use both to their full potential and not miss out on essential parts of either one knowing the
difference between the two is important. Some key differences are:

Analytics Reporting

Analytics is the method of examining and Reporting is an action that includes all the needed
analyzing summarized data to make information and data and is put together in an
business decisions. organized way.

Questioning the data, understanding it, Identifying business events, gathering the required
investigating it, and presenting it to the information, organizing, summarizing, and
end users are all part of analytics. presenting existing data are all part of reporting.

The purpose of analytics is to draw The purpose of reporting is to organize the data
conclusions based on data. into meaningful information.

Analytics is used by data analysts, Reporting is provided to the appropriate business


scientists, and business people to make leaders to perform effectively and efficiently
effective decisions. within a firm.

Analytics and reporting can be used to reach a number of different goals. Both of these can be
very helpful to a business if they are used correctly.
Importance of analytics vs reporting:
A business needs to understand the differences between analytics and reporting. Better data
knowledge through analytics and reporting helps businesses in decision-making and action inside
the organization. It results in higher value and performance.
Analytics is not really possible without advanced reporting, but analytics is more than just
reporting. Both tools are made for sharing important information that will help business people
make better decisions.
Sampling:
Sampling is a process of selecting group of observations from the population, to study the
characteristics of the data to make conclusion about the population.
Example: Covaxin (a covid-19 vaccine) is tested over thousands of males and females before
giving to all the people of country.

Types of Sampling:
Whether the data set for sampling is randomized or not, sampling is classified into two major
groups:
 Probability Sampling

 Non-Probability Sampling
Probability Sampling (Random Sampling):
In this type, data is randomly selected so that every observations of population gets the equal
chance to be selected for sampling.
Probability sampling is of 4 types:
 Simple Random Sampling
 Cluster Sampling
 Stratified Sampling
 Systematic Sampling
Non-Probability Sampling:
In this type, data is not randomly selected. It mainly depends upon how the statistician wants to
select the data. The results may or may not be biased with the population. Unlike probability
sampling, each observations of population doesn’t get the equal chance to be selected for
sampling.
Non-probability sampling is of 4 types:
 Convenience Sampling
 Judgmental/Purposive Sampling
 Snowball/Referral Sampling
 Quota Sampling.
Sampling Error:
Errors which occur during sampling process are known as Sampling Errors.
Or
Difference between observed value of a sample statistics and the actual value of a population
parameters.
Mathematical Formula for Sampling Error:

Sampling error can be reduced by:


 Increasing the sample size
 Classifying population into different groups
Advantage of Sampling:
 Reduce cost and Time
 Accuracy of Data
 Inferences can be applied to a larger population
 Less resource needed
Resampling:
Resampling is the method that consists of drawing repeatedly drawing samples from the
population. It involves the selection of randomized cases with replacement from sample.
Note: In machine learning resampling is used to improve the performance of the model.
Types of Resampling:
Two common method of Resampling are:
 K-fold Cross-validation
 Bootstrapping.
K-fold cross-validation:
In this method population data is divided into k equal sets in which one set is considered as the
test set for the experiment while all other set will be used to train the model.
In first experiment, first set is considered as the test set and all other as trained set.
Process will be repeated k-time by choosing different sets as a test set.

Bootstrapping:
In bootstrapping, samples are drawn with replacement (i.e. one observation can be repeated in
more than one group) and the remaining data which are not used in samples are used to test the
model.

Statistics is a branch of Mathematics that deals with the collection, analysis, interpretation and
the presentation of the numerical data. In other words, it is defined as the collection of
quantitative data. The main purpose of Statistics is to make an accurate conclusion using a
limited sample about a greater population.
Types of Statistical Inference
There are different types of statistical inferences that are extensively used for making
conclusions. They are:
 One sample hypothesis testing
 Confidence Interval
 Pearson Correlation
 Bi-variate regression
 Multi-variate regression
 Chi-square statistics and contingency table
 ANOVA or T-test

Statistical Inference Procedure:


The procedure involved in inferential statistics is:
 Begin with a theory
 Create a research hypothesis
 Operationalize the variables
 Recognize the population to which the study results should apply
 Formulate a null hypothesis for this population
 Accumulate a sample from the population and continue the study
 Conduct statistical tests to see if the collected sample properties are adequately different
from what would be expected under the null hypothesis to be able to reject the null
hypothesis.
Prediction error:
It is the failure of some expected event to occur. When predictions fail, humans can
use metacognitive functions, examining prior predictions and failures and deciding, for example,
whether there are correlations and trends, such as consistently being unable to foresee outcomes
accurately in particular situations. Applying that type of knowledge can inform decisions and
improve the quality of future predictions. Predictive analytics software processes new and
historical data to forecast activity, behaviour and trends. The programs apply statistical
analysis techniques, analytical queries and machine learning algorithms to data sets to
create predictive models that quantify the likelihood of a particular event happening.
Errors are an inescapable element of predictive analytics that should also be quantified and
presented along with any model, often in the form of a confidence interval that indicates how
accurate its predictions are expected to be. Analysis of prediction errors from similar or previous
models can help determine confidence intervals. In artificial intelligence (AI), the analysis of
prediction errors can help guide machine learning (ML), similarly to the way it does for human
learning. In reinforcement learning, for example, an agent might use the goal of minimizing error
feedback as a way to improve. Prediction errors, in that case, might be assigned a negative value
and predicted outcomes a positive value, in which case the AI would be programmed to attempt
to maximize its score. That approach to ML, sometimes known as error-driven learning, seeks to
stimulate learning by approximating the human drive for mastery.
Questions:

1. What is Big Data and Explain about types of Big Data .

2. Explain about Big Data platforms.

3. Explain about Web Data and also explain abut use cases of web data.

4. What are conventional systems and explain about list of challenges of conventional systems.
UNIT-II
DATA ANALYSIS
Regression analysis:
Regression analysis is used to determining how the points or variables might be related. It
helps in determine the equation for a curve or line that might capture the relationship between
two variables.
Bi-variate Regression Analysis:
It is depend on the two variables and linear indicates that we are fitting straight line
through the data points. In linear regression the dependent variable denoted by ‘y’ and the
independent variable denoted by ‘x’.
The line of equation is given by
y=β0+β1x+ϵ
In the model y is dependent and x is independent variables. β 0 and β1 are the parameters of linear
model. ϵ is the error.
Multivariate Analysis:
Multivariate analysis is based on the observations and analysis of more than one
statistical outcome variable at a time. There are two types of multivariate techniques namely 1)
Dependence techniques and 2) interdependence techniques.
1) Dependence techniques:
Dependence methods are used when one or some of the variables are dependent on
others. In machine learning, dependence techniques are used to predictive models. Simple
example the dependent variable of “weight” might be predicted by independent variables such as
“height” and “age”.
2) Interdependence techniques:
These methods are used to understand the structural makeup and underlying patterns
within a dataset. In this case no variables are dependent on others, so you’re not looking for
casual relationships. Rather, interdependence methods seek to give meaning to a set of variables
or to group them together in meaningful ways.
Methods for Multivariate Analysis:
The following are the Multivariate Analysis techniques
Multiple Linear Regression:
Multiple linear regression is dependence method which looks at the relationship between
one dependent and more than one independent variables. This is useful as it helps you to
understand which factors are likely to influence a certain outcome, allowing you to estimate
future outcomes. For example growth of a crop is dependent on rainfall, temperature, fertilizers
and amount of sun light.
Multiple logistic regression:
Logistic Regression analysis is used to calculate the probability of a binary even
occurring. A binary outcome is one where there are only two possible outcomes; either the event
occurs (1) or it doesn’t (0). Based on independent variables, logistic regression can predict how
likely it is that a certain scenario will arise. In insurance sector analyst need to predict how likely
it is that each potential customer will make a claim.
Multivariate analysis of variance (MANOVA):
It is used to measure the effect of multiple independent variables on two or more
dependent variables. With this technique, it’s important to note that the independent variables are
categorical, while the dependent variables are metric in nature. For example an engineering
company that is on a mission to build a super-fast, eco-friendly rocket. In this example the
independent variables are
 Engine type E1,E2 or E3
 Material used for the rocket exterior.
 Type of fuel used to power the rocket.

Bayesian Modeling:
The Bayesian technique is an approach in statistics used in data analysis and parameter
estimation. This approach is based on the Bayes theorem.

Bayesian Statistics follows a unique principle wherein it helps determine the joint
probability distribution for observed and unobserved parameters using a statistical model. The
knowledge of statistics is essential to tackle analytical problems in this scenario.

Ever since the introduction of the Bayes theorem in the 1770s by Thomas Bayes, it has
remained an indispensable tool in statistics. Bayesian models are a classic replacement for
frequents models as recent innovations in statistics have helped breach milestones in a wide
range of industries, including medical research, understanding web searches, and processing
natural languages (Natural Language Processing).

For example, Alzheimer’s is a disease known to pose a progressive risk as a person ages.
However, with the help of the Bayes theorem, doctors can estimate the probability of a person
having Alzheimer’s in the future. It also applies to cancer and other age-related illnesses that a
person becomes vulnerable to in the later years of his life.

Bayesian networks
Bayesian networks are a type of probabilistic graphical model that uses Bayesian inference for
probability computations. Bayesian networks aim to model conditional dependence, and therefore
causation, by representing conditional dependence by edges in a directed graph. Through these
relationships, one can efficiently conduct inference on the random variables in the graph through
the use of factors.
Probability

Before going into exactly what a Bayesian network is, it is first useful to review probability

theory.

First, remember that the joint probability distribution of random variables A_0, A_1, …, A_n,

denoted as P(A_0, A_1, …, A_n), is equal to P(A_1 | A_2, …, A_n) * P(A_2 | A_3, …, A_n) *

… * P(A_n) by the chain rule of probability. We can consider this a factorized representation of

the distribution, since it is a product of N factors that are localized probabilities.


Next, recall that conditional independence between two random variables, A and B,

given another random variable, C, is equivalent to satisfying the following property: P(A,B|C) =

P(A|C) * P(B|C). In other words, as long as the value of C is known and fixed, A and B are

independent. Another way of stating this, which we will use later on, is that P(A|B,C) = P(A|C).

The Bayesian Network

Using the relationships specified by our Bayesian network, we can obtain a compact, factorized

representation of the joint probability distribution by taking advantage of conditional


independence.

A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional

dependency, and each node corresponds to a unique random variable. Formally, if an edge (A, B)

exists in the graph connecting random variables A and B, it means that P(B|A) is a factor in the

joint probability distribution, so we must know P(B|A) for all values of B and A in order to
conduct inference. In the above example, since Rain has an edge going into Wet Grass, it means

that P(WetGrass|Rain) will be a factor, whose probability values are specified next to the Wet

Grass node in a…

Probabilistic Bayesian Networks Inference

Use of Bayesian Network (BN) is to estimate the probability that the hypothesis is true based on
evidence.

Bayesian Networks Inference:

 Deducing Unobserved Variables


 Parameter Learning
 Structure Learning.

1. Deducing Unobserved Variables

With the help of this network, we can develop a comprehensive model that delineates the
relationship between the variables. It is used to answer probabilistic queries about them. We can
use it to observe the updated knowledge of the state of a subset of variables. For computing, the
posterior distribution of the variables with the given evidence is called probabilistic inference.
For detection applications, it gives universal statistics. When anyone wants to select values for
the variable subset, it minimizes some expected loss function, for instance, the probability of
decision error. A BN is a mechanism for applying Bayes’ theorem to complex problems.
Popular inference methods are:

1.1 Variable Elimination

Variable Elimination eliminates the non-observed non-query variables. It eliminates one by one
by distributing the sum over the product.

1.2 Clique Tree Propagation

It caches the computation to query many variables at one time and also to propagate new
evidence.

1.3 Recursive Conditioning

Recursive conditioning allows a tradeoff between space and time. It is equivalent to the variable
elimination method if sufficient space is available.

2. Parameter Learning

To specify the BN and thus represent the joint probability distribution, it is necessary to
specify for each node X. Here, the probability distribution for the node X is conditional, based on
its parents. There can be many forms of distribution of X. Discrete or a Gaussian distribution
simplifies calculations. Sometimes constraints on distribution are only known. To determine a
single distribution, we can use the principle of maximum entropy. The only one who has the
greatest entropy is given the constraints.

Conditional distributions include parameters from data and unknown. Sometimes by


using the most likely approach, we can estimate the data. When there are unobserved variables,
direct maximization of the likelihood is often complex. EMA refers to Expectation-
maximization algorithm. It is for computing expected values of the unobserved variables by
performing the maximization of the likelihood with an assumption that the prior expectations are
correct. This process converges on most likelihood values for parameters under mild condition.
To treat parameters as additional unobserved variables, Bayesian is an approach. We use BN to
compute a posterior distribution conditional upon observed data and then to integrate out the
parameters. This approach can be costly and lead to large dimension model. Thus, in real
practice, classical parameter-setting are more common approaches.

3. Structure Learning

BN is specified by an expert and after that, it is used to perform inference. The task of defining
the network is too complex for humans in other applications. The parameters of the local
distributions and the network structure must learn from data in this case.

A challenge pursued that within machine learning is automatically learning the graph structure of
a BN. After that, the idea went back to an algorithm developed by Rebane and Pearl (1987).
The triplets allowed in a Directed Acyclic Graph (DAG):
 X àY àZ
 X ßYàZ
 X àYßZ
X and Z are independent given Y. Represent the same dependencies by Type 1 and 2, so it is,
indistinguishable. We can uniquely identify Type 3. All other pairs are dependent and X and Z
are marginally independent. So, while the skeletons of these three triplets are identical, the
direction of the arrows is somehow identifiable. When X and Z have common parents, the same
distinction applies except that one condition on those parents. We develop the algorithm to
determine the skeleton of the underlying graph. After that orient, all arrows whose directionality
is estimated by the conditional independencies are observed.

Optimization-based search is an alternative method that is used by structural learning. It needs a


scoring function and a search strategy. The posterior probability is a common scoring function of
the structure with the given training data. The time for of an exhaustive search returns a
structure. It maximizes the score that is super-exponential in the number of variables. We make
changes that are incremental in nature in order to improve the overall score. We can do
incremental changes through a local search strategy. A global search algorithm like Markov
chain can avoid getting trapped in local minima.

Support vector Machine:


Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well its best suited
for classification. The main objective of the SVM algorithm is to find the optimal hyper
plane in an N-dimensional space that can separate the data points in different classes in the
feature space. The hyper plane tries that the margin between the closest points of different
classes should be as maximum as possible. The dimension of the hyperplane depends upon the
number of features. If the number of input features is two, then the hyperplane is just a line. If
the number of input features is three, then the hyper plane becomes a 2-D plane. It becomes
difficult to imagine when the number of features exceeds three.

Let’s consider two independent variables x 1, x2, and one dependent variable which is either a
blue circle or a red circle.

Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here
is a line because we are considering only two input features x 1, x2) that segregate our data
points or do a classification between red and blue circles. So how do we choose the best line or
in general the best hyperplane that segregates our data points?

How does SVM work?

One reasonable choice as the best hyper plane is the one that represents the largest
separation or margin between the two classes.
Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario
like shown below

Selecting hyperplane for data with outlier


Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that
maximizes the margin. SVM is robust to outliers.

Hype rplane which is the most optimized one.So in this type of data point what SVM

does is, finds the maximum margin as done with previous data sets along with that it adds a

penalty each time a point crosses the margin. So the margins in these types of cases are

called soft margins. When there is a soft margin to the data set, the SVM tries to

minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations no

hinge loss. If violations hinge loss proportional to the distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?
Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point x i on the line and we create a new variable y i as a function of distance
from origin o.so if we plot this we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as a kernel.
Types of Support Vector Machine

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A hyperplane
that maximizes the margin between the classes is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel functions,
nonlinear SVMs can handle nonlinearly separable data. The original input data is
transformed by these kernel functions into a higher-dimensional feature space, where the
data points can be linearly separated. A linear SVM is used to locate a nonlinear decision
boundary in this modified space.

What is Kernel method?


Kernel method is the mathematical technique that is used in machine learning for analyzing data.
This method uses Kernel function - that maps data from one space to another space.
It is generally used in Support Vector Machines (SVMs) where the algorithms classify data by
finding the hyperplane that separates the data points of different classes.
The most important benefit of Kernel Method is that it can work with non-linearly separable
data, and it works with multiple Kernel functions - depending on the type of data.
Because the linear classifier can solve a very limited class of problems, the kernel trick is
employed to empower the linear classifier, enabling the SVM to solve a larger class of problems.
Characteristics of Kernel Function

Kernel functions used in machine learning, including in SVMs (Support Vector Machines), have
several important characteristics, including:

o Mercer's condition: A kernel function must satisfy Mercer's condition to be valid. This
condition ensures that the kernel function is positive semi definite, which means that it is
always greater than or equal to zero.
o Positive definiteness: A kernel function is positive definite if it is always greater than
zero except for when the inputs are equal to each other.
o Non-negativity: A kernel function is non-negative, meaning that it produces non-
negative values for all inputs.
o Symmetry: A kernel function is symmetric, meaning that it produces the same value
regardless of the order in which the inputs are given.
o Reproducing property: A kernel function satisfies the reproducing property if it can be
used to reconstruct the input data in the feature space.
o Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.
o Complexity: The complexity of a kernel function is an important consideration, as more
complex kernel functions may lead to over fitting and reduced generalization
performance.

Major Kernel Function in Support Vector Machine

In Support Vector Machines (SVMs), there are several types of kernel functions that can be used
to map the input data into a higher-dimensional feature space. The choice of kernel function
depends on the specific problem and the characteristics of the data.

Here are some most commonly used kernel functions in SVMs:

Linear Kernel

A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function, and it
defines the dot product between the input vectors in the original feature space.
Polynomial Kernel

A particular kind of kernel function utilised in machine learning, such as in SVMs, is a


polynomial kernel (Support Vector Machines). It is a nonlinear kernel function that employs
polynomial functions to transfer the input data into a higher-dimensional feature space.

Gaussian (RBF) Kernel

The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a popular kernel
function used in machine learning, particularly in SVMs (Support Vector Machines). It is a
nonlinear kernel function that maps the input data into a higher-dimensional feature space using
a Gaussian function.

Laplace Kernel

The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type of
kernel function used in machine learning, including in SVMs (Support Vector Machines). It is a
non-parametric kernel that can be used to measure the similarity or distance between two input
feature vectors.

Time Series Analysis:


An arrangement of statistical data in accordance with time of occurrence or in
chronological order is called a time series data. In time series analysis, current data in a series
may be compared with past data in the same series.
The main aim of this analysis is forecast and for evaluating the past performances. The
essential requirements of a time series are as follows:
 The time gap, between various values must be as far as possible, equal.
 It must contain of homogeneous set of values.
 Data must be available for a long period.

What Is Time Series Analysis?

Time series analysis is indispensable in data science, statistics, and analytics.

At its core, time series analysis focuses on studying and interpreting a sequence of data points
recorded or collected at consistent time intervals. Unlike cross-sectional data, which captures a
snapshot in time, time series data is fundamentally dynamic, evolving over chronological
sequences both short and extremely long. This type of analysis is pivotal in uncovering
underlying structures within the data, such as trends, cycles, and seasonal variations.

Technically, time series analysis seeks to model the inherent structures within the data,
accounting for phenomena like autocorrelation, seasonal patterns, and trends. The order of data
points is crucial; rearranging them could lose meaningful insights or distort interpretations.
Furthermore, time series analysis often requires a substantial dataset to maintain the statistical
significance of the findings. This enables analysts to filter out 'noise,' ensuring that observed
patterns are not mere outliers but statistically significant trends or cycles.

Types of Data

When embarking on time series analysis, the first step is often understanding the type of data
you're working with. This categorization primarily falls into three distinct types: Time Series
Data, Cross-Sectional Data, and Pooled Data. Each type has unique features that guide the
subsequent analysis and modeling.

 Time Series Data: Comprises observations collected at different time intervals. It's
geared towards analyzing trends, cycles, and other temporal patterns.

 Cross-Sectional Data: Involves data points collected at a single moment in time. Useful
for understanding relationships or comparisons between different entities or categories at
that specific point.

 Pooled Data: A combination of Time Series and Cross-Sectional data. This hybrid
enriches the dataset, allowing for more nuanced and comprehensive analyses.

Time Series Analysis Techniques

Time series analysis is critical for businesses to predict future outcomes, assess past
performances, or identify underlying patterns and trends in various metrics. Time series analysis
can offer valuable insights into stock prices, sales figures, customer behavior, and other time-
dependent variables. By leveraging these techniques, businesses can make informed decisions,
optimize operations, and enhance long-term strategies.

Time series analysis offers a multitude of benefits to businesses.The applications are also wide-
ranging, whether it's in forecasting sales to manage inventory better, identifying the seasonality
in consumer behavior to plan marketing campaigns, or even analyzing financial markets for
investment strategies. Different techniques serve distinct purposes and offer varied granularity
and accuracy, making it vital for businesses to understand the methods that best suit their specific
needs.

 Moving Average: Useful for smoothing out long-term trends. It is ideal for removing
noise and identifying the general direction in which values are moving.

 Exponential Smoothing: Suited for univariate data with a systematic trend or seasonal
component. Assigns higher weight to recent observations, allowing for more dynamic
adjustments.

 Autoregression: Leverages past observations as inputs for a regression equation to


predict future values. It is good for short-term forecasting when past data is a good
indicator.

 Decomposition: This breaks down a time series into its core components—trend,
seasonality, and residuals—to enhance the understanding and forecast accuracy.

 Time Series Clustering: Unsupervised method to categorize data points based on


similarity, aiding in identifying archetypes or trends in sequential data.

 Wavelet Analysis: Effective for analyzing non-stationary time series data. It helps in
identifying patterns across various scales or resolutions.

 Intervention Analysis: Assesses the impact of external events on a time series, such as
the effect of a policy change or a marketing campaign.
 Box-Jenkins ARIMA models: Focuses on using past behavior and errors to model time
series data. Assumes data can be characterized by a linear function of its past values.

 Box-Jenkins Multivariate models: Similar to ARIMA, but accounts for multiple


variables. Useful when other variables influence one time series.

 Holt-Winters Exponential Smoothing: Best for data with a distinct trend and
seasonality. Incorporates weighted averages and builds upon the equations for
exponential smoothing.

The Advantages of Time Series Analysis

Time series analysis is a powerful tool for data analysts that offers a variety of advantages for
both businesses and researchers. Its strengths include:

1. Data Cleansing: Time series analysis techniques such as smoothing and seasonality
adjustments help remove noise and outliers, making the data more reliable and
interpretable.

2. Understanding Data: Models like ARIMA or exponential smoothing provide insight
into the data's underlying structure. Autocorrelations and stationarity measures can help
understand the data's true nature.

3. Forecasting: One of the primary uses of time series analysis is to predict future values
based on historical data. Forecasting is invaluable for business planning, stock market
analysis, and other applications.

4. Identifying Trends and Seasonality: Time series analysis can uncover underlying
patterns, trends, and seasonality in data that might not be apparent through simple
observation.

5. Visualizations: Through time series decomposition and other techniques, it's possible to
create meaningful visualizations that clearly show trends, cycles, and irregularities in the
data.

6. Efficiency: With time series analysis, less data can sometimes be more. Focusing on
critical metrics and periods can often derive valuable insights without getting bogged
down in overly complex models or datasets.

7. Risk Assessment: Volatility and other risk factors can be modeled over time, aiding
financial and operational decision-making processes.

Linear systems analysis can also be applied in the context of data analytics, particularly in the
analysis of linear relationships between variables and in modeling the behavior of systems that
exhibit linear responses to input.

Here's how linear systems analysis concepts can be relevant in data analytics:

1. Linear Regression : Linear regression is a fundamental technique in data analytics for


modeling the relationship between a dependent variable and one or more independent variables.
It assumes a linear relationship between the variables and aims to find the best-fitting linear
equation that describes this relationship. Techniques such as ordinary least squares (OLS)
estimation and gradient descent optimization are often used in linear regression analysis.

2. Time Series Analysis : Many time series data sets can be effectively modeled using linear
systems analysis techniques. For example, autoregressive (AR), moving average (MA), and
autoregressive integrated moving average (ARIMA) models are commonly used in time series
analysis to capture linear dependencies between observations over time.

3. Principal Component Analysis (PCA) : PCA is a dimensionality reduction technique that can
be viewed as a linear transformation of the data into a new coordinate system, such that the
greatest variance lies along the first coordinate (principal component), the second greatest
variance lies along the second coordinate, and so on. PCA is based on the analysis of the
covariance matrix of the data, which is a linear systems analysis concept.
4. Kalman Filtering : Kalman filters are used in data analytics for state estimation in systems
with linear dynamics and Gaussian noise. They are commonly applied in tracking applications,
sensor fusion, and signal processing tasks where there's a need to estimate the true state of a
system based on noisy measurements.

5. Linear Discriminant Analysis (LDA) : LDA is a technique used for dimensionality reduction
and classification. It seeks to find the linear combinations of features that best separate the
classes in the data. LDA is closely related to PCA but takes into account class labels when
finding the optimal feature space transformation.

6. Sparse Linear Models : Techniques like LASSO (Least Absolute Shrinkage and Selection
Operator) and ridge regression are used in data analytics for regression tasks where the number
of predictors is large compared to the number of observations. These techniques introduce
regularization to the linear regression model, encouraging sparsity in the coefficient estimates.

What is a Rule Induction? Rule Induction Explained

Rule induction is a machine-learning technique that involves the discovery of patterns or rules in
data. It aims to extract explicit if-then rules that can accurately predict or classify instances based
on their features or attributes.

The process of rule induction typically involves the following steps:

Data Preparation: The input data is prepared by organizing it into a structured format, such as a
table or a matrix, where each row represents an instance or observation, and each column
represents a feature or attribute.
Rule Generation: The rule generation process involves finding patterns or associations in the
data that can be expressed as if-then rules. Various algorithms and methods can be used for rule
generation, such as decision tree algorithms (e.g., CART), association rule mining algorithms
(e.g., Apriori), and logical reasoning approaches (e.g., inductive logic programming).
Rule Evaluation: Once the rules are generated, they need to be evaluated to determine their
quality and usefulness. Evaluation metrics can include accuracy, coverage, support, confidence,
lift, and other measures depending on the specific application and domain.
Rule Selection and Pruning: Depending on the complexity of the rule set and the specific
requirements, rule selection and pruning techniques can be applied to refine the rule set. This
process involves removing redundant, irrelevant, or overlapping rules to improve interpretability
and efficiency.
Rule Application: Once a set of high-quality rules is obtained, they can be applied to new,
unseen instances for prediction or classification. Each instance is evaluated against the rules, and
the applicable rule(s) with the highest confidence or support is used to make predictions or
decisions.
Rule induction has been widely used in various domains, such as data mining, machine learning,
expert systems, and decision support systems. It provides interpretable and human-readable
models, making it useful for generating understandable insights and explanations from data.

While rule induction can be effective in capturing explicit patterns and associations in the data, it
may struggle with capturing complex or non-linear relationships. Additionally, rule induction
algorithms may face challenges when dealing with large and high-dimensional datasets, as the
search space of possible rules can become exponentially large.

What is a neural network?


A neural network is a method in artificial intelligence that teaches computers to process data in a
way that is inspired by the human brain. It is a type of machine learning process, called deep
learning, that uses interconnected nodes or neurons in a layered structure that resembles the
human brain. It creates an adaptive system that computers use to learn from their mistakes and
improve continuously. Thus, artificial neural networks attempt to solve complicated problems,
like summarizing documents or recognizing faces, with greater accuracy.
Why are neural networks important?

Neural networks can help computers make intelligent decisions with limited human assistance.
This is because they can learn and model the relationships between input and output data that are
nonlinear and complex. For instance, they can do the following tasks.

Make generalizations and inferences

Neural networks can comprehend unstructured data and make general observations without
explicit training. For instance, they can recognize that two different input sentences have a
similar meaning:

 Can you tell me how to make the payment?

 How do I transfer money?

A neural network would know that both sentences mean the same thing. Or it would be able to
broadly recognize that Baxter Road is a place, but Baxter Smith is a person’s name.

What are neural networks used for?

Neural networks have several use cases across many industries, such as the following:

 Medical diagnosis by medical image classification

 Targeted marketing by social network filtering and behavioral data analysis

 Financial predictions by processing historical data of financial instruments


 Electrical load and energy demand forecasting

 Process and quality control

 Chemical compound identification

We give four of the important applications of neural networks below.

Computer vision

Computer vision is the ability of computers to extract information and insights from images and
videos. With neural networks, computers can distinguish and recognize images similar to
humans. Computer vision has several applications, such as the following:

 Visual recognition in self-driving cars so they can recognize road signs and other road
users

 Content moderation to automatically remove unsafe or inappropriate content from image


and video archives

 Facial recognition to identify faces and recognize attributes like open eyes, glasses, and
facial hair

 Image labeling to identify brand logos, clothing, safety gear, and other image details

Speech recognition

Neural networks can analyze human speech despite varying speech patterns, pitch, tone,
language, and accent. Virtual assistants like Amazon Alexa and automatic transcription software
use speech recognition to do tasks like these:

 Assist call center agents and automatically classify calls

 Convert clinical conversations into documentation in real time

 Accurately subtitle videos and meeting recordings for wider content reach

Natural language processing

Natural language processing (NLP) is the ability to process natural, human-created text. Neural
networks help computers gather insights and meaning from text data and documents. NLP has
several use cases, including in these functions:

 Automated virtual agents and chatbots

 Automatic organization and classification of written data

 Business intelligence analysis of long-form documents like emails and forms


 Indexing of key phrases that indicate sentiment, like positive and negative comments on
social media

 Document summarization and article generation for a given topic

Recommendation engines

Neural networks can track user activity to develop personalized recommendations. They can also
analyze all user behavior and discover new products or services that interest a specific user. For
example, Curalate, a Philadelphia-based startup, helps brands convert social media posts into
sales. Brands use Curalate’s intelligent product tagging (IPT) service to automate the collection
and curation of user-generated social content. IPT uses neural networks to automatically find and
recommend products relevant to the user’s social media activity. Consumers don't have to hunt
through online catalogs to find a specific product from a social media image. Instead, they can
use Curalate’s auto product tagging to purchase the product with ease.

How do neural networks work?

The human brain is the inspiration behind neural network architecture. Human brain cells, called
neurons, form a complex, highly interconnected network and send electrical signals to each other
to help humans process information. Similarly, an artificial neural network is made of artificial
neurons that work together to solve a problem. Artificial neurons are software modules, called
nodes, and artificial neural networks are software programs or algorithms that, at their core, use
computing systems to solve mathematical calculations.

Simple neural network architecture

A basic neural network has interconnected artificial neurons in three layers:

Input Layer

Information from the outside world enters the artificial neural network from the input layer.
Input nodes process the data, analyze or categorize it, and pass it on to the next layer.

Hidden Layer

Hidden layers take their input from the input layer or other hidden layers. Artificial neural
networks can have a large number of hidden layers. Each hidden layer analyzes the output from
the previous layer, processes it further, and passes it on to the next layer.

Output Layer

The output layer gives the final result of all the data processing by the artificial neural network. It
can have single or multiple nodes. For instance, if we have a binary (yes/no) classification
problem, the output layer will have one output node, which will give the result as 1 or 0.
However, if we have a multi-class classification problem, the output layer might consist of more
than one output node.

Deep neural network architecture

Deep neural networks, or deep learning networks, have several hidden layers with millions of
artificial neurons linked together. A number, called weight, represents the connections between
one node and another. The weight is a positive number if one node excites another, or negative if
one node suppresses the other. Nodes with higher weight values have more influence on the
other nodes.
Theoretically, deep neural networks can map any input type to any output type. However, they
also need much more training as compared to other machine learning methods. They need
millions of examples of training data rather than perhaps the hundreds or thousands that a
simpler network might need.

What are the types of neural networks?

Artificial neural networks can be categorized by how the data flows from the input node to the
output node. Below are some examples:

Feedforward neural networks

Feedforward neural networks process data in one direction, from the input node to the output
node. Every node in one layer is connected to every node in the next layer. A feedforward
network uses a feedback process to improve predictions over time.

Backpropagation algorithm

Artificial neural networks learn continuously by using corrective feedback loops to improve their
predictive analytics. In simple terms, you can think of the data flowing from the input node to the
output node through many different paths in the neural network. Only one path is the correct one
that maps the input node to the correct output node. To find this path, the neural network uses a
feedback loop, which works as follows:

1. Each node makes a guess about the next node in the path.

2. It checks if the guess was correct. Nodes assign higher weight values to paths that lead to
more correct guesses and lower weight values to node paths that lead to incorrect
guesses.

3. For the next data point, the nodes make a new prediction using the higher weight paths
and then repeat Step 1.
Convolutional neural networks

The hidden layers in convolutional neural networks perform specific mathematical functions,
like summarizing or filtering, called convolutions. They are very useful for image classification
because they can extract relevant features from images that are useful for image recognition and
classification. The new form is easier to process without losing features that are critical for
making a good prediction. Each hidden layer extracts and processes different image features, like
edges, color, and depth.

Generalization in Neural Networks


When training a neural network in deep learning, its performance on processing new data is key.
Improving the model's ability to generalize relies on preventing overfitting using these important
methods.

Whenever we train our own neural networks, we need to take care of something called
the generalization of the neural network. This essentially means how good our model is at
learning from the given data and applying the learnt information elsewhere.

When training a neural network, there’s going to be some data that the neural network trains
on, and there’s going to be some data reserved for checking the performance of the neural
network. If the neural network performs well on the data which it has not trained on, we can say
it has generalized well on the given data. Let’s understand this with an example.

Suppose we are training a neural network which should tell us if a given image has a dog or not.
Let’s assume we have several pictures of dogs, each dog belonging to a certain breed, and there
are 12 total breeds within those pictures. I’m going to keep all the images of 10 breeds of dogs
for training, and the remaining images of the 2 breeds will be kept aside for now.
Dogs training testing data split.

Now before going to the deep learning side of things, let’s look at this from a human perspective.
Let’s consider a human being who has never seen a dog in their entire life (just for the sake of an
example). Now we will show this human the 10 breeds of dogs and tell them that these are dogs.
After this, if we show them the other 2 breeds, will they be able to tell that they are also dogs?
Well hopefully they should, 10 breeds should be enough to understand and identify the unique
features of a dog. This concept of learning from some data and correctly applying the gained
knowledge on other data is called generalization.

Coming back to deep learning, our aim is to make the neural network learn as effectively from
the given data as possible. If we successfully make the neural network understand that the other
2 breeds are also dogs, then we have trained a very general neural network, and it will perform
really well in the real world.
What is Competitive Learning?

Competitive learning is a subset of machine learning that falls under the umbrella
of unsupervised learning algorithms. In competitive learning, a network of artificial neurons
competes to "fire" or become active in response to a specific input. The "winning" neuron, which
typically is the one that best matches the given input, is then updated while the others are left
unchanged. The significance of this learning method lies in its power to automatically cluster
similar data inputs, enabling us to find patterns and groupings in data where no prior knowledge
or labels are given.

Competitive Learning Explained

Artificial neural networks often utilize competitive learning models to classify input
without the use of labeled data. The process begins with an input vector (often a data set). This
input is then presented to a network of artificial neurons, each of which has its own set of
weights, which act like filters. Each neuron computes a score based on its weight and the input
vector, typically through a dot product operation (a way of multiplying the input information
with the filter and adding the results together).

After the computation, the neuron that has the highest score (the "winner") is updated,
usually by shifting its weights closer to the input vector. This process is often referred to as the
"Winner-Takes-All" strategy. Over time, neurons become specialized as they get updated toward
input vectors they can best match. This leads to the formation of clusters of similar data, hence
enabling the discovery of inherent patterns within the input dataset.

To illustrate how one can use competitive learning, imagine an ecommerce business
wants to segment its customer base for targeted marketing, but they have no prior labels or
segmentation. By feeding customer data (purchase history, browsing pattern, demographics, etc.)
to a competitive learning model, they could automatically find distinct clusters (like high
spenders, frequent buyers, discount lovers) and tailor marketing strategies accordingly.

The Competitive Learning Process: A Step-by-Step Example

For this simple illustration, let's assume we have a dataset composed of 1-dimensional input
vectors ranging from 1 to 10 and a competitive learning network with two neurons.

Step 1: Initialization

We start by initializing the weights of the two neurons to random values. Let's assume:

 Neuron 1 weight: 2
 Neuron 2 weight: 8
Step 2: Presenting the input vector

Now, we present an input vector to the network. Let's say our input vector is '5'.

Step 3: Calculating distance

We calculate the distance between the input vector and the weights of the two neurons.
The neuron with the weight closest to the input vector 'wins.' This could be calculated using any
distance metric, for example, the absolute difference:

 Neuron 1 distance: |5-2| = 3


 Neuron 2 distance: |5-8| = 3
Since both distances are equal, we can choose the winner randomly. Let's say Neuron 1 is the
winner.

Step 4: Updating weights

We adjust the winning neuron's weight to bring it closer to the input vector. If our
learning rate (a tuning parameter in an optimization algorithm that determines the step size at
each iteration) is 0.5, the weight update would be:

 Neuron 1 weight: 2 + 0.5*(5-2) = 3.5


 Neuron 2 weight: 8 (unchanged)

Step 5: Iteration

We repeat the process with all the other input vectors in the dataset, updating the weights after
each presentation.

Step 6: Convergence

After several iterations (also known as epochs), the neurons' weights will start to
converge to the centers of their corresponding input clusters. In this case, with 1-dimensional
data ranging from 1 to 10, we could expect one neuron to converge around the lower range (1 to
5) and the other around the higher range (6 to 10).

This process exemplifies how competitive learning works. Over time, each neuron
specializes in a different cluster of the data, enabling the system to identify and represent the
inherent groupings in the dataset.
What Is Principal Component Analysis?

Principal component analysis, or PCA, is a dimensionality reduction method that is often used to
reduce the dimensionality of large data sets, by transforming a large set of variables into a
smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller
data sets are easier to explore and visualize and make analyzing data points much easier and
faster for machine learning algorithms without extraneous variables to process.
So, to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.

Step-by-Step Explanation of PCA

STEP 1: STANDARDIZATION

The aim of this step is to standardize the range of the continuous initial variables so that each one
of them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between 0 and 100
will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So,
transforming the data to comparable scales can prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by the standard deviation
for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.

STEP 2: COVARIANCE MATRIX COMPUTATION

The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them. Because sometimes, variables are highly correlated in such a way that they contain
redundant information. So, in order to identify these correlations, we compute the covariance
matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that
has as entries the covariances associated with all possible pairs of the initial variables. For
example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3
data matrix of this from:

Covariance Matrix for 3-Dimensional Data.

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main
diagonal (Top left to bottom right) we actually have the variances of each initial variable. And
since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix
are symmetric with respect to the main diagonal, which means that the upper and the lower
triangular portions are equal.

STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE


COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data. Before getting to
the explanation of these concepts, let’s first understand what do we mean by principal
components.
Principal components are new variables that are constructed as linear combinations or mixtures
of the initial variables. These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you
10 principal components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on, until having
something like shown in the scree plot below.

Percentage of Variance (Information) for each by PC.

Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.
An important thing to realize here is that the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
STEP 4: FEATURE VECTOR

As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components
that we decide to keep. This makes it the first step towards dimensionality reduction, because if
we choose to keep only p eigenvectors (components) out of n, the final data set will have
only p dimensions.

STEP 5: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES

In the previous steps, apart from standardization, you do not make any changes on the data, you
just select the principal components and form the feature vector, but the input data set remains
always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis). This
can be done by multiplying the transpose of the original data set by the transpose of the feature
vector.
Fuzzy Logic introduction
The word fuzzy refers to things which are not clear or are vague. Any event, process, or
function that is changing continuously cannot always be defined as either true or false, which
means that we need to define such activities in a Fuzzy manner.

What is Fuzzy Logic?


The term fuzzy refers to things which are not clear or are vague. In the real world many
times we encounter a situation when we can’t determine whether the state is true or false, their
fuzzy logic provides a very valuable flexibility for reasoning. In this way, we can consider the
inaccuracies and uncertainties of any situation. In Boolean system truth value, 1.0 represents
absolute truth value and 0.0 represents absolute false value. But in the fuzzy system, there is no
logic for absolute truth and absolute false value. But in fuzzy logic, there is intermediate value
too present which is partially true and partially false.

In other words, we can say that fuzzy logic is not logic that is fuzzy, but logic that is used
to describe fuzziness. There can be numerous other examples like this with the help of which we
can understand the concept of fuzzy logic. Fuzzy Logic was introduced in 1965 by Lofti A.
Zadeh in his research paper “Fuzzy Sets”. He is considered as the father of Fuzzy Logic.

Fuzzy Logic - Classical Set Theory

A set is an unordered collection of different elements. It can be written explicitly by


listing its elements using the set bracket. If the order of the elements is changed or any element
of a set is repeated, it does not make any changes in the set.
Example

• A set of all positive integers.

• A set of all the planets in the solar system.

• A set of all the states in India.

• A set of all the lowercase letters of the alphabet

EXTRACTING FUZZY MODELS FROM DATA,

Introduction:
Fuzzy-Logic theory has introduced a framework whereby human knowledge can be
formalized and used by machines in a wide variety of applications, ranging from cameras to
trains. The basic ideas that we discussed in the earlier posts were concerned with only this aspect
with regards to the use of Fuzzy Logic-based systems; that is the application of human experience
into machine-driven applications. While there are numerous instances where such techniques are
relevant; there are also applications where it is challenging for a human user to articulate the
knowledge that they hold. Such applications include driving a car or recognizing images. Machine
learning techniques provide an excellent platform in such circumstances, where sets of inputs and
corresponding outputs are available, building a model that provides the transformation from the
input data to the outputs using the available data.
Procedure
The objective of this exercise is, as we have explained in the introduction, given a set of
input/output combinations; we will generate a rule set that determines the mapping between the
inputs and outputs. In this discussion, we will consider a two-input, single-output system.
Extending this procedure for more complex systems should be a straightforward task to the
reader.
Step 1 — Divide the input and output spaces into fuzzy regions.
We start by assigning some fuzzy sets to each input and output space. Wang and Mendel
specified an odd number of evenly spaced fuzzy regions, determined by 2N+1 where N is an
integer. As we will see later on, the value of N affects the performance of our models and can
result in under/over fitting at times. N is, therefore, one of the hyper parameters that we will use to
tweak this system’s performance.

Divisions of an input space into fuzzy regions where N=2

Step 2 — Generate Fuzzy Rules from data.


We can use our input and output spaces, together with the fuzzy regions that we have just defined,
and the dataset for the application to generate fuzzy rules in the form of:
If {antecedent clauses} then {consequent clauses}
We start by determining the degree of membership of each sample in the dataset to the different
fuzzy regions in that space. If, as an example, we consider a sample depicted below:
Corres

we obtain the following degrees of membership values.

Degree-of-membership values for sample-1

We then assign the region having the maximum degree of membership of to the spaces, which is
indicated by the highlighted elements in the above table so that it is possible to obtain a rule:
sample 1 => If x1 is b1 and x2 is s1 then y is ce => Rule 1
The next illustration shows a second example, together with the degree of membership results that
it generates.
This sample will, therefore, produce the following rule:
sample 2=> If x1 is b1 and x2 is ce then y is b1 => Rule 2
Step 3 — Assign a degree to each rule.
Step 2 is very straightforward to implement, yet it suffers from one problem; it will
generate conflicting rules, that is, rules that have the same antecedent clauses but different
consequent clauses. Wang and Medel solved this issue by assigning a degree to each rule, using a
product strategy such that the degree is the product of all the degree-of-membership values from
both antecedent and consequent spaces forming the rule. We retain the rule having the most
significant degree, while we discard the rules having the same antecedent but a having a smaller
degree.
If we refer to the previous example, the degree of Rule 1 will equate to:

and for Rule 2 we obtain:

We notice that this procedure reduces the number of rules radically in practice.
It is also possible to fuse human knowledge to the knowledge obtained from data by
introducing a human element to the rule degree that has high applicability in practice, as human
supervision can assess the reliability of data, and hence the rules generated from it directly. In the
cases where human intervention is not desirable, this factor is set to 1 for all rules. Rule 1 can be
hence defined as follows;

Step 4 — Create a Combined Fuzzy Rule Base


The notion of the Combined Fuzzy Rule Base was examined in a previous post. It is a
matrix that holds the fuzzy rule-base information for a system. A Combined Fuzzy Rule Base can
contain the rules that are generated numerically using the procedure described above, but also
rules that are obtained from human experience.
Combined Fuzzy Rule Base for this system. Note Rule 1 and Rule 2.

Step 5 — Determine a mapping based on the Combined Fuzzy Rule Base.


The final step in this procedure explains the defuzzification strategy used to determine the
value of y, given (x1, x2). Wang and Mendel suggest a different approach to the max-min
computation used by Mamdani. We have to consider that, in practical applications, the number of
input spaces will be significant when compared to the typical control application where Fuzzy
Logic is typically used. Besides, this procedure will generate a large number of rules, and
therefore it would be impractical to compute an output using the ‘normal’ approach.
For a given input combination (x1, x2), we combine the antecedents of a given rule to
determine the degree of output control corresponding to (x1, x2) using the product operator. If

is the degree of output control for the ith Rule,

Therefore for Rule 1


If x1 is b1 and x2 is s1 then y is ce

We now define the centre of a fuzzy region as the point that has the smallest absolute value
among all points at which the membership function for this region is equal to 1 as illustrated
below;

Center of fuzzy region

The value of y for a given (x1, x2) combination is thus


where K is the number of rules.
Evolution of Analytic scalability

Traditional Analytic Architecture:

 Traditional analytics collects data from heterogeneous data sources and we had to pull all
data together into a separate analytics environment to do analysis which can be an
analytical server or a personal computer with more computing capability.
 The heavy processing occurs in the analytic environment as shown in figure.
 In such environments, shipping of data becomes a must, which might result in issues
related with security of data and its confidentiality.

Modern In-Database Architecture:

 Data from heterogeneous sources are collected, transformed and loaded into data
warehouse for final analysis by decision makers.
 The processing stays in the database where the data has been consolidated.
 The data is presented in aggregated form for querying.
 Queries from users are submitted to OLAP (online analytical processing) engines for
execution.
 Such in-database architectures are tested for their query throughput rather than
transaction throughput as in traditional database environments.
 More of metadata is required for directing the queries which helps in reducing the time
taken for answering queries and hence increase the query throughput.
 Moreover, the data in consolidated form are free from anomalies, since they are pre-
processed before loading into warehouses which may be used directly for analysis.

Massively Parallel Processing (MPP)

Massive Parallel Processing (MPP) is the -shared nothing|| approach of parallel computing.

 It is a type of computing wherein the process is being done by many CPUs working in
parallel to execute a single program. One of the most significant differences between
a Symmetric Multi-Processing or SMP and Massive Parallel Processing is that with
MPP, each of the many CPUs has its own memory to assist it in preventing a possible
hold up that the user may experience with using SMP when all of the CPUs attempt to
access the memory simultaneously.
The salient feature of MPP systems is:

 Loosely coupled nodes


 Nodes linked together by a high-speed connection
 Each node has its own memory
 Disks are not shared; each being attached to only one node - shared nothing
architectures

The Cloud Computing:

 Cloud computing is the delivery of computing services over the Internet.


 Examples of cloud services include online file storage, social networking sites,
webmail, and online business applications.
 The cloud computing model allows access to information and computer resources
from anywhere that a network connection is available.
 Cloud computing provides a shared pool of resources, including data storage space,
networks, computer processing power, and specialized corporate and user
applications.
 McKinsey and Company has indicated the following as characteristic features of
cloud:
1. Mask the underlying infrastructure from the user

2. Be elastic to scale on demand

3.On a pay-per-use basis

4. National Institute of Standards and Technology (NIST)

5.On-demand self-service

6.Broad network access

7. Resource pooling

8.Rapid elasticity
9.Measured service

There are two types of cloud environment:

1. Public Cloud:

The services and infrastructure are provided off-site over the internet

Less secured and more vulnerable than private clouds

2. Private Cloud:

Infrastructure operated solely for a single organization

Offer the greatest level of security and control

Grid Computing:

 Grid computing is a form of distributed computing whereby a "super and virtual


computer" is composed of a cluster of networked, loosely coupled computers, acting in
concert to perform very large tasks.
 Grid computing (Foster and Kesselman, 1999) is a growing technology that facilitates the
executions of large-scale resource intensive applications on geographically distributed
computing resources.
 Facilitates flexible, secure, coordinated large scale resource sharing among dynamic
collections of individuals, institutions, and resource
 Distributed or Grid computing in general is a special type of parallel computing that
relies on complete computers connected to a network by a conventional network interface
producing commodity hardware, compared to the lower efficiency of designing and
constructing a small number of custom supercomputers.

Disadvantage of Grid Computing:

* The various processors and local storage areas do not have high-speed connections.
Hadoop:

 Apache Hadoop is an open-source software framework for storage and large-scale


processing of data-sets on clusters of commodity hardware.
 Two main building blocks inside this runtime environment are MapReduce and
Hadoop Distributed File System (HDFS).

Map Reduce:

Hadoop MapReduce is a software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.
 A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps, which are then input to the reduce
tasks.
 Typically, both the input and the output of the job are stored in a file-system.
 The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.

HDFS:

 HDFS stands for Hadoop Distributed File System.


 HDFS is one of the core components of the Hadoop framework and is responsible for
the storage aspect.
 Unlike the usual storage available on our computers, HDFS is a Distributed File
System and
 parts of a single large file can be stored on different nodes across the cluster.
 HDFS is a distributed, reliable, and scalable file system.
2. Types of Analytics and Types of Bigdata:

Types Of Analytics:

1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics

You might also like