Unit - I Part I

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Introduction to Big Data

Course Outcomes:
 CO 1 - Demonstrate knowledge of Big Data Analytics
concepts and its applications in business.
 CO 2 - Demonstrate functions and components of Map
Reduce Framework and HDFS.
 CO 3 - Discuss Data Management concepts in NoSQL
environment.
 CO 4 - Explain process of developing Map Reduce based
distributed processing applications.
 CO 5 - Explain process of developing applications using
HBASE, Hive, Pig etc.
Unit - I
• Introduction to Big Data
Types of digital data, history of Big Data innovation,
introduction to Big Data platform, drivers for Big Data,
Big Data architecture and characteristics, 5 Vs of Big
Data, Big Data technology components, Big Data
importance and applications
Big Data features – security, compliance, auditing and
protection, Big Data privacy and ethics, Big Data
Analytics, Challenges of conventional systems, intelligent
data analysis, nature of data, analytic processes and tools,
analysis vs reporting, modern data analytic tools.
Definition – Big Data
• Data that contains greater variety, arriving in
increasing volumes and with more velocity.
• Otherwise, Big Data is a collection of data that
is huge in volume, yet growing exponentially
with time.
• It is a data with so large size and complexity
that none of traditional data management
tools can store it or process it efficiently. Big
data is also a data but with huge size.
Example of Big Data

• New York Stock Exchange is an example of Big Data


that generates about one terabyte of new trade data per
day.
• The statistic shows that 500+terabytes of new data get
ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting
comments etc.
• A single Jet engine can generate 10+terabytes of data
in 30 minutes of flight time. With many thousand flights
per day, generation of data reaches up to many Petabytes.
Importance of Big Data
• Big Data, not worries about the amount of data – it
care about how it can be used
• Reduction in cost
• Reduction in Time
• New product development with optimized offers
• Well groomed decision making
• Faster Risk management by calculating risk portfolios
• Real time determination of core causes of failures,
problems or faults
Three Vs of Big Data
Three V’s of Big Data
• Volume: The amount of data matters.
With big data, you’ll have to process high volumes of low-density, unstructured
data.
This can be data of unknown value, such as Twitter data feeds, click streams
on a web page or a mobile app, or sensor-enabled equipment.
For some organizations, this might be tens of terabytes of data. For others, it
may be hundreds of petabytes
• Velocity: Velocity is the fast rate at which data is received and (perhaps)
acted on.
Normally, the highest velocity of data streams directly into memory versus
being written to disk.
Some internet-enabled smart products operate in real time or near real time
and will require real-time evaluation and action.
• Variety: Variety refers to the many types of data that are available.
Traditional data types were structured and fit neatly in a relational database.
With the rise of big data, data comes in new unstructured data types.
Unstructured and semi structured data types, such as text, audio, and video,
require additional preprocessing to derive meaning and support metadata.
Five Vs of Big Data
• Veracity:
It refers to the assurance
of quality/integrity/credibility/accuracy of
the data. Since the data is collected from
multiple sources, we need to check the data
for accuracy before using it for business
insights.
• Value:
Value refers to how useful the data is in decision
making. We need to extract the value of the
Big Data using proper analytics.
Types of Digital Data
• Structured
• Unstructured
• Semi-structured
Structured Data
• Any data that can be stored, accessed and processed in the
form of fixed format is termed as a ‘structured’ data.
• This is the data which is in an organized form (e.g., in rows
and columns) and can be easily used by a computer
program.
• Relationships exist between entities of data, such as classes
and their objects.
• Examples Of Structured Data
• Data stored in a relational database management system is
one example of a ‘structured’ data.
• An ‘Employee’ table in a database is an example of
Structured Data
• Structured data is organized in semantic chunks (entities)
• Similar entities are grouped together (relations or classes)
• Entities in the same group have the same descriptions
(attributes)
• Descriptions for all entities in a group (schema)
• have the same defined format
• have a predefined length
• are all present
• and follow the same order
Structured Data Come From….
• Spreadsheets
• Databases
• SQL
• OLTPs
Unstructured Data
• This is the data which does not conform to a data model
or is not in a form which can be used easily by a
computer program.
• About 80—90% data of an organization is in this format;
• for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, white
papers, body of an email, etc.
• Unstructured data is also known as “dark data” because
it cannot be analyzed without the proper software tools.
• Anything in a non-database form is unstructured data.
• It can be classified into two broad categories:
• Bitmap objects : For example, image, video, or audio
files.
• Textual objects : For example, Microsoft Word
documents, emails, or Microsoft Excel spread-sheets.
• A lot of unstructured data is also noisy text such as
chats, emails and SMS texts.
• The language of noisy text differs significantly from the
standard form of language.
Examples
Manage Unstructured Data
• Indexing: Data is indexed to enable faster search and retrieval. On the basis
of some value in the data, index is defined which is nothing but, an identifier
and represents the large record in the data set.
In the absence of an index, the whole data set/ document will be scanned
for retrieving the desired information.
In the case of unstructured data too, indexing helps in searching and
retrieval.
• Tags/Metadata:
Using metadata, data in a document, etc. can be tagged. : This enables
search and retrieval. But in unstructured data, this is difficult as little or no
metadata is available. Structure of data has to be determined which is very
difficult as the data itself has no particular format and is coming from more
than one source.
• Classification/Taxonomy: Taxonomy is classifying data on the basis of the
relationships that exist between data. Data can be arranged in groups and
placed in hierarchies based on the taxonomy prevalent in an organization.
Semi Structured Data
• Semi-structured data is not bound by any rigid schema for data storage and
handling.
• The data is not in the relational format and is not neatly organized into rows and
columns like that in a spreadsheet.
• However, there are some features like key-value pairs that help in discerning the
different entities from each other.
• Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
• A data serialization language is used to exchange semi-structured data across
systems that may even have varied underlying infrastructure.
• Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for computer
programs.
• This type of information typically comes from external sources such as social
media platforms or other web-based data feeds.
Examples of Semi Structured Data
• Similar entities in the data are grouped and
organized in a hierarchy. The attributes or the
properties within a group may or may not be
the same.
Manage Semi-structured Data
• Schemas
• Graph Based Data Models
• XML
Store Semi-structured Data
• XML: XML allows to define tags and attributes to store
data. Data can be stored in a hierarchical/nested
structure
• RDBMS: Semi-structured data can be stored in a
relational database by mapping the data to a relational
schema which is then mapped to a table
• Special Databases: Databases which are specifically
designed to store Semi-structured data
• OEM : Data can be stored and exchanged in the form of
graph where entities are represented as objects which are
the vertices in a graph
Extract Information from Semi-structured
Data - Challenges
• Semi-structured is usually stored in flat files
which are difficult to index and Search
• Heterogeneous sources
• Extracting structure when there is none and
interpreting the relations existing in the
structure which is present is a difficult task
Extract Information from Semi-structured
Data - Solutions
• Indexing data in a graph-based model enables
quick search
• Allows data to be stored in a graph-based data
model which is easier to index and search
• Allows data to be arranged in a hierarchical or
tree-like structure which enables indexing and
searching
• Various mining tools are available which search
data based on graphs, schemas, structure, etc.
Big data Technologies
• Technology - highly accurate decision making,
ensuring operational efficiencies by reducing costs &
trade risks – Providing Infrastructure to facilitate,
manage and process huge volumes in real time
• Two types:
Operational Big Data – Offers equipped capabilities in
real time for large data operations. eg. MongoDB,
Apache Cassandra, CouchDB
Analytical Big Data – Offers analytical competence to
process complex analysis on large datasets. Eg.
MapReduce, BigQuery, Apache Spark or Massively
Parallel Processing DB.
History of Big Data
• The earliest records of using data to track and control
businesses date back from 7.000 years ago when accounting
was introduced in Mesopotamia in order to record the growth
of crops and herds.
• First trace of big data in 1663 when John Graunt, dealt with
overwhelming amounts of information while he studied the
bubonic plague, which was haunting Europe at the time.
• Graunt was the first-ever person to use statistical data analysis.
• The earliest remembrance of modern data is from the 1887
when Herman Hollerith invented a computing machine that
could read holes punched into paper cards in order to organize
census data.
History …
• After Herman Hollerith’s input, the next noteworthy data development
leap happened in 1937 under Franklin D. Roosevelt’s presidential
administration in the United States.
• After the United States congress passed the Social Security Act, the
government was required to keep track of millions of Americans.
• The government contracted IBM to develop a punch card-reading
system that would be applied in this extensive data project.
• First data-processing machine was named ‘Colossus’ - searching for any
patterns that would appear regularly in the intercepted messages.
• Machine worked at a record rate of five thousand characters per second
• National Security Agency (NSA) in 1952- decrypting the obtained
messages during the course of the Cold War
History…
• This machine could independently and automatically
collect and process information.
• First data centre was built by the United States
government in 1965 – storing millions of tax returns and
fingerprint sets
• Tim Berners-Lee a British computer scientist invented
the World Wide Web in 1989 - Enable the sharing of
information through a hypertext system
• As we entered the 1990’s, the creation of data grew at
an extremely high rate as more devices gained capacity
to access the internet.
History…
• The first super-computer was built in 1995. This computer had the
capacity to handle work that would take a single person thousands of
years in a matter of seconds.
• In 2005, Yahoo created the now open-source Hadoop with the intention
of indexing the entire World Wide Web.
• Hadoop is used by millions of businesses to go through colossal
amounts of data.
• During this period, social networks were rapidly increasing and large
amounts of data were being created on a daily basis.
• Businesses and governments alike began to establish big data projects.
• For example, in 2009 in the largest biometric database ever created, the
Indian government stored fingerprint and iris scans of all of its citizens.
History…
• Over the past number of years, there have
been various organizations that have come up
in an attempt to deal with big data, for
example, HCL.
• These organizations’ business is aiding other
businesses to understand big data.
Big data platform
• Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single
solution.
It is an enterprise class IT platform that enables organization in developing,
deploying, operating and managing a big data infrastructure /environment.
• Big data platform generally consists of big data storage, servers, database,
big data management, business intelligence and other big data
management utilities.
• Supports custom development, querying and integration with other
systems.
• The primary benefit - reduce the complexity of multiple vendors/ solutions
into a one cohesive solution.
• Big data platform are also delivered through cloud where the provider
provides an all inclusive big data solutions and services.
• Focuses on providing their user with efficient
analytics tools for massive datasets.
• These platforms are often used by data
engineers to aggregate, clean, and prepare
data for business analysis.
Best Big Data Platforms
• Based on S, A, P, S which means Scalability,
Availability, Performance, and Security,
platforms are listed below:
 Hadoop Delta Lake Migration Platform
 Data Catalog Platform
 Data Ingestion Platform
 IoT Analytics Platform
 Data Integration and Management Platform
 ETL Data Transformation Platform
• Hadoop - Delta Lake Migration Platform
It is an open-source software platform managed by Apache
Software Foundation.
It is used to manage and store large data sets at a low cost and
with great efficiency.
• Data Catalog Platform
Provides a single self-service environment to the users, helping
them find, understand, and trust the data source.
Helps the users to discover the new data sources if there are any.
Discovering and understanding data sources are the initial steps
for registering the sources.
• Data Ingestion Platform
This layer is the first step for the data coming
from variable sources to start its journey.
This means the data here is prioritized and
categorized, making data flow smoothly in
further layers in this process flow.
• IoT Analytics Platform
It provides a wide range of tool to work upon big data;
this functionality of it comes handy while using it over
the IoT case.
• Data Integration and Management Platform
ElixirData provides a highly customizable solution for
Enterprises. ElixirData provides Flexibility, Security, and
Stability for an Enterprise application and Big Data
Infrastructure to deploy on-premises and Public Cloud
with cognitive insights using Machine Learning and
Artificial Intelligence.
• ETL Data Transformation Platform
This Platform can be used to build pipelines
and even schedule the running of the same for
data transformation.
Essential components of Big Data Platform
• Data Ingestion, Management, ETL, and Warehouse – It provides these resources for
effective data management and effective data warehousing, and this manages data as a
valuable resource.
• Stream Computing – Helps compute the streaming data that is used for real-time
analytics.
• Analytics/ Machine Learning – Features for advanced analytics and machine learning.
• Integration – It provides its user with features like integrating big data from any source
with ease.
• Data Governance – It also provides comprehensive security, data governance, and
solutions to protect the data.
• Provides Accurate Data – It delivers with analytic tools which in turn helps to omit any
inaccurate data that has not been analyzed. This also helps the business to make the right
decision by utilizing accurate information.
• Scalability – It also helps scale the application to analyze all time climbing data; it sizes to
provide efficient analysis. It offers scalable storage capacity.
• Price Optimization – Data analytics with the help of a big data platform provides insight
for B2C and B2B enterprises which helps the business to optimize the prices they charge
accordingly.
• Reduced Latency – With the set of the warehouse, analytics tools, and efficient Data
transformation, it helps to reduce the data latency and provide high throughput.
Big Data Drivers
• Drivers are looked from two different lenses:
Business and Technology.
• Business entails market, sales and financial
side of things
• whereas, Technology has indicator/driver
targeted towards technology and IT
infrastructure side of things.
Business Drivers
1. Data driven initiatives: They are primarily categorized into 3 broad areas:
a. Data Driven Innovation
b. Data Driven Decision Making: Data driven decision-making is the inherent ability of
analytics to sieve through globs of data and identify the best path forward. Whether in terms of
finding the best route to validating the current route and estimating the success/failure in
current strategy.
c. Data Driven Discovery
2. Data Science as a competitive advantage
In commodity businesses a consistent outcry is evidenced that to build big data as a
capability to add to their competitive advantage. With a proper data driven framework,
businesses could build sustainable capabilities and further leverage these capabilities as a
competitive edge. If businesses were able to master big data driven capabilities,
businesses could use these capabilities to establish secondary source of revenues by
selling it to other businesses.
3. Sustained processes: Data driven approach creates sustainable processes, which gives a
huge endorsement to big data analytics strategy as a go for enterprise adoption.
Randomness kills businesses and adds scary risks, while data driven strategy reduces the
risk by bringing statistical models, which are measurable.
• 4. Cost advantages of commodity hardware & open source software: Cost advantage is
music to CXO’s ears. How about the savings your IT will enjoy from moving things to
commodity hardware and leverage more open source platforms for cost effective ways to
achieve enterprise level computations and beyond. No more overpaying of premium
hardware when similar or better analytical processing could be done using commodity and
open source systems.
• 5. Quick turnaround and less bench times: Have you dealt with IT folks in your company? Mo
and mo people, complex processes and communication charter gives you hard time
connecting with someone who could get the task done. Things take forever long and cost
fortunes with substandard quality. A good bigdata and analytics strategy could reduce the
proof of concept time smoothly and substantially. It reduces the burden on IT and gets more
high quality, fast and cost effective solutions baked. So, you will waste less time waiting for
analysis / insights and more time digging through mo and mo data, and use it for better
insights and analyses which was never heard of before.
• 6. Automation to backfill redundant/mundane tasks: How about doing something to the
80% of time that is wasted in data cleaning and preprocessing. There is great deal of
automation that could be take part and sky rocket enterprise efficiency. Less manual time
spent on data prep and more time is spent on doing analysis that would have substantial ROI
compared to mundane data preps and monotonous tasks.
• 7. Optimize workforce to leverage high talent cost: This is an interesting area that I am
keeping a close eye on. Businesses already have right talent pools that would solve some
pieces of the big data puzzle on data science. Businesses have BI, Modelers and IT people
working in harmony in some shape or form. So, a good big data & analytics strategy ensures
current workforce is leveraged to it’s core in handling enterprise big data and also ensures
right number of data scientists are involved with clearer sight to their contribution and their
ROI.
Technology Drivers
• Technical
8. Data continues to grow exponentially: Whether you like it or not, data is increasing. One key
technological push is the increasing data and the threat of not being able to use this exploding
enterprise data for insights. Having a good strategy puts a pacifier to growing unutilized data
concerns.
• 9. Data is everywhere and in many formats: Besides being able to sieve through data in huge
volumes, having a stream of disparate data also poses its threats. Text, voice, video, logs and
other emerging formats make it harder to gain insights using traditional tools. So, businesses
need to drive their big data toolkit to prep for this exploding data type that is entering corporate
data DNA.
• 10. Alternate, Multiple Synchronous & Asynchronous data streams: Data coming through
multiple silos in realtime, creating problem in keeping up with this data in existing data systems.
These multiple streams put pressure on businesses to have an effective strategy on handling
these sources. With tools out there to handle such situations, it has become important to
acquire such capabilities before the competition does.
• 11. Low barrier to entry: As with any business, low barrier to entry poses one great leverage for
businesses to try different technologies and come up with the best strategy. Easy frameworks &
paradigms have made available lots of tools, which are relatively easier to deploy. These tools
could deliver, a phenomenal computing horsepower.
• 12. Traditional solutions failing to catch up with new market conditions: Big data has given rise
to exploding volume, velocity and variety of data. These 3Vs are difficult to handle and demand
cutting edge technologies. New requirements have emerged from changing market dynamics
that could not be addressed by old tools, but demands new big data tools. Hence, a big data and
analytics strategy to embrace these tools before business goes obsolete.
Big Data architecture and characteristics

• Big data
architecture
refers to the
logical and
physical structure
that dictates how
high volumes of
data are ingested,
processed, stored,
managed, and
accessed.
Layers in BIG DATA Architecture
• Big Data Ingestion Layer
This layer of Big Data Architecture is the first step for the data coming
from variable sources to start its journey. Data ingestion means the data
is prioritized and categorized, making data flow smoothly in further
layers in the Data ingestion process flow.
Tools used by this layer is
Apache Flume - straightforward and flexible architecture based on
streaming data flows,
Apache Nifi - supports robust and scalable directed graphs of data
routing, transformation, and system mediation logic.,
Elastic Logstash - open-source Data ingestion tool, server-side data
processing pipeline that ingests data from many sources, simultaneously
transforms it, and then sends it to your “stash, ” i.e., Elasticsearch
• Data Collector Layer
In this Layer, more focus is on the transportation of data from the
ingestion layer to the rest of the data pipeline. It is the Layer of
data architecture where components are decoupled so that
analytic capabilities may begin.
Data Processing Layer
• In this primary layer of Big Data Architecture, the focus is to
specialize in the data pipeline processing system. We can say
the data we have collected in the previous layer is processed in
this layer. Here we do some magic with the data to route them
to a different destination and classify the data flow, and it’s the
first point where the analytic may occur.
Data Storage Layer
Storage becomes a challenge when the size of the data you are dealing with
becomes large. Several possible solutions, like Data Ingestion Patterns, can
rescue from such problems. Finding a storage solution is very much
important when the size of your data becomes large. This layer of Big Data
Architecture focuses on “where to store such large data efficiently.”
Data Query Layer
This is the architectural layer where active analytic processing of Big Data
takes place. Here, the primary focus is to gather the data value to be more
helpful for the next layer.
Data Visualization Layer
The visualization, or presentation tier, probably the most prestigious tier,
where the data pipeline users may feel the VALUE of DATA.
Importance of Big Data
• To understand Where, When and Why their customers buy
• Protect the company’s client base with improved loyalty
programs
• Seizing cross-selling and upselling opportunities
• Provide targeted promotional information
• Optimize Workforce planning and operations
• Improve inefficiencies in the company’s supply chain
• Predict market trends
• Predict future needs
• Make companies more innovative and competitive
• It helps companies to discover new sources of revenue
Applications of Big Data
• Transportation: Big Data helps run GPS in smart phone
applications which sources data from government agencies and
even satellite images. Airplanes also generate a huge volume of
data for transatlantic flights to optimize fuel efficiency, balance
cargo and passenger weights, and analyze weather conditions in
order to ensure the maximum level of safety.
• Advertising and Marketing: Big Data is a major constituent
of marketing and advertising to target particular segments of the
consumer base. Advertisers purchase or collect large volumes of
data to identify what consumers like.
• Banking and Financial Services: Big Data plays an important role
in the financial industry because it is used for fraud detection,
managing and mitigating risks, optimizing customer
relationships as well as personalized marketing.
• Media and Entertainment: Big Data is extensively used by the
entertainment industry for gaining insights from reviews sent by
consumers, predicting audience preferences and interests, and targeting
campaigns for marketing purposes.
• Meteorology: Weather sensors and satellites all over the globe help
collect large volumes of data to track climate conditions. Meteorologists
extensively use Big Data to study the patterns of natural disasters,
prepare forecasts of weather, and the like.
• Healthcare: Big Data has significantly impacted the healthcare industry at
large. Healthcare providers and organizations have widely used Big Data
for various purposes, including predicting outbreaks of diseases,
detecting early symptoms of preventable diseases, e-records of health,
real-time cautioning, improving patient engagement, predicting and
preventing grave medical conditions, strategic planning, telemedicine
and research, and the like.
• Education: Many educational institutions have embraced the usage of Big
Data for improving curricula, attracting the best talent, and reducing
rates of dropouts by improving student outcomes, targeting global
recruiting, and optimizing the overall student experience.

You might also like