Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

B.

Tech IV-Yr I-Sem CSE


2023-24
BIG DATA ANALYTICS
(A6515) (VCE-R20)

UNIT-1
INTRODUCTION TO BIG DATA
BHANU PRASAD ANDRAJU
Associate Professor, Dept. of CSE
a.bhanuprasad@vardhaman.org
andrajub4u@gmail.com
9885990509

VARDHAMAN COLLEGE OF ENGINEERING


(AUTONOMOUS)
Shamshabad – 501218, Hyderabad, AP
Course Outcomes (Cos)
After the completion of the course, the student will be able to: POs PSOs
Identify the fundamental concepts of big data
A6515.1 1 1
analytics.
Select Hadoop environment and apply HDFS
A6515.2 commands on file management tasks. 1 1
Utilize optimization techniques of MapReduce
A6515.3 Programming to process massive amounts of data 5 1
in parallel.
Make use of NoSQL databases like MangoDB and
A6515.4 Cassandra to stock log data to be pulled for analysis.
5 1
Identify appropriate modern tools like Pig and 5 1
A6515.5 Hive for complex data flow and analysis
Bloom’s Taxonomy Level
CO# Remember Understand Apply Analyze Evaluate Create
(L1) (L2) (L3) (L4) (L5) (L6)
A6515.1 ✔
A6515.2 ✔
A6515.3 ✔
A6515.4 ✔
2
A6515.5 ✔
Course Contents - Theory
Introduction to Big Data: Classification of Digital Data, Characteristics of Data,
Definition of Big Data, Challenges with Big Data, Definitional Traits of Big Data,
Traditional Business Intelligence (BI) versus Big Data, Coexistence of Big Data and Data
Warehouse, Realms of Big Data, Big Data Analytics, Classification of Analytics,
Challenges of Big Data, Terminologies Used in Big Data Environments, Few Top
Analytics Tools.
The Big Data Technology Landscape: The Big Data Technology Landscape: NoSQL (Not
Only SQL), Types of NoSQL Databases, SQL versus NoSQL, Introduction to Hadoop,
RDBMS versus Hadoop, Distributed Computing Challenges, Hadoop Overview, Hadoop
Distributors, HDFS (Hadoop Distributed File System), Working with HDFS commands,
Interacting with Hadoop Ecosystem.
Mapreduce Programming: Processing Data with Hadoop, Mapper, Reducer, Combiner,
Partitioner, Searching, Sorting, Compression, Managing Resources and Applications with
Hadoop YARN.
Cassandra: Features of Cassandra, CQL Data Types, Keyspaces, CRUD Operations,
Collection Types, Table Operations.
MONGODB: Features of MongoDB, RDBMS vs MongoDB, Data Types in MongoDB,
MongoDB Query Language, CRUD operations, Count, Limit, Sort, and Skip.
PIG: The Anatomy of Pig, Pig Philosophy, Pig Latin Overview, Data Types in Pig,
Running Pig, Execution Modes of Pig, Relational Operators, Eval Functions, Word Count
using Pig.
HIVE: Introduction to Hive, Hive Architecture, Hive Data Types, Hive File Format, Hive 3
Query Language (HQL): DDL, DML, Partitions, Pig versus Hive.
UNIT – I CONTENTS
1. INTRODUCTION TO BIG DATA
1.1 Classification of Digital Data
1.2 Characteristics of Data
1.3 Definition of Big Data
1.4 Challenges with Big Data
1.5 Characteristics of Big Data / Definitional Traits of Big Data
1.6 Traditional Business Intelligence (BI) versus Big Data
1.7 Coexistence of Big Data and Data Warehouse
1.8 Realms of Big Data
1.9 Big Data Analytics
1.10 Classification of Analytics
1.11 Challenges of Big Data
1.12 Terminologies Used in Big Data Environments
1.13 Few Top Analytics Tools.

4
INTRODUCTION TO BIG DATA
1.1 Classification of Digital Data
Digital data can be classified into 3 types
1) Structured Data
➢ This is the data which is in an organized form
(e.g., in rows and columns) and can be easily
used by a computer program.
➢ Relationships exist between entities of data, such
as classes and their objects.
➢ Ex: data stored in databases
2) Semi-structured data:
➢ This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a
computer program.
➢ Ex: emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
3) Unstructured data:
➢ This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.
➢ About 80–90% data of an organization is in this format.
➢ Ex: memos, chat rooms, PowerPoint presentations, images, videos, letters,
researches, white papers, body of an email, etc. 5
1.1.1 Structured Data
➢ Structured Data: when data conforms to a pre-defined schema/structure we
say it is structured data.
➢ Most of the structured data is held in RDBMS. An RDBMS conforms to the
relational data model wherein the data is stored in rows/columns. Refer Table
1.1.
➢ The number of rows/records/tuples in a relation is called the cardinality of a
relation. The number of columns is referred to as the degree of a relation.

6
Sources of Structured data
➢ Oracle, IBM-DB2, Microsoft SQL server,
MySQL(open source)
➢ Online transaction processing (OLTP):
Transactional/operational data in day to day
business activities Ex: online banking
➢ Online shopping
➢ Use simple queries
➢ Required read/write operations
➢ Size is smaller 100MB to 10GB
Sources of Structured Data

Databases Age Debit/credit card number


Oracle Billing Online Banking
IBM-DB2 Contact Online transactions
Microsoft SQL server Address
MySQL (open source) Expenses
7
Ease of working with Structured data
1) Insert/delete/update: The Data Manipulation Language (DML)
operations provide the required ease with data input, storage, access,
process, analysis, etc.
2) Security: Encryption and tokenization solutions are available for the
security of information throughout its lifecycle. . Only authorized
individuals are able to decrypt and view sensitive information.
3) Indexing: An index is a data structure that speeds up the data
retrieval operations.
4) Scalability: The storage and processing capabilities of the traditional
RDBMS can be easily scaled up by increasing the horsepower of the
database server.
5) Transactional processing: RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID) properties of
transaction. Atomicity: Either a transaction happens in its entirety or
none of it at all. Consistency: Before and after execution of
transaction the database must be consistent state. Isolation: It allows
concurrent execution. Durability: All changes made to the database
during a transaction are permanent.
8
1.1.2 Semi-Structured Data
➢ Semi-structured data is also referred to as self-describing structure. It
has the following features:
1) It does not conform to the data models that one typically associates
with relational databases or any other form of data tables.
2) It uses tags to segregate semantic elements.
3) Tags are also used to enforce hierarchies of records and fields within
data.
4) There is no separation between the
data and the schema. The amount
of structure used is dictated by the
purpose at hand.
5) In semi-structured data, entities
belonging to the same class and
also grouped together need not
necessarily have the same set of
attributes.
And if at all, they have the same set of attributes, the order of
attributes may not be similar and for all practical purposes it is not 9
important as well.
Sources of Semi-Structured Data
➢ Amongst the sources for semi-structured data, the front runners are
“XML” and “JSON” as depicted in Fig.
1) XML: eXtensible Markup Language (XML) is hugely popularized by
web services developed utilizing the Simple Object Access Protocol
(SOAP) principles.
2) JSON: Java Script Object Notation (JSON) is used to transmit data
between a server and a web application. JSON is popularized by web
services developed utilizing the REpresentational State Transfer
(REST) – an architecture style for creating scalable web services.
MongoDB (open-source, distributed, NoSQL, documented-oriented
database) and Couchbase (open-source, distributed, NoSQL,
document-oriented database) store data natively in JSON format.
Sources of Semi-structured Data
XML CSV files
JSON TSV files
HTML email
10
1.1.3 UnStructured Data
➢ Unstructured data does not conform to any pre-defined data model.
➢ The structure of the unstructured data is quite unpredictable. Various
sources of unstructured data is depicted below.
Sources of Unstructured Data
Web Pages Images Chats
Body of e-mail Audios Social media data
Free form text Videos Twitter message
Text messages Animations Facebook post
Word documents Hyperlinks Whatsapp
Log files
Issues with “Unstructured” Data

11
How to Deal with Unstructured Data?
➢ The following techniques are used to find patterns in or
interpret unstructured data:
1) Data mining: We use methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems to
unearth consistent patterns in large data sets and/or systematic
relationships between variables. It is the analysis step of the
“knowledge discovery in databases” process. Few popular data mining
algorithms are as follows:
▪ Association rule mining: It is also called “market basket analysis”
or “affinity analysis”. It is about when you buy a product, what is
the other product that you are likely to purchase with it.
▪ Regression analysis: It helps to predict the relationship between
two variables. The variable whose value needs to be predicted is
called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.
▪ Collaborative filtering: It is about predicting a user’s preference or
preferences based on the preferences of a group of users.
12
Deal with Unstructured Data Contd..
2) Text analytics or text mining: Text is largely unstructured,
amorphous, and difficult to deal with algorithmically. Text mining is
the process of gleaning (collecting) high quality and meaningful
information (through devising of patterns and trends by means of
statistical pattern learning) from text. It includes tasks such as text
categorization, text clustering, sentiment analysis, concept/entity
extraction, etc.
3) Natural language processing (NLP): It is related to the area of human
computer interaction. It is about enabling computers to understand
human or natural language input.
4) Noisy text analytics: It is the process of extracting structured or semi-
structured information from noisy unstructured data such as chats,
blogs, wikis, emails, message-boards, text messages, etc. The noisy
unstructured data usually comprises one or more of the following:
Spelling mistakes, abbreviations, acronyms, non-standard words,
missing punctuation, missing letter case, filler words such as “uh”,
“um”, etc.

13
Deal with Unstructured Data Contd..
5) Manual tagging with metadata: This is about tagging manually with
adequate metadata to provide the requisite semantics to understand
unstructured data.
6) Part-of-speech tagging: It is also called POS or POST or grammatical
tagging. It is the process of reading text and tagging each word in the
sentence as belonging to a particular part of speech such as “noun”,
“verb”, “adjective”, etc.
7) Unstructured Information Management Architecture (UIMA): It is an
open source platform from IBM. It is used for real-time content
analytics. It is about processing text and other unstructured data to
find latent meaning and relevant relationship buried therein.

14
➢ In contrast to structured data, which
is stored in data warehouses,
unstructured is placed in data lakes,
which preserve the raw format of the
data and all of the information it
holds.
➢ In warehouses, the data is limited to
its defined schema. This is not true of
lakes which make the data more
malleable.

15
Differences between
Structured, Semi-structured and Unstructured data
Semi- Unstructured
Factors Structured data
structured data data
Flexibility It is dependent and It is more flexible It is flexible in nature
less flexible than structured and there is an
data but less than absence of a schema
unstructured data
Transaction Matured transaction The transaction is No transaction
Management and various adapted from management and no
concurrency DBMS not concurrency
technique matured
Query Structured query Queries over An only textual query
performance allow complex anonymous is possible
joining nodes are
possible
Technology It is based on the It is based on based on character
relational database RDF and XML and library data
table
Scalability It is very difficult to It’s scaling is It is more scalable.
scale DB schema simpler than
structured data
16
1.2 Characteristics of Data
➢ Data has three key characteristics:
1) Composition: The composition of data deals with the structure of data,
that is, the sources of data, the granularity, the types, and the nature of
data as to whether it is static or real-time streaming.
2) Condition: The condition of data deals with the state of data, that is,
“Can one use this data as is for analysis?” or “Does it require cleansing
for further enhancement and enrichment?”
3) Context: The context of data deals with “Where has this data been
generated?” “Why was this data generated?” “How sensitive is this data?”
“What are the events associated with this data?” and so on.
➢ Small data (data as it existed prior to the big data revolution) is about
certainty. It is about fairly known data sources; it is about no major
changes to the composition or context of data.
➢ Big data is about complexity in terms of
multiple and unknown datasets, exploding
volume, speed at which the data is being
generated and needs to be processed, and in
terms of the variety of data (internal or
external, behavioral or social) that is being
17
generated.
1.3 Definition of BIG DATA
➢ Different sources defined Big data in different ways:
➢ Big data is high-volume, high-velocity, and high-variety information
assets that demand cost effective, innovative forms of information
processing for enhanced insight and decision making. (or)
➢ Big data is anything beyond the human and technical infrastructure
needed to support storage, processing, and analysis. (or)
➢ Big data is the term for the collection of datasets so large and complex
that it becomes difficult to process using database system tools and
traditional processing applications
➢ Today’s BIG may be tomorrow’s NORMAL.
➢ The 3Vs (Volume, Velocity, Variety) concept was proposed by the
Gartner analyst Doug Laney
➢ There is no explicit definition of how big the dataset should be for it to
be considered “big data.” Big data that is just too big, moves fast, and
does not fit the structures of typical database systems. The data
changes are highly dynamic.
18
1.4 Challenges with Big Data
➢ Following are a few challenges with big data:
1) Data Generation: Data today is growing at an exponential rate. The
key questions here are: “Will all this data be useful for analysis?”, “Do
we work with all this data or a subset of it?”, “How will we separate
the knowledge from the noise?”, etc.
2) Cloud computing and virtualization: Cloud computing is the answer
to managing infrastructure for big data as far as cost-efficiency,
elasticity, and easy upgrading/downgrading is concerned. This
further complicates the decision to host big data solutions outside the
enterprise.
3) Retention: How long should one retain this data? As some data is
useful for making long-term decisions, whereas in few cases, the data
may quickly become irrelevant and obsolete just a few hours after
having being generated.
4) Lack of Talent: There are a lot of Big Data projects in major
organizations, but there is a lack of skilled professionals who possess
a high level of proficiency in data sciences that is vital in
implementing big data solutions. 19
Challenges with Big Data Contd..
5) Data visualization: is becoming popular as a separate discipline. We
are short by quite a number, as far as business visualization experts
are concerned.
6) Data Quality: The problem is with Veracity of data. The data is very
messy, inconsistent and incomplete
7) Discovery: Analyzing peta bytes of data using extremely powerful
algorithms to find patterns and insights are very difficult.
8) Storage: The more data an organization has, the more complex the
problems of managing it can become. The question that arises here is
“Where to store it?”. We need a storage system which can easily scale
up or down on-demand
9) Analytics: In the case of Big Data, most of the time we are unaware of
the kind of data we are dealing with, so analyzing that data is even
more difficult.
10) Security: Since the data is huge in size, keeping it secure is another
challenge. It includes user authentication, restricting access based on
a user, recording data access histories, proper use of data encryption
etc 20
1.5 Definitional Traits of Big Data /
Characteristics of Big Data
➢ Big data is data that is big in
1) Volume
2) Velocity
3) Variety
➢ Refer Figure 1.5.
1) Volume
➢ Volume refers to the ‘amount of
data’, which is growing day by day at
a very fast pace. (or) data can
actually be considered as a Big Data
or not, is dependent upon the
volume of data.
➢ Data rapidly increasing GB, TB,
PB….

21
Sources of big data
1) Typical internal data sources: Data present within an organization’s
firewall. It is as follows:
• Data storage: File systems, SQL (RDBMSs – Oracle, MS SQL
Server, DB2, MySQL, PostgreSQL, etc.), NoSQL (MongoDB,
Cassandra, etc.), and so on.
• Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients’ health records, students’
admission records, students’ assessment records, and so on.
2) External data sources: Data residing outside an organization’s firewall.
It is as follows:
• Public Web: Wikipedia, weather, regulatory, compliance, census,
etc.

22
Sources of big data Contd..
3) Both (internal + external data sources)
• Sensor data: Car sensors, smart electric meters, office buildings, air
conditioning units, refrigerators, and so on.
• Machine log data: Event logs, application logs, Business process
logs, audit logs, clickstream data, etc.
• Social media: Twitter, blogs, Facebook, LinkedIn, YouTube,
Instagram, etc.
• Business apps: ERP, CRM, HR, Google Docs, and so on.
• Media: Audio, Video, Image, Podcast, etc.
• Docs: Comma separated value (CSV), Word Documents, PDF, XLS,
PPT, and so on.

23
Sources of big data

• Data storage: File systems,


• Archives: Archives of scanned SQL (RDBMSs – Oracle, MS SQL
documents, paper archives, customer Server, DB2, MySQL, PostgreSQL,
correspondence records, patients’ etc.), NoSQL (MongoDB,
health records, students’ admission Cassandra, etc.), and so on.
records, students’ assessment records,
and so on. Internal data sources
• Sensor data: Car
•Media: Audio, Video,
sensors, smart electric
Image, Podcast, etc
meters, office buildings,
air conditioning units, Both (internal + external data sources)
refrigerators, and so on.
• Docs: Comma separated
• Machine log data: value (CSV), Word
Event logs, application Documents, PDF, XLS, PPT
logs, Business process
logs, audit logs,
clickstream data, etc. . • Business apps: ERP,
CRM, HR, Google Docs,
and so on

Exernal data sources • Social media: Twitter,


• Public Web: Wikipedia, weather, blogs, Facebook, LinkedIn,
regulatory, compliance, census, etc. YouTube, Instagram, etc.
24
Big Data Contd..
2) Velocity: Refers to the speed of generation of data, How fast the data
is generated and processed to meet the demands.
We have moved from the days of batch processing (remember our payroll
applications) to real-time processing.
Batch → Periodic → Near real time → Real-time processing
1990: HD: 1GB-20GB ,Ram: 28MB, Reading capacity: 10kbps
3) Variety: Variety deals with a wide range of data types and sources of
data. There are three categories:
1) Structured data: From traditional transaction processing systems and
RDBMS, etc.
2) Semi-structured data: For example Hyper Text Markup Language
(HTML), eXtensible Markup Language (XML).
3) Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs, social media, etc.

25
Other Characteristics of Data Which are
not Definitional Traits of Big Data
➢ There are yet other characteristics of data which are not necessarily the
definitional traits of big data. Few of these are listed as follows:
1) Veracity and validity: Veracity refers to biases, noise, and
abnormality in data. The key question here is: “Is all the data that is
being stored, mined, and analyzed meaningful and pertinent to the
problem under consideration?” Validity refers to the accuracy and
correctness of the data. Any data that is picked up for analysis needs
to be accurate. It is not just true about big data alone.
2) Volatility: Volatility of data deals with, how long is the data valid?
And how long should it be stored? There is some data that is required
for long-term decisions and remains valid for longer periods of time.
However, there are also pieces of data that quickly become obsolete
minutes after their generation.
3) Variability: Data flows can be highly inconsistent with periodic
peaks. Process of being able to handle and manage the data
effectively.
4) Value: Big Data i.e. Value. Is it adding to the benefits of the
organizations who are analyzing big data? 26
27
1.6 Traditional Business Intelligence (BI)
versus Big Data
➢ Some of the differences between traditional BI and big data.
Comparison of
Business Intelligence Big Data
Objectives
1) Environment In traditional BI In a big data
environment, all the environment, data
enterprise’s data is housed resides in a distributed
in a typical central database file system that scales
server that scales vertically. horizontally
2) Data analyzed data is generally analyzed data, it is analyzed in
in an offline mode both real time as well as
in offline mode.
3) Processing Traditional BI is about Big data is about variety:
functions structured data and it is Structured, semi-
here that data is taken to structured, and
processing functions (move unstructured data and
data to code). here the processing
functions are taken to
the data (move code to
data). 28
Traditional BI Vs Big Data Contd..
Comparison of
Business Intelligence Big Data
Objectives
4) Purpose To help the business to To capture, process, and
make better decisions and analyze the data, both
in delivering accurate structured and
reports by extracting unstructured to improve
information directly from customer outcomes.
the data source.
5) EcoSystem / • ERP databases, Data • Hadoop, Spark, R
Components/ Warehouse, Dashboard Server
Tools • Tableau, Qlik Sense • Hive, Polybase
• OLAP, Sisense • Cassandra, Presto
• Data Warehousing • Cloudera, Storm etc
6) Properties/ Location intelligence, Volume, Velocity,
Characteristics Executive Dashboards, Variety, Variability, and
“what if” analysis, Veracity.
Interactive and Ranking
reports, Metadata layer.
29
Traditional BI Vs Big Data Contd..
Comparison of
Business Intelligence Big Data
Objectives
7) Benefits • Helps in making better • Better Decision
business decisions making, Fraud
• Faster and more detection,
accurate reporting and • Storage, mining, and
analysis analysis of data
• Improved data quality • Market prediction &
• Reduced costs forecasting
• Increase revenues • Helps in
• Improved operational implementing the new
efficiency etc. strategies
• Keep up with
customer trends
8) Applied Fields • Social media, • The banking sector,
Healthcare, Gaming Entertainment, and
Industry, Food Industry Social media,
etc Healthcare, Retail and
wholesale etc 30
1.7 Coexistence of Big Data and
Data Warehouse
➢ Few companies are a wee bit comfortable working with incumbent data
warehouse for standard BI and analytics reporting, for example the
quarterly sales report, customer dashboard, etc.
➢ The power that Hadoop brings to the table with different types of
analysis on different types of data.
➢ The same operational systems, was engaged in powering the data
warehouse, can also populate the big data environment when they’re
needed for computation-rich processing or for raw data exploration.
➢ We cannot ignore the powerful analytics capability of Hadoop and the
revolutionary developments in RDBMS. So, the need of the hour is to
have both data warehouse and Hadoop co-exist in today’s
environment.

31
1.8 Realms of Big Data
➢ Three very important reasons why companies should compulsorily
consider leveraging big data:
1) Competitive advantage:
➢ The most important resource with any organization today is their
data.
➢ What they do with it will determine their fate in the market.
2) Decision making:
➢ Decision making has shifted from the hands of the elite few to the
empowered many.
➢ Good decisions play a significant role in furthering customer
engagement, reducing operating margins in retail, cutting cost and
other expenditures in the health sector.
3) Value of data:
➢ The value of data continues to see a steep rise.
➢ As the all-important resource, it is time to look at newer
architecture, tools, and practices to leverage this.
32
1.9. Big Data Analytics
➢ Raw data is collected, classified, and organized.
➢ Associating it with adequate metadata and laying bare the context
converts this data into meaningful information.
➢ It is then aggregated and summarized so that it becomes easy to
consume it for analysis.
➢ Gradual accumulation of such meaningful information builds a
knowledge repository. This, in turn, helps with actionable insights
which prove useful for decision making. Refer Figure 3.1.

Fig 2.1: Transformation of data to yield actionable insights. 33


What is Big Data Analytics?
Big data analytics is the process of examining big data to uncover
patterns, unearth trends, and find unknown correlations and other useful
information to make faster and better decisions.
1) Technology-enabled analytics: Quite a few data analytics and
visualization tools are available in the market today from leading
vendors such as IBM, Tableau, SAS, R Analytics, Statistica, etc. to help
process and analyze your big data.
2) About gaining a meaningful, deeper, and richer insight into your
business to steer it in the right direction, understanding the
customer’s demographics to cross-sell and up-sell to them, better
leveraging the services of your vendors and suppliers, etc.
3) About a competitive edge over your competitors by enabling you with
findings that allow quicker and better decision-making.
4) A tight handshake between three communities: IT, business users, and
data scientists.
5) Working with datasets whose volume and variety exceed the current
storage, processing capabilities and infrastructure of your enterprise.
6) About moving code to data. This makes perfect sense as the program
for distributed processing is tiny (few KBs) compared to the data34
(TBs/PBs/ZBs).
1.10 Classification of Analytics
1) classify analytics into analytics 1.0, analytics 2.0,
and analytics 3.0. Refer Table 3.1.

Analytics 1.0 Analytics 2.0 Analytics 3.0


mid 1950s to 2009 2005 to 2012 2012 to present
Descriptive statistics Descriptive statistics Descriptive + predictive +
(report on events, + predictive statistics prescriptive statistics (use
occurrences, etc. of (use data from the data from the past to
the past) past to make make prophecies for the
predictions for the future and make
future) recommendations)
Key questions asked: Key questions asked: Key questions asked:
What happened? What will happen? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the action
taken to take advantage of
what will happen?
35
Classification of Analytics Contd..
Analytics 1.0 Analytics 2.0 Analytics 3.0
Data from legacy Big data A blend of big data and
systems, ERP, CRM, data from legacy systems,
and 3rd party ERP, CRM, and 3rd party
applications. applications.
Small and structured Big data is being A blend of big data and
data sources. Data taken up seriously. traditional analytics to
stored in enterprise Data is mainly yield insights and
data warehouses or unstructured, offerings with speed and
data marts. arriving at a much impact.
higher pace.
Data was internally Data was often Data is both being
sourced. externally sourced. internally and externally
sourced.
Relational databases Database appliances, In memory analytics, in
Hadoop clusters, SQL database processing, agile
to Hadoop analytical methods,
environments, etc. machine learning 36
techniques, etc.
Classification of Analytics Contd..
Analytics 4.0 → Automated Capabilities
Employing data-mining techniques and machine learning algorithms
along with the existing descriptive-predictive-prescriptive analytics
comes to full fruition in this era.

37
1.11 Top Challenges Facing Big Data
➢ Following are the various top challenges of big data:
1) Scale: Storage (RDBMS or NoSQL) is one major concern that needs to be
addressed to handle the need for scaling rapidly and elastically. The need of
the hour is a storage that can best withstand the onslaught of large volume,
velocity, and variety of big data? Should you scale vertically or should you
scale horizontally?
2) Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization mechanisms)
when it comes to safeguarding big data.
3) Schema: Rigid schemas have no place. We want the technology to be able to
fit our big data and not the other way around. The need of the hour is
dynamic schema. Static (pre-defined schemas) are old.
4) Continuous availability: The big question here is how to provide 24/7
support because almost all RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
5) Consistency: Should one opt for consistency or eventual consistency?
6) Partition tolerant: How to build partition tolerant systems that can take care
of both hardware and software failures?
7) Data quality: How to maintain data quality – data accuracy, completeness,38
timeliness, etc.? Do we have appropriate metadata in place?
1.12 Terminologies Used in
Big Data Environments
1) In-Memory Analytics: Data access from non-volatile storage such as
hard disk is a slow process. All the relevant data is stored in Random
Access Memory (RAM) or primary storage thus eliminating the need
to access the data from hard disk. The advantage is faster access,
rapid deployment, better insights, and minimal IT involvement.
2) In-Database Processing (analytics): works by blending data
warehouses with analytical systems. With in-database processing, the
database program itself can run the computations eliminating the
need for Extraction Transformation and Loading data into data
warehouse and thereby saving on time.
3) Symmetric Multiprocessor System (SMP): In SMP, there is a single
common main memory that is shared by two or more identical
processors. The processors have full access to all I/O devices and are
controlled by a single operating system instance. SMP are tightly
coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a
system bus.

39
Terminologies in Big Data Contd..
4) Massive Parallel Processing (MPP): refers to the coordinated
processing of programs by a number of processors working parallel.
The processors, each have their own operating systems and dedicated
memory. They work on different parts of the same program and all
the executing segments can communicate with each other.
5) Difference Between Parallel and Distributed Systems: A parallel
database system is a tightly coupled system in which the processors
co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the
system. Either the processors have access to a common memory or
make use of message passing for communication. Distributed
database systems are known to be loosely coupled and are composed
by individual machines that can run their individual application and
serve their own respective user. The data is usually distributed across
several machines, thereby necessitating quite a number of machines
to be accessed to answer a user query.

40
Terminologies in Big Data Contd..
6) Shared Nothing Architecture: The three most common types of
architecture for multiprocessor high transaction rate systems are:
1. Shared Memory (SM) architecture: a common central memory is
shared by multiple processors
2. Shared Disk (SD) architecture: multiple processors share a
common collection of disks while having their own private
memory.
3. Shared Nothing (SN) architecture: neither memory nor disk is
shared among multiple processors.
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A fault in a single node is contained and
confined to that node exclusively and exposed only through
messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource in which
different nodes will have to take turns to access the critical
data. This imposes a limit on how many nodes can be added to
the distributed shared disk system, thus compromising on
scalability.
41
Terminologies in Big Data Contd..
7) CAP Theorem (Brewer’s Theorem) : states that in a distributed
computing environment (a collection of interconnected nodes that
share data), it is impossible to provide the following guarantees. One
must be sacrificed.
1. Consistency implies that every read fetches the last write.
2. Availability implies that reads and writes always succeed.
3. Partition tolerance implies that the system will continue to
function when network partition occurs.

42
1.12.1 Atomicity, Consistency, Isolation and
Durability (ACID)
➢ The key ACID guarantee is that it provides a safe environment in which to
operate on your data. The ACID acronym stands for:
➢ Atomicity: Either the task (or all tasks) within a transaction are performed
or none of them are. This is the all-or-none principle. If one element of a
transaction fails, the entire transaction fails.
➢ Consistency: The transaction must meet all protocols or rules defined by
the system at all times. The transaction does not violate those protocols
and the database must remain in a consistent state at the beginning and
end of a transaction; there are never any half-completed transactions.
➢ Isolation: No transaction has access to any other transaction that is in an
intermediate or unfinished state. Thus, each transaction is independent
unto itself.
➢ Durability: Once the
transaction is complete, it will
persist as complete and cannot
be undone; it will survive
system failure, power loss and
other types of system
breakdowns. 43
1.12.2 Basically Available Soft State Eventual
Consistency (BASE)
➢ Basically Available: This constraint states that the system does
guarantee the availability of the data as regards CAP Theorem; there
will be a response to any request. But, that response could still be
‘failure’ to obtain the requested data or the data may be in an
inconsistent or changing state, much like waiting for a check to clear
in your bank account.
➢ Soft state: The state of the system could change over time, so even
during times without input there may be changes going on due to
‘eventual consistency,’ thus the state of the system is always ‘soft.’
➢ Eventual consistency: The system will eventually become consistent
once it stops receiving input. The data will propagate to everywhere it
should sooner or later, but the system will continue to receive input
and is not checking the consistency of every transaction before it
moves onto the next one.
➢ Google’s BigTable, Amazon’s Dynamo, Facebook’s Cassandra are few
such big example that deal with a loss of consistency and still maintain
system reliability with the help of distributed data system which
provides them BASE properties.
44
1.13 Few Top Data Analytics Tools
➢ Data analytics tools offer diverse capabilities in the
industry to identify trends and understand their
customer base. Data analytics software can track and analyze data,
allowing to create actionable reports and dashboards.
➢ Below are the list of few top analytics tools.
1. Excel- Microsoft Excel is the world’s best-known commercial
spreadsheet software for data collection and analysis. With lots of
useful functions and plug-ins, Excel is one the easiest ways to store
data, create data visualizations, calculations, clean data, report data in
an understandable manner which helps in decision making. Its
limitations are expensive, calculation errors, poor at handling big data.
2. Python- is an open-source, high-level, object-oriented
programming language with thousands of free libraries. It is easy to
learn, highly versatile, widely-used, support multiple file-formats for
data visualizing, data masking, merging, indexing and grouping data,
data cleaning etc. Python’s main drawback is its speed—it is memory
intensive and slower than many languages.
45
Top Data Analytics Tools Contd..
3. Hadoop – is a Java-based open-source software
framework to store data and run applications parallel on clusters of
commodity hardware. It can process both structured and unstructured
data from one server to multiple computers and offers cross-platform
support for its users.

4. Tableau- is one of the most commercial, Business Intelligence tool, for


fast analytics, great visualizations, interactivity, mobile support to
explore any type of data – spreadsheets, databases, data on Hadoop
and cloud services. It has a visual drag and drop interface. Its
limitations are Poor version control, no data pre-processing.

5. Spark - free, open-source, integrated analytics engine


for real-time processing and analyzing large amounts of data for
developers, researchers, and data scientists. It is a framework that
supports applications while maintaining MapReduce’s scalability and
fault tolerance. Its drawbacks are No file management system, rigid
user interface.

46
Top Data Analytics Tools Contd..
6. Power BI- is commercial software with capabilities of Business
Intelligence from interactive data visualization to predictive analytics.
Great data connectivity, regular updates. it operates seamlessly with
Excel, text files, SQL server, and cloud sources. It also has limitations
like Clunky user interface, rigid formulas, data limits (in the free
version).
7. Mongo DB- is a free, open-source platform and a document-oriented
(NoSQL) schema-less database that is used to store a high volume of
data. It uses collections and key-value pairs documents for storage.
8. Cassandra- is a free, open-source NoSQL distributed
database that is used to fetch large amounts of data. It has high
scalability and availability with Data Storage Flexibility, Data
Distribution System, Fast Processing and Fault-tolerance.
9. Talend - data integration ETL, combines data integration, data quality,
and data governance in a single, low-code platform that works with
virtually any data source and data architecture.
10.QlikView is a Self-Service Business Intelligence, Data Visualization,
Data Literacy and Data Analytics, executive dashboards. 47
48

You might also like