Professional Documents
Culture Documents
1 Bda A6515 Intro Bda
1 Bda A6515 Intro Bda
UNIT-1
INTRODUCTION TO BIG DATA
BHANU PRASAD ANDRAJU
Associate Professor, Dept. of CSE
a.bhanuprasad@vardhaman.org
andrajub4u@gmail.com
9885990509
4
INTRODUCTION TO BIG DATA
1.1 Classification of Digital Data
Digital data can be classified into 3 types
1) Structured Data
➢ This is the data which is in an organized form
(e.g., in rows and columns) and can be easily
used by a computer program.
➢ Relationships exist between entities of data, such
as classes and their objects.
➢ Ex: data stored in databases
2) Semi-structured data:
➢ This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a
computer program.
➢ Ex: emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
3) Unstructured data:
➢ This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.
➢ About 80–90% data of an organization is in this format.
➢ Ex: memos, chat rooms, PowerPoint presentations, images, videos, letters,
researches, white papers, body of an email, etc. 5
1.1.1 Structured Data
➢ Structured Data: when data conforms to a pre-defined schema/structure we
say it is structured data.
➢ Most of the structured data is held in RDBMS. An RDBMS conforms to the
relational data model wherein the data is stored in rows/columns. Refer Table
1.1.
➢ The number of rows/records/tuples in a relation is called the cardinality of a
relation. The number of columns is referred to as the degree of a relation.
6
Sources of Structured data
➢ Oracle, IBM-DB2, Microsoft SQL server,
MySQL(open source)
➢ Online transaction processing (OLTP):
Transactional/operational data in day to day
business activities Ex: online banking
➢ Online shopping
➢ Use simple queries
➢ Required read/write operations
➢ Size is smaller 100MB to 10GB
Sources of Structured Data
11
How to Deal with Unstructured Data?
➢ The following techniques are used to find patterns in or
interpret unstructured data:
1) Data mining: We use methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems to
unearth consistent patterns in large data sets and/or systematic
relationships between variables. It is the analysis step of the
“knowledge discovery in databases” process. Few popular data mining
algorithms are as follows:
▪ Association rule mining: It is also called “market basket analysis”
or “affinity analysis”. It is about when you buy a product, what is
the other product that you are likely to purchase with it.
▪ Regression analysis: It helps to predict the relationship between
two variables. The variable whose value needs to be predicted is
called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.
▪ Collaborative filtering: It is about predicting a user’s preference or
preferences based on the preferences of a group of users.
12
Deal with Unstructured Data Contd..
2) Text analytics or text mining: Text is largely unstructured,
amorphous, and difficult to deal with algorithmically. Text mining is
the process of gleaning (collecting) high quality and meaningful
information (through devising of patterns and trends by means of
statistical pattern learning) from text. It includes tasks such as text
categorization, text clustering, sentiment analysis, concept/entity
extraction, etc.
3) Natural language processing (NLP): It is related to the area of human
computer interaction. It is about enabling computers to understand
human or natural language input.
4) Noisy text analytics: It is the process of extracting structured or semi-
structured information from noisy unstructured data such as chats,
blogs, wikis, emails, message-boards, text messages, etc. The noisy
unstructured data usually comprises one or more of the following:
Spelling mistakes, abbreviations, acronyms, non-standard words,
missing punctuation, missing letter case, filler words such as “uh”,
“um”, etc.
13
Deal with Unstructured Data Contd..
5) Manual tagging with metadata: This is about tagging manually with
adequate metadata to provide the requisite semantics to understand
unstructured data.
6) Part-of-speech tagging: It is also called POS or POST or grammatical
tagging. It is the process of reading text and tagging each word in the
sentence as belonging to a particular part of speech such as “noun”,
“verb”, “adjective”, etc.
7) Unstructured Information Management Architecture (UIMA): It is an
open source platform from IBM. It is used for real-time content
analytics. It is about processing text and other unstructured data to
find latent meaning and relevant relationship buried therein.
14
➢ In contrast to structured data, which
is stored in data warehouses,
unstructured is placed in data lakes,
which preserve the raw format of the
data and all of the information it
holds.
➢ In warehouses, the data is limited to
its defined schema. This is not true of
lakes which make the data more
malleable.
15
Differences between
Structured, Semi-structured and Unstructured data
Semi- Unstructured
Factors Structured data
structured data data
Flexibility It is dependent and It is more flexible It is flexible in nature
less flexible than structured and there is an
data but less than absence of a schema
unstructured data
Transaction Matured transaction The transaction is No transaction
Management and various adapted from management and no
concurrency DBMS not concurrency
technique matured
Query Structured query Queries over An only textual query
performance allow complex anonymous is possible
joining nodes are
possible
Technology It is based on the It is based on based on character
relational database RDF and XML and library data
table
Scalability It is very difficult to It’s scaling is It is more scalable.
scale DB schema simpler than
structured data
16
1.2 Characteristics of Data
➢ Data has three key characteristics:
1) Composition: The composition of data deals with the structure of data,
that is, the sources of data, the granularity, the types, and the nature of
data as to whether it is static or real-time streaming.
2) Condition: The condition of data deals with the state of data, that is,
“Can one use this data as is for analysis?” or “Does it require cleansing
for further enhancement and enrichment?”
3) Context: The context of data deals with “Where has this data been
generated?” “Why was this data generated?” “How sensitive is this data?”
“What are the events associated with this data?” and so on.
➢ Small data (data as it existed prior to the big data revolution) is about
certainty. It is about fairly known data sources; it is about no major
changes to the composition or context of data.
➢ Big data is about complexity in terms of
multiple and unknown datasets, exploding
volume, speed at which the data is being
generated and needs to be processed, and in
terms of the variety of data (internal or
external, behavioral or social) that is being
17
generated.
1.3 Definition of BIG DATA
➢ Different sources defined Big data in different ways:
➢ Big data is high-volume, high-velocity, and high-variety information
assets that demand cost effective, innovative forms of information
processing for enhanced insight and decision making. (or)
➢ Big data is anything beyond the human and technical infrastructure
needed to support storage, processing, and analysis. (or)
➢ Big data is the term for the collection of datasets so large and complex
that it becomes difficult to process using database system tools and
traditional processing applications
➢ Today’s BIG may be tomorrow’s NORMAL.
➢ The 3Vs (Volume, Velocity, Variety) concept was proposed by the
Gartner analyst Doug Laney
➢ There is no explicit definition of how big the dataset should be for it to
be considered “big data.” Big data that is just too big, moves fast, and
does not fit the structures of typical database systems. The data
changes are highly dynamic.
18
1.4 Challenges with Big Data
➢ Following are a few challenges with big data:
1) Data Generation: Data today is growing at an exponential rate. The
key questions here are: “Will all this data be useful for analysis?”, “Do
we work with all this data or a subset of it?”, “How will we separate
the knowledge from the noise?”, etc.
2) Cloud computing and virtualization: Cloud computing is the answer
to managing infrastructure for big data as far as cost-efficiency,
elasticity, and easy upgrading/downgrading is concerned. This
further complicates the decision to host big data solutions outside the
enterprise.
3) Retention: How long should one retain this data? As some data is
useful for making long-term decisions, whereas in few cases, the data
may quickly become irrelevant and obsolete just a few hours after
having being generated.
4) Lack of Talent: There are a lot of Big Data projects in major
organizations, but there is a lack of skilled professionals who possess
a high level of proficiency in data sciences that is vital in
implementing big data solutions. 19
Challenges with Big Data Contd..
5) Data visualization: is becoming popular as a separate discipline. We
are short by quite a number, as far as business visualization experts
are concerned.
6) Data Quality: The problem is with Veracity of data. The data is very
messy, inconsistent and incomplete
7) Discovery: Analyzing peta bytes of data using extremely powerful
algorithms to find patterns and insights are very difficult.
8) Storage: The more data an organization has, the more complex the
problems of managing it can become. The question that arises here is
“Where to store it?”. We need a storage system which can easily scale
up or down on-demand
9) Analytics: In the case of Big Data, most of the time we are unaware of
the kind of data we are dealing with, so analyzing that data is even
more difficult.
10) Security: Since the data is huge in size, keeping it secure is another
challenge. It includes user authentication, restricting access based on
a user, recording data access histories, proper use of data encryption
etc 20
1.5 Definitional Traits of Big Data /
Characteristics of Big Data
➢ Big data is data that is big in
1) Volume
2) Velocity
3) Variety
➢ Refer Figure 1.5.
1) Volume
➢ Volume refers to the ‘amount of
data’, which is growing day by day at
a very fast pace. (or) data can
actually be considered as a Big Data
or not, is dependent upon the
volume of data.
➢ Data rapidly increasing GB, TB,
PB….
21
Sources of big data
1) Typical internal data sources: Data present within an organization’s
firewall. It is as follows:
• Data storage: File systems, SQL (RDBMSs – Oracle, MS SQL
Server, DB2, MySQL, PostgreSQL, etc.), NoSQL (MongoDB,
Cassandra, etc.), and so on.
• Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients’ health records, students’
admission records, students’ assessment records, and so on.
2) External data sources: Data residing outside an organization’s firewall.
It is as follows:
• Public Web: Wikipedia, weather, regulatory, compliance, census,
etc.
22
Sources of big data Contd..
3) Both (internal + external data sources)
• Sensor data: Car sensors, smart electric meters, office buildings, air
conditioning units, refrigerators, and so on.
• Machine log data: Event logs, application logs, Business process
logs, audit logs, clickstream data, etc.
• Social media: Twitter, blogs, Facebook, LinkedIn, YouTube,
Instagram, etc.
• Business apps: ERP, CRM, HR, Google Docs, and so on.
• Media: Audio, Video, Image, Podcast, etc.
• Docs: Comma separated value (CSV), Word Documents, PDF, XLS,
PPT, and so on.
23
Sources of big data
25
Other Characteristics of Data Which are
not Definitional Traits of Big Data
➢ There are yet other characteristics of data which are not necessarily the
definitional traits of big data. Few of these are listed as follows:
1) Veracity and validity: Veracity refers to biases, noise, and
abnormality in data. The key question here is: “Is all the data that is
being stored, mined, and analyzed meaningful and pertinent to the
problem under consideration?” Validity refers to the accuracy and
correctness of the data. Any data that is picked up for analysis needs
to be accurate. It is not just true about big data alone.
2) Volatility: Volatility of data deals with, how long is the data valid?
And how long should it be stored? There is some data that is required
for long-term decisions and remains valid for longer periods of time.
However, there are also pieces of data that quickly become obsolete
minutes after their generation.
3) Variability: Data flows can be highly inconsistent with periodic
peaks. Process of being able to handle and manage the data
effectively.
4) Value: Big Data i.e. Value. Is it adding to the benefits of the
organizations who are analyzing big data? 26
27
1.6 Traditional Business Intelligence (BI)
versus Big Data
➢ Some of the differences between traditional BI and big data.
Comparison of
Business Intelligence Big Data
Objectives
1) Environment In traditional BI In a big data
environment, all the environment, data
enterprise’s data is housed resides in a distributed
in a typical central database file system that scales
server that scales vertically. horizontally
2) Data analyzed data is generally analyzed data, it is analyzed in
in an offline mode both real time as well as
in offline mode.
3) Processing Traditional BI is about Big data is about variety:
functions structured data and it is Structured, semi-
here that data is taken to structured, and
processing functions (move unstructured data and
data to code). here the processing
functions are taken to
the data (move code to
data). 28
Traditional BI Vs Big Data Contd..
Comparison of
Business Intelligence Big Data
Objectives
4) Purpose To help the business to To capture, process, and
make better decisions and analyze the data, both
in delivering accurate structured and
reports by extracting unstructured to improve
information directly from customer outcomes.
the data source.
5) EcoSystem / • ERP databases, Data • Hadoop, Spark, R
Components/ Warehouse, Dashboard Server
Tools • Tableau, Qlik Sense • Hive, Polybase
• OLAP, Sisense • Cassandra, Presto
• Data Warehousing • Cloudera, Storm etc
6) Properties/ Location intelligence, Volume, Velocity,
Characteristics Executive Dashboards, Variety, Variability, and
“what if” analysis, Veracity.
Interactive and Ranking
reports, Metadata layer.
29
Traditional BI Vs Big Data Contd..
Comparison of
Business Intelligence Big Data
Objectives
7) Benefits • Helps in making better • Better Decision
business decisions making, Fraud
• Faster and more detection,
accurate reporting and • Storage, mining, and
analysis analysis of data
• Improved data quality • Market prediction &
• Reduced costs forecasting
• Increase revenues • Helps in
• Improved operational implementing the new
efficiency etc. strategies
• Keep up with
customer trends
8) Applied Fields • Social media, • The banking sector,
Healthcare, Gaming Entertainment, and
Industry, Food Industry Social media,
etc Healthcare, Retail and
wholesale etc 30
1.7 Coexistence of Big Data and
Data Warehouse
➢ Few companies are a wee bit comfortable working with incumbent data
warehouse for standard BI and analytics reporting, for example the
quarterly sales report, customer dashboard, etc.
➢ The power that Hadoop brings to the table with different types of
analysis on different types of data.
➢ The same operational systems, was engaged in powering the data
warehouse, can also populate the big data environment when they’re
needed for computation-rich processing or for raw data exploration.
➢ We cannot ignore the powerful analytics capability of Hadoop and the
revolutionary developments in RDBMS. So, the need of the hour is to
have both data warehouse and Hadoop co-exist in today’s
environment.
31
1.8 Realms of Big Data
➢ Three very important reasons why companies should compulsorily
consider leveraging big data:
1) Competitive advantage:
➢ The most important resource with any organization today is their
data.
➢ What they do with it will determine their fate in the market.
2) Decision making:
➢ Decision making has shifted from the hands of the elite few to the
empowered many.
➢ Good decisions play a significant role in furthering customer
engagement, reducing operating margins in retail, cutting cost and
other expenditures in the health sector.
3) Value of data:
➢ The value of data continues to see a steep rise.
➢ As the all-important resource, it is time to look at newer
architecture, tools, and practices to leverage this.
32
1.9. Big Data Analytics
➢ Raw data is collected, classified, and organized.
➢ Associating it with adequate metadata and laying bare the context
converts this data into meaningful information.
➢ It is then aggregated and summarized so that it becomes easy to
consume it for analysis.
➢ Gradual accumulation of such meaningful information builds a
knowledge repository. This, in turn, helps with actionable insights
which prove useful for decision making. Refer Figure 3.1.
37
1.11 Top Challenges Facing Big Data
➢ Following are the various top challenges of big data:
1) Scale: Storage (RDBMS or NoSQL) is one major concern that needs to be
addressed to handle the need for scaling rapidly and elastically. The need of
the hour is a storage that can best withstand the onslaught of large volume,
velocity, and variety of big data? Should you scale vertically or should you
scale horizontally?
2) Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization mechanisms)
when it comes to safeguarding big data.
3) Schema: Rigid schemas have no place. We want the technology to be able to
fit our big data and not the other way around. The need of the hour is
dynamic schema. Static (pre-defined schemas) are old.
4) Continuous availability: The big question here is how to provide 24/7
support because almost all RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
5) Consistency: Should one opt for consistency or eventual consistency?
6) Partition tolerant: How to build partition tolerant systems that can take care
of both hardware and software failures?
7) Data quality: How to maintain data quality – data accuracy, completeness,38
timeliness, etc.? Do we have appropriate metadata in place?
1.12 Terminologies Used in
Big Data Environments
1) In-Memory Analytics: Data access from non-volatile storage such as
hard disk is a slow process. All the relevant data is stored in Random
Access Memory (RAM) or primary storage thus eliminating the need
to access the data from hard disk. The advantage is faster access,
rapid deployment, better insights, and minimal IT involvement.
2) In-Database Processing (analytics): works by blending data
warehouses with analytical systems. With in-database processing, the
database program itself can run the computations eliminating the
need for Extraction Transformation and Loading data into data
warehouse and thereby saving on time.
3) Symmetric Multiprocessor System (SMP): In SMP, there is a single
common main memory that is shared by two or more identical
processors. The processors have full access to all I/O devices and are
controlled by a single operating system instance. SMP are tightly
coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a
system bus.
39
Terminologies in Big Data Contd..
4) Massive Parallel Processing (MPP): refers to the coordinated
processing of programs by a number of processors working parallel.
The processors, each have their own operating systems and dedicated
memory. They work on different parts of the same program and all
the executing segments can communicate with each other.
5) Difference Between Parallel and Distributed Systems: A parallel
database system is a tightly coupled system in which the processors
co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the
system. Either the processors have access to a common memory or
make use of message passing for communication. Distributed
database systems are known to be loosely coupled and are composed
by individual machines that can run their individual application and
serve their own respective user. The data is usually distributed across
several machines, thereby necessitating quite a number of machines
to be accessed to answer a user query.
40
Terminologies in Big Data Contd..
6) Shared Nothing Architecture: The three most common types of
architecture for multiprocessor high transaction rate systems are:
1. Shared Memory (SM) architecture: a common central memory is
shared by multiple processors
2. Shared Disk (SD) architecture: multiple processors share a
common collection of disks while having their own private
memory.
3. Shared Nothing (SN) architecture: neither memory nor disk is
shared among multiple processors.
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A fault in a single node is contained and
confined to that node exclusively and exposed only through
messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource in which
different nodes will have to take turns to access the critical
data. This imposes a limit on how many nodes can be added to
the distributed shared disk system, thus compromising on
scalability.
41
Terminologies in Big Data Contd..
7) CAP Theorem (Brewer’s Theorem) : states that in a distributed
computing environment (a collection of interconnected nodes that
share data), it is impossible to provide the following guarantees. One
must be sacrificed.
1. Consistency implies that every read fetches the last write.
2. Availability implies that reads and writes always succeed.
3. Partition tolerance implies that the system will continue to
function when network partition occurs.
42
1.12.1 Atomicity, Consistency, Isolation and
Durability (ACID)
➢ The key ACID guarantee is that it provides a safe environment in which to
operate on your data. The ACID acronym stands for:
➢ Atomicity: Either the task (or all tasks) within a transaction are performed
or none of them are. This is the all-or-none principle. If one element of a
transaction fails, the entire transaction fails.
➢ Consistency: The transaction must meet all protocols or rules defined by
the system at all times. The transaction does not violate those protocols
and the database must remain in a consistent state at the beginning and
end of a transaction; there are never any half-completed transactions.
➢ Isolation: No transaction has access to any other transaction that is in an
intermediate or unfinished state. Thus, each transaction is independent
unto itself.
➢ Durability: Once the
transaction is complete, it will
persist as complete and cannot
be undone; it will survive
system failure, power loss and
other types of system
breakdowns. 43
1.12.2 Basically Available Soft State Eventual
Consistency (BASE)
➢ Basically Available: This constraint states that the system does
guarantee the availability of the data as regards CAP Theorem; there
will be a response to any request. But, that response could still be
‘failure’ to obtain the requested data or the data may be in an
inconsistent or changing state, much like waiting for a check to clear
in your bank account.
➢ Soft state: The state of the system could change over time, so even
during times without input there may be changes going on due to
‘eventual consistency,’ thus the state of the system is always ‘soft.’
➢ Eventual consistency: The system will eventually become consistent
once it stops receiving input. The data will propagate to everywhere it
should sooner or later, but the system will continue to receive input
and is not checking the consistency of every transaction before it
moves onto the next one.
➢ Google’s BigTable, Amazon’s Dynamo, Facebook’s Cassandra are few
such big example that deal with a loss of consistency and still maintain
system reliability with the help of distributed data system which
provides them BASE properties.
44
1.13 Few Top Data Analytics Tools
➢ Data analytics tools offer diverse capabilities in the
industry to identify trends and understand their
customer base. Data analytics software can track and analyze data,
allowing to create actionable reports and dashboards.
➢ Below are the list of few top analytics tools.
1. Excel- Microsoft Excel is the world’s best-known commercial
spreadsheet software for data collection and analysis. With lots of
useful functions and plug-ins, Excel is one the easiest ways to store
data, create data visualizations, calculations, clean data, report data in
an understandable manner which helps in decision making. Its
limitations are expensive, calculation errors, poor at handling big data.
2. Python- is an open-source, high-level, object-oriented
programming language with thousands of free libraries. It is easy to
learn, highly versatile, widely-used, support multiple file-formats for
data visualizing, data masking, merging, indexing and grouping data,
data cleaning etc. Python’s main drawback is its speed—it is memory
intensive and slower than many languages.
45
Top Data Analytics Tools Contd..
3. Hadoop – is a Java-based open-source software
framework to store data and run applications parallel on clusters of
commodity hardware. It can process both structured and unstructured
data from one server to multiple computers and offers cross-platform
support for its users.
46
Top Data Analytics Tools Contd..
6. Power BI- is commercial software with capabilities of Business
Intelligence from interactive data visualization to predictive analytics.
Great data connectivity, regular updates. it operates seamlessly with
Excel, text files, SQL server, and cloud sources. It also has limitations
like Clunky user interface, rigid formulas, data limits (in the free
version).
7. Mongo DB- is a free, open-source platform and a document-oriented
(NoSQL) schema-less database that is used to store a high volume of
data. It uses collections and key-value pairs documents for storage.
8. Cassandra- is a free, open-source NoSQL distributed
database that is used to fetch large amounts of data. It has high
scalability and availability with Data Storage Flexibility, Data
Distribution System, Fast Processing and Fault-tolerance.
9. Talend - data integration ETL, combines data integration, data quality,
and data governance in a single, low-code platform that works with
virtually any data source and data architecture.
10.QlikView is a Self-Service Business Intelligence, Data Visualization,
Data Literacy and Data Analytics, executive dashboards. 47
48