Professional Documents
Culture Documents
Big dataaRCHITECTURES Foundations
Big dataaRCHITECTURES Foundations
Analytics Schema
RECAP??
Questions to think about
1) What is a big data framework?
2)Which are the components of big data
management system
3)When do I need a big data management
system/ model/ framework?
Big Data Architecture
• Big data architecture is the foundation for big data analytics.
• It is a process of designing any kind of data architecture is to create a model that
should give a complete view of all the required elements of the big data.
• Sometimes designing a model consumes more time, but subsequently
implementation of the model can save significant amount of time, effort, and
reworks for managing big data.
• Configuration of the model/architecture may vary depending upon the specific
needs of the organisation.
• But, for any data architecture, the basic layers and components are more or less,
remain the same.
• To design a big data architecture model we need to think of Big Data as a strategy
and not a project.
Do I Need Big Data
Architecture?
• Not everyone does need to leverage big data architecture.
• Single computing tasks rarely top more than 100GB of data, which does
not require a big data architecture.
• Unless you are analyzing terabytes, petabytes and zettabytes of data –
and doing it consistently -- look to a scalable server instead of a
massively scale-out architecture like Hadoop.
• If you need analytics, then consider a scalable array that offers native
analytics for stored data.
Do I Need Big Data Architecture?
(cont.)
You probably do need big data architecture if any of the following applies to you:
• You want to extract information from extensive networking or web logs.
• You process massive datasets over 100GB in size. Some of these computing tasks run 8
hours or longer.
• You are willing to invest in a big data project, including third-party products to optimize your
environment.
• You store large amounts of unstructured data that you need to summarize or transform into a
structured format for better analytics.
• You have multiple large data sources to analyze, including structured and unstructured.
• You want to proactively analyze big data for business needs, such as analyzing store sales by
season and advertising, applying sentiment analysis to social media posts, or investigating
email for suspicious communication patterns – or all the above.
Big Data
Architecture
The startegy includes the design principles related to the creation of an environment to support
the Big Data. The principles are deals with storage of data, analytics, reporting, and applications.
• During the creation of Big Data architecture the consideration is required on hardware, software
infrastructure, operational s/w, management s/w, APIs, and software developer tools.
• The architecture of Big Data environment must fulfill all fundamental requirements to perform the
following functions:
Identification
Filtration
Validation
Noise reduction
Transformation
Compression
Integration
Ingestion Layer (cont.)
Identification: Data is categorised into various known data formats or unstructured data is assigned
with default formats.
Filtration: The information relevant for the enterprise is filtered on the basis of the Enterprise Master
Data Management (MDM) repository.
Validation: The filtered day is analysied against MDM metadata.
Noise reduction: Data is cleaned by removing the noiseand minimising the related disturbances.
Transformation: Data is split or combined on the basis of its type, contents, and the requirement of the
organisation.
Compression: The size of the data is reduced without affecting is relavance for the required process. It
should be remembered that compression does not affect the analysis results.
Integration: The refined data set is integrated with the Hadoop storage layer, which consists of Hadoop
Distributed File System (HDFS) and NOSQL database.
Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to ETL
(Extract, Transform and Load) in case of traditional warehouses.
Storage Layer
Storage becomes a challenge when the size of the data you are dealing with, becomes large.
Finding a storage solution is very much important when the size of your data becomes large.
This layer focuses on "where to store such a large data efficiently."
Hadoop is an open source framework normally used to store high volume of data in adistributed
manner across multiple machines.
There are two major components of Hadoop - a scalable Hadoop Distributed File System
(HDFS) that cn support petabytes of data and another MapReduce engine that compute
results in batches.
Hadoop has its own database file system, known as HBase, but others including Amazon’s
DynamoDB, MongoDB and Cassandra (used by Facebook), all based on the NoSQL
architecture, those are more are popular too.
Digging into Big Data Technology Components
Physical Infastructure Layer
As big data is all about high-velocity, high-volume, and high-data variety, the physical
infrastructure will literally “make or break” the implementation.
Most big data implementations need to be highly available, so the networks, servers, and
physical storage must be both resilient and redundant. Resiliency and redundancy are
interrelated.
An infrastructure, or a system, is resilient to failure or changes when sufficient redundant
resources are in place, ready to jump into action.
Redundancy ensures that such a malfunction won’t cause an outage. Resiliency helps to
eliminate single points of failure in your infrastructure.
This means that the technical and operational complexity is masked behind a collection of
services, each with specific terms for performance, availability, recovery, and so on. These
terms are described in service-level agreements (SLAs) and are usually negotiated between the
service provider and the customer, with penalties for noncompliance.
Physical Infastructure Layer (cont.)
A prioritized list of big data principles should include statements about the following:
Performance: How responsive do you need the system to be? Performance, also called latency,
is often measured end to end, based on a single transaction or query request.
Availability: Do you need a 100 percent uptime guarantee of service? How long can your
business wait in the case of a service interruption or failure?
Scalability: How big does your infrastructure need to be? How much disk space is needed
today and in the future? How much computing power do you need? Typically, you need to
decide what you need and then add a little more scale for unexpected challenges.
Flexibility: How quickly can you add more resources to the infrastructure? How quickly can
your infrastructure recover from failures?
Cost: What can you afford? Because the infrastructure is a set of components, you might be
able to buy the “best” networking and decide to save money on storage. You need to establish
requirements for each of these areas in the context of an overall budget and then make trade-
offs where necessary.
A. PHYSICAL REDUNDANT NETWORKS
Networks should be redundant and must have enough capacity to accommodate the
anticipated volume and velocity of the inbound and outbound data in addition to the
“normal” network traffic experienced by the business.
As you begin making big data an integral part of your computing strategy, it is
reasonable to expect volume and velocity to increase.
Infrastructure designers should plan for these expected increases and try to create
physical implementations that are “elastic.”
As network traffic ebbs and flows, so too does the set of physical assets associated with
the implementation.
Your infrastructure should offer monitoring capabilities so that operators can react when
more resources are required to address changes in workloads.
B. MANAGE HARDWARE: STORAGE AND SERVERS
The hardware (storage and server) assets must have sufficient speed and capacity to
handle all expected big data capabilities.
It’s of little use to have a high-speed network with slow servers because the servers will
most likely become a bottleneck.
However, a very fast set of storage and compute servers can overcome variable network
performance.
Of course, nothing will work properly if network performance is poor or unreliable.
C. INFRASTRUCTURE OPERATIONS
It manages core components of Hadoop as HDFS & MapReduce and other tools
to store, access and analyse large amount of data using real-time analysis.