Ajay Bigdata Unit 2

Submitted By:
AJAY KUMAR
Assistant Professor-II
Computer Science & Engineering
BIG DATA ANALYTICS

UNIT 2: Integration of Big Data & Data Warehouse
Topics to be Covered
 Integration of Big Data and Data Warehousing
 Data Driven Architecture
 Information Management and Lifecycle
 Big Data Analytics
 Visualization and Data Scientist
 Implementing the “Big Data” data
 R interfaces
Data warehousing
 Data warehousing is the process of
constructing and using a data warehouse.
 A data warehouse is constructed by
integrating data from multiple heterogeneous
sources that support analytical reporting,
structured and/or ad hoc queries, and decision
making.
 Data warehousing involves data cleaning, data
integration, and data consolidations.
Data Warehouse
 A data warehouse is a large collection of business data used to help an organization
make decisions. The concept of the data warehouse has existed since the 1980s, when it
was developed to help transition data from merely powering operations to fueling decision
support systems that reveal business intelligence. The large amount of data in data
warehouses comes from different places such as internal applications such as marketing,
sales, and finance; customer-facing apps; and external partner systems, among others.
 Organizations that use a data warehouse to assist their analytics and business intelligence
see a number of substantial benefits:
• Better data — Adding data sources to a data warehouse enables organizations to ensure
that they are collecting consistent and relevant data from that source. They don’t need to
wonder whether the data will be accessible or inconsistent as it comes in to the system.
This ensures higher data quality and data integrity for sound decision making.
• Faster decisions — Data in a warehouse is in such consistent formats that it is ready to be
analyzed. It also provides the analytical power and a more complete dataset to base
decisions on hard facts. Therefore, decision makers no longer need to reply on hunches,
incomplete data, or poor quality data and risk delivering slow and inaccurate results.
Functions of Data Warehouse Tools and Utilities
 The following are the functions of data warehouse tools and

utilities −
• Data Extraction − Involves gathering data from multiple
heterogeneous sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy
format to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating,
checking integrity, and building indices and partitions.
• Refreshing − Involves updating from data sources to warehouse.
 Note − Data cleaning and data transformation are important steps
in improving the quality of data and data mining results.
Integration of Big Data and Data Warehousing
Components of the next-generation data warehouse.

Integration Strategies
 Data integration refers to combining data  Traditional data integration techniques
from different source systems for usage by have been focused on ETL, ELT, CDC,
business users to study different behaviors and EAI types of architecture and
of the business and its customers. In the associated programming models.
early days of data integration, the data was
limited to transactional systems and their
 In the world of Big Data, however, these
applications. techniques will need to either be
modified to suit the size and processing
 The limited data set provided the basis for
creating decision support platforms that complexity demands, including the
were used as analytic guides for making formats of data that need to be
business decisions. processed. Big Data processing needs to
 The growth of the volume of data and the be implemented as a two-step process.
data types over the last three decades, along  The first step is a data-driven
with the advent of data warehousing, architecture that includes analysis and
coupled with the advances in infrastructure design of data processing. The second
and technologies to support the analysis and step is the physical architecture
storage requirements for data, have changed implementation
the landscape of data integration forever.
Data Driven Integration
 In this technique of building the next-generation
data warehouse, all the data within the enterprise
are categorized according to the data type, and
depending on the nature of the data and its
associated processing requirements, the data
processing is completed using business rules
encapsulated in processing logic and integrated into
a series of program flows incorporating enterprise
metadata, MDM, and semantic technologies like
taxonomies.
Data Driven Integration
 Figure shows the inbound data processing of different categories of data. This
model segments each data type based on the format and structure of the data, and
then processes the appropriate layers of processing rules within the ETL, ELT,
CDC, or text processing techniques. Let us analyze the data integration
architecture and its benefits.
External data integration
External data integration
 The external data integration approach to creating the next-
generation data warehouse.
 In this approach the existing data processing and data warehouse
platforms are retained, and a new platform for processing Big
Data is created in new technology architecture.
 A data bus is developed using metadata and semantic
technologies, which will create a data integration environment
for data exploration and processing.
 Workload processing is clearly divided in this architecture into
processing Big Data in its infrastructure and the current-state data
warehouse in its infrastructure.
Data Driven Approach
 When a company employs a “data-driven”
approach, it means it makes strategic decisions
based on data analysis and interpretation.
 A data-driven approach enables companies to
examine and organise their data with the goal
of better serving their customers and
consumers. By using data to drive its actions,
an organisation can contextualise and/or
personalise its messaging to its prospects and
customers for a more customer-centric approach.
Data Driven Architecture
Data Management Stages

Data Driven Architecture- Metadata
 Metadata is defined as data about data or, in other words, information

about data within any data environment.
 Metadata changes in the lifetime of a database when changes occur
within the business, such as mergers and acquisitions, new systems
deployment, integration between legacy, and new applications.
 To maintain the metadata associated with the data, we need to
implement business and technology processes and governance policies.
 Many enterprises today do not track the life cycle of metadata, which
will cost them when data is brought back from backups or data is
restored from an archive database, and nobody can quite decipher the
contents and its relationships, hierarchies, and business processing
rules.
Data Driven Architecture- Master data management
 Master data management (MDM) is the core process used

to manage, centralize, organize, categorize, localize,
synchronize and enrich master data according to the
business rules of the sales, marketing and operational
strategies of your company.
 The efficient management of master data in a central
repository gives you a single authoritative view of
information and eliminates costly inefficiencies caused by
data silos.
 It supports your business initiatives and objectives through
identification, linking and syndication of information and
content across products, customers, stores/locations,
employees, suppliers, digital assets and more.
 Figure shows the data-driven architecture that can
be deployed based on the metadata and master data
solutions.
 This approach streamlines the data assets across the
enterprise data warehouse and enables seamless
integration with metadata and master data for data
management in the data warehouse. While this
architecture is more difficult to implement, it is a
reusable approach where new data can be easily
added into the infrastructure since the system is
driven by data-driven architecture. Extending this
concept to new systems including Big Data is more
feasible as an approach.
Information Management life
Cycle
 Information life-cycle management is the practice of
managing the life cycle of data across an enterprise
from its creation or acquisition to archival.
 The concept of information life-cycle management
has always existed as “records management” since
the early days of computing, but the management of
records meant archival and deletion with extremely
limited capability to reuse the same data when
needed later on.
Information life-cycle management components
 Information life-cycle
management forms one of
the foundational pillars in
the management of data
within an enterprise.
 It is the platform on which
the three pillars of data
management are designed.
 The first pillar represents
process, the second
represents the people, and
the third represents the
technology.
Phases of Information Life cycle
 Capturing Data
 Preserving data
 Grouping data
 Processing data
 Publishing data
 Archiving data
 Removing data
Governance
 Information and program governance
are two important aspects of managing
information within an enterprise.
Information governance deals with
setting up governance models for data
within the enterprise and program
governance deals with implementing
the policies and processes set forth in
information governance.
 Both of these tasks are fairly people-
specific as they involve both the
business user and the technology teams.
 A governance process is a
multistructured organization of people
who play different roles in managing
information. Data Governance Teams
Technology
 Implementing the program from a concept to  Is measured in percentage of
reality within data governance falls in the corrections required per execution
technology layers. There are several different per table. The lower the
percentage of corrections, the
technologies that are used to implement the higher the quality of data.
different aspects of governance. These include
tools and technologies used in Data acquisition,  Data enrichment
Data cleansing, Data transformation and  We have always enriched data to
Database code such as stored procedures, improve its accuracy and information
quality.
programming modules coded as application
programming interface (API), Semantic
 In the world of Big Data, data
enrichment is accomplished by
technologies and Metadata libraries. integrating taxonomies, ontologies, and
third-party libraries as a part of the data
 Data quality processing architecture.
 Is implemented as a part of the data movement and
transformation processes.
 Is developed as a combination of business rules
developed in ETL/ELT programs and third-party data
enrichment processes.
Technology
 Enriched data will provide the user  Is developed by IT teams. ● Includes
capabilities: auditing and traceability framework
 To define and manage hierarchies. components for recording data
manipulation language (DML) outputs
 To create new business rules on-the-fly for and rejects from data quality and integrity
tagging and classifying the data. checks.
 To process text and semi-structured data
more efficiently.
 Data archival and retention
 Explore and process multilingual and
 Is implemented as part of the archival and
multistructured data analysis. purging process.
 Is developed as a part of the database
 Data transformation systems by many vendors.
 Is implemented as part of ETL/ELT  Is often misquoted as a database feature.
processes.
 Often fails when legacy data is imported
 Is defined as business requirements by the back due to lack of correct metadata and
user teams. underlying structural changes.
 Uses master data and metadata program  This can be avoided easily by exporting
outputs for referential data processing and the metadata and the master data along
data standardization with the data set.
Technology
 Master data management  Metadata

Is implemented as a data definition process by
 Is implemented as a standalone business users,
program.  Has business-oriented definitions for data for
 Is implemented in multiple cycles each business unit.
for customers and products.
 One central definition is regarded as the
enterprise metadata view of the data.
 Is implemented for location,  Has IT definitions for metadata related to data
organization, and other smaller data structures, data management programs, and
sets as an add-on by the semantic layers within the database.
implementing organization.
 Has definitions for semantic layers
implemented for business intelligence and
 Measured as a percentage of changes analytical applications.
processed every execution from  All the technologies used in the processes
source systems. described above have a database, a user
 Operationalized as business rules for interface for managing data, rules and
definitions, and reports available on the
key management across operational,
processing of each component and its
transactional, warehouse, and associated metrics.
analytical data
Measuring the impact of information life-cycle
management
 To measure and monitor the impact of governance and

processes for information life-cycle management, you
can implement scorecards and dashboards and even
extend the currently used models for the data warehouse.
 This is a very nascent and emerging topic, but an
important topic to consider implementing when starting a
Big Data program.
 There will be a lot of evolutions and improvements in
this area in the next few years, especially in the world of
Big Data and the next generation of the data warehouse.
Big Data Analytics
 Analytics programs provide a platform  Big Data analytics can be defined as the
and the opportunity to measure combination of traditional analytics and data
mining techniques along with large volumes
everything across an enterprise.
of data to create a foundational platform to
 The associated effect of this approach analyze, model, and predict the behavior of
is the creation of transparency across customers, markets, products, services, and
the different layers of data, its the competition, thereby enabling an
associated processes, and methods, outcomes-based strategy precisely tailored to
meet the needs of the enterprise for that
and exposing insights into potential market and customer segment.
opportunities, threats, risks, and  Big Data analytics is the process of
issues. discovering patterns and insights from Big
 Executives, when provided outcomes Data and modeling them for use with
from analytical exercises, gain better corporate data. The results of Big Data
processing for analytics can be modeled as a
understanding of the decisions and can
list of key attributes, and its associated values
provide more effective guidance to that can leverage the existing metadata and
their enterprise as a whole. semantic layers for integration with
traditional data.
How does this work in reality?
 When you engage in a web search for a product, you can see that along
with results for the product you searched, you are provided details on sales,
promotions, and coupons for the product in your geographical area and the
nearest ten miles, as well as promotions being offered by web retailers.
 Analyzing the data needed to create the search results tailored to meet your
individual search results, we can see that the companies who have targeted
you for a promotional offer or a discount coupon have used the outcomes
of behavioral analytics from clickstream data of thousands of other people
who have searched for a similar product or service, and combined them
with promotional data targeted for your geographical area to compete for
your wallet share.
 Sometimes all of these activities are done by a third-party company as a
service and these third-party vendors use Big Data processing and analytics
techniques to provide this kind of service.
Visualization
 Big Data visualization is not like traditional business intelligence where the data is
interactive and can be processed as drilldowns and rollups in a hierarchy or can be
drilled into in a real-time fashion.
 This data is static in nature and will be minimally interactive in a visualization
situation. The underlying reason for this static nature is due to the design of the Big
Data platform like Hadoop or NoSQL, where the data is stored in files and not table
structured, and processing changes will require massive file operations, which are
best performed in a microbatch environment as opposed to a real-time environment.
 This limitation is being addressed in the next generation of Hadoop and other Big
Data platforms.
 Today, the data that is available for visualization is largely integrated using mashup
tools and software that support such functionality, including Datameer,
Karmasphere, Tableau, and Spotfire.
 The mashup platform provides the capability for the user to integrate data from
multiple streams into one picture, by linking common data between the different
data sets.
Visualization
 Another form of visualization of Big Data is
delivered through the use of statistical software like
R, SAS, and KXEN, where the predefined models
for different statistical functions can use the data
extracted from the discovery environment and
integrate it with corporate and other data sets to
drive the statistical visualizations.
 Very popular software that uses R for
accomplishing this type of functionality is RStudio.
Evolving Role of Data Scientists
 The key role that enables the  A data scientist is an expert business
difference between success and analyst or an engineer who uses data
discovery tools to find new insights in data
failure of a Big Data program is a
by using techniques that are statistical or
data scientist. The term was scientific in nature.
originally coined by two of the  They work on a variety of hypotheses and
original data scientists, D. J. Patil and design multiple models that they
Jeff Hammerbacher, when they were experiment with to arrive at new insights.
working at LinkedIn and Facebook.  To accomplish this they use a large volume
 What defines a data scientist? Is this of data, which is collectively called Big
a special skill or education? How Data. Data scientists work very closely
with data and often question everything
different are these roles from an
that is input or output from the data. In
analyst or engineer? There is no fact, in every enterprise there are a handful
standard definition for the role of a of senior business analysts or data analysts
data scientist, but here is a close who are playing the role of data scientist
description: without being formally called as one.
Evolving Role of Data Scientists
 Data scientists use the data  As we evolve the data
discovery tools discussed in within enterprises from
this chapter to create the now into the future, we
visualization and analytics will need more explorers
associated with Big Data. This
who can journey through
role is still in evolution
phases, and in the future we the data asking the
will see many teams of data questions of why and
scientists in enterprises as where looking for insights
opposed to the handful who and patterns. This is the
we see today. If data is the role that will largely fall
new oil, then the data scientist onto the Data Scientist
is the new explorer. teams.
Big Data Implementation- Hadoop and MySQL
drives innovation
 Implementation of a Big Data–based data  MySQL—this database technology can
warehouse architecture is driven by integrating be deployed to manage the web
and augmenting the incumbent platform with a databases where transactional data needs
Hadoop and MySQL architecture. to be managed. Low-cost and simple
 The difference in this architecture approach is replication techniques are the biggest
the migration from a traditional RDBMS to the draws for this technology in the data
new data architecture platform. The business architecture.
problem is a leading electronics manufacturer  NoSQL—Cassandra or HBase will be
has been having a weak market penetration with
its products and services. The biggest threats useful to manage nontransactional data
facing the enterprise include loss of traditional such as call center data and text analytics
markets, increased customer attrition, poor data.
market performance from a wallet-share  Textual analytics databases—can be
perspective, lack of customer confidence, and useful in managing textual data without
overall weak performance. having to worry about deploying
 Hadoop—the Big Data platform can be MapReduce code.
deployed as the enterprise repository of data.  Datameer, Karmasphere, and R—can
With its lower cost and greater scalability, the
be used by the team of analysts and
platform will bring a lot of performance boost
to the overall data architecture.
business users to create powerful data
mining models.
drives innovation
 The data isolated with this technique was modeled based on each layer’s input and output
requirements and the interfaces that needed to be developed. Once the design approach was
finalized as shown in figure., the steps are depicted in the next slide
drives innovation
 Created additional data governance process  Integrated a semantic data discovery
for integrating all data types into the new platform with interfaces to internal and
data warehouse.
external products and service
 Created the data interfaces for each hierarchies.
technology layer using metadata-driven
 Created and deployed a tagging
interfaces.
process for data management and
 Created a migration plan for moving history
and legacy data along with the respective integration, especially with
metadata to Hadoop. unstructured data.
 Implemented a metadata integration
 Created a best-practices approach for
deploying Cassandra and HBase across the and reference architecture for
different regions of the world, where the centralized management of metadata
data interfaces were deployed based on across the enterprise.
corporate standards.
 Designed and developed data parsing
algorithms for integrating semi-

structured and unstructured data
components.
drives innovation
 Integrated algorithms for parsing machine logs and clickstream logs to understand the user
and device behaviors.
 Integrated algorithms for predictive analytics integration.
 Integrated algorithms for statistical modeling and integration.
 Integrated reporting and visualization platforms into the different data layers.
 Provided business users capabilities to discover, analyze, and visualize data across the
enterprise in an integrated platform.
Benefits of Architecture
 The implementation process was a planned migration to the heterogeneous platform over 24 months
on a global basis. The main benefits of the architecture were:
• Design once and deploy in parallel worldwide

• Modular scalability
• Standardization
• Lower cost of maintenance
• Higher scalability and flexibility
• Security standards compliant
• Always available
• Self-service capabilities
• Fault tolerant
• Easy recovery from failure

Ajay Bigdata Unit 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ajay Bigdata Unit 2

Uploaded by

Copyright:

Available Formats

Submitted By:

BIG DATA ANALYTICS

 The following are the functions of data warehouse tools and

Components of the next-generation data warehouse.

Data Management Stages

 Metadata is defined as data about data or, in other words, information

 Master data management (MDM) is the core process used

 To measure and monitor the impact of governance and

algorithms for integrating semi-

• Design once and deploy in parallel worldwide

You might also like