Introduction To Big Data: William El Kaim Oct. 2016 - V3.0

Introduction to Big Data
William El Kaim
Oct. 2016 – V3.0
This Presentation is part of the
Enterprise Architecture Digital Codex
Copyright © William El Kaim 2016 http://www.eacodex.com/ 2

Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• What is a Data Lake?
• What is BI on Hadoop?
• What is Data Science?
• Hadoop Processing Paradigms
• Hadoop Tools Landscape
Copyright © William El Kaim 2016 3

Taming the Data Deluge
Copyright © William El Kaim 2016 Source: Domo 4



At The Same Time …

Data Science (R)evolution
Source: Capgemini

Plan
• What is Hadoop?

What is Big Data?
• A collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data
processing applications
• Due to its technical nature, the same challenges arise in Analytics at much lower
volumes than what is traditionally considered Big Data.
• Big Data Analytics is:
• The same as ‘Small Data’ Analytics, only with the added challenges (and potential) of
large datasets (~50M records or 50GB size, or more)
• Challenges :
• Data storage and management
• De-centralized/multi-server architectures
• Performance bottlenecks, poor responsiveness
• Increasing hardware requirements
Copyright © William El Kaim 2015 Source: SiSense 11

What is Big Data: the “Vs” to Nirvana
Visualization
Big Data: A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications
Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577.
Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw
away (David Brower @dbrower)
Copyright © William El Kaim 2016 Source: James Higginbotham 13

Six V to Nirvana
Copyright © William El Kaim 2016 Source: Bernard Marr 14

Six V to Nirvana
Copyright © William El Kaim 2016 Source: IBM 15

Six V to Nirvana

Six V to Nirvana

Six V to Nirvana

Six V to Nirvana
Visualization
Copyright © William El Kaim 2016 Source: Bernard Marr 19

Plan
• What is Hadoop?

Hadoop Genesis
• Scalability issue when running jobs processing terabytes of data
• Could span dozen days just to read that amount of data on 1 computer
• Need lots of cheap computers
• To Fix speed problem
• But lead to reliability problems
• In large clusters, computers fail every day
• Cluster size is not fixed
• Need common infrastructure
• Must be efficient and reliable

Hadoop Genesis
“It is an important technique!”

“Map-reduce”
2004
Doug Cutting
Extended
Apache Nutch
Copyright © William El Kaim 2016 Source: Xiaoxiao Shi 22

Hadoop Genesis
“Hadoop came from the name my kid gave a stuffed yellow elephant. Short,
relatively easy to spell and pronounce, meaningless, and not used
elsewhere: those are my naming criteria.”
Doug Cutting, Hadoop project creator
• Open Source Apache Project

• Written in Java
• Running on Commodity hardware and all major OS
• Linux, Mac OS/X, Windows, and Solaris

Enters Hadoop the Big Data Refinery
• Hadoop is not replacing anything.
• Hadoop has become another
component in an organizations
enterprise data platform.
• Hadoop (Big Data Refinery) can
ingest data from all types of
different sources.
• Hadoop then interacts and has data
flows with traditional systems that
provide transactions and
interactions (relational databases)
and business intelligence and
analytic systems (data
warehouses).
Copyright © William El Kaim 2015 Source: DBA Journey Blog 24

Hadoop Platform
Copyright © William El Kaim 2016 Source: Octo Technology 25

Plan
• What is Hadoop?

When to Use Hadoop?
Two Main Use Cases
2
Big Data
Analytics
and Use
1
Data Mgt.
And Storage

When to Use Hadoop? 1 Data Mgt and Storage
2 Data Analytics and Use
Copyright © William El Kaim 2016 Source: Octo Technology 28

Hadoop for Data Mgt and Storage 1
ETL Pre-processor
ETL Pre-processor
• Shift the pre-processing of ETL in staging data warehouse to Hadoop
• Shifts high cost data warehousing to lower cost Hadoop clusters
Copyright © William El Kaim 2016 Source: Microsoft 29

Hadoop for Data Mgt and Storage 1
Massive Storage
Massive Storage
• Offloading large volume of historical data into cold storage with Hadoop
• Keep data warehouse for hot data to allow BI and analytics
• When data from cold storage is needed, it can be moved back into the warehouse

Hadoop for Data Analytics and Use 2
Data Discovery
Data Discovery
• Keep data warehouse for operational BI and analytics
• Allow data scientists to gain new discoveries on raw data (no format or structure)
• Operationalize discoveries back into the warehouse

Hadoop for Data Analytics and Use 2
From Hindsight to Insight to Foresight

• Traditional analytics tools are not well
suited to capturing the full value of big
data.
• The volume of data is too large for
comprehensive analysis.
• The range of potential correlations and
relationships between disparate data
sources are too great for any analyst to test
all hypotheses and derive all the value
buried in the data.
• Basic analytical methods used in business
intelligence and enterprise reporting tools
reduce to reporting sums, counts, simple
averages and running SQL queries.
• Online analytical processing is merely a
systematized extension of these basic
analytics that still rely on a human to direct
activities specify what should be calculated.
Plan
• What is Hadoop?

How to Implement Hadoop?
The Rise of Big Data Fabric
• Definition:
• Bringing together disparate big data sources automatically, intelligently, and securely,
and processing them in a big data platform technology, such as Hadoop and Apache
Spark, to deliver a unified, trusted, and comprehensive view of customer and business
data.
• Big data fabric focuses on automating the process of ingestion, curation, and
integrating big data sources to deliver intelligent insights that are critical for
businesses to succeed.
• The platform minimizes complexity by automating processes, generating big data
technology and platform code automatically, and integrating workflows to simplify the
deployment.
• Big data fabric is not just about Hadoop or Spark
• It comprises several components, all of which must work in tandem to deliver a flexible,
integrated, secure, and scalable platform.
Copyright © William El Kaim 2016 Source: Forrester 34

Make Big Data Fabric Part Of Your Big Data Strategy!
• Enterprise architects whose companies are pursuing a big data strategy can
benefit from a big data fabric implementation that automates, secures,
integrates, and curates big data sources intelligently.
• Most enterprises that have a big data fabric platform are building it themselves by
integrating various core open source technologies, however it requires significant time
and effort.
• Your big data fabric strategy should:
• Integrate only a few big data sources at first.
• Start top-down rather than bottom-up, keeping the end in mind.
• Separate analytics from data management. Analytics tools should focus primarily on
data visualization and advanced statistical/data mining algorithms with limited
dependence on data management functions. Decoupling data management from data
analytics reduces the time and effort needed to deliver trusted analytics.
• Create a team of experts to ensure success.
• Use automation and machine learning to accelerate deployment.
Copyright © William El Kaim 2016 Source: Forrester 35

Typical Project Development Steps
Unstructured Business CRM, ERP
Data
Transactions Web, Mobile
& Interactions Point of sale
Log files Big Data
Platform
Exhaust Data
Classic Data
Integration & ETL
Social Media
Sensors,
devices Business Dashboards,
Intelligence Reports,
& Analytics Visualization, …
DB data
Capture Big Data Process Distribute Results Feedback

1 Collect data from all sources
structured &unstructured
2
Transform, refine,
aggregate, analyze, report
3 Interoperate and share data
with applications/analytics
4 Use operational data w/in
the big data platform
Copyright © William El Kaim 2016 Source: HortonWorks 36

Typical Three Steps Process
Collect Explore Enrich
Copyright © William El Kaim 2016 Source: HortonWorks 37

Typical Project Development Steps
Raw Data Data Lake Lakeshore Data Science Data

Bus. or Usage
Visualization 4. Data Output
Correlations,
oriented Analytics, & Bus. Layer
extractions Machine intelligence
Internal Data Learning
3. Data Processing /
Analysis Layer
External Data
2. Data Storage
Layer
1. Data Source
Layer
1 2 3 3 4

CRISP Methodology
Copyright © William El Kaim 2016 Source: The Modeling Agency and sv-europe 39
Plan
• What is Hadoop?

What is a Data Lake?
• A data lake is:
• Typically built using Hadoop.
• Supports structured or
unstructured data.
• Benefiting from a variety of
storage and processing tools to
extract value quickly.
• Requiring little or no processing
for adapting the structure to an
enterprise schema
• A central location in which to
store all your data in its native
form, regardless of its source or
format.
Copyright © William El Kaim 2016 Source: Zaloni 41

Who is Using Data Lakes?
Copyright © William El Kaim 2016 Source: PWC 42

Data Flow In The Data Lake
Copyright © William El Kaim 2016 Source: PWC 43

Schema On Read or Schema On Write?
• The Hadoop data lake concept can be summed up as, “Store it all in one
place, figure out what to do with it later.”
• But while this might be the general idea of your Hadoop data lake, you won’t
get any real value out of that data until you figure out a logical structure for it.
• And you’d better keep track of your metadata one way or another. It does no
good to have a lake full of data, if you have no idea what lies under the shiny
surface.
• At some point, you have to give that data a schema, especially if you want to
query it with SQL or something like it. The eternal Hadoop question is
whether to apply the brave new strategy of schema on read, or to stick with
the tried and true method of schema on write.
Copyright © William El Kaim 2016 Source: AdaptiveSystems 44

Schema On Write …
• Before any data is written in the database, the structure of that data is strictly
defined, and that metadata stored and tracked.
• Irrelevant data is discarded, data types, lengths and positions are all
delineated.
• The schema; the columns, rows, tables and relationships are all defined first for the
specific purpose that database will serve.
• Then the data is filled into its pre-defined positions. The data must all be cleansed,
transformed and made to fit in that structure before it can be stored in a process
generally referred to as ETL (Extract Transform Load).
• That is why it is called “schema on write” because the data structure is
already defined when the data is written and stored.
• For a very long time, it was believed that this was the only right way to
manage data.

Schema On Read …
• Revolutionary concept: “You don’t have to know what you’re going to do with
your data before you store it.”
• Data of many types, sizes, shapes and structures can all be thrown into the Hadoop
Distributed File System, and other Hadoop data storage systems.
• While some metadata needs to be stored, to know what’s in there, no need yet to know
how it will be structured!
• Therefore, the data is stored in its original granular form, with nothing thrown
away
• In fact, no structural information is defined at all when the data is stored.
• So “schema on read” implies that the schema is defined at the time the data
is read and used, not at the time that it is written and stored.
• When someone is ready to use that data, then, at that time, they define what pieces are
essential to their purpose, where to find those pieces of information that matter for that
purpose, and which pieces of the data set to ignore.

Schema On Read or Schema On Write?
• Schema on read options tend to be a better choice:
• for exploration, for “unknown unknowns,” when you don’t know what kind of questions
you might want to ask, or the kinds of questions might change over time.
• when you don’t have a strong need for immediate responses. They’re ideal for data
exploration projects, and looking for new insights with no specific goal in mind.
• Schema on write options tend to be very efficient for “known unknowns.”
• When you know what questions you’re going to need to ask, especially if you will need
the answers fast, schema on write is the only sensible way to go.
• This strategy works best for old school BI types of scenarios on new school big data
sets.

Data Lake Resources
• Introduction
• PWC: Data lakes and the promise of unsiloed data
• Zaloni: Resources
• Tools
• Bigstep
• Informatica Data Lake Mgt & Intelligent Data Lake
• Microsoft Azure Data Lake and Azure Data Catalog
• Koverse
• Oracle BigData Discovery
• Platfora
• Podium Data
• Waterline Data Inventory
• Zaloni Bedrock
• Zdata Data Lake

Data Lake Tools: Platfora
Copyright © William El Kaim 2016 Source: Platfora 49

Data Lake Tool: Waterline
Copyright © William El Kaim 2016 Source: Waterline 50

Plan
• What is Hadoop?

What is BI on Hadoop?
Three Options
Copyright © William El Kaim 2016 Source: Dremio 52

Three Options: ETL to Data Warehouse
• Pros
• Relational databases and their BI integrations are very mature
• Use your favorite tools
• Tableau, Excel, R, …
• Cons
• Traditional ETL tools don’t work well with modern data
• Changing schemas, complex or semi-structured data, …
• Hand-coded scripts are a common substitute
• Data freshness
• How often do you replicate/synchronize?
• Data resolution
• Can’t store all the raw data in the RDBMS (due to scalability and/or cost)
• Need to sample, aggregate or time-constrain the data

Three Options: Monolithic Tools
• Single piece of software on top of

Big Data
• Performs both data
visualization (BI) and execution
• Utilize sampling or manual pre-
aggregation to reduce the data
volume that the user is
interacting with
• Arcadia • Indexima • Platfora

• Atscale • Jethro • Tamr
• Datameer • Looker • ZoomData
Copyright © William El Kaim 2016 Source: Modified from Platfora 54
Three Options: Monolithic Tools
• Pros
• Only one tool to learn and operate
• Easier than building and maintain ETL-to-RDBMS pipeline
• Integrated data preparation in some solutions
• Cons
• Can’t analyze the raw data
• Rely on aggregation or sampling before primary analysis
• Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …)
• Can’t run arbitrary SQL queries

Three Options: SQL-on-Hadoop
• The combination of a familiar interface (SQL) along with a modern computing
architecture (Hadoop) enables people to manipulate and query data in new
and powerful ways.
• There’s no shortage of SQL on Hadoop offerings, and each Hadoop
distributor seems to have its preferred flavor.
• Not all SQL-on-Hadoop tools are equal, so picking the right tool is a challenge.
Copyright © William El Kaim 2016 Source: Datanami & Cloudera & Dremio 56
Three Options: SQL-on-Hadoop
• SQL on Hadoop tools could be categorized as
• Interactive or Native SQL
• Batch & Data-Science SQL
• OLAP Cubes (In-memory) on Hadoop

SQL-on-Hadoop: Native SQL
• When to use it?
• Excel at executing ad-hoc SQL queries and performing self-service data exploration
often used directly by data analysts or at executing the machine-generated SQL code
from BI tools like Qlik and Tableau.
• Latency is usually measured in seconds to minutes.
• One of the key differentiator among the interactive SQL-on-Hadoop tools is how they
were built.
• Some of the tools, such as Impala and Drill, were developed from the beginning to run on
Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran
on vendors’ massively parallel processing (MPP) databases
Copyright © William El Kaim 2016 Source: Datanami 58

SQL-on-Hadoop: Native SQL
• Pros • Interactive
• Highest performance for Big Data • In 2012, Cloudera rolled out the first
workloads release of Apache Impala
• Connect to Hadoop and also NoSQL • MapR has been pushing the schema-
systems less bounds of SQL querying with
Apache Drill, which is based
• Make Hadoop “look like a database” on Google‘s Dremel.
• Cons • Presto (created by Facebook, now
• Queries may still be too slow for backed by Teradata)
interactive analysis on many TB/PB • VectorH (backed by Actian)
• Can’t defeat physics • Apache Hawq (backed by Pivotal)
• Apache Phoenix.
• BigSQL (backed by IBM)
• Big Data SQL (backed by Oracle)
• Vertica SQL on Hadoop (backed
by Hewlett-Packard).
Copyright © William El Kaim 2016 Source: Datanami & Dremio 59

SQL-on-Hadoop: Batch & Data Science SQL
• When to use it?
• Most often used for running big and complex jobs, including ETL and production data
“pipelines,” against massive data sets.
• Apache Hive is the best example of this tool category. The software essentially recreates a
relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache
Tez) as an intermediate processing layer.
• Tools
• Apache Hive, Apache Tez, Apache Spark SQL
• Pros
• Potentially simpler deployment (no daemons)
• New YARN job (MapReduce/Spark) for each query
• Check-pointing support enables very long-running queries
• Days to weeks (ETL work)
• Works well in tandem with machine learning (Spark)
• Cons
• Latency prohibitive for for interactive analytics
• Tableau, Qlik Sense, …
• Slower than native SQL engines
Copyright © William El Kaim 2016 Source: Datanami & Dremio 60

SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop
• When to use it?
• Data scientists doing self-service data exploration needing performance (in milliseconds to
seconds).
• Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with
competition in this category.
• Often require an In-memory computing architecture,
• Tools
• Apache Kylin, AtScale, Kyvos Insights,
• In-memory: Spark SQL, Apache Flink
• Pros
• Fast queries on pre-aggregated data
• Can use SQL and MDX tools
• Cons
• Explicit cube definition/modeling phase
• Not “self-service”
• Frequent updates required due to dependency on business logic
• Aggregation create and maintenance can be long (and large)
• User connects to and interacts with the cube
• Can’t interact with the raw data
SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop
• Apache Kylin lets you query massive data set at sub-second latency in 3 steps.
1. Identify a Star Schema data on Hadoop.
2. Build Cube on Hadoop.
3. Query data with ANSI-SQL and get results via ODBC, JDBC or RESTful API.
Copyright © William El Kaim 2016 Source: Apache Kylin 62

SQL-on-Hadoop: Synthesis
• Pros
• Continue using your favorite BI tools and SQL-based clients
• Tableau, Qlik, Power BI, Excel, R, SAS, …
• Technical analysts can write custom SQL queries
• Cons
• Another layer in your data stack
• May need to pre-aggregate the data depending on your scale
• Need a separate data preparation tool (or custom scripts)

SQL-on-Hadoop: Synthesis

SQL-on-Hadoop: Encoding Formats
• The different encoding standards result in different block sizes, and that can
impact performance.
• ORC files compress smaller than Parquet files which can be a decisive choice factor.
• Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while
Hive and others support optimized row column (ORC) files, sequence files, or plain text.
• Semi-structured data format like JSON is gaining traction
• Previously Hadoop users were using MapReduce to pound unstructured data into a
more structured or relational format.
• Drill opened up SQL-based access directly to semi-structured data, such as JSON,
which is a common format found on NoSQL and SQL databases. Cloudera also recently
added support for JSON in Impala.
Copyright © William El Kaim 2016 Source: Datanami 65

SQL-on-Hadoop: Decision Tree

Plan
• What is Hadoop?
• What is Data Science

What is Data Science?
• Data Science is an interdisciplinary field about processes and systems to
extract knowledge or insights from data in various forms

From Hindsight to Insight to Foresight

Source: Forrester
Source: wikipedia 70
Copyright © William El Kaim 2016
Algorithms In Decision Making

Tools: Anaconda
Copyright © William El Kaim 2016 https://www.continuum.io/ 72

Tools: Dataiku DSS
Combine and Join Datasets Create Machine Learning Models

Copyright © William El Kaim 2016 http://www.dataiku.com/dss/ 73
Tools: IBM Data Science Experience
• Cloud-based development
environment for near real-time,
high performance analytics
• Available on IBM Cloud Bluemix
platform
• Provides
• 250 curated data sets
• Open source tools and a
collaborative workspace like H2O,
RStudio, Jupyter Notebooks on
Apache Spark
Copyright © William El Kaim 2016 http://datascience.ibm.com/ 74

Tools: Tamr
Copyright © William El Kaim 2016 http://www.tamr.com/ 75

Plan
• What is Hadoop?

Hadoop Processing Paradigms
Batch processing
• Large amount of statics data
• Generally incurs a high-latency / Volume
Real-time processing
• Compute streaming data
• Low latency
• Velocity
Hybrid computation
• Lambda Architecture
• Volume + Velocity
Copyright © William El Kaim 2016 Source: Rubén Casado & Cloudera 77

Storage Technologies: Cost & Speed

Storage Technology: Encoding Format
• Apache Parquet vs. Apache Orc

Hadoop Batch processing
• Scalable
• Large amount of static data
• Distributed
Volume
• Parallel
• Fault tolerant
• High latency
Copyright © William El Kaim 2016 Source: Rubén Casado 80

Hadoop – Batch Processing - Map Reduce
• MapReduce was designed by
Google as a programming model for
processing large data sets with a
parallel, distributed algorithm on a
cluster.
• Key Terminology
• Job: A “full program” - an execution of a
Mapper and Reducer across a data set
• Task: An execution of a Mapper or a
Reducer on a slice of data – a.k.a. Task-
In-Progress (TIP)
• Task Attempt: A particular instance of an
attempt to execute a task on a machine

• Processing can occur on data stored either in a filesystem (unstructured) or
in a database (structured).
• MapReduce can take advantage of the locality of data, processing it near the
place it is stored in order to reduce the distance over which it must be
transmitted.
• "Map" step
• Each worker node applies the "map()" function to the local data, and writes the output to a
temporary storage.
• A master node ensures that only one copy of redundant input data is processed.
• "Shuffle" step
• Worker nodes redistribute data based on the output keys (produced by the "map()" function),
such that all data belonging to one key is located on the same worker node.
• "Reduce" step
• Worker nodes now process each group of output data, per key, in parallel.


Real-time Processing
• Low latency
• Continuous
unbounded
Velocity streams of data
• Distributed
• Parallel
• Fault-tolerant

Real-time (Stream) Processing
• Computational model and Infrastructure for continuous data processing, with
the ability to produce low-latency results
• Data collected continuously is naturally processed continuously (Event Processing or
Complex Event Processing -CEP)
Copyright © William El Kaim 2016 Source: Trivadis 85

• (Event-) Stream Processing
• A one-at-a-time processing model
• A datum is processed as it arrives
• Sub-second latency
• Difficult to process state data efficiently
• Micro-Batching
• A special case of batch processing with very small batch sizes (tiny)
• A nice mix between batching and streaming
• At cost of latency
• Gives statefull computation, making windowing an easy task


Real-time (Stream) Processing Arch. Pattern
Copyright © William El Kaim 2016 Source: Cloudera 88

Hybrid Computation Model
• Low latency
• Massive data + Streaming data
• Scalable
• Combine batch and real-time results
Volume Velocity

Hybrid Computation: Lambda Architecture
• Data-processing architecture designed to handle massive quantities of data
by taking advantage of both batch- and stream-processing methods.
• A system consisting of three layers: batch processing, speed (or real-time)
processing, and a serving layer for responding to queries.
• This approach to architecture attempts to balance latency, throughput, and fault-
tolerance by using batch processing to provide comprehensive and accurate views of
batch data, while simultaneously using real-time stream processing to provide views of
online data.
• The two view outputs may be joined before presentation.
• Lambda Architecture case stories via lambda-architecture.net
Copyright © William El Kaim 2016 Source: Kreps 90

Hybrid Computation: Lambda Architecture
• Batch layer
• Receives arriving data, combines it with historical data and recomputes results by
iterating over the entire combined data set.
• The batch layer has two major tasks:
• managing historical data; and recomputing results such as machine learning models.
• Operates on the full data and thus allows the system to produce the most accurate
results. However, the results come at the cost of high latency due to high computation
time.
• The speed layer
• Is used in order to provide results in a low-latency, near real-time fashion.
• Receives the arriving data and performs incremental updates to the batch layer results.
• Thanks to the incremental algorithms implemented at the speed layer, computation cost is
significantly reduced.
• The serving layer enables various queries of the results sent from the batch
and speed layers.

Hybrid computation: Lambda Architecture
Copyright © William El Kaim 2016 Source: Mapr 92

Hybrid computation: Kappa Architecture
• Proposal from Jay Kreps (LinkedIn) in this article.
• Then talk “Turning the database inside out with Apache Samza” by Martin Kleppmann
• Main objective
• Avoid maintaining two separate code bases for the batch and speed layers (lambda).
• Key benefits
• Handle both real-time data processing and continuous data reprocessing using a single
stream processing engine.
• Data reprocessing is an important requirement for making visible the effects of code
changes on the results.

Hybrid computation: Kappa Architecture
• Architecture is composed of only two layers:
• The stream processing layer runs the stream processing jobs.
• Normally, a single stream processing job is run to enable real-time data processing.
• Data reprocessing is only done when some code of the stream processing job needs to be
modified.
• This is achieved by running another modified stream processing job and replying all previous data.
• The serving layer is used to query the results (like the Lambda architecture).
Copyright © William El Kaim 2016 Source: O’Reilly 94

Hybrid computation: Lambda vs. Kappa
Lambda Used to value all data in a unique treatment chain
Kappa Used to provide the freshest data to customers
Source: Kreps

Hadoop Processing Paradigms Evolutions
Source: Rubén Casado Tejedor

Hadoop Architecture
Data Sourcing Data Preparation Feature Preparation
Metadata Dataiku, Tableau,

Open Python, R, etc.
Data Data Science Tools
Batch SQL & NoSQL & Platforms
Source Data Database
Data Driven
Operational Business
Systems Qlik, Tibco, IBM, Process,
(ODS, IoT) SAP, BIME, etc. Applications
Computed BI Tools & Platforms and Services
Data Hadoop
Existing Sources Streaming Spark
of Data Cascading, Crunch,
(Databases, Hfactory, Hunk, Spring
DW, DataMart) for Hadoop
Data
Data Sources Ingestion Data Lake Lakeshore App. Services

Hadoop Technologies
Data Sourcing Data Preparation Feature Preparation
Encoding Format: JSON, Distributed File Data Science: Dataiku, Cassandra, Druid, DynamoDB,
Rcfile, Parquet, ORCfile System: GlusterFS, Datameer, Tamr, R, SaS, MongoDB, Redshift, Google
Open BigQuery, etc.
HDFS, Amazon S3, Python, RapidMiner, etc.
Data
MapRFS, ElasticSearch Machine Learning: Data Warehouse
Batch BigML, Mahout,
Map Reduce Predicsys, Azure ML,
Operational TensorFlow, H2O, etc.
Systems Streaming Qlik, Tableau, Tibco, Jethro,
(ODS, IoT) Event Stream & Micro Batch Looker, IBM, SAP, BIME, etc.
BI Tools & Platforms
Ingestion Technologies: NoSQL Databases: Distributions: Cloudera,
Existing Sources Apex, Flink, Flume, Cassandra, Ceph, HortonWorks, MapR,
of Data Kafka, Amazon Kinesis, DynamoDB, Hbase, SyncFusion, Amazon Cascading, Crunch, Hfactory,
(Databases, Nifi, Samza, Spark, Hive, Impala, Ring, EMR, Azure HDInsight, Hunk, Spring for Hadoop,
DW, DataMart) Sqoop, Scribe, Storm, OpenStack Swift, etc. Altiscale, Pachyderm, D3.js, Leaflet
NFS Gateway, etc. Qubole, etc.
App. Services
Data Sources Data Ingestion Data Lake Lakeshore & Analytics Analytics App and Services
Copyright © William El Kaim 2016
Plan
• What is Hadoop?

Big Data Landscape
Hadoop Distributors
• Three Main pure-play Hadoop distributors
• Cloudera, Hortonworks, and MapR Technologies
• Other Hadoop distributors
• SyncFusion: Hadoop for Windows,
• Pivotal Big Data Suite
• Pachyderm
• Hadoop Cloud Provider:
• Altiscale, Amazon EMR, BigStep, Google Cloud
DataProc, HortonWorks SequenceIQ, IBM
BigInsights, Microsoft HDInsight, Oracle Big
Data, Qubole, Rackspace
• Hadoop Infrastructure as a Service
Forrester Big Data Hadoop Cloud, Q1 2016
• BlueData, Packet

Big Data Landscape
Data Preparation
Source: Bloor Research

Big Data Landscape
Business Intelligence and Analytics Platforms

Big Data Landscape
Hadoop Distributions To Start With
• Apache Hadoop
• Cloudera Live
• Dataiku
• Hortonworks Sandbox
• IBM BigInsights
• MapR Sandbox
• Microsoft Azure HDInsight
• Syncfusion Hadoop for Windows
• W3C Big Data Europe Platform

EA Digital Codex Twitter
http://www.eacodex.com/ http://www.twitter.com/welkaim
Linkedin SlideShare
http://fr.linkedin.com/in/williamelkaim http://www.slideshare.net/welkaim
Claudine O'Sullivan

Introduction To Big Data: William El Kaim Oct. 2016 - V3.0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Big Data: William El Kaim Oct. 2016 - V3.0

Uploaded by

Copyright:

Available Formats

Introduction to Big Data

Copyright © William El Kaim 2016 http://www.eacodex.com/ 2

Copyright © William El Kaim 2016 3

Copyright © William El Kaim 2016 Source: Domo 4

Copyright © William El Kaim 2016 6

Copyright © William El Kaim 2016 7

Copyright © William El Kaim 2016 8

Copyright © William El Kaim 2016 9

Copyright © William El Kaim 2016 10

Copyright © William El Kaim 2015 Source: SiSense 11

Copyright © William El Kaim 2016 Source: James Higginbotham 13

Copyright © William El Kaim 2016 Source: Bernard Marr 14

Copyright © William El Kaim 2016 Source: IBM 15

Copyright © William El Kaim 2016 Source: IBM 16

Copyright © William El Kaim 2016 Source: IBM 17

Copyright © William El Kaim 2016 Source: IBM 18

Copyright © William El Kaim 2016 Source: Bernard Marr 19

Copyright © William El Kaim 2016 20

Copyright © William El Kaim 2016 21

“It is an important technique!”

Copyright © William El Kaim 2016 Source: Xiaoxiao Shi 22

• Open Source Apache Project

Copyright © William El Kaim 2016 23

Copyright © William El Kaim 2015 Source: DBA Journey Blog 24

Copyright © William El Kaim 2016 Source: Octo Technology 25

Copyright © William El Kaim 2016 26

Copyright © William El Kaim 2016 27

2 Data Analytics and Use

Copyright © William El Kaim 2016 Source: Octo Technology 28

Copyright © William El Kaim 2016 Source: Microsoft 29

Copyright © William El Kaim 2016 Source: Microsoft 30

Copyright © William El Kaim 2016 Source: Microsoft 31

From Hindsight to Insight to Foresight

Copyright © William El Kaim 2016 33

Copyright © William El Kaim 2016 Source: Forrester 34

Copyright © William El Kaim 2016 Source: Forrester 35

Capture Big Data Process Distribute Results Feedback

Copyright © William El Kaim 2016 Source: HortonWorks 36

Copyright © William El Kaim 2016 Source: HortonWorks 37

Raw Data Data Lake Lakeshore Data Science Data

Copyright © William El Kaim 2016 38

Copyright © William El Kaim 2016 40

Copyright © William El Kaim 2016 Source: Zaloni 41

Copyright © William El Kaim 2016 Source: PWC 42

Copyright © William El Kaim 2016 Source: PWC 43

Copyright © William El Kaim 2016 Source: AdaptiveSystems 44

Copyright © William El Kaim 2016 Source: AdaptiveSystems 45

Copyright © William El Kaim 2016 Source: AdaptiveSystems 46

Copyright © William El Kaim 2016 Source: AdaptiveSystems 47

Copyright © William El Kaim 2016 48

Copyright © William El Kaim 2016 Source: Platfora 49

Copyright © William El Kaim 2016 Source: Waterline 50

Copyright © William El Kaim 2016 51

Copyright © William El Kaim 2016 Source: Dremio 52

Copyright © William El Kaim 2016 Source: Dremio 53

• Single piece of software on top of

• Arcadia • Indexima • Platfora

Copyright © William El Kaim 2016 Source: Dremio 55

Copyright © William El Kaim 2016 57

Copyright © William El Kaim 2016 Source: Datanami 58

Copyright © William El Kaim 2016 Source: Datanami & Dremio 59

Copyright © William El Kaim 2016 Source: Datanami & Dremio 60

Copyright © William El Kaim 2016 Source: Apache Kylin 62