Professional Documents
Culture Documents
Introduction To Big Data: William El Kaim Oct. 2016 - V3.0
Introduction To Big Data: William El Kaim Oct. 2016 - V3.0
William El Kaim
Oct. 2016 – V3.0
This Presentation is part of the
Enterprise Architecture Digital Codex
Source: Capgemini
Visualization
Big Data: A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications
Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577.
Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw
away (David Brower @dbrower)
Doug Cutting
Extended
Apache Nutch
2
Big Data
Analytics
and Use
1
Data Mgt.
And Storage
ETL Pre-processor
ETL Pre-processor
• Shift the pre-processing of ETL in staging data warehouse to Hadoop
• Shifts high cost data warehousing to lower cost Hadoop clusters
Massive Storage
Massive Storage
• Offloading large volume of historical data into cold storage with Hadoop
• Keep data warehouse for hot data to allow BI and analytics
• When data from cold storage is needed, it can be moved back into the warehouse
Data Discovery
Data Discovery
• Keep data warehouse for operational BI and analytics
• Allow data scientists to gain new discoveries on raw data (no format or structure)
• Operationalize discoveries back into the warehouse
Sensors,
devices Business Dashboards,
Intelligence Reports,
& Analytics Visualization, …
DB data
3. Data Processing /
Analysis Layer
External Data
2. Data Storage
Layer
1. Data Source
Layer
1 2 3 3 4
Copyright © William El Kaim 2016 Source: The Modeling Agency and sv-europe 39
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• What is a Data Lake?
• What is BI on Hadoop?
• What is Data Science?
• Hadoop Processing Paradigms
• Hadoop Tools Landscape
Copyright © William El Kaim 2016 Source: Datanami & Cloudera & Dremio 56
What is BI on Hadoop?
Three Options: SQL-on-Hadoop
• SQL on Hadoop tools could be categorized as
• Interactive or Native SQL
• Batch & Data-Science SQL
• OLAP Cubes (In-memory) on Hadoop
Source: Forrester
Source: wikipedia 70
Copyright © William El Kaim 2016
What is Data Science?
Algorithms In Decision Making
• Cloud-based development
environment for near real-time,
high performance analytics
• Available on IBM Cloud Bluemix
platform
• Provides
• 250 curated data sets
• Open source tools and a
collaborative workspace like H2O,
RStudio, Jupyter Notebooks on
Apache Spark
Batch processing
• Large amount of statics data
• Generally incurs a high-latency / Volume
Real-time processing
• Compute streaming data
• Low latency
• Velocity
Hybrid computation
• Lambda Architecture
• Volume + Velocity
• Scalable
• Large amount of static data
• Distributed
Volume
• Parallel
• Fault tolerant
• High latency
• Low latency
• Continuous
unbounded
Velocity streams of data
• Distributed
• Parallel
• Fault-tolerant
Volume Velocity
Source: Kreps
Encoding Format: JSON, Distributed File Data Science: Dataiku, Cassandra, Druid, DynamoDB,
Rcfile, Parquet, ORCfile System: GlusterFS, Datameer, Tamr, R, SaS, MongoDB, Redshift, Google
Open BigQuery, etc.
HDFS, Amazon S3, Python, RapidMiner, etc.
Data
MapRFS, ElasticSearch Machine Learning: Data Warehouse
Batch BigML, Mahout,
Map Reduce Predicsys, Azure ML,
Operational TensorFlow, H2O, etc.
Systems Streaming Qlik, Tableau, Tibco, Jethro,
(ODS, IoT) Event Stream & Micro Batch Looker, IBM, SAP, BIME, etc.
BI Tools & Platforms
Ingestion Technologies: NoSQL Databases: Distributions: Cloudera,
Existing Sources Apex, Flink, Flume, Cassandra, Ceph, HortonWorks, MapR,
of Data Kafka, Amazon Kinesis, DynamoDB, Hbase, SyncFusion, Amazon Cascading, Crunch, Hfactory,
(Databases, Nifi, Samza, Spark, Hive, Impala, Ring, EMR, Azure HDInsight, Hunk, Spring for Hadoop,
DW, DataMart) Sqoop, Scribe, Storm, OpenStack Swift, etc. Altiscale, Pachyderm, D3.js, Leaflet
NFS Gateway, etc. Qubole, etc.
App. Services
Data Sources Data Ingestion Data Lake Lakeshore & Analytics Analytics App and Services
Copyright © William El Kaim 2016
Copyright © William El Kaim 2016 98
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• What is a Data Lake?
• What is BI on Hadoop?
• What is Data Science?
• Hadoop Processing Paradigms
• Hadoop Tools Landscape
Claudine O'Sullivan
Copyright © William El Kaim 2016 104