IIM Cal Big Data Course Slides

Big Data Course Overview - IIMCal
From EduEnigma – www.eduenigma.com

Modules Overview
Day 1 (Overview) Day 2 & 3 (Ecosystem Part I)
Introduction to Course – Objectives and Modules (Brief) Distributed File System and Parallel Execution - Hadoop/HDFS
Big Data – A brief History The Ecosystem Part 1 - MapReduce Explained
Big Data – Opportunities/Case Studies/Examples Dissecting the components of MapReduce and Programming
Distributions – Installation, Configuration of a Cloudera Single Hands on – MapReduce and Practice Exercise
Node Cluster in local environment and in AWS
Day 5 (Non Relation Paradigm) Day 4 (Ecosystem contd..)

Installation and Configuration of MongoDB and Neo4j The Ecosystem Part 2 - Hive, Pig, Sqoop, Flume, Zookeeper
Basics of JSON and Cypher query language (Neo4j) Hands on – Pig Scripts and Hive QL
Demonstration of a practical use case in Neo4j and MongoDB The Ecosystem Part 3 – Why NoSQL? Ex. Graph (Neo4j) and
Document (MongoDB) databases
Visualization Showcase – Data in Motion using R & googleVis
Practice Exercises
Q & A + Exam
Day 6 (Processing Data faster) Day 7 (Monitor and Visualize)

Introduction to Spark MLlib with Hands on
Introduction to Spark Core - Installation and Configuration
Hadoop Administration – HDFS, Monitoring tools and Audit
Spark-SQL, Streaming
Data Visualization dashboards with Tableau
Hands On – Spark-SQL
Practice exercises on MLlib
Practice exercises on Spark SQL
So what is Big Data!!
So what is Big Data!!
Big Data and is it really Big?
What is Big Data?
Big data is a buzzword, or catch-phrase, used to describe a massive volume
of human or machine generated structured, semi-structured or unstructured
data coming in large volumes , high speeds, that require complex processing
and analysis to help companies improve operations and make faster, more
intelligent decisions
Big data is any data that is really expensive to manage and hard to extract
value from it – Michael Franklin
Bytes ++
1 Bit = Binary Digit

8 Bits = 1 Byte
1024 Bytes = 1 Kilobyte
1024 Kilobytes = 1 Megabyte
1024 Megabytes = 1 Gigabyte
1024 Gigabytes = 1 Terabyte
1024 Terabytes = 1 Petabyte
1024 Petabytes = 1 Exabyte
1024 Exabytes = 1 Zettabyte
1024 Zettabytes = 1 Yottabyte
1024 Yottabytes = 1 Brontobyte
1024 Brontobytes = 1 Geopbyte
Big Data – 4Vs (Volume, Velocity, Variety and Veracity)
Evolution of Big Data
Big Data – How Big it is?
Now every 2 days we produce more data than we created since beginning of time till 3/4 years back.
Over 5000 Exabyte's of data in cloud. This is ~ 90x of if books printed and stacked from earth till Pluto.
~3 Zettabytes of data floating around
More than 90% of the data is created in last 3 years alone.
Every minute we send ~400 million email, ~2.5 mil. Facebook likes, ~350 thousands tweets,
upload ~300000 photos in Facebook.
Burn all the data we created so far in DVD, stack them, will reach moon 3 times (up and down)
~150B+ Big data market today, growing @13% annually..
Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
Walmart handles more than 1 million customer transactions every hour, which is imported into
databases estimated to contain more than 2.5 petabytes of data
The Hype comparison in last 3 years
Big Data has become prevalent in our life

- Betsy Burton (Gartner)
Smart Machine Technologies with NLP and ANN

based Deep Learning are a hype
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities ! Wordscapes
The Problem !!
The Problem !!
The Problem !!
The Problem !!
The Problem !!
The Problem !!
The Solution !!
Challenges & Traditional Solutions..
Big Data Vs Traditional DW
Traditional Data Warehouse Systems Big Data Systems
Centralized and powerful Servers Leverages distributed clusters of machines/nodes

Not all important data (unstructured or and incorporation of commodity/old hardware is
semi structured) cannot be loaded due to quick.
structural mismatch or size All types (Structured, Semi Structured or
Old data needs to be archived or deleted Unstructured) of data can be stored.
frequently Size of data no longer a restriction
Scalability is more often vendor locked Analysis and Analytics are easier to perform
and difficult (time and money) to upgrade providing a 360 degree view of data
Most of the S/W are vendor locked i.e. Scalability can be achieved easily by adding nodes
proprietary. in a linear fashion. Cloud implementations are
Costly readily available.
Cloud is recently an option. Low Cost and huge variety of open source tools.
Integrates with DW Systems easily with adapters.
Traditional Big Data Implementation
Hadoop – The Solution..
Hadoop – The Solution..
Hadoop – What it is NOT..
Not a Database substitute
Hadoop is not meant to store & manage data the way RDBMS does it.
Not a real time data processing engine
Hadoop is a batch processing system. Using Hadoop with an expectation to analyze data
as soon as it is generated is inappropriate. If need is to analyze data at point of
generation without time lag – look for alternate technologies
Not an analytic engine
Hadoop by itself does not provide any inbuilt analytic capabilities. Write MapReduce
programs for any data processing requirement
A brief History – The Journey from 2002
A brief History – Contd..
http://sortbenchmark.org/
Instructions to Install Cloudera CDH 5.13/5.14 in google Cloud
- for APDS 2018-19 batch IIMCal
Prepare Your own google Cloud Environment
1. Create a Free Trial in Google Cloud
2. Once Done - Login (cloud.google.com) and create your first Project
3. In the Project Dashboard, go to "Compute Engine" in the left tab
4. In "Compute Engine" go to "Metadata" option and select "SSH Keys" tab
5. Perform and "Edit" and upload your ssh keys. These are essential for connecting the VM
6. SSH Keys can be generated using "ssh-keygen" command or by "puttygen.exe" in Windows

1. https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys
7. Once uploaded the keys in SSH tab "Save" them and goto "VM instances" option in the left
8. Here select "Create" in the right frame
9. Provide details and create the VM –

1. Name: iimcal-bigdata Zone: us-east-1b Machine Type: 4vCPUs 15GB Boot Disk: Ubuntu 14.04 LTS HDD – 30GB SSD
2. Access scopes: Allow full access to all Cloud APIs Firewall: Check both "Allow HTTP traffic" and "Allow HTTPS traffic“
10. Once your VM is UP, it will have a Public IP. Please copy the same
11. Now in the Project Dashboard, goto "VPC network" option in the "Networking" section and select "Firewall rules" in
the sub-menu/option
Prepare Your own google Cloud Environment…
12. Select "CREATE FIREWALL RULE" in the right frame
13. Here give the following details and Create

1. Name: iimcal-rule
2. Targets: All instances in the network
3. Source IP ranges: 0.0.0.0/0
4. Protocols and Ports: Select "Allow All" OR type "tcp:6000-9000" without the quotes and keep "Specified protocols and ports" option selected
14. Now you are ready to connect the VM

15. If you have a Linux Laptop connect via -> ssh -i <ssh-key private> username@PublicIPofVM
1. https://cloud.google.com/compute/docs/instances/connecting-to-instance
2. https://www.youtube.com/watch?v=S2MocgFZMPU
If you have Windows connect via putty.exe (www.putty.org)

1. https://www.youtube.com/watch?v=J1-zXIWwoJE
16. Once connected please run command - sudo apt-get update

17. Change the VM swappiness parameter - sudo sysctl vm.swappiness=10
Install Cloudera CDH
1. Login to VM and make a "Install" directory - mkdir Install 7. At the end it will say to open cloudera console i.e.
localhost:7180 in a browser.
2. Go to Install directory - cd Install
8. Now in your local laptop type:
3. Please download "cdh5-repository_1.0_all.deb" http://<VMPublicIP>:7180/
wget https://s3.amazonaws.com/iimcal/cdh5-
9. Once console is open login as admin/admin.
repository_1.0_all.deb
10. Follow the Video I sent OR you can also follow
4. Please download "cloudera-manager-installer.bin"
many of the available YouTube videos OR
wget https://s3.amazonaws.com/iimcal/cloudera-manager- Cloudera docs in www.cloudera.com
installer.bin
11. Once done and Cluster is UP and running please
5. Please download JDK 1.7 run some hadoop commands to verify
wget https://s3.amazonaws.com/iimcal/jdk-7u80-linux-x64.tar.gz - hadoop fs -ls /hbase
Install the jdk (There are quite a few videos in YouTube for
howto install jdk in ubuntu)
6. Install cdh5-repository_1.0_all.deb and run:
sudo dpkg -i cdh5-repository_1.0_all.deb
sudo chmod +x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin
Follow the Instruction

Tutorials
1. I created a video tutorial for the installation. Please have a look into it.
https://s3.amazonaws.com/iimcal/IIMCal+APDS+GCP+CDH+Installation.mp4
Note: See it in VLC media player. You can download the same from –
https://www.videolan.org/vlc/index.html
1. Some of you already know Linux and Networking and Big Data. For those who does not I have kept
some Tutorials and links to free classes on Linux and Networking.
https://s3.amazonaws.com/iimcal/Tutorials/Client+Server+Communication.pdf
https://s3.amazonaws.com/iimcal/Tutorials/Ubuntu+Pocket+Guide.pdf
https://s3.amazonaws.com/iimcal/Tutorials/Ubuntu+Reference.pdf
https://s3.amazonaws.com/iimcal/Tutorials/Free+Additional+Courses+Links.txt
Some ***Important Points to Remember
Never keep your VM Up and Running idle as it would eat up free tier quota.
First - Login to Cloudera Console (http://<VMPublicIP>7180/) and Stop the Cluster.
Second – Whenever you are not practicing “Stop” the VM. Always the check the free
credit amount in the google cloud console dashboard in the top left.
Third – If you know that you are not going to use the VM for next 7 days terminate the
same. You can always create another cluster within 30 - 45minutes.
Fourth – If you terminate the VM then make sure you take the backup of your files and
work.
Big Data – Ecosystem
Distributions
Cloudera
Hortonworks
MapR (Fastest)
Why Another File System..
Let us understand Blocks..
Let us understand Blocks..
HDFS..
Distributed file system designed to run on commodity hardware.
Highly fault tolerant and reliable.
Data is split into blocks and distributed across multiple nodes in the cluster.
HDFS has rack awareness and distributes blocks accordingly.
HDFS is immutable – write once, read multiple times.
Suitable for applications that require high throughput access to large data sets.
Default block size is 64/128 MB which is configurable.

WHY HDFS..
Hardware failure
An HDFS instance consists of hundreds of machines each of which can fail, key goal of HDFS
architecture is to support detection of such faults and recovery.
Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-
times pattern. More suitable for batch processing than interactive use. The emphasis is on high
throughput of data access rather than low latency of data access.
Large data sets

Application that run on HDFS have large datasets. A typical file in HDFS is in gigabyte to terabyte size.
A typical cluster stores petabytes worth of data.
WHY HDFS..
Hardware failure
An HDFS instance consists of hundreds of machines each of which can fail, key goal of HDFS architecture is to support detection
of such faults and recovery.
Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. More
suitable for batch processing than interactive use. The emphasis is on high throughput of data access rather than low latency of
data access.
Large data sets

Application that run on HDFS have large datasets. A typical file in HDFS is in gigabyte to terabyte size. A typical cluster stores
petabytes worth of data.
Data Locality
Achieves greater efficiency by moving computation to the data. Since files are spread across the distributed file system as chunks,
each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its
locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and
preventing unnecessary network transfers.
Portability
Designed to be portable from one platform to another facilitating wider adoption.
Economy
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware
(commonly available hardware available from multiple vendors3) for which the chance of node failure across the cluster is high,
at least for large clusters.
When NOT to use HDFS..
Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds range, will not work
well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this
may be at the expense of latency.
Lots of small files

Since the namenode holds file system metadata in memory, the limit to the number of files in a file
system is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications

Files in HDFS may be written to by a single writer. Writes are always made at the end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
HDFS Nodes..
Master & Slaves..
HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
NameNode..
The NameNode (master) executes file system namespace operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
The NameNode maintains the file system tree and metadata for all files and directories in the tree. Any change to the file system
namespace or its properties is recorded by the NameNode.
This information is stored persistently on the local disk in the form of two files: the FsImage and the Edit log.
The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog.
o The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage
(stored in local file system).
o The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
o When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk.
o It then truncates the old EditLog.This process is called checkpoint.(Generally occurs during startup)
The namenode also knows the datanodes on which all the blocks are stored for a given file.
It does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes.
The client presents a filesystem interface, so the user code does not need to know about the namenode and datanode to function.
DataNode..
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
DataNode periodically sends a Heartbeat and a Blockreport to the NameNode in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a
list of all blocks on a DataNode.
Replication facts..
HDFS is designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance.

The block size and replication factor are configurable per file
An application can specify the number of replicas of a file
The replication factor can be specified at file creation time and can be changed later.
Files in HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks.

It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly
A Blockreport contains a list of all blocks on a DataNode.
Replication facts..
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches.
In most cases, network bandwidth between machines in the same rack is greater than network
bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via a process called - Hadoop
Rack Awareness
A simple but non-optimal policy is to place replicas on unique racks.

This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when
reading data.
This policy evenly distributes replicas in the cluster which makes it easy to balance load on component
failure.
However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
Replication facts..
For the common case, when the replication factor is three, HDFS’s placement policy is to put one
replica on one node in the local rack, another on a node in a different (remote) rack, and the last
on a different node in the same remote rack.
This policy cuts the inter-rack write traffic which generally improves write performance.
The chance of rack failure is far less than that of node failure - this policy does not impact data
reliability and availability guarantees.
However, it does reduce the aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three.
Hands on - HDFS Sample Commands practice
Listing, Copy (Local < -- > HDFS), Move,

Replication Check
Block Locations
Etc…
HDFS Read Write – In Detail..
Hands on - HDFS Java Programs for Read and Write..
Read From HDFS

Write to HDFS
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Introduction of MapReduce – Divide and Conquer
A technique by which large problems

are divided into smaller sub problems
Sub-problems are worked upon in
parallel and in isolation
Intermediate results are combined into
a final result
Introduction of MapReduce – Divide and Conquer
MapReduce (v1) Components..
MapReduce – Another Example
MapReduce (Word Count/Frequency)
MapReduce (Word Count/Frequency)
Cat 1
Cat 2
Cat 1
Dog 1 Cat 3
Mouse 1
Cat Dog Mouse

Dog 1
Input:
Dog 1 Dog 2
Mouse 1 Cat 3
Cat Dog Mouse Mouse Bird Dog Bird 1 Dog 2
Mouse Bird Dog Dog 1
Cat Bird Cat Mouse 2
Mouse 1 Bird 2
Cat Bird Cat Mouse 1 Mouse 2
Cat 2
Bird 1
Bird 1 Bird 2
Bird 1
Splitting Mapping Shuffle Reducing Result

Architecture..
Architecture..MapReduce v1 Vs v2
Architecture..
MapReduce Execution Workflow..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Hands on – Steps to write a MapReduce Program..
Write driver code

Write mapper code
Write reducer code
Write partitioner and combiner if required
Include all necessary imports
Place the input data file in the Hdfs file system
hadoop fs –copyFromLocal <your_folder> Hdfsfilelocation
Run the program
hadoop jar jarlocation classname [inputfilename] [outputfilename]
Hands on – Word Count..
Sample provided by Cloudera

Let us write a program ourselves
Hands on – Stocks..
Let us write code with our Stocks Dataset

example
MapReduce – Stocks Example
Appendix
Ref: on Rick Smolan, Patrick Meirer and Deb Roy
Rick Smolan is a former TIME, LIFE, and National Geographic photographer best known as the co-creator of the Day
in the Life book series. He is currently CEO of Against All Odds Productions, a cross-media organization.More than
five million of his books have been sold around the world, many have appeared on The New York Times best-seller
lists and have been featured on the covers of Fortune, Time, and Newsweek.Smolan is also a member of the
CuriosityStream Advisory Board.
https://en.wikipedia.org/wiki/Rick_Smolan
https://www.youtube.com/watch?v=4VeITe6EJDU
https://www.youtube.com/watch?v=OV1y6ZUV_Q4
Patrick Meier is an internationally recognized expert and consultant on Humanitarian Technology and Innovation. His book,
Digital Humanitarians, has been praised by Harvard, MIT, Stanford, Oxford, UN, Red Cross, World Bank, USAID and
others.
https://irevolutions.org/bio/
Deb Roy is a tenured professor at MIT and served as Chief Media Scientist of Twitter from 2013-2017.
A native of Winnipeg, Manitoba, Canada, Roy received his PhD in Media Arts and Sciences from MIT.
MIT researcher Deb Roy wanted to understand how his infant son learned language -- so he wired up his house with video
cameras to catch every moment (with exceptions) of his son's life, then parsed 90,000 hours of home video to watch "gaaaa"
slowly turn into "water."
https://dkroy.media.mit.edu/
https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en
Pre-requisites
Background knowledge of RDMS, SQL
Background of a Programming language – Core Java is preferred as examples (Spark) will be in Java
A brief knowledge in Cloud Computing – Some handouts can be given before the course.
Hands on knowledge on Linux

Reference Terminology
Astronomy
LSST - The Large Synoptic Survey Telescope (LSST) is a wide-field survey reflecting telescope with an 8.4-
meter primary mirror
PAN-STARRS - The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; code: F51
and F52) located at Haleakala Observatory, Hawaii, consists of astronomical cameras, telescopes and a
computing facility that is surveying the sky for moving objects on a continual basis, including accurate
astrometry and photometry of already detected objects
SDSS - The Sloan Digital Sky Survey or SDSS is a major multi-filter imaging and spectroscopic redshift
survey using a dedicated 2.5-m wide-angle optical telescope at Apache Point Observatory in New Mexico,
United States. The project was named after the Alfred P. Sloan Foundation, which contributed significant
funding.
n-body-SIMS - In physics and astronomy, an N-body simulation is a simulation of a dynamical system of
particles, usually under the influence of physical forces, such as gravity (see n-body problem). N-body
simulations are widely used tools in astrophysics, from investigating the dynamics of few-body systems like
the Earth-Moon-Sun system to understanding the evolution of the large-scale structure of the universe. In
physical cosmology, N-body simulations are used to study processes of non-linear structure formation such
as galaxy filaments and galaxy halos from the influence of dark matter. Direct N-body simulations are used
to study the dynamical evolution of star clusters.
Ocean Sciences
AUV - Autonomous Underwater vehicles

ADCP - The Acoustic Doppler Current Profiler (ADCP) measures the speed and direction of ocean currents
using the principle of “Doppler shift”
CTD - A CTD or Sonde is an oceanography instrument used to measure the conductivity, temperature, and
pressure of seawater (the D stands for "depth," which is closely related to pressure)
OOI - The National Science Foundation-funded Ocean Observatories Initiative (OOI) is an integrated
infrastructure program composed of science-driven platforms and sensor systems that measure physical,
chemical, geological and biological properties and processes from the seafloor to the air-sea interface.
IOOS - The U.S. Integrated Ocean Observing System (IOOS)
CMOP - Center for Coastal Margin Observation & Prediction
Glider - An ocean glider is an autonomous, unmanned underwater vehicle used for ocean science. Since
gliders require little or no human assistance while traveling, these little robots are uniquely suited for
collecting data in remote locations, safely and at relatively low cost
Ocean Sciences
AUV - Autonomous Underwater vehicles

ADCP - The Acoustic Doppler Current Profiler (ADCP) measures the speed and direction of ocean currents
using the principle of “Doppler shift”
CTD - A CTD or Sonde is an oceanography instrument used to measure the conductivity, temperature, and
pressure of seawater (the D stands for "depth," which is closely related to pressure)
OOI - The National Science Foundation-funded Ocean Observatories Initiative (OOI) is an integrated
infrastructure program composed of science-driven platforms and sensor systems that measure physical,
chemical, geological and biological properties and processes from the seafloor to the air-sea interface.
IOOS - The U.S. Integrated Ocean Observing System (IOOS)
CMOP - Center for Coastal Margin Observation & Prediction
Glider - An ocean glider is an autonomous, unmanned underwater vehicle used for ocean science. Since
gliders require little or no human assistance while traveling, these little robots are uniquely suited for
collecting data in remote locations, safely and at relatively low cost
World Bank Demo Terminology
Terms
Fertility Rate - Number of live births per 1000 women between the ages of 15 and 44 years
Life Expectancy - Life expectancy equals the average number of years a person born in a given country is
expected to live if mortality rates at each age were to remain steady in the future
GDP Per Captia - GDP per capita is a measure of a country's economic output that accounts for population.
It divides the country's gross domestic product by its total population. That makes it the best measurement
of a country's standard of living. It tells you how prosperous a country feels to each of its citizens..
Why the Largest Economies Aren't the Richest per Capita - GDP per capita allows you to compare the
prosperity of countries with different population sizes. For example, U.S. GDP was $18.56 trillion in 2016.
But one reason America is so prosperous is it has so many people. It's the third most populous country after
China and India.
The United States must spread its wealth among 324 million people. As a result, its GDP per capita is only
$57,300. That makes it the 18th most prosperous country per person.
China has the largest GDP in the world, producing $21.2 trillion in 2016. But its GDP per capita was only
$15,400 because it has four times the number of people as the United States.
Big Data – Why NoSQL?
Why? Types
Key-value databases are generally useful for storing session

Flexibility for Faster Development (schema information, user profiles, preferences, shopping Dogt data.
less architecture) We would avoid using Key-value databases when we need to
query by data, have relationships between the data being
Simplicity for Easier Development stored or we need to operate on multiple keys at the same time.
Elasticity for Performance at Scale Document databases are generally useful for content
Availability for Always-on, Global management systems, blogging platforms, web analytics, real-
time analytics, ecommerce-applications. We would avoid using
Deployment document databases for systems that need complex
transactions spanning multiple operations or queries against
varying aggregate structures.
Types Column family databases are generally useful for content
management systems, blogging platforms, maintaining
counters, expiring usage, heavy write volume such as log
Key value (Memcached, AWS DynamoDB, aggregation. We would avoid using column family databases
for systems that are in early development, changing query
Couchbase) patterns.
Document (MongoDB, CouchDB) Graph databases are very well suited to problem spaces where
Column Family Stores (Hbase, Cassandra) we have connected data, such as social networks, spatial data,
routing information for goods and money, recommendation
Graph (Neo4J, OrientDB) engines
Big Data – Data Visualization Showcase
Why? Tools/Products available
A Picture is worth of 1000 words – More Tableau

insight i.e. Frameworks – D3.js
Comprehend information quickly
Amazon QuickSight
Discover emerging trends
IBM Watson Explorer
Identify relationship and patterns
Helps data scientists to create Microsoft PowerBI
mathematical models for predictions
A video by Hans Rosling (Founder of Gapminder) in 2006 (TedX)
A video to show the athletic power of quadcopters and the data it generates for processing
History of Spark
Why Spark ?
Why Spark – More Speed Samples?
4000
3500
Emerged as fastest open source solution to sort 100TB
3000 110 sec / iteration data in Daytona Gray Sort Benchmark
2500 (http://sortbenchmark.org/)
Running Time (s)
2000 Hadoop
1500
Spark
1000
500
0
1 5 10 20 30 first iteration 80 sec
Number of Iterations
further iterations 1 sec
Overview of Spark
An emerging open source big data cluster computing framework, alternative to MR and top priority project in
Apache.
App 1 App 2 App 3 App N Reliable, scalable, fast-parallel, in-memory cluster computing.
Spark
Spark SQL MLlib GraphX SQL, Streaming and Complex Analytics in one framework.
Streaming
Enables distributed processing through RDD abstraction.

Spark Core Improved usability and performance through Data Frame
API
Runs anywhere or in cloud and access to a range of

YARN Mesos StandAlone
diverse range of data sources.
Multi language support for development including

HDFS Hbase Cassandra S3
interactive shell.
Big Data – Data Visualization Showcase 1
Healthcare data in Motion with R

Big Data – Data Visualization Showcase 2
WorldBank Data

IIM Cal Big Data Course Slides

Uploaded by

Copyright:

Available Formats

You might also like

IIM Cal Big Data Course Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IIM Cal Big Data Course Slides

Uploaded by

Copyright:

Available Formats

Big Data Course Overview - IIMCal

From EduEnigma – www.eduenigma.com

Big Data – A brief History The Ecosystem Part 1 - MapReduce Explained

Day 5 (Non Relation Paradigm) Day 4 (Ecosystem contd..)

Day 6 (Processing Data faster) Day 7 (Monitor and Visualize)

1 Bit = Binary Digit

~3 Zettabytes of data floating around

More than 90% of the data is created in last 3 years alone.

~150B+ Big data market today, growing @13% annually..

Big Data has become prevalent in our life

Smart Machine Technologies with NLP and ANN

Traditional Data Warehouse Systems Big Data Systems

Centralized and powerful Servers Leverages distributed clusters of machines/nodes

2. Once Done - Login (cloud.google.com) and create your first Project

3. In the Project Dashboard, go to "Compute Engine" in the left tab

4. In "Compute Engine" go to "Metadata" option and select "SSH Keys" tab

6. SSH Keys can be generated using "ssh-keygen" command or by "puttygen.exe" in Windows

8. Here select "Create" in the right frame

9. Provide details and create the VM –

13. Here give the following details and Create

2. Targets: All instances in the network

3. Source IP ranges: 0.0.0.0/0

14. Now you are ready to connect the VM

If you have Windows connect via putty.exe (www.putty.org)

16. Once connected please run command - sudo apt-get update

wget https://s3.amazonaws.com/iimcal/jdk-7u80-linux-x64.tar.gz - hadoop fs -ls /hbase

howto install jdk in ubuntu)

6. Install cdh5-repository_1.0_all.deb and run:

sudo dpkg -i cdh5-repository_1.0_all.deb

sudo chmod +x cloudera-manager-installer.bin

Follow the Instruction

Distributed file system designed to run on commodity hardware.

Highly fault tolerant and reliable.

HDFS has rack awareness and distributes blocks accordingly.

HDFS is immutable – write once, read multiple times.

Default block size is 64/128 MB which is configurable.

Streaming data access

Large data sets

Streaming data access

Large data sets

Low-latency data access

Lots of small files

Multiple writers, arbitrary file modifications

The blocks of a file are replicated for fault tolerance.

The NameNode makes all decisions regarding replication of blocks.

Communication between two nodes in different racks has to go through switches.

A simple but non-optimal policy is to place replicas on unique racks.

Listing, Copy (Local < -- > HDFS), Move,

Read From HDFS

A technique by which large problems

Cat Dog Mouse

Splitting Mapping Shuffle Reducing Result

Write driver code

Sample provided by Cloudera

Let us write code with our Stocks Dataset

Background knowledge of RDMS, SQL

Hands on knowledge on Linux