Professional Documents
Culture Documents
IIM Cal Big Data Course Slides
IIM Cal Big Data Course Slides
IIM Cal Big Data Course Slides
Big Data – Opportunities/Case Studies/Examples Dissecting the components of MapReduce and Programming
Distributions – Installation, Configuration of a Cloudera Single Hands on – MapReduce and Practice Exercise
Node Cluster in local environment and in AWS
Basics of JSON and Cypher query language (Neo4j) Hands on – Pig Scripts and Hive QL
Demonstration of a practical use case in Neo4j and MongoDB The Ecosystem Part 3 – Why NoSQL? Ex. Graph (Neo4j) and
Document (MongoDB) databases
Visualization Showcase – Data in Motion using R & googleVis
Practice Exercises
Q & A + Exam
Bytes ++
Now every 2 days we produce more data than we created since beginning of time till 3/4 years back.
Over 5000 Exabyte's of data in cloud. This is ~ 90x of if books printed and stacked from earth till Pluto.
Every minute we send ~400 million email, ~2.5 mil. Facebook likes, ~350 thousands tweets,
upload ~300000 photos in Facebook.
Burn all the data we created so far in DVD, stack them, will reach moon 3 times (up and down)
Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
Walmart handles more than 1 million customer transactions every hour, which is imported into
databases estimated to contain more than 2.5 petabytes of data
The Hype comparison in last 3 years
http://sortbenchmark.org/
Instructions to Install Cloudera CDH 5.13/5.14 in google Cloud
- for APDS 2018-19 batch IIMCal
Prepare Your own google Cloud Environment
1. Create a Free Trial in Google Cloud
5. Perform and "Edit" and upload your ssh keys. These are essential for connecting the VM
7. Once uploaded the keys in SSH tab "Save" them and goto "VM instances" option in the left
2. Access scopes: Allow full access to all Cloud APIs Firewall: Check both "Allow HTTP traffic" and "Allow HTTPS traffic“
10. Once your VM is UP, it will have a Public IP. Please copy the same
11. Now in the Project Dashboard, goto "VPC network" option in the "Networking" section and select "Firewall rules" in
the sub-menu/option
Prepare Your own google Cloud Environment…
12. Select "CREATE FIREWALL RULE" in the right frame
4. Protocols and Ports: Select "Allow All" OR type "tcp:6000-9000" without the quotes and keep "Specified protocols and ports" option selected
2. https://www.youtube.com/watch?v=S2MocgFZMPU
Install the jdk (There are quite a few videos in YouTube for
sudo ./cloudera-manager-installer.bin
Note: See it in VLC media player. You can download the same from –
https://www.videolan.org/vlc/index.html
1. Some of you already know Linux and Networking and Big Data. For those who does not I have kept
some Tutorials and links to free classes on Linux and Networking.
https://s3.amazonaws.com/iimcal/Tutorials/Client+Server+Communication.pdf
https://s3.amazonaws.com/iimcal/Tutorials/Ubuntu+Pocket+Guide.pdf
https://s3.amazonaws.com/iimcal/Tutorials/Ubuntu+Reference.pdf
https://s3.amazonaws.com/iimcal/Tutorials/Free+Additional+Courses+Links.txt
Some ***Important Points to Remember
Never keep your VM Up and Running idle as it would eat up free tier quota.
First - Login to Cloudera Console (http://<VMPublicIP>7180/) and Stop the Cluster.
Second – Whenever you are not practicing “Stop” the VM. Always the check the free
credit amount in the google cloud console dashboard in the top left.
Third – If you know that you are not going to use the VM for next 7 days terminate the
same. You can always create another cluster within 30 - 45minutes.
Fourth – If you terminate the VM then make sure you take the backup of your files and
work.
Big Data – Ecosystem
Distributions
Cloudera
Hortonworks
MapR (Fastest)
Why Another File System..
Why Another File System..
Why Another File System..
Why Another File System..
Let us understand Blocks..
Let us understand Blocks..
HDFS..
Data is split into blocks and distributed across multiple nodes in the cluster.
Suitable for applications that require high throughput access to large data sets.
Hardware failure
An HDFS instance consists of hundreds of machines each of which can fail, key goal of HDFS
architecture is to support detection of such faults and recovery.
Data Locality
Achieves greater efficiency by moving computation to the data. Since files are spread across the distributed file system as chunks,
each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its
locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and
preventing unnecessary network transfers.
Portability
Designed to be portable from one platform to another facilitating wider adoption.
Economy
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware
(commonly available hardware available from multiple vendors3) for which the chance of node failure across the cluster is high,
at least for large clusters.
When NOT to use HDFS..
An HDFS cluster consists of a single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
NameNode..
The NameNode (master) executes file system namespace operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
The NameNode maintains the file system tree and metadata for all files and directories in the tree. Any change to the file system
namespace or its properties is recorded by the NameNode.
This information is stored persistently on the local disk in the form of two files: the FsImage and the Edit log.
The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog.
o The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage
(stored in local file system).
o The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
o When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk.
o It then truncates the old EditLog.This process is called checkpoint.(Generally occurs during startup)
The namenode also knows the datanodes on which all the blocks are stored for a given file.
It does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes.
The client presents a filesystem interface, so the user code does not need to know about the namenode and datanode to function.
DataNode..
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
DataNode periodically sends a Heartbeat and a Blockreport to the NameNode in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a
list of all blocks on a DataNode.
Replication facts..
HDFS is designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
Files in HDFS are write-once and have strictly one writer at any time.
In most cases, network bandwidth between machines in the same rack is greater than network
bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via a process called - Hadoop
Rack Awareness
This policy cuts the inter-rack write traffic which generally improves write performance.
The chance of rack failure is far less than that of node failure - this policy does not impact data
reliability and availability guarantees.
However, it does reduce the aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three.
Hands on - HDFS Sample Commands practice
Etc…
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
Hands on - HDFS Java Programs for Read and Write..
Cat 1
Cat 2
Cat 1
Dog 1 Cat 3
Mouse 1
Cat 2
Bird 1
Bird 1 Bird 2
Bird 1
Patrick Meier is an internationally recognized expert and consultant on Humanitarian Technology and Innovation. His book,
Digital Humanitarians, has been praised by Harvard, MIT, Stanford, Oxford, UN, Red Cross, World Bank, USAID and
others.
https://irevolutions.org/bio/
Deb Roy is a tenured professor at MIT and served as Chief Media Scientist of Twitter from 2013-2017.
A native of Winnipeg, Manitoba, Canada, Roy received his PhD in Media Arts and Sciences from MIT.
MIT researcher Deb Roy wanted to understand how his infant son learned language -- so he wired up his house with video
cameras to catch every moment (with exceptions) of his son's life, then parsed 90,000 hours of home video to watch "gaaaa"
slowly turn into "water."
https://dkroy.media.mit.edu/
https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en
Pre-requisites
Background of a Programming language – Core Java is preferred as examples (Spark) will be in Java
A brief knowledge in Cloud Computing – Some handouts can be given before the course.
Astronomy
LSST - The Large Synoptic Survey Telescope (LSST) is a wide-field survey reflecting telescope with an 8.4-
meter primary mirror
PAN-STARRS - The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; code: F51
and F52) located at Haleakala Observatory, Hawaii, consists of astronomical cameras, telescopes and a
computing facility that is surveying the sky for moving objects on a continual basis, including accurate
astrometry and photometry of already detected objects
SDSS - The Sloan Digital Sky Survey or SDSS is a major multi-filter imaging and spectroscopic redshift
survey using a dedicated 2.5-m wide-angle optical telescope at Apache Point Observatory in New Mexico,
United States. The project was named after the Alfred P. Sloan Foundation, which contributed significant
funding.
n-body-SIMS - In physics and astronomy, an N-body simulation is a simulation of a dynamical system of
particles, usually under the influence of physical forces, such as gravity (see n-body problem). N-body
simulations are widely used tools in astrophysics, from investigating the dynamics of few-body systems like
the Earth-Moon-Sun system to understanding the evolution of the large-scale structure of the universe. In
physical cosmology, N-body simulations are used to study processes of non-linear structure formation such
as galaxy filaments and galaxy halos from the influence of dark matter. Direct N-body simulations are used
to study the dynamical evolution of star clusters.
Reference Terminology
Ocean Sciences
Ocean Sciences
Terms
Fertility Rate - Number of live births per 1000 women between the ages of 15 and 44 years
Life Expectancy - Life expectancy equals the average number of years a person born in a given country is
expected to live if mortality rates at each age were to remain steady in the future
GDP Per Captia - GDP per capita is a measure of a country's economic output that accounts for population.
It divides the country's gross domestic product by its total population. That makes it the best measurement
of a country's standard of living. It tells you how prosperous a country feels to each of its citizens..
Why the Largest Economies Aren't the Richest per Capita - GDP per capita allows you to compare the
prosperity of countries with different population sizes. For example, U.S. GDP was $18.56 trillion in 2016.
But one reason America is so prosperous is it has so many people. It's the third most populous country after
China and India.
The United States must spread its wealth among 324 million people. As a result, its GDP per capita is only
$57,300. That makes it the 18th most prosperous country per person.
China has the largest GDP in the world, producing $21.2 trillion in 2016. But its GDP per capita was only
$15,400 because it has four times the number of people as the United States.
Big Data – Why NoSQL?
Why? Types
A video to show the athletic power of quadcopters and the data it generates for processing
History of Spark
Why Spark ?
Why Spark – More Speed Samples?
4000
3500
Emerged as fastest open source solution to sort 100TB
3000 110 sec / iteration data in Daytona Gray Sort Benchmark
2500 (http://sortbenchmark.org/)
Running Time (s)
2000 Hadoop
1500
Spark
1000
500
0
1 5 10 20 30 first iteration 80 sec
Number of Iterations
further iterations 1 sec
Overview of Spark
An emerging open source big data cluster computing framework, alternative to MR and top priority project in
Apache.
App 1 App 2 App 3 App N Reliable, scalable, fast-parallel, in-memory cluster computing.
Spark
Spark SQL MLlib GraphX SQL, Streaming and Complex Analytics in one framework.
Streaming