Intro Apache Spark Jun5 V4

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 8

Think Scalable

Think Parallel
Let’s know our “Limit”s
 Limit - Does it mean there is a “bound” ? -Yes, agree ?
 What is the “limit”(ation) with your computer /server?
 In terms of it’s capacity to store the data, process (CPU, RAM) the data
 Particular to a machine there is a “Limit” to store the data & process the data
 Can I store or process the data that is beyond the limits of a machine?
 Solutions to push the limit bound:
 (scale-up –> expensive & there is a boundary), (scale-out –> inexpensive & no boundary (?))
Which is the scalable solution ?
 Scale-up OR Scale-out ?
 Scale-up is always limited to one machine’s capacity
 Scale-out system’s overall capacity is a collective capacity of individual machines
 Scale-out system is a convenient scalable solution
 Usually a scale-out system is called as a “Cluster”
 If multiple networked systems work together as a one system then it is called as a
“Cluster” system
Can we “Think” / “Work” parallel ?
 A cluster system facilitates the parallel work, how ?
 When you place the data in a cluster, the data is spread across multiple machines
 The entire copy of your computation is send to each block of targeted data

Program to find frequency


Size: 50GB
of wods
Size: 50GB

Size: 5GB
Size: 20GB Size: 8GB Size: 17GB

NODE1 NODE5

NODE2 NODE3 NODE4


Program (contains
computation steps)

Size: 5GB
Size: 20GB Size: 8GB Size: 5GB

NODE1 NODE5

NODE2 NODE3 NODE4


Final OutPut

( Intermediate Outputs are written to local system disk, this is called map phase, reduce
phase may happen if required, then final output will be written to user specified path )
Int.OutPut

Int.OutPut Int.OutPut
Int.OutPut

Size: 5GB
Size: 20GB Size: 8GB Size: 5GB

NODE1 NODE5

NODE2 NODE3 NODE4

You might also like