Professional Documents
Culture Documents
Intro Apache Spark Jun5 V4
Intro Apache Spark Jun5 V4
Intro Apache Spark Jun5 V4
Think Parallel
Let’s know our “Limit”s
Limit - Does it mean there is a “bound” ? -Yes, agree ?
What is the “limit”(ation) with your computer /server?
In terms of it’s capacity to store the data, process (CPU, RAM) the data
Particular to a machine there is a “Limit” to store the data & process the data
Can I store or process the data that is beyond the limits of a machine?
Solutions to push the limit bound:
(scale-up –> expensive & there is a boundary), (scale-out –> inexpensive & no boundary (?))
Which is the scalable solution ?
Scale-up OR Scale-out ?
Scale-up is always limited to one machine’s capacity
Scale-out system’s overall capacity is a collective capacity of individual machines
Scale-out system is a convenient scalable solution
Usually a scale-out system is called as a “Cluster”
If multiple networked systems work together as a one system then it is called as a
“Cluster” system
Can we “Think” / “Work” parallel ?
A cluster system facilitates the parallel work, how ?
When you place the data in a cluster, the data is spread across multiple machines
The entire copy of your computation is send to each block of targeted data
Size: 5GB
Size: 20GB Size: 8GB Size: 17GB
NODE1 NODE5
Size: 5GB
Size: 20GB Size: 8GB Size: 5GB
NODE1 NODE5
( Intermediate Outputs are written to local system disk, this is called map phase, reduce
phase may happen if required, then final output will be written to user specified path )
Int.OutPut
Int.OutPut Int.OutPut
Int.OutPut
Size: 5GB
Size: 20GB Size: 8GB Size: 5GB
NODE1 NODE5