Professional Documents
Culture Documents
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
What are basic characteristics of Data and how is Parallel processing system different
from distributed system?
The main difference between parallel and distributed computing is that parallel computing allows
multiple processors to execute tasks simultaneously while distributed computing divides a single
task between multiple computers to achieve a common goal.
Parallel takes place in single computer and distributed takes place in different systems.
Processors communicate through buses whereas in distributed it takes place through network.
Computer can have shared memory in parallel whereas in distributed computer has its own
memory.
Validating data
5. Explore and list few use cases of big data Analytics in the following domains
Healthcare
Retail
Telecom
Entertainment
1) Healthcare:-
Hospital quality and patient safety in the ICU.
Real-Time Alerting.
Enhancing Patient Engagement.
Precision medicine, personalized care, and genomics.
Population health management, risk stratification, and prevention.
Big data might just cure cancer.
2) Retail:-
Up-Sell/Cross-Sell Recommendations.
Fraud Detection.
Personalizing customer experience.
Forecasting demand in Retail.
Customer journey analytics.
3) Telecom:-
Improved Network Security
Better Customer Service
Contextualized Location-Based Promotions
Predictive Maintenance
Targeted Campaigns
Real-Time Network Analytics
3) Entertainment:-
Predicting what your audience wants.
Insights into customer churn.
Optimized scheduling of media streams.
Content monetization.
Effective Ad Targeting.
8.How is a file stored in HDFS to support parallel processing as well as support fault
tolerance?
HDFS provides fault tolerance by replicating the data blocks and distributing it among different
data Nodes across the cluster. By default, this replication factor is set to 3 which is configurable.
So, if I store a file of 1 GB in HDFS where the replication factor is set to default i.e. , 3, it will
finally occupy a total space of 3 GB because of the replication. Now, even if a retrieve the data
from other replicas stored in different Data Nodes.
9.What are the HDFS Daemons and what are their responsibilities?
Daemons are the processes that run in the background. There are mainly 4 daemons
which run for Hadoop.
Namenode – It runs on master node for HDFS.
Datanode – It runs on slave nodes for HDFS.
ResourceManager – It runs on master node for Yarn.
NodeManager – It runs on slave node for Yarn.
10.What assumptions were made in the design of HDFS? Do these assumptions make
sense? Discuss why?
1. Large dataset: The architecture is designed such that it is the best fit for large amount of data.
2. Write once, read many: It assumes that a file in HDFS once written will not be modified,
though it can be accessed n number of times. This assumption enables to ensure high throughput
of data access.
3. Commodity hardware: HDFS assumes that the cluster(s) will run on common hardware, that
is, non-expensive, ordinary machines. This is in order to reduce overall cost to a great extent.
4. Data replication and Fault tolerance: HDFS works on the assumption that hardware is bound
to fail at some point of time or the other. To overcome this failure, each block is stored on three
nodes ( rf = 3 default): two on the same rack and one on a different rack for fault tolerance. This
redundancy enables robustness, fault detection, quick recovery.
5. Moving code to the data, than data to the code: This is done in order to increase the overall
efficiency as its much better to do the computations at the applications near the data and send the
results than send the data itself. This reduces congestion in the network as large amount of data
isn’t sent.
MapReduce serves two essential functions: it filters and parcels out work to various nodes
within the cluster or map, a function sometimes referred to as the mapper, and it organizes and
reduces the results from each node into a cohesive answer to a query, referred to as the reducer.
How MapReduce works
JobTracker -- the master node that manages all the jobs and resources in a cluster;
TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce
tasks; and
JobHistory Server -- a component that tracks completed jobs and is typically deployed as
a separate function or with JobTracker.
12.What are the two main phases of the Map Reduce programming ? Explain through an
example.
Sort and Shuffle phase takes the input as (K,V) and generates the output int the form of Key and
List of Value pairs (K, List(v)).
Reducer phase takes the input as (K, List(v)), and generates the output as (K,V). Reducer phase
output is the Final Output.
Hadoop YARN is the resource management and job scheduling technology in the open
source Hadoop distributed processing framework. One of Apache Hadoop's core components,
YARN is responsible for allocating system resources to the various applications running in
a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.
In a cluster architecture, Apache Hadoop YARN sits between HDFS and the processing engines
being used to run applications. It combines a central resource manager with containers,
application coordinators and node-level agents that monitor processing operations in individual
cluster nodes. YARN can dynamically allocate resources to applications as needed, a capability
designed to improve resource utilization and application performance compared with
MapReduce's more static allocation approach.
In addition, YARN supports multiple scheduling methods, all based on a queue format for
submitting processing jobs. The default FIFO Scheduler runs applications on a first-in-first-out
basis, as reflected in its name. However, that may not be optimal for clusters that are shared by
multiple users. Apache Hadoop's pluggable Fair Scheduler tool instead assigns each job running
at the same time its "fair share" of cluster resources, based on a weighting metric that the
scheduler calculates.
Hadoop YARN also includes a Reservation System feature that lets users reserve cluster
resources in advance for important processing jobs to ensure they run smoothly. To avoid
overloading a cluster with reservations, IT managers can limit the amount of resources that can
be reserved by individual users and set automated policies to reject reservation requests that
exceed the limits.
14.What are the YARN daemons and what are their functions ?
Resources Manager:- Runs on a master daemon and manages the resource allocation in the
cluster.
Node Manager:- They run on the slave daemons and are responsible for the execution of a task
on every single Data Node.
Application Master:- Manages the user job life cycle and resource needs of individual
applications. It works along with the Node Manager and monitors the execution of tasks.
15.Explain how Map Reduce, YARN and HDFS work together. Diagrams are not
necessary for the explanation, but the sequence of handshakes between the components
should be clearly explained.
To process any data, the client submits data and program to Hadoop. HDFS stores the data while
MapReduce process the data and Yarn divide the tasks.
• Name Node is the daemon running of the master machine. Stores the directory tree of all files
in the file system. It tracks where across the cluster the file data resides.
• Data node daemon runs on the slave nodes. It stores data in the Hadoop File System. In
functional file system data replicates across many Data Modes.
➢ MapReduce: The general idea of the MapReduce algorithm is to process the data in parallel
on your distributed cluster. It subsequently combines it into the desired result or output.
• Map: Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
• Reduce: which takes the output from a map as an input and combines those data tuples into a
smaller set of tuples.
➢ YARN: Yarn divides the task on resource management and job scheduling/monitoring into
separate daemons. Yarn supports the concept of Resource Reservation via Reservation System.
In this, a user can fix several resources for execution of a job over time and temporal constraints.
➢ In summary Input data is broken into blocks of size 128 Mb and then blocks are moved to
different nodes. Once all the blocks of the data are stored on data-nodes, the user can process the
data. Resource Manager then schedules the program (submitted by the user) on individual nodes.
Once all the nodes process the data, the output is written back to HDFS.
16.What shortcomings of Hadoop motivated the development of Spark. How were the
shortcomings addressed by Spark ?
Low Processing speeds: as the algorithm is best only for large amount of data.
Latency: In Hadoop, the MapReduce framework is slower, since it supports different formats,
structures, and huge volumes of data.
Lengthy Line of Code: Since Hadoop is written in Java, the code is lengthy. And, this takes more
time to execute the program.
These are addressed in spark by In-memory Processing: In-memory processing is faster when
compared to Hadoop, as there is no time spent in moving data/processes in and out of the disk.
Stream Processing: Apache Spark supports stream processing, which involves continuous input
and output of data. Stream processing is also called real-time processing.
Less Latency: Apache Spark is relatively faster than Hadoop, since it caches most of the input
data in memory.
Lazy Evaluation: Apache Spark starts evaluating only when it is absolutely needed. This plays an
important role in contributing to its speed.
Less Lines of Code: Although Spark is written in both Scala and Java, the implementation is in
Scala, so the number of lines are relatively lesser in Spark when compared to Hadoop
17.Is Spark completely different from Hadoop ? If not, what is same between the two and
what is different ?
While Spark can run on top of Hadoop and provides a better computational speed solution.
However, there are various similarities between them.
• Recovery: RDDs allows recovery of partitions in spark on failed nodes by recomputation of the
DAG (direct acyclic graph) while also supporting a more similar recovery style to Hadoop by
way of checkpointing.
• Fault tolerance: Both have counter measures for handling faults, hence no need to restart the
system.
• Speed: Since spark works in memory, works much faster than Hadoop map reduce.
• Real time analysis: Spark enables real time analysis of the data while Hadoop fails.
Hadoop:
Spark:
Spark also is Apache Open Source so no license cost.
Hardware cost is more than MapReduce as even though Spark can work on commodity hardware
it needs a lot more memory(RAM) as compared to MapReduce since it should be able to fit all
the data in Memory for optimal performance. Cluster needs little high-end commodity hardware
with lots of RAM else performance gets hit.
Machine Learning: Spark comes with an integrated framework for performing advanced
analytics that helps users run repeated queries on sets of data which essentially amounts to
processing machine learning algorithms.
Interactive Analysis: Apache Spark, is fast enough to perform exploratory queries without
sampling. Spark also interfaces with a number of development languages including SQL, R, and
Python.
20.In the Scala REPL, type “3.” and then hit the TAB key. What do you see ? Note: do not
ignore the “.” after the 3.
It displays all the operations that can be performed on the integers such as !=, +, <<, >>, abs,
compareTo, getClass, isNaN, isWhole, round, toInt, %, -, ^, ==, / *, <, > and many more.
21.Do the same as above by typing “Hello.” followed by the TAB key. Note: do not ignore
the “.” after “Hello”. Repeat by typing “Hello.s” and then applying the TAB key.
=>“Hello”. Followed by the TAB key displays all the operations that can be performed on the
Strings such as *, ++, capitalize, contentEquals, flatten, charAt, lengthCompare, map, isBlank,
max, min, exists, equals and many more.
“Hello”.s followed by the TAB key displays all the operations that can be performed on the
Strings that starts with letter “s” such as sameElements, scanLeft, seq, slice, sortBy, split,
startsWith, subString, splitAt and many more.
22.In the Scala REPL, compute the square root of 3, and then square that value. What do
you observe and how can you explain your observation ?
23.Scala lets you multiply a string with a number—try out "crazy" * 3 in the REPL. What
does this
operation do?
scala> “crazy” * 3
24.How do you get the first character of a string in Scala? The last character?
scala> s.head
res0: Char = W
scala> s(0)
res1: Char = W
scala> s.last
res2: Char = e
scala> s(s.length - 1)
res3: Char = e
Lazy initialization is a technique that defers the creation of an object until the first time it is
needed. In other words, initialization of the object happens only on demand. We can improve the
application’s performance by avoiding unnecessary computation and memory consumption.
26.Write a Scala equivalent for the Java loop for (int i = 10; i >= 0; i--)
System.out.println(i);
27. Write a procedure countdown(n: Int) that prints the numbers from n to 0.
=>scala> def countdown(n:Int){ scala> def countdown(n:Int){
| for(i<-(0 to n).reverse) | for(i<-n to (0,-1))
| println(i) | println(i)
|} |}
countdown: (n: Int)Unit countdown: (n: Int)Unit
OR
scala> countdown(4) scala> countdown(5)
4 5
3 4
2 3
1 2
0 1
0
28.Write a for loop for computing the product of the Unicode codes of all letters in a string.
For example, the product of the characters in "Hello" is 825152896.
29.Write a function that computes x^n, where n is an integer. Use the following recursive
definition:
x^0 = 1.
31.Set up a map of gadgets that you want, along with their prices. Create a second map of
the gadgets with a 10% discount. Print the second map and inspect its contents.
32.Use a mutable map to count the number of times each word appears in a sentence
(provided as a string).
34. In the scala console, type “Hello”.zip(“World”). What does it do ? Give a scenario
where you
scala> "Hello".zip("World")
res38: scala.collection.immutable.IndexedSeq[(Char, Char)] = Vector((H,W), (e,o), (l,r), (l,l),
(o,d))
Comprehension is the ability to understand something after processing text and understanding its
meaning. Scala offers alight weight notation for expressing sequence comprehensions.
Comprehensions have the form for(enums) yielde, where enums refers to a semicolon-separated
list of enumerators. An enumerator is either a generator which introduces new variables, or it is a
filter. It is similar to comprehension as in it is a filter to get certain data.
36 .When does it make sense to use iterators ? Why is map / foreach usually a better option
?
It makes sense to use iterators when collection is too big to place completely in
memory, for example, when processing large files.
Map/foreach is usualy a better option in this case as they get only one item at a time and
foreach iterates over all items one at a time.
37.Given an array of strings, find the sum of the length of all the strings.
38.Write a function that given a string, produces a map of the indexes of all the characters
as a list. For example, indexes(“Mississippi”) should return a map associating ‘M’ with the
List(0), ‘i’ with the List(1, 4, 7, 10) and so on. Use a mutable map of characters and
ListBuffer’s which are mutable, in place of List’s which are not mutable.
39.Write a class Time with read only properties hours and minutes and a method
before(other: Time): Boolean that checks whether this time comes before the other. A
Time object should be constructed as new Time(hrs, min), where hrs is in 24 hour format
(0 to 23).
scala> Conversions.inchesToCms(5)
res5: Double = 12.7
scala> Conversions.milesToKms(10)
res6: Double = 16.0934
Getters are a technique through which we get the value of the variables of a class.
Getting the value of a global variable directly. In which we call specify the name of the
variable with the object.
Getting the value of a variable through method calling using the object. This technique is
good when we don’t have accessibility to class variables but methods are available public.
Example:
// Getter
private var student_rollno= 0
// Class method
def set_rollno(x: Int){
student_rollno= x
}
def get_rollno(): Int ={
return student_rollno
}
// Creating object
object Main
{
// Main method
def main(args: Array[String])
{
// Class object
var obj = new Student()
obj.student_name= "Yash"
obj.student_age= 22
obj.set_rollno(59)
Setters
Setter are a technique through which we set the value of variables of a class.
Setting an variable of class is simple it can be done in two ways :-
First if the members of a class are accessible from anywhere. i.e no access modifier
specified.
Example:
// Creating object
object Main
{
// Main method
def main(args: Array[String])
{
// Class object
var obj = new Student()
obj.student_name= "Yash"
obj.student_age= 22
obj.student_rollno= 59
println("Student Name: " + obj.student_name)
println("Student Age: " + obj.student_age)
println("Student Rollno: " + obj.student_rollno)
}
}
Output:
Student Name: dikshitha
Student Age: 21
Student Rollno: 031
For security reasons it is not recommended. As accessing the members of class
directly is not a good a method to initiate and change the value as it will allow
others to identify the variable.
Second if the members of a class are defined as private. Initiation of the variables is
done by passing the variable to public method of that class using the object of the
class.
Example:
// Class method
def set_roll_no(x: Int)
{
student_rollno= x
}
}
// Creating object
object GFG
{
// Main method
def main(args: Array[String])
{
// Class object
var obj = new Student()
obj.student_name= "Yash"
obj.student_age= 22
}
}
43.Using pattern matching, write a function swap that swaps the first two elements of an
array provided its length is atleast two. Refer to the example on List matching for hints.
scala> swap2(num)
res11: Array[Int] = Array(3, 4, 6, 7, 9)
The key factor of functional programming is "immutable state". In other words, once a variable
is initialized with one value, it cannot be assigned a different value later. This is in contrast with
imperative programming, where it is common to re-assign variables to new values.
• All built-in operators and user defined functions are "pure functions"
A functional style can be used in most (not all) imperative languages. To do so, the programmer
himself must avoid mutable data, since the language would continue to allow it. As a side note,
modern software engineering principles generally advise functional style code when possible,
even in non-functional languages. So if you are a coder in an imperative language, spend time
learning a functional language -- you'll improve your habits and learn a new set of patterns to
apply to solving problems. But in a language not designed for functional programming, the
programmer would find it difficult to accomplish some straightforward tasks in purely functional
terms, looping being a good example. To make functional coding easier, most functional
languages provide features such as:
• Recursion
• Closures
• Currying