Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Name: Cheong Tong Chon Student ID: DB925799

CISC3018 Cloud Computing and Big Data Systems


Assignment (04) No. of Pages: 3
Question 1:
Resilient distributed datasets (RDD) enable reliable storage on memory, Spark
proposes RDD which is a block of read-only dataset and can be kept in memory for a
long period if necessary. Reading the data in RDD in memory is much more efficient
compared to reading the data from disk (i.e., MapReduce). Each RDD block can only
be generated from external storages (e.g., disks) or the “transformation”-operations
from the data in memory, which is much more reliable and fault tolerant than the
traditional sharing memory. Different RDDs are connected via lineage. The lost RDD
can be recovered from its “parent” RDD directly. RDD does not make any change on
data (i.e., being randomly). As a result, RDD is also compatible with MapReduce. In
Narrow RDD, each parent RDD generates on child RDD and so no redundancy in
additional data. On the other hand, in wide RDD, each parent RDD generates
multiple child RDDs, redundancy in additional data but with increased reliability.

Question 2:
(a) The four key components of Yet Another Resource Negotiator (YARN) are
Resource Manager (RM), Node Manager (NM), Container and Application Master
(AM). Resource Manager manage the resources for the applications that are running
in a HDFS cluster, which usually works on the Master nodes. Node Manager manage
the memory resource within the individual node and memory disk, which works on
the slave-system (Data Nodes). Container is a collection of physical resources and
hardware such as RAM, CPU cores, and disks on a single slave-node. Application
Master manages the user job lifecycle and resource needs of individual applications.
It works along with the Node-Manager in the Slave-systems and monitors the
execution of tasks
(b) YARN uses multiple resource manager coexist simultaneously. An Active or
Standby Mode, at any point of time, one resource manager is active to work, and the
others are in Standby mode. One of the standby resource managers will be invoked
to take over the management if the currently active resource manager fails. In
practice, due to incomplete switchover or other reasons, the client and the slave may
mistakenly think that there are two active resource managers, when two nodes
collide in occupying the physical resources, an arbiter of the third party decides
whom to listen to. When the status of a node cannot be determined, it kills the other
party to ensure that the shared resources are completely released.

1
Question 3:
Batch processing extract knowledge and information from data, after several hours
or days when the batch of data is available. It process a significant amount of data in
a parallel and distributed manner (via virtual cloud). There is no strict latency limit to
complete jobs. Streaming processing analyze the real-time generated data in
sequence after it is produced. Data is generated in real-time fashion (“velocity”) as
the time passes, which needs to be processed when they arrive in sequence as a
“stream”. It processes data or tasks as “data flows” through a directed graph model.
Interactive processing provide feedback results instantly when a query is posed, it
uses a practical architecture which is an integration of batch processing and
streaming processing.

Question 4:
(a) The classification methods regarding the Rule-based Approach are Decision Tree
based Methods, Rule-based Methods and Nearest-neighbor. And regarding the
Optimization-based Approach, Bayesian analysis based Methods, Support Vector
Machines and Neural Networks. Decision Tree based Method use the generated
decision tree for deterring the “label” of a new data. Support Vector Machines find a
hyper-plane to separate two groups of nodes (i.e., classification) in space. Neural
Networks generate (i.e., learn) an artificial neural network for classification based on
the training data.
(b) Support Vector Machines (SVM) uses margin, which is the minimum interval from
the hyper-plane to both categories. The large the margin, the better the classification
result. B1 can provide a better classification result, as the intervals to both categories
should be as large as possible.

2
Question 5:
(a) When a single neuron is inputted, sign function is used as the activation function.
In figure 2, x1, x2 and x3 has weight of their own w1, w2 and w3. Then by computing
3

∑ wi xi. Then combining the result of b, the output y can be generated by:
i=1

{
T
f ( x )= 1 ,∧if w x +b>0
−1 ,∧otherwise
(b) As different x gives different y. So by training the model, by training the model, it
uses iterative gradient based approach, it finds the most suitable w1, w2, w3 and b.

Question 6:
Number of parameters: ( 5+1 ) ×3+ ( 3+1 ) =22

Question 7:
(a) The main structure of Convolutional Neural Network (CNN) includes
Convolutional Layer, Pooling Layer and Fully Connected Layer. Convolutional Layer
convolve images with different kernels and produce different results. It is similar to
the operations of neuron but more complicated. Pooling Layer reduce the
dimensionality of the data by combining the outputs of clusters of neurons in one
layer into a single neuron in the next layer. Fully Connected Layer use the extracted
features from convolution layer and pooling layer to finally output the classification
result.
(b) Artificial Neural Network (ANN) is based on the training data, to generate (i.e.,
learn) an artificial neural network for classification. While Convolutional Neural
Network (CNN) is based on convolution to generate the neural network.

Question 8:
Classification gives a collection of records (training set), it learn a model that maps
each attribute set x into one of the predefined labels y, then it use the trained model
for classifying another set of data with attribute x but unknown y. In clustering, it
identify the similarity among different data objects, the grouping is achieved by
determining similarities between data according to characteristics found in the real
data.

You might also like