Popular Machine Learning Algorithms in Apache Spark

Q. 1. State and explain Machine learning Q.
2]What is deep learning and describe 4

algorithm in Apache spark ? applications of deep learning and
Ans:- Apache Spark MLlib is a machine challenges?
learning library that provides a wide range of
algorithms for classification, regression, Ans:- Deep learning is a subset of machine learning
that has gained significant popularity due to its ability
clustering, collaborative filtering, to learn and make predictions from data with minimal
dimensionality reduction, and feature feature engineering. It is based on artificial neural
extraction. MLlib is built on top of Spark's networks, which are inspired by the structure and
distributed computing engine, which makes it function of the human brain. Deep learning has found
scalable to large datasets. numerous applications across various domains, and I'll
explain some of them in detail:
MLlib algorithms are implemented in two
1.)Computer Vision:-
different APIs:
a.)RDD-based API: This API is built on top of a.)Image Classification: Deep learning models
Spark's Resilient Distributed Datasets (RDDs). like Convolutional Neural Networks (CNNs)
It is the original MLlib API and is still can classify images into different categories.
supported, but it is in maintenance mode. This is widely used in applications such as
b.)DataFrame-based API: i.)This API is built on facial recognition, object detection, and
top of Spark's DataFrames. It is the newer and identifying diseases in medical images.
preferred MLlib API, and it is the primary API
for machine learning in Spark. 2.)Natural Language Processing (NLP):-
ii.)The DataFrame-based API provides a a.)Language Translation: Recurrent Neural
higher-level abstraction for machine learning, Networks (RNNs) and Transformer models like
making it easier to build and deploy machine BERT have revolutionized machine
learning pipelines. It also provides a number translation, enabling systems like Google
of features that are not available in the RDD- Translate to provide more accurate
based API, such as support for model translations.
persistence and distributed hyperparameter 3. *Speech Recognition*:
tuning.
popular machine learning algorithms in - Deep learning is used in automatic speech
Apache Spark: recognition systems, converting spoken
language into text. Applications include voice
a.)Classification: Logistic regression, naive assistants, transcription services, and
Bayes, decision trees, random forests, accessibility tools for people with disabilities.
gradient-boosted trees
b.)Regression: Generalized linear regression,
survival regression, decision trees, random 4. *Reinforcement Learning*:
forests, gradient-boosted trees
c.)Clustering: K-means, Gaussian mixtures, - Deep reinforcement learning combines
latent Dirichlet allocation (LDA) deep neural networks with reinforcement
d.)Collaborative filtering: Alternating least learning algorithms to enable machines to
squares (ALS) learn how to make decisions by interacting
e.)Dimensionality reduction: Principal with an environment. This is used in
component analysis (PCA), singular value autonomous robotics, game playing (e.g.,
decomposition (SVD) AlphaGo), and optimization problems.
f.)Feature extraction: Word2Vec, TF-IDF, one-
hot encoding
How to use machine learning algorithms
in Apache Spark:-To use a machine learning
algorithm in Apache Spark, you first need to
load the data into a Spark DataFrame. Once
the data is loaded, you can use the MLlib API
to create and train a model. Once the model
is trained, you can use it to make predictions
on new data. 1
Q.3]what is graph processing and describe Q.4] Describe key features of Mango DB ?
about -prequel, giraph, Apache graph x ? Ans:- 1. Document Oriented:-MongoDB
Ans:- Graph processing is a type of data stores all the data in the form of documents
processing that deals with analyzing and instead of tables like in RDBMS. In these
manipulating data structured as a graph. In a documents, the data is stored in key-value
graph, data is represented as a collection of pairs instead of rows and columns which
nodes (vertices) and edges that connect these make the data much more flexible in
nodes. Graphs are used to model various comparison to RDBMS where each document
types of relationships and networks, such as contains a unique ID.
social networks, transportation networks, 2. Schema-less Database:-Schema-less
biological networks, and more. database is a great feature provided by
Pregel: MongoDB which means one collection can
a.)Pregel is a graph processing framework hold different types of documents in it. In the
developed by Google and inspired by Google's MongoDB database, a single collection can
web crawling algorithms. hold multiple documents and these
b.)It provides a programming model that is documents may consist of the different
centered around the concept of vertices and numbers of fields, content, and size. It is not
their associated states. required that a document is similar to another
c.)Pregel is designed to handle large graphs document as it is in the case of relational
and distributed computing environments databases. Due to this amazing feature,
efficiently. MongoDB provides great flexibility to
d.)It allows users to define custom vertex and databases.
message handlers to perform computations 3. Scalability:-MongoDB provides horizontal
on the graph. scalability with the help of a mechanism
Giraph: known as sharding. Sharding refers to the
a.)Apache Giraph is an open-source graph process of distributing data on multiple
processing framework based on Pregel, servers, here a large amount of data is
designed to work with the Hadoop partitioned into data chunks using the shard
ecosystem. key, and these data chunks are evenly
b.)It extends the Pregel model and is distributed across shards that reside across
integrated with Apache Hadoop's HDFS many physical servers. It can also add new
(Hadoop Distributed File System) for data machines to a running database.
storage. 4. Indexing:-MongoDB database indexes
c.)Giraph is well-suited for processing large- every field in the documents with primary
scale graphs in parallel across a cluster of and secondary indices which makes it easier
machines. and takes less time to get or search data from
c.)It is often used for tasks like graph traversal, the pool of the data. If the data is not
community detection, and graph analytics. indexed, then the database searches each
Apache Gelly (part of Apache Flink): document with the specified query which
a.)Apache Gelly is a graph processing library takes lots of time and is not so efficient.
that is part of the Apache Flink project, which 5. Aggregation:-MongoDB also allows to
is a stream processing framework. perform operations on the grouped data and
b.)Gelly leverages Flink's capabilities for get a single result or computed result. It
distributed data processing to handle graph provides three different aggregations, namely,
data. aggregation pipeline, map-reduce function,
c.)It provides a high-level API for graph and single-purpose aggregation methods.
operations, making it easier for developers to 6. High Performance
work with graph data in a Flink-based Due to its features like scalability, indexing,
environment. replication, etc., the performance of
d.)Gelly supports common graph algorithms MongoDB becomes very high as also data
and can be used for various graph analytics persistence, as compared to any other
tasks. databases.
2
Q.5]How mango DB provide DB Q.6] Explain horizontal scaling and
replication? vertical scaling in mango DB?
Ans:- a.)Replication exists primarily to offer Ans:- Scaling alters the size of a system. In
data redundancy and high availability. We the scaling process, we either compress or
expand the system to meet the expected
maintain the durability of data by keeping needs. The scaling operation can be achieved
multiple copies or replicas of that data on by adding resources to meet the smaller
physically isolated servers. That’s replication: expectation in the current system, by adding
a new system to the existing one, or both:_
the process of creating redundant data to
Vertical Scaling: When new resources are
streamline and safeguard data availability and added to the existing system to meet the
durability. b.)Replication allows you to expectation, it is known as vertical scaling.
increase data availability by creating multiple Consider a rack of servers and resources that
copies of your data across servers. This is comprises the existing system. (as shown in
the figure). Now when the existing system
especially useful if a server crashes or if you fails to meet the expected needs, and the
experience service interruptions or hardware expected needs can be met by just adding
failure. c.)If your data only resides in a single resources, this is considered vertical scaling.
Vertical scaling is based on the idea of
database, any of these events would make
adding more power(CPU, RAM) to existing
accessing the data impossible. But thanks to systems, basically adding more resources.
replication, your applications can stay online Vertical scaling is not only easy but also
in case of database server failure, while also cheaper than Horizontal Scaling. It also
requires less time to be fixed. a.)It expands
providing disaster recovery and backup
the size of the existing system vertically.
options. d.)With MongoDB, replication is b.) It is harder to upgrade and may involve
achieved through a replica set. Writer downtime. c.)It is easy to implement.
operations are sent to the primary server d.) It is cheaper as we need to just add new
resources.
(node), which applies the operations across Horizontal Scaling: When new server racks
secondary servers, replicating the data. are added to the existing system to meet the
e.)If the primary server fails (through a crash higher expectation, it is known as horizontal
or system failure), one of the secondary scaling. Consider a rack of servers and
resources that comprises the existing
servers takes over and becomes the new system. (as shown in the figure). Now when
primary node via election. If that server the existing system fails to meet the
comes back online, it becomes a secondary expected needs, and the expected needs
once it fully recovers, aiding the new primary cannot be met by just adding resources, we
need to add completely new servers. This is
node. considered horizontal scaling. Horizontal
scaling is based on the idea of adding more
machines to our pool of resources.
Horizontal scaling is difficult and also costlier
than Vertical Scaling. It also requires more
time to be fixed.
a.)It expands the size of the existing system
horizontally. b.) It is easier to upgrade. c.) It
is difficult to implement d.) It is costlier, as
new server racks comprise a lot of resources
3
Q. 1. State and explain Machine learning Q.2]What is deep learning and describe 4
algorithm in Apache spark ? applications of deep learning and
Ans:- Apache Spark MLlib is a machine challenges?
learning library that provides a wide range of
algorithms for classification, regression, Ans:- Deep learning is a subset of machine learning
that has gained significant popularity due to its ability
clustering, collaborative filtering, to learn and make predictions from data with minimal
dimensionality reduction, and feature feature engineering. It is based on artificial neural
extraction. MLlib is built on top of Spark's networks, which are inspired by the structure and
distributed computing engine, which makes it function of the human brain. Deep learning has found
scalable to large datasets. numerous applications across various domains, and I'll
explain some of them in detail:
MLlib algorithms are implemented in two
1.)Computer Vision:-
different APIs:
a.)RDD-based API: This API is built on top of a.)Image Classification: Deep learning models
Spark's Resilient Distributed Datasets (RDDs). like Convolutional Neural Networks (CNNs)
It is the original MLlib API and is still can classify images into different categories.
supported, but it is in maintenance mode. This is widely used in applications such as
b.)DataFrame-based API: i.)This API is built on facial recognition, object detection, and
top of Spark's DataFrames. It is the newer and identifying diseases in medical images.
preferred MLlib API, and it is the primary API
for machine learning in Spark. 2.)Natural Language Processing (NLP):-
ii.)The DataFrame-based API provides a a.)Language Translation: Recurrent Neural
higher-level abstraction for machine learning, Networks (RNNs) and Transformer models like
making it easier to build and deploy machine BERT have revolutionized machine
learning pipelines. It also provides a number translation, enabling systems like Google
of features that are not available in the RDD- Translate to provide more accurate
based API, such as support for model translations.
persistence and distributed hyperparameter 3. *Speech Recognition*:
tuning.
popular machine learning algorithms in - Deep learning is used in automatic speech
Apache Spark: recognition systems, converting spoken
language into text. Applications include voice
a.)Classification: Logistic regression, naive assistants, transcription services, and
Bayes, decision trees, random forests, accessibility tools for people with disabilities.
gradient-boosted trees
b.)Regression: Generalized linear regression,
survival regression, decision trees, random 4. *Reinforcement Learning*:
forests, gradient-boosted trees
c.)Clustering: K-means, Gaussian mixtures, - Deep reinforcement learning combines
latent Dirichlet allocation (LDA) deep neural networks with reinforcement
d.)Collaborative filtering: Alternating least learning algorithms to enable machines to
squares (ALS) learn how to make decisions by interacting
e.)Dimensionality reduction: Principal with an environment. This is used in
component analysis (PCA), singular value autonomous robotics, game playing (e.g.,
decomposition (SVD) AlphaGo), and optimization problems.
f.)Feature extraction: Word2Vec, TF-IDF, one-
hot encoding
How to use machine learning algorithms
in Apache Spark:-To use a machine learning
algorithm in Apache Spark, you first need to
load the data into a Spark DataFrame. Once
the data is loaded, you can use the MLlib API
to create and train a model. Once the model
is trained, you can use it to make predictions 1
on new data.
Q.3]what is graph processing and describe Q.4] Describe key features of Mango DB ?
about -prequel, giraph, Apache graph x ? Ans:- 1. Document Oriented:-MongoDB
Ans:- Graph processing is a type of data stores all the data in the form of documents
processing that deals with analyzing and instead of tables like in RDBMS. In these
manipulating data structured as a graph. In a documents, the data is stored in key-value
graph, data is represented as a collection of pairs instead of rows and columns which
nodes (vertices) and edges that connect these make the data much more flexible in
nodes. Graphs are used to model various comparison to RDBMS where each document
types of relationships and networks, such as contains a unique ID.
social networks, transportation networks, 2. Schema-less Database:-Schema-less
biological networks, and more. database is a great feature provided by
Pregel: MongoDB which means one collection can
a.)Pregel is a graph processing framework hold different types of documents in it. In the
developed by Google and inspired by Google's MongoDB database, a single collection can
web crawling algorithms. hold multiple documents and these
b.)It provides a programming model that is documents may consist of the different
centered around the concept of vertices and numbers of fields, content, and size. It is not
their associated states. required that a document is similar to another
c.)Pregel is designed to handle large graphs document as it is in the case of relational
and distributed computing environments databases. Due to this amazing feature,
efficiently. MongoDB provides great flexibility to
d.)It allows users to define custom vertex and databases.
message handlers to perform computations 3. Scalability:-MongoDB provides horizontal
on the graph. scalability with the help of a mechanism
Giraph: known as sharding. Sharding refers to the
a.)Apache Giraph is an open-source graph process of distributing data on multiple
processing framework based on Pregel, servers, here a large amount of data is
designed to work with the Hadoop partitioned into data chunks using the shard
ecosystem. key, and these data chunks are evenly
b.)It extends the Pregel model and is distributed across shards that reside across
integrated with Apache Hadoop's HDFS many physical servers. It can also add new
(Hadoop Distributed File System) for data machines to a running database.
storage. 4. Indexing:-MongoDB database indexes
c.)Giraph is well-suited for processing large- every field in the documents with primary
scale graphs in parallel across a cluster of and secondary indices which makes it easier
machines. and takes less time to get or search data from
c.)It is often used for tasks like graph traversal, the pool of the data. If the data is not
community detection, and graph analytics. indexed, then the database searches each
Apache Gelly (part of Apache Flink): document with the specified query which
a.)Apache Gelly is a graph processing library takes lots of time and is not so efficient.
that is part of the Apache Flink project, which 5. Aggregation:-MongoDB also allows to
is a stream processing framework. perform operations on the grouped data and
b.)Gelly leverages Flink's capabilities for get a single result or computed result. It
distributed data processing to handle graph provides three different aggregations, namely,
data. aggregation pipeline, map-reduce function,
c.)It provides a high-level API for graph and single-purpose aggregation methods.
operations, making it easier for developers to 6. High Performance
work with graph data in a Flink-based Due to its features like scalability, indexing,
environment. replication, etc., the performance of
d.)Gelly supports common graph algorithms MongoDB becomes very high as also data
and can be used for various graph analytics persistence, as compared to any other
tasks. databases.
2
Q.5]How mango DB provide DB Q.6] Explain horizontal scaling and
replication? vertical scaling in mango DB?
Ans:- a.)Replication exists primarily to offer Ans:- Scaling alters the size of a system. In
data redundancy and high availability. We the scaling process, we either compress or
expand the system to meet the expected
maintain the durability of data by keeping needs. The scaling operation can be achieved
multiple copies or replicas of that data on by adding resources to meet the smaller
physically isolated servers. That’s replication: expectation in the current system, by adding
a new system to the existing one, or both:_
the process of creating redundant data to
Vertical Scaling: When new resources are
streamline and safeguard data availability and added to the existing system to meet the
durability. b.)Replication allows you to expectation, it is known as vertical scaling.
increase data availability by creating multiple Consider a rack of servers and resources that
copies of your data across servers. This is comprises the existing system. (as shown in
the figure). Now when the existing system
especially useful if a server crashes or if you fails to meet the expected needs, and the
experience service interruptions or hardware expected needs can be met by just adding
failure. c.)If your data only resides in a single resources, this is considered vertical scaling.
Vertical scaling is based on the idea of
database, any of these events would make
adding more power(CPU, RAM) to existing
accessing the data impossible. But thanks to systems, basically adding more resources.
replication, your applications can stay online Vertical scaling is not only easy but also
in case of database server failure, while also cheaper than Horizontal Scaling. It also
requires less time to be fixed. a.)It expands
providing disaster recovery and backup
the size of the existing system vertically.
options. d.)With MongoDB, replication is b.) It is harder to upgrade and may involve
achieved through a replica set. Writer downtime. c.)It is easy to implement.
operations are sent to the primary server d.) It is cheaper as we need to just add new
resources.
(node), which applies the operations across Horizontal Scaling: When new server racks
secondary servers, replicating the data. are added to the existing system to meet the
e.)If the primary server fails (through a crash higher expectation, it is known as horizontal
or system failure), one of the secondary scaling. Consider a rack of servers and
resources that comprises the existing
servers takes over and becomes the new system. (as shown in the figure). Now when
primary node via election. If that server the existing system fails to meet the
comes back online, it becomes a secondary expected needs, and the expected needs
once it fully recovers, aiding the new primary cannot be met by just adding resources, we
need to add completely new servers. This is
node. considered horizontal scaling. Horizontal
scaling is based on the idea of adding more
machines to our pool of resources.
Horizontal scaling is difficult and also costlier
than Vertical Scaling. It also requires more
time to be fixed.
a.)It expands the size of the existing system
horizontally. b.) It is easier to upgrade. c.) It
is difficult to implement d.) It is costlier, as
new server racks comprise a lot of resources

Popular Machine Learning Algorithms in Apache Spark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Popular Machine Learning Algorithms in Apache Spark

Uploaded by

Copyright:

Available Formats

Q. 1. State and explain Machine learning Q.

2]What is deep learning and describe 4

You might also like