Professional Documents
Culture Documents
ABD Exam Prep PDF
ABD Exam Prep PDF
Databricks
#1 We’ll move with a strong focus and detail on
Scale Up vs Scale Out Databricks because:
Up – there is a limit. Adding bigger, more • It is a leading platform for Big Data processing
powerful machines SMP: Symmetric • It is Spark based and leads Spark adoption
multiprocessing (along side with ASF)
Out – Add more commodity data servers. Adding • It provides a free platform for learn and work
more, smaller and cheaper machines MPP: with Spark
Massive parallel processing
Delta Lake is an open source storage layer that
brings reliability to data lakes. Delta Lake
provides ACID transactions, scalable metadata
handling, and unifies streaming and batch data
processing. Delta Lake runs on top of your
existing data lake and is fully compatible with
Apache Spark APIs
Big Data vs Traditional DB technologies
Traditional DBMS Systems
• Rigid data models
• Weak fault-tolerance architecture
• Scalability constrains
• Expensive to scale
• Limitation on unstructured data
• Proprietary HW and SW
Additionally: Dictionaries
%sh: allows you to execute shell code in your • A dictionary maps a set of objects (keys) to
notebook. Add the -e option in order to fail the another set of objects (values)
cell and (subsequently a job or a run command) if • Dictionaries are mutable and unordered (the
the shell command has a non-zero exit status order that the keys are added doesn't necessarily
%fs: allows you to use dbuilts filesystem reflect what order they may be reported back)
commands • Use {} curly brackets to construct the dictionary,
%md: Allows you to include various types of and [] square brackets to index it. Separate the
documentation, including text, images, and key and value with colons : and with commas ,
mathematical formulas and equations between each pair. Keys must be quoted
Tuples
• A tuple consists of a set of ordered values
separated by commas
• Tuples are immutable
• Tuples are always enclosed in parentheses
Lambda expressions
The Lambda operator or lambda function is used
for creating small, one-time and anonymous
Lists
• Lists are collections of items where each item in
the list has an assigned index value
• A list is mutable (meaning you can change its
contents)
• Lists are enclosed in square brackets [ ] and
each item is separated by a comma
Avro data files - data serialization • Efficient storage to optimized binary encoding
• To understand Avro, you must first understand • Widely support throughout the Hadoop
serialization ecosystem
• It’s the process of converting data into a o Can also be used outside of Hadoop
sequence of bytes/bits • Ideal for long-term storage of important data
o A way of representing data in memory as o Can read and write from many languages
a series of bytes o Embeds schema in the file, so will always
o Allows you to save data to disk or send it be readable
across the network o Schema evolution can accommodate
o Deserialization allows you to read the changes
data back into memory
• Many programming languages and libraries Columnar formats
support serialization • These organize data storage by column, rather
o Ex: Serializable in Java or pickle in Python than by row
• Backwards compatibility and cross-language • Very efficient when selecting only a small
support can be challenging subset of a table’s column
o Avro was developed to address these • Ex: RCFile, ORC (Optimized Row Columnar),
challenges Parquet (default in Spark)
A Transformation is a An Action is a
Reading and writing text files (transformation
function that produces function that
and action)
a new RDD / DF produces a value •
Spark can create distributed datasets from the
• textFile(file) count()
• map(func) • collect() storage source
• flatMap(func) • take(n) • Local file system, HDFS/DBFS, Cassandra,
• filter(func) • reduce() Hbase, Cloud provider, etc.
• sortBy(func) • min() • Use sc.textFile method to create the dataset
• union(rdd) • max()
• intersection(rdd) • stats() Reading a text file (is a transformation): •
• distinct() • foreach() sc.textFile(file)
• join(key value pair • countByKey() o Reads a text file(s) and creates an RDD of
rdd) • countByValue() lines (list)
• groupByKey() • saveAsTextFile(file) o Each line in the file(s) is a separate RDD
• reduceByKey(func)
element
Use actions for:
o Works only with line-delimited text files
• RDDs are immutable, • View data in the
Accepts a single file, wildcard list or a comma
data in a RDD is never console i) collect(),
changed take(n), count() separated list of files
• Transformation are • Collect data from • sc.textFile(“file.txt”) > One file
the core of the native objects in the • sc.textFile(“data/*.log”) > Wildcards
program business logic respective format i) • sc.textFile(“file1.txt,file2.txt”) > List of files
• It’s how you define parallelize(collection)
the changes you want • Write data to Writing a text file (is an action): •
to make in your data to output data sources rdd.saveAsTextFile(“path”)
produce the desired i) saveAsTextFile(file)
outcome Operations in Spark are lazy
• Lazy means that they do not compute their
results right away. Instead, they just remember
the transformations applied to some base dataset
(e.g., a file)
• The transformations are only computed when
an action requires a result to be returned to the
driver program
• This provides immense benefits to the end user
because Spark can optimize the entire data flow
from end to end
SUMMARY
• Spark is a distributed computation framework
• Can be used interactively via the Spark shell
(Python or Scala)
• Spark has 3 main data abstraction concepts:
RDD, DataFrames & Datasets
• RDD are the basic data unit in Spark and for
unstructured data. DataFrames are data
organized in columns for structured data
• Spark has 2 types of operations on data objects:
i) Transformations ii) Actions
• Lazy execution → Transformations are not
executed until required by an action
• Spark uses functional programming i) Passing
functions as parameters ii) Anonymous functions
• Spark terminology
o Application: set of jobs managed by a
driver
o Job: a set of tasks executed as a result of
an action
o Stage: a set of tasks in a job (one per
partition)
o Task: an individual unit of work sent to an
executor
• Spark keeps track of each RDDs lineage → To
provide fault tolerance
• RDD operations are executed on partitions in
parallel
o More parallelism means more speed
o The level of parallelism can be controlled
o Operations that depend on multiple
partitions are executed in separate stages
(ex: join, reduceByKey)
#5 Note: The input function to map returns a single
Summary statistics element, while the flatMap returns a list of
elements. The output of flatMap is flattened.
Lambda Architecture
Proposed by Nathan Marz to addresses the
latency problem of typical DW’s
Creating a GraphFrame
myGraphFrame = GraphFrame( verticesDF,
edgesDF )
• A GraphFrame is created from a vertex
DataFrame and an Edges DataFrame • The edges DataFrame must contain a column
• The vertex DataFrame must contain a column named src that stores the source of the edge and
named id that stores unique vertex IDs a column named dst that stores the destination
of the edge
• Both DFs can have additional columns, the
Graph operators List operators numEdges, numVertices, degrees
(GraphX, scala)
Property operators mapVertices, mapEdges, mapTriplets
Structural operators reverse, subgraph, goupEdges
Join operators joinVertices, outerJoinVertices
Neighborhood aggregateMessages (Primitives for developing graph algorithms)
Aggregation
GraphFrames show, display (vertices/edges)
operators
filter, groupBy, inDgrees/outDegress = # of edges pointing in/out
(GraphFrames,
inDegrees, g.vertices().filter(“column = ‘string’”).show()
Python)
outDegrees g.filterEdges(“column = ‘string’ “)
.dropIsolatedVertices().vertices().show()
g.vertices.groupBy().max(“column”).show()
g.inDegress.show()
GraphFrames Motif finding Searching for structural patterns in a graph. DSL - Domain
operators Specific Language: () - [] -> () g.find( “(v1) - [e1] -> (v2); (v2) - [e2]
(GraphFrames, -> (v1)” ) g.find( “(v1) - [] -> (v2); !(v2) - [] -> (v1)” )
Python)
Subgraphs Select subgraphs based on a combination of motif finding and
DataFrame filters. Ex: filterVertices(cond), filterEdges(cond),
dropIsolatedVertices()
Summary
• ML system are design to “learn” from data by
building a mathematical model and make
predictions based on that model (“past data”)
• ML implies a standard process
• Load data and clean/transform the data
• Built/train and test the model
• Deploy and use the model
• Evaluate the model
• Spark has unique characteristics for ML models Derive a new feature based on existing one(s)
development and implementation
• Spark does ML with two libraries: Mllib and
Spark ML
• Spark implements many ML functions and
algorithms (but still with work in progress)
Eliminate duplicates
Identifying Outliers
• Common rule: an outlier value is more than
Filter Nulls (eliminate rows with null values) 1.5 * IQR
• IQR is the interquartile range, defined as the
difference between the upper (Q3) and lower
(Q1) quartiles → Meaning: there are no outliers if
all the values are roughly within the Q1−1.5*IQR
and Q3+1.5*IQR range
• For every distinct category, create a feature
(vector)
• Better than just StringIndex to solve non
ordered categories
ABD 21 1st Exam - 11 January (pdf) a. A program to load data with high
1. What is the meaning of the data parallelization
processing model: “Scale Out”? b. A column-based data format
a. Means adding bigger, more powerful c. A row-based data format
machines d. A text-based data format for
b. Means adding more, smaller compatibility and portability
machines
c. Means processing data with scalable
machines 7. What is a Pair RDD?
d. Means processing data outside the a. Two RDDs in sequence in a
transformation statement
cluster
b. An RDD with only two rows
2. In Databricks notebooks you can? c. An RDD with a pair operation
a. Program only in Python and Spark d. An RDD with key-value pairs
b. Program only in the language defined
in the Notebook creation 8. What is the result of the Spark statement
c. Select a language at the cell level bellow?
d. Select a language for a set of cells sc.parallelize(mydata,3)
3. In a Databricks notebook, to access the a. Creates an RDD with a minimum of 3
cluster driver node console, what magic partitions
command is used? b. Creates an RDDs named mydata and
a. %fs the value 3
b. %drive c. Generates an error of too many
c. dbutils.fs.mount() parameters
d. %sh d. Creates 3 RDDs with mydata
13. What is a lambda function? 18. Select the right statement regarding
a. It’s a function defined without a name Spark transformations:
and with only one parameter a. Wide transformations are very efficient
b. It’s a function defined without a name because they don’t move data from the
and with only one expression node
c. It’s a function that can be reused with b. Narrow transformations are very
many parameters efficient because they don’t move data
d. It’s a function that can be reused with from the node
many expressions c. Both wide and narrow
transformations move data from the
14. collect() is a Spark function that? node
a. You can extensively use to display d. None of the narrow or wide
Dataframes content transformations move data from the
b. It's not available for Dataframes node
c. You can extensively use to display
RDDs content 19. In Spark, lazy execution means that:
d. You should use with caution to a. Execution will take some time because
display RDDs content it needs to be sent to the worker nodes
We should use the collect() on smaller b. Execution will take some time because
dataset usually after filter(), group(), count() the code is interpreted
e.t.c. Retrieving on larger dataset results in c. Execution is done one line at the time
out of memory. d. Execution is triggered only when an
action is found
15. With the instruction
sc.textFile(“file:/data”) you are? 20. What is the difference between
a. Reading a file from your hdfs file Spark Streaming and Structured
system Streaming?
b. Reading a file called “file:/data” a) Structured Streaming is for
c. Reading a file from your local non- structured streaming data processing
Hadoop file system and Spark Streaming is for
unstructured streaming data 24. In a Spark ML program, what is the
processing purpose of the code bellow?
b) Spark Streaming is the new ASF model.transform(mydata)
library for Streaming Data and a. Create a machine learning model
Structured Streaming the old one based on the data of ‘mydata’
c) Structured Streaming is a stream b. Apply the model in ‘model’ to the
processing engine and Spark data in ‘mydata’
Streaming is an extension to the c. Create a new model based on ‘mydata’
core Spark API to streaming data d. Adjust the model based on the data in
processing ‘mydata’
d) Structured Streaming relies on
micro batch and RDDs while Spark 25. Is a Spark ML program, what is the
Streaking relies on DataFrames and purpose of the code bellow?
Datasets model.fit(mydata)
a. Train a machine learning model based
21. What is a Tumbling Window in Spark on the data of ‘mydata’
Streaming? b. Apply the model in ‘model’ to the data
a. A fixed-sized, non-overlapping and in ‘mydata’
contiguous window of data c. Create a new model based on ‘mydata’
b. An overlapping and contiguous d. Adjust the model based on the data in
window of data ‘mydata’
c. A non-contiguous window of data 26. What is the result of the Spark ML
d. A dynamic size window of data instruction bellow? lr =
LogisticRegression(maxIter=10)
a. A logistic regression object is declared
22. Spark ML library can be classified as: with a maximum of 10 interactions
a. A mature ML library with a very wide b. A logistic regression is executed with a
range of predictive and descriptive maximum of 10 interactions
models to choose from c. A logistic regression is trained with a
b. A strong Deep Learning library maximum of 10 interactions
c. A complete ML framework for data d. A logistic regression is estimated with a
analysis maximum of 10 interactions
d. A ML library with a reasonable set of
models but still with work in progress 27. The vertex DataFrame in a GraphFrame
is:
23. What is the main difference between a. A free form DataFrame
Spark MLlib and ML? b. A DataFrame that must contain a
a. There is no difference apart from the column named 'id'
bigger set of algorithms available on c. A DataFrame that must contain a
Spark ML column named 'src' and 'dst'
b. Spark ML works with Streaming Data d. A DataFrame that must contain a
c. Spark MLlib is faster column named 'id', 'src' and 'dst'
d. Spark ML is based/ works on/with
Dataframes 28. What is DSL used for in GraphFrames?
Spark MLlib carries the original API built a. Formatting the output of a
on top of RDDs. GraphFrame query
Spark ML contains higher-level API built b. Declare a GraphFrame object
on top of DataFrames for constructing ML c. Search for patterns in a graph
pipelines. d. Define properties in a GraphFrame
29. Based on the figure bellow, explain the 30. What are the general steps of a Spark ML
shown algorithm line by line and the program? Give an example.
expected outcome with an example (of Regarding a Spark ML pipeline, the
the outcome). Complementary, explain general steps are
what is supposed to be input content for Data ingestions: in this step, the data is
the algorithm and where is it (the name loaded
of the the object) identified in the code. Data preparation: after loading the data,
an exploratory data analysis should be
performed, followed by cleaning the
dataset (incoherences, missing values).
Then the data should be properly split
The code is a map-reduce algorithm to into training and test datasets
perform a word count. Model creation and training, with the
In the first line, the outcome is a list of training data
words (one word per line). Each word is Model analysis and testing
split on the delimiter“ “, after applying Make predictions, by applying the trained
the flatMap, which flattens the results. In model to the test dataset
the second line, the map function applies (model.transform(test_data)
a lambda function to each word to create Model evaluation, where we should print
a tuple with (word, 1). the evaluation metrics
This way, the outcome is a key-value pair, Example for a multiple linear regression
being the word the key and the value 1. problem: We have data regarding clients
In the third line, an aggregation is from a insurance customer data and we
performed by key (word), being the want to predict the health insurance
outcome a list of the distinct words, with charges based on their ages, BMI,
the count of occurrences registered. smoking habits and the number of
The input content of the algorithm is an children. 1 - First we need to create a
rdd representing at least a line of text, dataframe with the customer data -
which is called in the first line before the which is in a list called clients df =
first lambda function. spark.createDataFrame(records,
Example: rdd = sc.parallelize([“This is the ["age","bmi","smoker","charges"]) 2 -
first line”, “This is the second line”, “This Convert the smoker column to a binary
is the last line”]) column using the String Indexer:
With this rdd, the outcome for the first smokerIndexer = StringIndexer(inputCol =
line (after applying the method collect() "smoker", outputCol = "smoker_binary")
)would be: df2 = smokerIndexer.fit(df).transform(df)
['This', 'is', 'the', 'first', 'line', 'This', 'is', df3 = df2.drop("smoker") 3 - Then we
'the', 'second', 'line', 'This', 'is', 'the', 'last', should vectorise the data, to be used by
'line'] the model, creating a features column
With this rdd, the outcome for the assembler =
second line (after applying the method VectorAssembler(inputCols=["age",
collect() ) would be: "bmi", "children", "smoker_binary"],
[('This', 1), ('is', 1), ('the', 1), ('first', 1), outputCol="features")
('line', 1), ('This', 1), ('is', 1), ('the', 1), Stages += [assembler]
('second', 1), ('line', 1), ('This', 1), ('is', 1), pipeline = Pipeline(stages=stages)
('the', 1), ('last', 1), ('line', 1)] pipelineModel = pipeline.fit(df3) dataset
Finally, the outcome after the line (after = pipelineModel.transform(df3) 4 - Then
applying the method collect() ) would be: keep relevant columns (label and
[('line', 3), ('This', 3), ('last', 1), ('the', 3), features) and rename the column
('first', 1), ('is', 3), ('second', 1)] "charges" to "label" df_final =
dataset.select(["features",
"charges"]).selectExpr("features as latency in this systems is the order of a
features", "charges as label") 5 - The next few seconds or miliseconds . An example
step is to split the data into training and can be a radar system, weather forecast
test sets (example 80% train, 20% test) or temperature measurement and
(trainingData, testData) = normally involve several IoT sensors
df_final.randomSplit([0.8,0.2]) 6 - Create
a linear regression object lr = ABD_Apoio_exame_vf9
LinearRegression(maxIter = 10) 7 - Chain 1. Which of the following is NOT a
Linear Regression in a Pipeline pipeline = component of big data architecture?
Pipeline(stages = [lr]) 8 - Train the Model
model = pipeline.fit(trainingData) 9 - a) Data storage
Make Predictions predictions = b) Data sources
model.transform(testData) 10 - Evaluate
the model eval = RegressionEvaluator c) Machine Learning
(labelCol = "label", predictionCol = d) Anonymization
"prediction") print("Coefficients: " +
str(model.stages[0].coefficients)) 2. Only one of the following sentences is
print("R2:", eval.evaluate(predictions, NOT true. Choose it.
{eval.metricName: "r2"})) a. Workspaces allow you to organize all the
work that you are doing on Databricks
31. Explain the main differences between b. The objective of data lake is to break data
real time data processing and batch data out of silos
processing. c. Clusters are single computer that you
Batch data processing deals with groups treat as a group of computers
of transactions that have already been d. The data lake stores data of any type:
collected over a period of time. The goal structured, unstructured, streaming
of a batch processing system is to
automatically execute periodic jobs in a 3. Which of the following is the most
batch. It is ideal for large volumes of appropriate definition for “Jobs”
data/transactions, as it is increases a. Are packages or modules that provide
efficiency in comparison with processing additional functionality that you need to solve
them individually. However, it can have a your business problems
delay between the collection of data and b. Are structured data that you and your team
getting the result after the batch process, will use for analysis
as it is normally a very time consuming c. Are the tool by which you can schedule
process. The latency in this systems is the execution to occur either on an already
order of a few hours. An example of existing cluster or a cluster of its own
batch processing are the processing of d. Are third party integrations with the
the salary within a company every Databricks platform
month.
On the other hand, real-time data 4. Only one of the following options is
processing deals with continuously NOT true. Choose it.
flowing data (individual records or micro a. Notebooks in order to be able to execute
batches) in real-time. Real-time commands
processing systems need to be very b. Dashboards can be created from
responsive and active all the time, in notebooks as a way of displaying the output
order to supply immediate response at of cells without the code that generates them
every instant. In this systems, the c. Clusters allow you to execute code
information is always up to date. from apps
However, the complexity of this process d. Workspaces allow you to organize all the
is higher than in batch processing. The work that you are doing on Databricks
d. databricks-finalizations
5. Only one of the following sentences is
correct. Choose it. 10. The example annex uses the lambda
a. Clusters allow you to execute code function. The syntaxes are correct?
from notebooks or libraries on set of data
b. Dashboards can not be created from
notebooks
c. Tables can not be stored on the cluster
that you're currently using
d. Applications like Tableau are jobs a. Yes
b. No
6. Only one of the following sentences is
correct. Choose it. The command “%sh”
allows you: 11. Concerning the use of Hadoop, only
a. To display the files of the folder one of the following sentences is correct.
b. To execute shell code in your notebook Choose it.
c. To use dbutils filesysytem commands a. A Node is a group of computers working
d. To include various types of documentation, together
including text, images and mathematical b. A Cluster is an individual computer in the
formula and equations cluster
c. A Daemon is a program running on a
7. Concerning the characteristics of node
“Lists” only one of the following options d. With Hadoop we can’t explore the nodes
is NOT true. Choose it. (name or data)
a. Lists are collections of items where each
item in the list has an assigned index value 12. Which of the following is NOT a
b. Lists consists of values separated by component of Hadoop data architecture?
commas a) HDFS
c. A list is mutable (meaning you can change b) MapIncrease – o que é é MapDecrease
its contents) c) Yarn
d. Lists are enclosed in square brackets [ ] d) Spark
and each item is separated by a comma
15. Only one of the following sentences is 20. Can you explain why use Apache
NOT true. Choose it. Avro data files: Storm?
a. Is a row-based storage format for Hadoop Apache Storm is a free and open source
b. It stores data in a non-binary format distributed realtime computation system.
c. It’s an efficient data serialization framework Apache Storm makes it easy to reliably
d. Uses JSON for defining the data schema process unbounded streams of data, doing for
realtime processing what Hadoop did for batch
16. Concerning Parquet files, only one of processing. Apache Storm is simple, can be
the following sentences is NOT true. used with any programming language, and is
Choose it. a lot of fun to use!
a. Parquet is supported in Spark, Apache Storm has many use cases: realtime
mapReduce, hive, Pig, Impala, and others analytics, online machine learning, continuous
b. Parquet reduces performance computation, distributed RPC, ETL, and more.
c. Parquet is a columnar format developed by Apache Storm is fast: a benchmark clocked it
Cloudera and Twitter at over a million tuples processed per
d. Parquet is most efficient when adding second per node. It is scalable, fault-tolerant,
many records at once guarantees your data will be processed, and is
easy to set up and operate.
17. Which of the following is NOT a Delta
21. Concerning Apache Spark, only one of
Lake key feature?
the following sentences is NOT true.
a. Closed Format:
Choose it.
b. Scalable Metadata Handling
a. Apache Spark is a unified analytics engine
c. Unified Batch and Streaming Source and
for large-scale data processing
Sink
b. Apache Sparks is a open-source
d. Schema Enforcement and Evolution
distributed computation framework for
executing code in parallel across many
18. Choose the option with the correct
different machines
command to copy the file foo.txt from
c. Apache Sparks is a slow, in-memory
local disk to user’s directory in HDFS
data processing engine without
a) $hdfs dfs -put foo.txt foo.txt
development APIs
b) $hdfs dfs -ls foo.txt foo.txt d. Apache sparks is ease to use and write
applications quickly in Java, Scala, Python, R
c) $hdfs dfs -get foo.txt foo.txt (??)
and SQL.
d) $hdfs dfs -rm foo.txt foo.txt
22. Which one of the following sentences
is NOT a key Spart characteristics?
19. Concerning Apache Flume, only one of a. Processing of unstructured, structured and
the following sentences is NOT true. streaming data
Choose it. b. Distributed processing
a. Apache Flame is a real time data ingestion c. Unified analytics and evolving platform
tool d. Works only with Python language
b. The use of Apache Flume is only
restricted to log data aggregation 23. Only one of the following sentences is
c. Apache Flame is a distributed, reliable, and NOT true. Choose it.
available system for efficiently collecting, Spark Core is:
aggregating and moving large amounts of a. Responsible for memory management and
streaming data from many different sources fault recovery
to a centralized data store
b. The base engine for large scale parallel Meaning: loading data in the cluster memory
data processing From data in memory - sc.parallelize
c. Responsible for interacting with storage From a file or set of files - sc.textFile,
systems spark.read
d. Home to the API that defines Data From another RDD - RDD operations,
warehouses ex: new-rdd.map(rdd)
24. Define the Spark’s main benefits in 28. Which one of the following sentences
terms of performance and productivity is NOT a key RDDs characteristics?
development a. Resilient
Spark’s main benefits in terms of performance b. Distributed
are using in-memory computing, Spark is c. Mutable
considerably faster than Hadoop MapReduce d. In-memory
(100x in some tests). It also can be used for
batch and real-time data processing. In terms 29. Which one of the following functions
of Developer productivity are Easy-to-use is NOT a single RDD transformation?
APIs for processing large datasets that a. sortBy
Includes 100+ operators for transforming. b. flatMap
c. reduce
25. Concerning main abstraction provides d. distinct
by Spark to work with data, only one of
the following sentences is correct. 30. Only one of the following sentences is
Choose it. NOT true. Choose it. Working with RDDs,
a. Data Frames are the main approach to function “collect” means:
work with unstructured data Scala objects a. Return the first element of the dataset
representing data b. Return an array with the first n elements of
b. RDD are structured data objects in Spark the data set
Data organized into named columns -tables c. Return all the elements of the dataset
c. Data sets are extensions of the Data as an array at the driver program
Frame API which provides type-safe, d. Return the first n elements of the RDD
object-oriented programming interface
and an optimizer 31. Concerning Spark terminology, only
d. A partition is a collection of columns that sit one of the following sentences is correct.
on one physical machine in our cluster Choose it.
a. Task is an individual unit of work sent
26. Only one of the following sentences is to an executor
NOT true. Choose it. b. Application is a set of tasks executed as a
a. Data Locality means moving data result of an action
towards computation rather than moving c. Job is a set of tasks in a job that can be
computation close to data executed in parallel
b. Hadoop stores data in HDFS, which splits d. Stage is a set of jobs managed by a driver
files into blocks and distribute among various
Data Nodes 32. Concerning RDDs persistence, only
c. With Hadoop, when a job is submitted, it is one of the following sentences is NOT
divided into map jobs and reduce jobs true. Choose it.
d. RDDs represent a collection of items a. Storage levels let you control partition
distributed across many compute nodes that replication
can be manipulated in parallel b. The persist method offers other options
called persistence levels
27. Define the 3 ways to create an RDD in c. Storage levels let you control storage
Commented [GMS3]: OpenAI diz que é esta, e as pessoas
Spark location não discordam
d. By default the persist method stores data d. PageRank
in memory only
38. What is the Spark algorithm used in
33. Concerning RDDs persistence, the code of this picture?
describe how to choose the persistence
level
Memory only: when possible for best
performance and saves space by saving a
serialized object in memory
Disk: when re-computation is more expensive a. Word Count
than disk read. Ex: expensive functions or b. PI estimation
filtering large datasets c. K-means clustering
Replication: when re-computation is more d. PageRank implementation
expensive than memory
39. Concerning K-means clustering, only
34. Concerning RDDs checkpointing, only one of the following sentences is NOT
one of the following sentences is NOT true. Choose it.
true. Choose it. a. K-means partitions the data into k clusters
a. Risk of stack overflow b. There are only one version on k-means
b. Maintaining RDD lineage provides algorithms and implementations
resilience but can cause problems when c. K-means is a common iterative algorithm
the lineage is very short used in Machine Learning for cluster analysis
c. Checkpointing saves data to HDFS d. The objective of k-means is to group the
d. Recovery can be expensive observations into clusters where each
observation belongs to the cluster with the
35. Only one of the following sentences is nearest mean
NOT true. Choose it. In Spark:
a. Tasks are executed on a data node within 40. Describe the 5 steps to implement the
a contained processing environment K-means clustering algorithm in Spark
b. Broadcast variables are used to give to 1.Choose the seeds (k)
every node a copy of input data 2.Each individual is associated with the
c. Read-write shared variables across nearest seed
tasks would be very efficient 3.Calculate the centroids of the formed
d. Accumulators are used to give to the driver clusters
a consolidated value from the data notes 4.Go back to step 2
5.End when the centroids cease to be re-
36. Only one of the following sentences is centered
NOT true. Choose it.
Spark is especially useful when working 41. Which one of the following functions
with any combination of: is NOT a Spark SQL component?
a. Iterative algorithms a. A SQL command line interface
b. Large amounts of data b. The DataFrame IPA
c. Intensive computations c. Integration with HiveQL
d. Plus privacy d. The SQL Engine
62. Using Spark ML programming what In Spark, lazy execution means that:
results can we obtain with the code Select one:
annex? Execution is triggered only when an action
is found
64. Which one of the following is NOT a The vertex DataFrame in a GraphFrame is:
regression model available in Spark? Select one:
a. Multilayer perceptron classifier A DataFrame that must contain a column
b. Decision classifier named 'id'
c. Gradient-boosted tree classifier
d. Random forest classifier ABD_Exames passados_jan2021 (removi
as perguntas repetidas)
65. Concerning Machine Learning, only 3. What is an RDD?
one of the following sentences is NOT Select one:
true. Choose it. a. A Hadoop data format
b. A dataset in-memory
a. ML requires a process with data loading,
c. A dataset in-disk
analysis and preparation d. A dataset in-disk and in-memory
b. Spark provides two libraries for ML:
“pyspark.ml” and “pyspark.mllib” 8. What is the output object type that results from
applying a map() function to an RDD that was created
c. Spark includes “dropna()” function for data
from a text file with the sc.textFile() method?
preparation for ML Select one:
d. Spark includes “UDFs” function for a. String
data modelling b. Tuple
c. List
d. Dictionary
In Databricks notebooks you can?
Select one:
9. Select the right statement to create an RDD: • Both map() and flatMap() are an transformations
Select one: that apply a function to the elements of an RDD an
a. myRDD = [“Alice”, “Carlos”, “Frank”, “Barbara”] return a new RDD with the transformed elements.
b. myRDD = sc.load(“Alice”, “Carlos”, “Frank”, • On the one hand, map() transformation takes one
“Barbara”) element and produces one element (one-to-one
c. myRDD = sc.parallelize([“Alice”, “Carlos”, “Frank”, transformation). On the other hand, flatMap() takes
“Barbara”])
one element and produces zero, one or more
d. myRDD = load(“Alice”, “Carlos”, “Frank”, “Barbara”)
elements (one-to-many transformation)
• In this example, let’s create an RDD which has a
16. Select the right instruction to create a list of 2 lines of text o rdd =
Dataframe? sc.parallelize(["First Word",
Select one: "Second Word"])
a. Dataframe = spark.range(10)
Out[1]: DataFrame[id: bigint] • If we perform an upper case transformation using
b. Dataframe = sc.textFile(“mydata”) the map(), the output will be a list with the two
c. Dataframe = spark.dataFrame(“Mydata”)
lines of text with all the words in uppercase: o
d. Dataframe = sc.parallelize(“mydata”)
code: rdd.map(lambda line: line.upper()).collect()
o output: ['FIRST WORD', 'SECOND
19. What is the difference between Spark Streaming WORD']
and Structured Streaming?
Select one: • If we perform an upper case transformation using
a. Structured Streaming is for structured streaming the flatMap(), the output will be a list with all the
data processing and Spark Streaming is for characters character in upper case, as the results
unstructured streaming data processing
are flattened: o code: rdd.flatMap(lambda line:
b. Spark Streaming is the new ASF library for
line.upper()).collect()
Streaming Data and Structured Streaming the old one
c. Structured Steaming is a stream processing engine o output: ['F', 'I','R','S', 'T','
and Spark Streaming is an extension to the core ','W','O','R','D','S','E','C','O','N','D',' ','W','O','R','D']
Spark API to streaming data processing
d. Structured Streaming relies on micro batch and 2nd Exam 2021
RDDs while Spark Streaming relies on DataFrames and Question 1
Datasets Write a Databricks Notebook program to do the
following tasks
20. What are DStreams? 1. Display your Spark session version, master and
Select one: AppName
a. Data abstractions provided from Spark core library
b. Data abstractions provided from Spark MLlib
c. Data abstractions provided from Spark Streaming
d. Data abstractions provided from Spark Structured
Streaming
30 – Explain the differences between the map() 3. Copy one of the files (you may suggest a non-
and flatMap() transformations in Spark. existing) name to a new version with the name
Complementary, show and example created by prefix “New_”
you (with code) and explain the output
differences.
1. Create a RDD with the following 7 lines of text:
"First Line", "Now the 2nd", "This is the 3rd line",
"This is not the 3rd line?", "This is the 5th ", "This
is the 6th", "Last Line"
1 – sc
2 – %fs ls /FileStore/tables
3 - dbutils.fs.cp("dbfs:/FileStore/tables/File.csv",
2. Create a RDD (based on the previous one with
"dbfs:/FileStore/tables/New_File.csv", True) Spark functions) with only the lines that start
with the word "This"
Question 2
Write a Spark program to do the following tasks:
1. Create a Python list of Temperatures in ºF as
in: [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6,
55.4, 54.7]
2. Create an RDD based on the list above
1 – List = [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6,
55.4, 54.7]
2 – rdd = sc.parallelize(List)
rdd.collect()
3 – rdd.stats() 1–
4 – rdd_Celsius = rdd.map(lambda T: (T-32)*5/9) text = ["First Line", "Now the 2nd", "This is the 3rd line",
5 – rdd_Celsius.collect() "This is not the 3rd line?", "This is the 5th ", "This is the 6th",
"Last Line"]
rdd = sc.parallelize(text)
Question 3 rdd.collect()
Write a Spark program to do the following tasks: 2–
rdd_2 = rdd.filter(lambda line: line.startswith("This"))
rdd_2.collect() Question 5
3– Write a Spark algorithm to count the number of
rdd_3 = rdd.filter(lambda line: "not" not in line)
rdd_3.collect() distinct words on the text line below (or similar
4– input you may write).
rdd_4 = rdd.map(lambda line: line.upper()) • o "First Line"
rdd_4.collect() • o "Now the 2nd line"
5 – rdd_4.take(2) • o "This is the 3rd line"
• o "This is not the final
Question 4 line?"
Write a Spark program to do the following tasks: • o "This is the 5th "
1. Create an RDD with the data below (simulating • o "This is the 6th"
customer acquisitions): o • o "Last Line is the 7th"
'Client1:p1,p2,p3'
o 'Client2:p1,p2,p3' The output must be a pair RDD with the distinct
o 'Client3:p3,p4' words and the corresponding number of
occurrences.
1–
rdd = sc.parallelize(['Client1:p2,p3', 'Client2:p1,p2,p3',
'Client3:p3,p4'])
rdd.collect()
2- text = ["First Line", "Now the 2nd line", "This is the 3rd line",
rdd2 = rdd.map(lambda line: line.split(":")) "This is not the final line?", "This is the 5th", "This is the 6th",
rdd3 = rdd2.map(lambda fields: (fields[0],fields[1])) "Last Line is the 7th"]
rdd4 = rdd3.flatMapValues(lambda p: p.split(",")) rdd = sc.parallelize(text)
rdd4.collect() rdd2 = rdd.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda w1,w2: w1+w2).collect()
Question 7
Question 6 Explain the major differences between Spark SQL,
Write a Spark program to do the following tasks: Hive, and Impala. Give examples supporting your
1. Create a RDD based on the list of the following explanation for each of the 3 cases.
4 tuples: • Spark SQL is a distributed in-memory computation engine.
It is a spark module for structured data processing which is
built on top of Spark Core.
[('Mark',25),('Tom',22),('Mary',20),('Sofia',26)] o Spark SQL can handle several independent processes and
2. Create a Dataframe based on the previous RDD in a distributed manner across thousands of clusters that are
with the 2 columns named "Name" and "Age" distributed among several physical and virtual clusters.
3. Display the new DataFrame o It supports several other Spark Modules being used in
applications such as Stream Processing and Machine
4. Create a new DataFrame based on the
Learning
previous one, adding a new column named • On the other hand, Hive is a data warehouse software for
"AgePlus" with the content of Age multiplied by querying and managing large distributed datasets, built on
1.2 top of Hadoop File System (HDFS) o Hive is designed for
Batch Processing through the use of Map Reduce
Programming
• Finally, Impala is a massively parallel processing (MPP)
engine developed by Cloudera.
o Contrarly to Spark, it supports multi-user environment
while having all the qualities of Hadoop it supports column
storage, tree architecture, Apache HBase storage and HDFS.
o It has significantly higher query throughput than Spark
SQL and Hive.
o However, in large analytical queries Spark SQL and Hive
outperform Impala.
1 – rdd =
sc.parallelize([('Mark',25),('Tom',22),('Mary',20),('Sofia',26)])
2 – df = spark.createDataFrame(rdd).toDF("Name","Age")
3 – display(df)
4 – df2 = df.withColumn("AgePlus", df["Age"]*1.2)
5 – df2.write.format("delta").save("/FileStore/tables/df2")
6 - %fs ls /FileStore/tables/df2
1 - from graphframes import *
# Create the vertices
vertices = sqlContext.createDataFrame([
("a", "Alice", 31),
("b", "Esther", 35),
("c", "David", 34),
("d", "Bob", 29)], ["id", "name", "age"])
# Create the edges
edges = sqlContext.createDataFrame([
("a", "d", "married"),
("a", "b", "friend"),
("b", "c", "married"),
("b", "a", "friend"),
("c", "d", "friend")], ["src", "dst", "relationship"])
# Create the graph
g = GraphFrame(vertices, edges)
2–
display(g.vertices)
display(g.edges)
3–
friends = g.edges.filter("relationship = 'friend' ")
friends.show() 1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"],
Question 9 - Write a Spark program to do the [28,35,2,191,"y"], [32,38,3,208,"n"] ]
following tasks: df = spark.createDataFrame(records,
["age","bmi","children","charges","smoker"])
1. Create a DataFrame simulating insurance
2–
customer data with: o The columns: from pyspark.ml.feature import StringIndexer
["age","bmi","children","charges" smokerIndexer = StringIndexer(inputCol = "smoker",
,"smoker"] outputCol = "smokerIndex")
o 4 records with the following values: [ df2 = smokerIndexer.fit(df).transform(df)
3-
[19,27,0,168,"y"], [18,33,1,177,"n"], df_final = df2.drop("smoker")
[28,35,2,191,"s"], [32,38,3,208,"n"]] display(df_final)
2. Create a new DataFrame based on the 1. Write a Spark program to do the following
previous one with the values in the smoker tasks:
column encoded in 1/0 values o Create a DataFrame simulating insurance
3. Display the new DataFrame without the customer data, with (or assume you already have
original "smoker" column a file with the data and read it from dbfs):
o The columns:
["age","bmi","children","charges"
,"smoker"]
o 4 records with the following values: [
[19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"s"], [32,38,3,208,"n"] ]
3. At the end of your program print also the:
coefficients, RMSE and the r2 of your model
1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"y"], [32,38,3,208,"n"] ]
df = spark.createDataFrame(records,
["age","bmi","children","charges","smoker"])’
2–
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
# Convert the smoker column to a binary column
2. Write a Spark ML program to perform a smokerIndexer = StringIndexer(inputCol = "smoker",
outputCol = "smoker_binary")
multiple linear regression in an attempt to df2 = smokerIndexer.fit(df).transform(df)
predict the health insurance charges based on df3 = df2.drop("smoker")
the clients age, bmi, number of children and # Vectorize the data
his/hers smoking habits from pyspark.ml import Pipeline
stages = []
assembler = VectorAssembler(inputCols=["age", "bmi",
"children", "smoker_binary"], outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df3)
dataset = pipelineModel.transform(df3)
# Keep relevant columns and rename the column "charges"
to "label"
df_final = dataset.select(["features",
"charges"]).selectExpr("features as features", "charges as
label")
# Split the data into training and test sets (80% train, 20%
test)
(trainingData, testData) = df_final.randomSplit([0.8,0.2])
# Create a linear regression object
lr = LinearRegression(maxIter = 10, regParam = 0.3,
elasticNetParam = 0.8)
# Chain Linear Regression in a Pipeline
pipeline = Pipeline(stages = [lr])
# Train the Model
model = pipeline.fit(trainingData)
# Make Predictions
predictions = model.transform(testData)
3–
eval = RegressionEvaluator (labelCol = "label", predictionCol
= "prediction")
print("Coefficients: " + str(model.stages[0].coefficients))
print("RMSE:", eval.evaluate(predictions, {eval.metricName:
"rmse"}))
print("R2:", eval.evaluate(predictions, {eval.metricName:
"r2"}))