Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Analyzing Big Data #2

Databricks
#1 We’ll move with a strong focus and detail on
Scale Up vs Scale Out Databricks because:
Up – there is a limit. Adding bigger, more • It is a leading platform for Big Data processing
powerful machines SMP: Symmetric • It is Spark based and leads Spark adoption
multiprocessing (along side with ASF)
Out – Add more commodity data servers. Adding • It provides a free platform for learn and work
more, smaller and cheaper machines MPP: with Spark
Massive parallel processing
Delta Lake is an open source storage layer that
brings reliability to data lakes. Delta Lake
provides ACID transactions, scalable metadata
handling, and unifies streaming and batch data
processing. Delta Lake runs on top of your
existing data lake and is fully compatible with
Apache Spark APIs
Big Data vs Traditional DB technologies
Traditional DBMS Systems
• Rigid data models
• Weak fault-tolerance architecture
• Scalability constrains
• Expensive to scale
• Limitation on unstructured data
• Proprietary HW and SW

Big Data Technologies


• Schema free data models Databricks (just what you need)
• Fault tolerance architecture Workspaces allow you to organize all the work
• Highly scalable that you are doing on Databricks. Like a folder
• Lower costs structure in your computer, it allows you to save
• Handles unstructured and streaming data notebooks and libraries and share them with
• Commodity hardware and open-source other users.
software Notebooks are a collection of runnable cells
(commands). Cells hold code in any of the
The Data Lake (the new kid on the block) following languages: Scala, Python, R, SQL, or
• Term coined by James Dixon – CTO Pentaho Markdown. Notebooks need to be connected to a
(~2010) cluster in order to be able to execute commands.
• Concept to contrast with the traditional “Data Dashboards can be created from notebooks as a
Mart” way of displaying the output of cells without the
• The objective is to break data out of silos code that generates them.
• Stores “non modeled” data i.e., “raw data” with Clusters are groups of computers that you treat
no formal governance and (business) structure as a single computer. In Databricks, this means
• Stores data of any type: structured, that you can effectively treat 20 computers as
unstructured, streaming you might treat one computer. Clusters allow you
• Tries to solve the data accessibility problems to execute code from notebooks or libraries on
• Data is used by the Business Users for set of data.
experimentation, not for operations
• Similar terms: Data Sandbox, Analytical Lab Using Notebooks
The primary language of each cell is shown in ()
next to the notebook name:

You can override the primary language by


specifying the language magic command
%<language> at the beginning of a cell. The
supported magic commands are: %python, %r,
%scala and %sql
%sql select * from diamonds

Additionally: Dictionaries
%sh: allows you to execute shell code in your • A dictionary maps a set of objects (keys) to
notebook. Add the -e option in order to fail the another set of objects (values)
cell and (subsequently a job or a run command) if • Dictionaries are mutable and unordered (the
the shell command has a non-zero exit status order that the keys are added doesn't necessarily
%fs: allows you to use dbuilts filesystem reflect what order they may be reported back)
commands • Use {} curly brackets to construct the dictionary,
%md: Allows you to include various types of and [] square brackets to index it. Separate the
documentation, including text, images, and key and value with colons : and with commas ,
mathematical formulas and equations between each pair. Keys must be quoted

Tuples
• A tuple consists of a set of ordered values
separated by commas
• Tuples are immutable
• Tuples are always enclosed in parentheses

Lambda expressions
The Lambda operator or lambda function is used
for creating small, one-time and anonymous

Databricks utilities – dbuilts.help()

Lists
• Lists are collections of items where each item in
the list has an assigned index value
• A list is mutable (meaning you can change its
contents)
• Lists are enclosed in square brackets [ ] and
each item is separated by a comma

function objects. lambda operator can have any


number of arguments, but it can have only one
expression. It cannot contain any statements and
it returns a function object which can be assigned
to any variable.

map and filter Functions


map(function, iterable) HDFS is tuned to support large files. It should
The map() function applies a given function to provide high aggregate data bandwidth and scale
each item of an iterable (list, tuple etc.) and to hundreds of nodes in a single cluster
returns a list of the results. • Simple Coherency Model
A file once created, written, and closed need not
filter(function, iterable) be changed except for appends and truncates.
The filter() function constructs an iterator from This assumption simplifies data coherency issues
elements of an iterable for which a function and enables high throughput data access
returns true. • “Moving Computation is Cheaper than Moving
The filter() function filters the given iterable with Data”
the help of a function that tests each element in A computation requested is much more efficient
the iterable to be true or not. if it is executed near the data it operates on. It’s
often better to migrate the computation closer to
#3 where the data is located rather than moving the
Hadoop data to where the application is running
The Apache Hadoop project develops open- • Portability Across Heterogeneous HW and SW
source software for reliable, scalable, distributed Platforms
computing. HDFS has been designed to be easily portable
The AH software library is a framework that from one platform to another
allows for the distributed processing of large data
sets across clusters of computers using simple Naming conventions
programing models. It is designed to scale up • A Cluster is a group of computers working
from single servers to thousands of machines together - Provides data storage, data processing
each offering local computation and storage. and resource management
Rather than rely on hardware to deliver high- • A Node in an individual computer in the cluster
availability, the library itself is designed to detect - Master nodes manage distribution of work and
and handle failures at the application layer, so data to works nodes
delivering a highly-available service on top of a • A Daemon is a program running on a node -
cluster of computers, each of which may be Each Hadoop daemon (runs on a JVM) performs a
prone to failures. specific function in the cluster
• It’s a framework
• Distributed storage (HDFS) and computing Big Data architecture
(MapReduce, Hive etc.)
• Reliable (designed for failure)
• Designed for commodity hardware
• Optimized for batch processing
• Open source
• Low cost

Assumptions and goals


• Hardware Failure
Detection of faults and quick, automatic recovery
from them is a core architectural goal of HDFS
• Streaming Data Access
HDFS is designed more for batch processing File System
rather than interactive use, the emphasis is on • The file system is responsible for storing data
high throughput of data access rather than low on the cluster
latency • HDFS is a file system written in Java - Based on
• Large Data Sets Google GFS
• Sits on top of a native file system Ex: FAT, NTFS,
ext3, ext4, xfs
• Provides redundant storage - For massive o Map (plus shuffle & sort)
amounts of data o Reduce
• Each node processes data stored on that node
 Performs best with large files (when possible)
 Files are “write once” • The Mapper
 The file system is optimized for reading o Map tasks run on the node where the file
large streams of file (rather than random (block) is stored
files) o Each Map task operates on a single HDFS
• Data is split into blocks and distributed across node/file block
multiple nodes Each block is typically 64MB or • Shuffle & Sort
128MB (it is configurable) o Transfers the mappers output it to the
• In HDFS each block is replicated multiple times reducers as inputs
(standard is 3 times) • The Reducer
• Name Node stores metada o Operates on the shuffled/sorted
intermediate data
Resource Manager (YARN) o Produces the final product
• YARN - Yet Another Resource Negotiator
o Introduced in Hadoop 2 to improve the
MapReduce implementation
• YARN is Hadoop Resource Manager
o Resource Manager
o Job scheduler
• YARN allows multiple data processing engines
to run on a single cluster MapReduce drawbacks
o Batch processing (Spark, MapReduce) • Complex code maintenance (originally
o Interactive SQL (Hive, Impala) developed in Java)
o Advanced Analytics (Spark ML, mahout) • No shell
o Streaming (Storm, Spark Streaming) • Disk based: data stored (/processing) on disk
• Slow data access (many disk reads)
Resource Manager (YARN daemons) • No memory operations (caching)
• Resource Manager • No abstract concepts for data manipulation and
o Runs on a master node processing
o Global resource scheduler • Not optimized for iterative algorithms
o Arbitrates system resources between • Limited to batch processing
competing applications • No additional libraries: ML, Graph, SQL,
• Node Manager Streaming etc.
o Runs on slave nodes
o Communicates with RM

Processing with MapReduce (the starting point)


• MapReduce is a programming model for
processing data in a distributed way on many
machines (nodes)
• It’s a “batch query processor”
• Although the Hadoop framework is Spark – Next generation MapReduce
implemented in Java, MapReduce applications Data formats4
need not be written in Java Text files
• The concept behind MapReduce is to Map the • Text files are the most basic file type in Hadoop
original data into a collection of key-value pairs o Can be read or written from a virtually
and then, Reduce the output with the same key any programming language
• Consists of two main step functions:
o Comma and tab delimited files are • Data is stored according with specific rules →
compatible with many applications The Schema (type, order, format)
• Text files are human readable since everything • The schema describes the fields and their data
is a string types
o Useful when debugging • The schema is created using JSON format
• However, the format is inefficient at scale
o Representing numeric values as
strings wastes storage space
o Difficult to represent binary data
such as images
o Conversion to/from native types
adds performance penalty

Avro data files - data serialization • Efficient storage to optimized binary encoding
• To understand Avro, you must first understand • Widely support throughout the Hadoop
serialization ecosystem
• It’s the process of converting data into a o Can also be used outside of Hadoop
sequence of bytes/bits • Ideal for long-term storage of important data
o A way of representing data in memory as o Can read and write from many languages
a series of bytes o Embeds schema in the file, so will always
o Allows you to save data to disk or send it be readable
across the network o Schema evolution can accommodate
o Deserialization allows you to read the changes
data back into memory
• Many programming languages and libraries Columnar formats
support serialization • These organize data storage by column, rather
o Ex: Serializable in Java or pickle in Python than by row
• Backwards compatibility and cross-language • Very efficient when selecting only a small
support can be challenging subset of a table’s column
o Avro was developed to address these • Ex: RCFile, ORC (Optimized Row Columnar),
challenges Parquet (default in Spark)

Avro data files Parquet files


• Avro is a row-based storage format • Parquet is a columnar format developed by
• It stores data in a binary format using Cloudera and Twitter
serialization with a compact binary format o Supported in Spark, mapReduce, hive,
• It’s an efficient data serialization framework Pig, Impala, and others
o Top-level Apache project (created by o Schema metadata in embedded in the file
Dough Cutting) (like Avro)
o Widely supported throughout Hadoop • Uses advanced optimizations (from Google’s
and its ecosystem Dremel project)
• Uses JSON for defining the data schema o Reduces storage space
• Schema metadata is embedded in the file o Increases performance
• Offers compatibility without sacrificing • Most efficient when adding many records at
performance once
o Data is serialized according to a schema o Some optimization rely on identifying
you define repeated patterns
o Serializes data using a highly-optimized
binary encoding Delta Lake
o Specifies rules for evolving your schemas • Is an open-source storage layer based on
over time parquet
• It was developed by Databricks and donated to • The use of Apache Flume is not only restricted
Linux Foundation to log data aggregation. Since data sources are
• Adds reliability, quality, performance to data customizable, Flume can be used to transport
lakes massive quantities of event data including but
• Includes a transaction log on top of parquet not limited to network traffic data, social-media-
files generated data, email messages and pretty much
• Provides ACID transactions, scalable metadata any data source possible
handling, and unifies streaming and batch data • Apache Flume is a top-level project at the
processing Apache Software Foundation
• Brings together the best of data warehousing
and data lakes SUMMARY
• Components: • HDFS/DBFS is the storage layer for Big Data
o Delta engine processing
o Delta tables o Chunks data into blocks and distributes
o Delta format and storage layer them across the cluster when data is
• Key Features: stored
o ACID transactions o Slave nodes run DataNode daemons,
o Scalable metadata managed by a single NameNode on a
o Time travel master node
o Open source o Access storage via UI or APIs
o Unified Batch/Streaming • The resource manager schedules jobs manages
o Schema Evolution/ Enforcement resouces
o Audit History o YARN works with the File System, to run
o DML Operations tasks where the data is stored
o Slave nodes run NodeManager daemons,
managed by a ResourceManager on a
master node
Monitor jubs using UI
• Data Formats are available with different
purposes /advantages
o Text formats
o Row based and binary/serialized formats
o Columnar formats

Streaming data is data that is continuously #4


generated by different sources. Such data should Apache Spark
be processed incrementally using Stream Was designed to be a modern distributed
Processing techniques without having access to computation framework as a response to the
all of the data. In addition, it should be limitations of MapReduce
considered that concept drift may happen in the Databricks have written over 75% of the code in
data which means that the properties of the Apache Spark and have contributed more than 10
stream may change over time. It is usually used in times more code than any other organization
the context of big data in which it is generated by Databricks makes big data simple by providing
many different sources at high speed. Apache Spark as a hosted solution
• Apache Spark is an open-source, distributed
Flume computation framework for parallel data
• Apache Flume is a real time data ingestion tool processing across many machines (clusters)
• It’s a distributed, reliable, and available system • Complementary to data abstractions and
for efficiently collecting, aggregating and moving interfaces, it manages the clusters of computers
large amounts of streaming data from many and ensures production-level stability (not simple
different sources to a centralized data store at all)
• It’s a fast, in-memory, parallel data processing • Interactive analytics or business intelligence
engine with development APIs to allow data Gaining insight from massive data sets to inform
workers to execute batch, streaming, SQL and product or business decisions in ad hoc
machine learning workloads investigations or regularly planned dashboards
• Unified engine for large scale data analytics • High performance computation Reducing time
to run complex algorithms against large scale
Spark’s philosophy (Databricks) and what Spark data
is not • Machine learning and advanced analytics
1. Unified: Spark’s key driving goal is to offer a Applying sophisticated algorithms to predict
unified platform for writing big data applications outcomes, detecting fraud, inferring hidden
2. Computing Engine: At the same time that information or making decisions based on input
Spark strives for unification, Spark limits its scope data
to a computing engine • Real-time data processing Capturing and
3. Libraries: Spark’s final component is its processing data continuously with low latency
libraries, which build on its design as a unified and high reliability
engine to provide a unified API for common data
analysis tasks Spark applications
Spark applications consist of:
Apache Spark main characteristics • A driver process (running on a node)
In-memory distributed processing framework o Maintains info about the Spark App and
written in Scala defines the SparkSession
• Functional programming language that runs on o Executes program input and output
a JVM operations
o Articulates resources allocation with the
Support for many data abstractions concepts resource manager
• For data storage (RDD, DataFrame, Dataset etc.) o Analyses, distributes and schedules work
• For data processing (map, join etc.) on the executers
API based • Executor processes
• Requires an API to run Spark code o Execute code given from the driver
• APIs: Scala, Java, Phyton, SQL and R o Report on the stat of the execution to the
Unified analytics evolving platform drives
• Spark core for managing the cluster and basic
data objects The Spark Context and Spark Session
• Specific libraries for additional functionalities • The driver process is instantiated as an object
(Streaming, Graphs, ML etc.) called the (SparkContext and) SparkSession
• It represents the connection to the existing
Spark Core Spark cluster and provides the entry point for
• Is the base engine for large scale parallel data interacting with Spark
processing • The SparkContext is the 1st version and the
• It’s responsible for: SparkSession was introduced in Spark 2.0 as the
o Memory management and fault recovery new entry point that subsumes SparkContext,
o Scheduling, distributing and monitoring SQLContext, StreamingContext, and HiveContext
jobs on a Cluster (for backward compatibility, they are preserved)
o Interacting with the storage systems
• It’s also home to the API that defines RDDs,
DataFrames, Datasets etc.

Spark Use Cases


• Data integration, ETL and ELT Pipelines for data
integration to refresh Decision Support Systems
(DSS/DW)
• In Databricks notebooks and Spark REPL, the • Each partition is created in a data node / core
SparkSession is created for you and stored in a • An action triggers execution of tasks in parallel
variable called spark
What is data locality?
• Spark is specifically designed to solve Big Data
problems; it must deal with larger amounts of
data. As it is not feasible to move large data sets
towards computation, Spark provides this feature
called data Locality
• Data locality means moving computation
(mapper code) close to data rather than moving
data towards the computation
Spark terminology • The filesystem splits files into blocks and
• Application: set of jobs managed by a driver distribute among various Data Nodes. When a job
• Job: a set of tasks executed as a result of an is submitted, it is divided into map jobs and
action reduce jobs. A Map job is assigned to a Data
• Stage: a set of tasks in a job (one per partition) Node according to the availability of the data, i.e.,
that can be executed in parallel it assigns the task to a Data Node which is closer
• Task: an individual unit of work sent to an to or stores the data on its local disk
executor • Data locality refers the process of placing
computation near to data, which helps in high
Spark provides 3 main abstraction to work with throughput and faster execution
data
• RDDs: logical distributed abstraction of your RDDs
data. Immutable collection of data elements • RDDs are Spark’s main data abstraction concept
partitioned across the cluster. Scala/Java/Python (unit of data)
objects representing data. The main approach to • RDDs represent a collection of items distributed
work with unstructured data across many compute nodes that can be
• DataFrames: structured data objects (schema manipulated in parallel
RDD, row) Data organized into named columns - • RDDs are immutable
tables • Spark Core provides many APIs for building and
• Datasets: extensions of the DataFrame API manipulating these collections
which provides a type-safe object-oriented ➔ Resilient: fault-tolerant with the help of
programming RDD lineage graph and so able to
Partitions and RDDs recompute missing or damaged partitions
• In order to allow every executor to perform due to node failures
work in parallel, Spark breaks up the data into ➔ Distributed: data residing on multiple
chunks, called partitions nodes in a cluster
• A partition is an atomic chunk of data. It’s a ➔ Dataset: a collection of partitioned data
collection of rows that sit on one physical with primitive values or values of values,
machine in the cluster e.g., tuples, lists or other objects
• By default, Spark tries to create as many • RDDs can hold any type of data element
partitions as cores in the cluster (or as many as o Primitive types: integers, characters,
the ones already saved on the filesystem) Booleans etc.
• Partitions are basic units of parallelism in o Sequence types: strings, lists, arrays,
Apache Spark tuples, dicts etc.
• RDDs are a collection of partitions o Also support nested data types and
mixed types
Partitions and data locality • Some types of RDDs have additional
• By default, Spark partitions file-based RDDs by functionality
blocks
o Pair RDDs → RDDs consisting of key-value Use transformations
pairs for:
• Double RDDs • Transform data in
o RDDs consisting of numeric data sequence to modify it
as needed
How to create an RDD? (i.e. loading data in the • Create a new RDD
from an existing one.
cluster memory)
Ex: i) map(function) ii)
• From data in memory → sc.parallelize
filter(function)
• From a file or set of files → sc.textFile
• From another RDD with operations → new-
• Transformations are executed in parallel on
rdd.map(rdd)
each partition
• When possible, tasks are executed on the
Spark operations
workers nodes where data is in memory
RDDs (and DFs) support two types of operations:
There are 2 types of transformations:
• Transformations: Define a
• Narrow transformations (preserves
new RDD based on the
partitioning)
current one Ex: map(), filter()
o Input partitions will contribute to only
• Actions: Return values Ex:
one output partition
collect(), count(), 50 Spark
o Are the result of map(), filter()
data abstraction concepts
• Wide transformations (shuffles)
Transformations are the ones
o Input partitions will contribute to more
that produce new
than one output partition
RDDs/Datasets Actions are the ones that trigger
o Are the result of groupbyKey() and
computation and return results
reducebyKey().

A Transformation is a An Action is a
Reading and writing text files (transformation
function that produces function that
and action)
a new RDD / DF produces a value •
Spark can create distributed datasets from the
• textFile(file) count()
• map(func) • collect() storage source
• flatMap(func) • take(n) • Local file system, HDFS/DBFS, Cassandra,
• filter(func) • reduce() Hbase, Cloud provider, etc.
• sortBy(func) • min() • Use sc.textFile method to create the dataset
• union(rdd) • max()
• intersection(rdd) • stats() Reading a text file (is a transformation): •
• distinct() • foreach() sc.textFile(file)
• join(key value pair • countByKey() o Reads a text file(s) and creates an RDD of
rdd) • countByValue() lines (list)
• groupByKey() • saveAsTextFile(file) o Each line in the file(s) is a separate RDD
• reduceByKey(func)
element
Use actions for:
o Works only with line-delimited text files
• RDDs are immutable, • View data in the
Accepts a single file, wildcard list or a comma
data in a RDD is never console i) collect(),
changed take(n), count() separated list of files
• Transformation are • Collect data from • sc.textFile(“file.txt”) > One file
the core of the native objects in the • sc.textFile(“data/*.log”) > Wildcards
program business logic respective format i) • sc.textFile(“file1.txt,file2.txt”) > List of files
• It’s how you define parallelize(collection)
the changes you want • Write data to Writing a text file (is an action): •
to make in your data to output data sources rdd.saveAsTextFile(“path”)
produce the desired i) saveAsTextFile(file)
outcome Operations in Spark are lazy
• Lazy means that they do not compute their
results right away. Instead, they just remember
the transformations applied to some base dataset
(e.g., a file)
• The transformations are only computed when
an action requires a result to be returned to the
driver program
• This provides immense benefits to the end user
because Spark can optimize the entire data flow
from end to end

RDD / Dataframe lineage is a graph of all the


parent RDDs of an RDD
It is built as a result of applying transformations
to the RDD and creates a logical execution plan
• mydata.toDebugString() - (Verbose output)
• mydataframe.explain() - Only on DataFrames

SUMMARY
• Spark is a distributed computation framework
• Can be used interactively via the Spark shell
(Python or Scala)
• Spark has 3 main data abstraction concepts:
RDD, DataFrames & Datasets
• RDD are the basic data unit in Spark and for
unstructured data. DataFrames are data
organized in columns for structured data
• Spark has 2 types of operations on data objects:
i) Transformations ii) Actions
• Lazy execution → Transformations are not
executed until required by an action
• Spark uses functional programming i) Passing
functions as parameters ii) Anonymous functions
• Spark terminology
o Application: set of jobs managed by a
driver
o Job: a set of tasks executed as a result of
an action
o Stage: a set of tasks in a job (one per
partition)
o Task: an individual unit of work sent to an
executor
• Spark keeps track of each RDDs lineage → To
provide fault tolerance
• RDD operations are executed on partitions in
parallel
o More parallelism means more speed
o The level of parallelism can be controlled
o Operations that depend on multiple
partitions are executed in separate stages
(ex: join, reduceByKey)
#5 Note: The input function to map returns a single
Summary statistics element, while the flatMap returns a list of
elements. The output of flatMap is flattened.

map() and filter() with text strings11


map() vs flatMap()

RDDs persistence (or caching)


• Sparks doesn’t guarantee RRD data will be kept
in memory
• However, having data in-memory can be very
useful
Access speed
No need for reprocessing
Very useful for interactive algorithms
• Use the rdd.persist() or rdd.cache() to keep
data in memory
Persist offers options for Storage Levels o Number of unique words could exceed
o Storage location available memory
o Format in memory • Statistics are often simple aggregate functions
o Partition replication o Distributive in nature
• Stop persisting and remove data in memory o Ex: max, min, sum, count
(and disk) • Map-Reduce breaks complex tasks down into
o rdd.unpersist() smaller elements which can be executed in
• In-memory persistence is a suggestion to Spark parallel
o If not enough memory is available, • Many common tasks are very similar to word
persisted partitions will be cleared count • Ex: log files analysis
o Transformations will be re-executed
using the lineage when necessary Pi estimation (using Monte Carlo method)
• Compute intensive task
Checkpointing o Data generation and many computations
• Maintaining RDD lineage provides resilience but o Distributed computing
can cause problems when the lineage is very long • Iterative algorithms
• Recovery can be expensive o In-memory processing
• Risk of stack overflow o The more iterations, the better the
• Checkpointing saves data to HDFS → Provides answer will be
fault-tolerance across nodes • Many scientific algorithms use similar
• Lineage is not saved approaches
o Generating data and iterative
Shared variables computations
• In Spark tasks are executed on a data node
within a contained processing environment (JVM)
• Having general, read-write shared variables
across tasks would be inefficient SUMMARY
• To support some level of driver and node Working with RDDs
communication, Spark provides two shared • RDDs are the basic data concept (unstructured)
variables: • Broadcast variables • Accumulators in Spark
• RDDs are partitioned and process in parallel and
Broadcast variables in memory
• Used to give to every node a copy of input data • RDDs can have any type of data element
• Broadcast variables are created by calling: • Operations are the elements to work with RDDs
sc.broadcast(broadcast_var) o Actions
• Broadcast variables can be access by calling o Transformations
(method value): broadcast_var.value • Pair RDDs are a special form of RDD consisting
of Key-value pairs
Accumulators • Pair RDD operations are available for different
• Used to give to the driver a consolidated value purposes
from the data notes • Reading files: textFile()
• Accumulator variables are created by calling: • Writing files: saveAsTextFile()
acc1 = sc.accumulator(initial_value) Use persist if an RDD will be used multiple times
• Accumulator variables can be changed by: • (avoid re-computation)
Calling the add method • Using the += operator Use shared variable to communicate with the
cluster and driver
Why do we care about counting words • Broadcast: passes variables to each node
• Word count is challenging over massive • Accumulators: accumulates values from the
amounts of data nodes to the driver
o Using a single compute node would be Spark is especially suited for big data problems
too time-consuming that require iteration and intensive computations
• Parallel processing and in-memory persistence • DataFrames are built on base RDDs containing
makes this very efficient Row objects (each representing a record with
MapReduce is a generic programming model for named columns).
distributed processing • A distributed collection of data elements
• Spark implements MapReduce with Pair RDDs organized into named columns (meaning: data
• Spark provides many Maps and Reducers and metadata) - Row’s, like Pandas in Python
Common Spark algorithms are: • Conceptually equivalent to a table in a
• Word count, Pi estimation relational database or a dataframe in R/Python,
• PageRank, k-means but with richer optimizations
Spark includes examples and specialized libraries
to implement many common algorithms DataFrames can be created:
• GraphX and GraphFrames • From data in-memory
• Mllib and Spark ML • From an existing RDD
• From an existing structured data source
#6 (Parquet file, JSON file, ODBC/JDBC etc.)
Apache Spark SQL • By performing an operation or query on
• Spark module for structured data processing another DataFrame
- “Table” like data objects, SQL engine and syntax • By programmatically defining a schema
• Built on top of core Spark • From a Table (in Spark catalog)
• Works with the concept of DataFrame
- Data organized in columns DF = spark.range(5) → Range is a method to
- Data and Metadata create a DF
DF.show() → Show is the standard method to
Spark SQL componentes display a DF
• The DataFrame API DF. → Hit Tab to see the DataFrame available
o Library for working with data as tables method
o Defines DataFrames containing Rows and
Columns DF.collect() → Show the DF as a RDD
• The SQL Engine (Interpreter and Optimizer) Display(DF) → Use Display to show a DataFrame
- Catalyst query optimizer and Tungstem memory in a tabular format with visual options
management
• A SQL command line interface
• Integration with Apache Hive (HiveQL) and
ODBC/JDBC drivers

The context / environments


The main Spark SQL entry point is a SQL Context
object sqlContext.
• Is similar to Spark Context in core Spark
• Requires a SparkContext -sc
The sqlContext makes the DataFrame
functionality available while the sparkContext
focuses on the Apache Spark engine itself
Creating a DF reading a json file
After Apache Spark 2.X there is one new context -
spark.read.json(“/…)
the SparkSession (named as spark by default)
Creating a DF reading a csv file
with the SQL Context embedded
spark.read.csv(“/…)
Creating a DF reading a text file
DataFrames
spark.read.text(“/…)
• DataFrames are the main abstraction in Spark
SQL (like RDDs in core Spark), representing a two-
Generic Load and Save Functions
dimensional labeled data structure
To read/write data, Spark relies on generic - Database: global_temp
load/save functions - Query: select * from
• df = spark.read.load(“path/file.parquet”) global_temp.temp_view_name
• df = spark.read.load(“path/file.json”, • You won't find it in the tables button list on the
format=“json”) left because it's not globally registered on the
• df.select(“col1”, (Hive) metastore
“col2”).write.save(“file.parquet") • If registered as GlobalTempView other users
• df.select(“col1”, and notebooks that attach to this cluster can
“col2”).write.save(“file.json“, access this table
format=“json”) • If we restart our Application, our table will have
• df = spark.read.load(“path/file.json”, to be reregistered
format=“Json”)
• df.select(“col1”, Saving to persisting Tables
“col2”).write.save(“file.parquet“, • DataFrames are objects from your spark session
format=“parquet”) • DataFrames can be saved as persistent tables
• Data in DataFrames can be saved to a data into metastore with the saveAsTable method –
source (json, parquet, delta, jdbc, orc, libsvm, csv, df.write.saveAsTable(“table”)
text) • saveAsTable will materialize the contents of the
• Spark SQL save functions are not atomic, DataFrame and create a pointer to the data in the
transactional nor they use any locking metastore (unlike the createTempView method)
• The generic save method has as optional “save • An existing Hive environment is not necessary.
mode” parameter which determines what Spark will create a default Hive metastore if
happens if you attempt to save a file/table that necessary
already exists • Persistent tables will still exist even after your
Spark program has restarted
• You can create a DataFrame from a Table with
df = spark.table() or df = spark.sql(“select
SaveMode options statement”)
Managed and Unmanaged Tables
• Managed Tables: Spark manages both the
metadata and the data in the file store
spark.sql(“ CREATE TABLE table_name (col1 Type,
col2 type, …)“) or
df.write.saveAsTable(“table_name”)

• Unmanaged Tables: Spark only manages the


Register a DataFrame as a View or Table metadata. You manage the data
Useful for sharing the data or declarative SQL spark.sql(””” CREATE TABLE table_name (col1
• df.createTempView(“name”) - session Type, col2 type, …)
scope USING format OPTION (PATH ‘path’ ”””)
• df.createOrReplaceTempView(“name”) or
• df.createGlobalTempView(“name”) - df.write.option(“path” ,
application scope “path”).saveAsTable(“table_name”)
• df.createOrReplaceGlobalTempView(“na
me”) global_temp.view Running SQL queries programmatically
• df.write.saveAsTable(“name”) • Spark SQL also supports free-form SQL
• spark.catalog.dropTempView("name") • The DataFrame must be registered as View or
• spark.catalog.dropGlobalTempView("nam Table
e")
• Registering a DataFrame as View saves the
metadata on the local cluster
Note: triple quote syntax (”””) is an easy way of diamond.select(“color”,”clarity”).where(“color =
creating a string that contains special characters ‘E’”).show(5) The where() clause is equivalent to
(like quotes) without having to escape them. Ex: filter()15
spark.sql(””” select … from … ”””)
DataFrame Actions (examples)
Basic Operations with DataFrames (metadata) • collect
• schema - returns a schema object describing • take
the data • count
• printSchema - displays the schema as a visual • show
tree
• cache/persist - persists the DataFrame to disk Sort and orderBy
or memory sort orders at partition level, orderBy orders at
• columns - returns as array containing the DF level (less efficient)
names of the columns diamond.sort(‘table’).show()
• dtypes - returns an array of (columns-name,
type) pairs GroupBy
• explain - prints debug info about the DataFrame display(diamond.select(“color”,”price”).groupby(
• alias - renames column names “color”).agg(avg(“price”)))

Operations with DataFrames join


• Queries – create a new DataFrame
o DataFrames are immutable
o Queries are analogous to RDD
transformations
o Queries can be chained like
transformations

• Actions – return data to the drives


o Actions trigger “lazy” execution of
queries Spark User Defined Functions – UDF
• Functions are very useful to operate on data
DataFrame Queries (examples) • You must register a function in order to use it in
• join - joins DataFrames a query (meaning, on the data on your data
• union - a new DF with the content of the DF storage)
added to base DF
• limit - a new DF with the first n rows of the base
DF diamond.limit(3).show()
• distinct - returns a new DF with distinct
elements
• select - a new DF with data from one or more
columns of the base DF
diamond.select(“carat”,”cut”,”color”).show(5)
• filter - a new DF with rows meeting a specific
condition diamond.filter(diamonds[color] == DataFrames are built on RDDs
‘E’).show(5) • DataFrames are Base RDDs containing Row
objects
• Use the rdd method to get the underlying RDD Processing techniques without having access to
peopleRDD = peopledDF.rdd all of the data.
It is usually used in the context of big data in
Summary which it is generated by many different sources
• Spark SQL is a Spark API for handling structured at high speed
and semistructured data
• Requires SQLContext as entry point or Streaming Data sources
SparkContext • Web Log Files / System Log Files
• Use a Catalog and SQL Optimizer • Sensors
• DataFrames are the key unit for structured data • Mobile Devices
• DataFrames are based on RDD of Row objects • Social Media
• DataFrames allow transformation • Network sockets
s and actions as RDDs • Wearables/ IoT devices
• DataFrames allow “SQL like” operations
• Spark is not a replacement for a Data Base Streaming Data use cases
• Fraud detection in credit cards / bank
#7 - SQL examples with Spark transactions
Building your Data Lake • Continuous patient monitoring
• Dynamic inventory management
• Dynamic routing of traffic data
• Thresholds / anomalies detection in sensors
data
• Bronze - Raw Ingestion • Next Best Action in on-line user engagement
• Patterns detection in social streaming data
• Silver – Filtered, Cleaned, Augmented
• Gold – Business Level aggregates
Processing Streaming Data
• Many small and fast data inputs/streams
Summary
• Can’t have all data ready up front
• Spark SQL supports many data operations and
• Data streams contain raw and unprocessed
functions like:
data: Noise; Data quality and integrity
• GroupBy and pivot
• Focus on the analysis of the streams patterns:
• join, union
Stream content and behavior
• Many SQL functions
• Actions might be required in real-time
• ETL/ELT pipelines with Spark Dataframes have
important advantages like performance and data
Batch vs Streaming Data
locality
• Data Scope: batch processing can process all
• The more recent Delta format allows ACID
the data in the dataset. Stream processing
transformations and more SQL like operations on
typically only has access to the most recent data
the data
or within a rolling time window
• The Data lake evolved to more sophisticated
• Data Size: batch processing is suitable for
data management and quality operations by
handling large datasets efficiently. Stream
using the Delta format
processing is intended for individual records or
• The Delta format supported the creation of the
micro batches consisting of few records
Data Lakehouse
• Performance: the latency for batch processing
is typically a few hours. Stream processing
#9 - Spark Streaming typically occurs immediately, with latency in the
Streaming Data order of seconds or milliseconds. Latency is the
Streaming data is data that is continuously time taken for the data to be received and
generated by different sources. Such data should processed
be processed incrementally using Stream • Analysis: you typically use batch processing for
performing complex analytics. Stream processing
is used for simple response functions, aggregates • Low latency and high scalability
or calculations such as rolling averages • Fault tolerance and reliability
• Integration with on-line sources and RDBMS
Impact on technology Systems
• Need to process (large volumes of) data quickly
and on the fly
• Means in memory parallel processing
• You have the queries but not the data
• As opposed to traditional batch systems
• You store the outputs, not the input raw data
• As opposed to traditional batch systems
• Typically, a component of a larger system
• Means that connectivity and integration are
key

Lambda Architecture
Proposed by Nathan Marz to addresses the
latency problem of typical DW’s

Spark Structured Streaming

• Apache Structured Streaming - Is a scalable and


fault-tolerant stream processing engine

• Spark Structured Streaming


• Since Spark 2.0
• Built on top of Spark SQL Library
• Separate engine based on DataFrames
• Streams are unbounded DataFrames -
Streaming DataFrames
• Allow micro-batch processing and a new low-
latency processing mode - Continuous processing
• Stream processing engine built on the Spark
SQL engine and DataFrames / Datasets • Spark
SQL engine will run queries incrementally and
Kappa Architecture continuously and updating the result as
Proposed by Jay Kreps as alternative to Lambda streaming data continues to arrive (small batch
jobs)
• Structured Streaming queries are processed
using a microbatch processing engine, which
processes data streams as a series of small batch
jobs thereby achieving end-to-end latencies as
low as 100 milliseconds
• Since Spark 2.3 was introduced a new low-
• All data follows the same path as stream events latency processing mode called Continuous
• Based on a stream processing engine (event- Processing, which can achieve endto-end
based and real-time) latencies as low as 1 millisecond

Streaming Data processing engine (aka:ESP, CEP)


• Specific engine to deal with Streaming Data
• Able to process large volumes of data in real-
time
• Processing the Stream
streamingDFnew = ( streamingDF
.withColumnRenamed("Index", "User_ID") #
Pick a "better" column name
.drop("_corrupt_record") # Remove an
unnecessary column )
streamingDFnew =
streamingdf.select(“device”).where(“signal =
1”)

• Checking if it’s a Stream


streamingDF.isStreaming()
Programming basic concepts
• Writing the Stream
• The streaming contexto • The method DataFrame.writeStream
from pyspark.streaming import StreamingContex returns a DataStreamWriter used to
ssc = StreamingContext(sc, configure the output of the stream
batch_interval_seconds) • Parameters of the DataStreamWriter
configuration - aka Query:
• Spark Structured Streams • Query's name (optional)
StreamingDataFrames: (Since Spark 2.0) - Name must be unique among all the
DataFrames and Datasets can represent currently active queries in the spark
static, bounded data, as well as streaming, context
unbounded data • Trigger (optional)
- Default value is ProcessingTime(0) - it
will run the query as fast as possible
• Checkpointing directory (optional for
pup/sub sinks)
• Output Mode and Output Sink
• Configuration specific to the output
sink, such as:
• The host, port and topic of the receiving
• Reading a Stream The method Kafka server
spark.readStream returns a • The file format and destination of files
DataStreamReader used to configure the • A custom sink via
stream writeStream.foreach(…)
streamingDF = ( spark .readStream Once the configuration is completed, we can
.format("socket") .option("host", "localhost") trigger the job with a call to .start()
.option("port", port_number) .load() )
Input sources format Queries - The query that is continually being
• File: Reads files written in a directory as a executed over streaming data
stream of data. Ex: .json() .csv() .parquet()
.text() …
• Kafka: Reads data from Kafka
• Socket (for testing): Reads UTF8 text data
from a socket connection • Managing Streaming Queries
• Rate source (for testing): Generates data at • query.name()
the specified number of rows per second, • query.stop()
each output row contains a timestamp and • query.explain()
value • query.awaitTermination()
Remember: you want to have a program (agent)
Triggers – specifies when the system should always running
process the next set of data
• Receiving the data
• The data is received from sources using
receivers
• Use readStream() – Create your input
StreamingDF
• Transforming the data
• The received data is transformed
Checkpointing • Use DF Transformations – Define your
• Checkpoint stores the current state of your query logic
streaming jog on a reliable storage (Ex: HDFS) • Pushing out the data
• With checkpoint, a terminated stream can be • The final transformed data is pushed
restarted and will continue from where it out (to external systems)
stopped • Use writeStream() - Create your query
• To enable checkpointing you need to specify name and output
the location of the directory with: Remember: you have to start() and stop() your
.option(“checkpointLocation”, checkpointPath) agent

• Window operations on Event Time: apply


transformations over a sliding window of data
• window(w_length, slide_interval)
Return a new StrmDF computed based on
Output Modes windowed batches of the source StrmDF
• countByWindow(w_length,
slide_interval) Return a sliding window
count of elements in the stream
Note: Append mode is the default • reduceByWindow(func, w_length,
slide_interval)
Output Sinks • Tumbling windows: fixed-sized, non-
overlapping and contiguous time intervals
• Sliding windows: overlapping and
contiguous time intervals
• Session windows (Spark 3.2): dynamic
size of window length

• Join operations → Join_output =


Structure of a Spark Streaming program StrmDF1.join(StrmDF2, “join_type”)
• Define the input sources by creating input • DataFrame and SQL operations →
DStreams / StreamingDFs Using DataFrames and SQL operations on
• Define the streaming process/computations by streaming data
applying transformation StrmDF.createOrReplaceTempView(“New
• Define the streaming output operation _Strm”) spark.sql(“select count(*) from
• Start receiving data and processing it using New_Strm”)
streamingContext.start() or query.start()
• Wait for the processing to be stopped Summary
(manually or due to any error) using • Streaming data is data that is continuously
streamingContext.awaitTermination() being generated and need to be analyzed in near
• The processing can be manually stopped using or real time
streamingContext.stop() or query.stop()
• Streaming data has been growing in terms of
importance and use cases with the new sources
like IoT and Social Media
• Streaming data requires a specific computing
architecture to coupe with the speed and reliance
required to process the data
• Spark incorporates key advantages to process
streaming data like parallel computing and in-
memory processing Vertices = {a, b, c}
• Spark Streaming and later, Spark Structured Edges = {{a,b}, {b,c}, {c,a}}
Streaming, are additional Spark libraries for
processing streaming data Types of Graph algorithms
• Pathfinding Finding the shortest path with the
#10 – Graph Analytics fewest hops or lowest weight
Intro to Graphs • Centrality Understanding which nodes are
• Graph algorithms provide one of the most more important in a network like Ability to
potent approaches to analyzing connected data quickly spread information or bridging distinct
because their mathematical calculations are groups
specifically built to operate on relationships • Community detection Connectedness and
• Making sense of connected data by studying connectivity analysis is used to find communities
their relationships is key to have new insights like and quantify the quality of groupings
understanding the data structures and revealing
patterns Graph programming concepts
Spark Graph Analysis library
Graph analytics use cases • Spark library - API for Graph analysis and graph-
• In Spark, a job is associated with a chain of a parallel computation
Data Object (RDD/Dataset) dependencies
organized in a direct acyclic graph (DAG) GraphX (legacy)
• A DAG (finite direct graph) is a set of Vertices • Spark library, Scala
and Edges, where vertices represents the Data • Property Graph - RDD extension for Graph
Objects (RDD/Dataset) and the edges represents abstraction
the Operations to be applied on the Data Object. - Direct Multigraph with user defined
• Find influencers in a network objects attached to vertex and edges
• Find vulnerable components in a network • Graph operators and Graph algorithms
• Find the best path
• Fraud detection GraphFrames
• Understanding customer behavior • Not Spark itself, it’s a separate package (GitHub
• Understanding disease behavior – neo4j)
• Cost analysis and rout optimization in - Collaboration between Databricks, UC
transports Berkeley & MIT
• Wheather analysis and prediction • GraphX extended, Scala, Python, Java
• Web navigation analysis (web clicks path • DataFrame Graphs
analysis) - With Spark optimizations: Catalyst
• Track objects and operations sequence optimizer and Tungsten memory management
(logistics/computing) • Included in Databricks runtime for ML

Creating a GraphFrame
myGraphFrame = GraphFrame( verticesDF,
edgesDF )
• A GraphFrame is created from a vertex
DataFrame and an Edges DataFrame • The edges DataFrame must contain a column
• The vertex DataFrame must contain a column named src that stores the source of the edge and
named id that stores unique vertex IDs a column named dst that stores the destination
of the edge
• Both DFs can have additional columns, the
Graph operators List operators numEdges, numVertices, degrees
(GraphX, scala)
Property operators mapVertices, mapEdges, mapTriplets
Structural operators reverse, subgraph, goupEdges
Join operators joinVertices, outerJoinVertices
Neighborhood aggregateMessages (Primitives for developing graph algorithms)
Aggregation
GraphFrames show, display (vertices/edges)
operators
filter, groupBy, inDgrees/outDegress = # of edges pointing in/out
(GraphFrames,
inDegrees, g.vertices().filter(“column = ‘string’”).show()
Python)
outDegrees g.filterEdges(“column = ‘string’ “)
.dropIsolatedVertices().vertices().show()
g.vertices.groupBy().max(“column”).show()
g.inDegress.show()
GraphFrames Motif finding Searching for structural patterns in a graph. DSL - Domain
operators Specific Language: () - [] -> () g.find( “(v1) - [e1] -> (v2); (v2) - [e2]
(GraphFrames, -> (v1)” ) g.find( “(v1) - [] -> (v2); !(v2) - [] -> (v1)” )
Python)
Subgraphs Select subgraphs based on a combination of motif finding and
DataFrame filters. Ex: filterVertices(cond), filterEdges(cond),
dropIsolatedVertices()

Message passing Primitives for developing graph algorithms


GraphFrames Breadth-first search Finds the shortest path(s) from one vertex to another vertex
algorithms (BFS) paths = g.bfs(“column_start = ‘string’”, “columns_end = ‘string’”)
Connected Computes the connected component (id) membership of each
components vertex results = g.connectedComponents()
Strongly connected Compute the strongly connected component (SCC) of each
components vertex results = g.stonglyConnectedComponents(maxIter=10)
Label Propagation Run static LPA for detecting communities in networks result =
Algorithm (LPA) g.labelPropagation(maxIter=5)
Page Rank Compute a Rank/Score to Edges (identify important vertices in a
graph) results = g.pageRank(MaxIter=10, resetProbability=0.15)
Shortest paths Computes shortest paths from each vertex to the given vertices
results = g.shortestPaths( [ “a”, “c” ] )
Triangle count Computes the number of triangles (3 linked nodes) passing
through each vertex results = g.triangleCount()

Loading and Saving GraphFrames v = spark.read.parquet(“path_vertices”)


e = spark.read.parquet(“path_edges”)
g = graphFrame( v , e )
g.vertices.write.parquet(“path”)
g.edges.write.parquet(“path”)
vertices/edges attributes (depending on the
specific needs)
Summary Supervised Learning Unsupervised
• Graph are mathematical structures to model Learning
relationships between objects • Regression To • Clustering
• Graphs are everywhere in the world: Sociology, predict a real value • Association
Biology, Physics, Medicine, Computer Sciences output (linear - Association rules
etc. variables) - Collaborative
• Graphs are defined with Vertices and Edges • Classification To filtering
predict a discrete • Visualization
• In Spark we have two libraries to work with
value output
Graphs: GraphX (legacy) and GraphFrames
(nonlinear /
(GitHub) categorical variables)
• To operate with GraphsFrames we have the Predictive Modeling Descriptive Modeling
Graph Operators (show, filter,…,motif finding,
subgraphs) and GraphFrame Algorithms (BFS, The three “Cs”:
Connected Components, LPA, PAgeRank, Shortest • Collaborative filtering (or recommendation
Path, Triangle Count) systems) - Unsupervised Learning
• CF is a “unsupervised learning” technique for
#11 – Spark ML Introduction recommendations (recommendation systems)
Machine Learning (ML) • The algorithm discover groups or
• Many programs are process driven (procedural, recommendations based on similarities
tell the computer what to do) between users and items
• Database transactions and queries • Is a technique for making automatic
• Controlling systems: transport, predictions (by filtering) based on historical
manufacturing processes etc. events
• Search engines • Helps users navigate data by expanding to
• Social systems: blogs, chats, email etc. topics that have affinity with their established
• An alternative technique is to have computers interests
“learn” what to do • CF algorithms are agnostics to the different
• ML refers to programs that leverage collected types of data items involved
data to drive future program behavior • Examples: • Given people who each like certain
• Development of systems that can “learn” from books, learn to suggest what someone may like in
“past data” to identify patterns (describe the the future based on what they already like •
data) and apply it (to new data) Recommendations on music playlist, videos,
• ML is based on the construction and application locations
of algorithms and statistical models to perform a
specific task without using a • Clustering - Unsupervised Learning
procedural/declarative approach • Clustering algorithms discover structure in
• ML algorithms built a mathematical model collections of data – Where no formal structure
(based on sample data) that best represents the previously existed
data (reality) • They discover what groupings (clusters)
• ML requires a process to train the model and naturally occur in data
apply the model • Examples: • Finding related news articles •
• The process to train (built) the model requires Identify similar customers • Computer vision
heavy data processing capabilities to: (groups of pixels that cohere into objects)
• Prepare the data in the right format for
the model • Classification and regression - Supervised
• Build (train) the model Leaning
ML methods are classified in • Takes a set of data records with (a) known
• Supervised leaning - When we have (a target) label(s) - Learns how to label new records based
labelled data to train the model on that information
• Unsupervised leaning - When we don’t have • Examples: • Given a set of emails identified as
labeled data to train the model spam/not spam, label new emails as spam/not
spam • Given tumors identified as benign or Main concepts in Pipelines
malignant, classify new tumors Estimator: Learns (fits) parameters from the
inputDF and returns a model, which is a
Application of ML transformer
• Fraud detection MyModel = SparkModel.fit(inputDF)
• Speech recognition Transformer: Returns a new DF (transformed)
• Object Recognition / Computer vision based on the input DF
• Customer segmentation MyModel.transform(inputDF)
• Customer churn / attrition / erosion Pipeline: Combines multiples steps named
• Next Best Offer, Next Best Action stages, into a single workflow
• Recommendation Systems pipeline = Pipeline(stages=[stg1_object, stg2_obj,
• Classification systems …])
• Transport optimization
• Production Plan optimization Spark Mllib and ML Algorithms~

Apache Spark for Machine Learning


• ML brings many computational challenges:
• Large volumes of source data to develop and
train the model
• Very intensive in terms of computational
resources
• Using of iterative algorithms (optimization
functions)
• Use both structured and unstructured data General steps of a ML pipeline

• Spark is a good fit for a ML processing engine


• Large parallelized datasets
• In-memory processing
• Scalability • Data ingestion (load the data)
• Iterative algorithms (using an API of choice) • Data exploration and preparation (analyze,
• Doesn’t require movement of data from the clean and transform data)
Big Data cluster Ex: missing values, split data into training and
test, encoders etc.
Spark MLlib • Model creation and training
• Mllib is an Apache Spark library model = model_type(), model.fit(training_data)
on top of core Spark • Model analysis and test
print(coeficients), print(error/precision)
• Mllib already includes many common
• Make preditions
ML algorithms:
prediction = model.transform(test_data)
• K-means
• Model deployment
• Alternative least squares
• Model evaluation
• Logistics regression
• Linear regression
Data exploration and preparation for ML
• Gradient descent
• Eliminate duplicates
• Still work in progress • Filter NULLs
• Still limited set of operations and algorithms • Derive new features
• Moving from Mllib to Spark ML • Drop features
• Outliers' identification
Primary API, which is the original Spark MLlib • Vectorize features
High-level API, which is a DataFrame based API in • Index Strings
spark.ml
• One Hot Encoding

Summary
• ML system are design to “learn” from data by
building a mathematical model and make
predictions based on that model (“past data”)
• ML implies a standard process
• Load data and clean/transform the data
• Built/train and test the model
• Deploy and use the model
• Evaluate the model
• Spark has unique characteristics for ML models Derive a new feature based on existing one(s)
development and implementation
• Spark does ML with two libraries: Mllib and
Spark ML
• Spark implements many ML functions and
algorithms (but still with work in progress)

#12 – ML data preparation with Spark


Machine Learning example (the simple thing)

Drop features of a DataFrame

Eliminate duplicates
Identifying Outliers
• Common rule: an outlier value is more than
Filter Nulls (eliminate rows with null values) 1.5 * IQR
• IQR is the interquartile range, defined as the
difference between the upper (Q3) and lower
(Q1) quartiles → Meaning: there are no outliers if
all the values are roughly within the Q1−1.5*IQR
and Q3+1.5*IQR range
• For every distinct category, create a feature
(vector)
• Better than just StringIndex to solve non
ordered categories

approxQuantile() method to identify approximate


quantiles of numerical columns
• 1st parameter is the name of the column
• 2nd parameter can be either a number
between 0 or 1 (where 0.5 means to calculated
median) or a list (as in this case)
Summary
• 3rd parameter specifies the acceptable level of
• ML requires a process with data loading,
an error for each metric (0 means an exact value -
analysis, preparation and modeling
it can be very expensive)
• The data preparation process requires many
specific operations for getting the data ready for
Vectorize features (models take features as a
modeling
vector)
• Spark includes many useful functions for data
preparation for ML
• dropna()
• dropDuplicates()
• selectExpr()
• UDFs
• VectorAssemble()
• StringIndexer
• OneHotEncoder

#13 – ML examples with Spark


1) Regression - Supervised learning
Index Strings – StringIndexer
• Used to transform categorical string features • Using a set of features values (ex: claimed
into numerical features amount/house area) to predict a value (ex:
• To use algorithms like Logistic Regression, you amount of fraud/house price) – continuous
must first convert the categorical variables into output variables
numeric variables • Linear regression is a regression model that fits
a line into the data

One Hot Enconding – OHE


Regression info – on notebook • Association rules is used to discover relations
between variables in large datasets. In a
Regression models available in Spark
transaction with a variety of items, association
(continuous output variables)
rules are meant to discover how or why certain
• Linear regression
items are connected
• Generalized linear regression
• Decision tree regression • The FP-growth operates on item sets to identify
• Random forest regression frequent items and association rules with the
• Gradient-boosted tree regression confidence, lit and support indicators
• Survival regression
• Isotonic regression
• Factorization machines regressor

2) Classification - Supervised learning


• Instead of predicting a value, in classification
we predict a class (ex: 0 and 1)
Confidence: how often (0-1) the association rule
• Logistic regression is a classification model that
has been found to be true
fits a logistic function into the data
Lift: measures the correlation between the
antecedent the consequent ( 1 yes)
Support: association rule occurrence / total
transactions (0-1)

Association Rules (FPM) models available in


Spark
• FP-Growth
Classification models available in Spark (discrete
• PrefixSpan
or categorical output variables)
• Logistic regression • Decision tree classifier
Summary
• Random forest classifier
Spark has many functionalities to cover the
• Gradient-boosted tree classifier
ML programming lifecycle
• Multilayer perceptron classifier
• Data exploration and preparation
• Linear Support Vector Machine
• Model creation and training
• One-vs-Rest classifier
• Model analysis and execution
• Naive Bayes
• Model evaluation
• Factorization machines classifier

3) Clustering - Unsupervised learning Spark ML available algorithms


• Clustering is an unsupervised learning problem • Regression (continuous output variables)
where we aim to group subsets of entities with • Classification (discrete or categorical output
one another based on some notion of similarity variables)
• K-means aims to partition n observations into k • Clustering
clusters in which each observation belongs to the • Frequent Pattern Mining (association)
cluster with the nearest mean • Collaborative Filtering (association)
Clustering models available in Spark
• K-means #14 – Review Class and Examples
• Latent Dirichlet Allocation (LDA)
• Bisecting k-means Summary
• Gaussian Mixture Model (GMM) Anlysing Big Data and Apache Spark • Analyzing
• Power Iteraction Clustering (PIC) Big Data means processing large datasets (over
multiple machines) with parallel processing
4) Association rules - Unsupervised learning technologies like Apache Spark • Apache Spark is
a multi-language and multi-purpose engine for in-
memory data processing in distributed
environments • Batch and streaming data • SQL
analytics • Data engineering and Machine
learning • Support for building the Data
Lakehouse • Apache Spark is the most widely-
used engine for scalable computing Thousands of
companies, including 80% of the Fortune 500, use
Apache Spark™. Over 2,000 contributors to the
open-source project from industry and academia.

ABD 21 1st Exam - 11 January (pdf) a. A program to load data with high
1. What is the meaning of the data parallelization
processing model: “Scale Out”? b. A column-based data format
a. Means adding bigger, more powerful c. A row-based data format
machines d. A text-based data format for
b. Means adding more, smaller compatibility and portability
machines
c. Means processing data with scalable
machines 7. What is a Pair RDD?
d. Means processing data outside the a. Two RDDs in sequence in a
transformation statement
cluster
b. An RDD with only two rows
2. In Databricks notebooks you can? c. An RDD with a pair operation
a. Program only in Python and Spark d. An RDD with key-value pairs
b. Program only in the language defined
in the Notebook creation 8. What is the result of the Spark statement
c. Select a language at the cell level bellow?
d. Select a language for a set of cells sc.parallelize(mydata,3)
3. In a Databricks notebook, to access the a. Creates an RDD with a minimum of 3
cluster driver node console, what magic partitions
command is used? b. Creates an RDDs named mydata and
a. %fs the value 3
b. %drive c. Generates an error of too many
c. dbutils.fs.mount() parameters
d. %sh d. Creates 3 RDDs with mydata

4. What is a DataFrame? 9. What Spark function has, as an output, a


a. A data processing method in Spark Pair RDD?
Commented [GMS1]: Word diz que é esta, e chat gpt
b. A distributed dataset in-memory a. map tambem
b. flatmap
c. A distributed dataset on-disk Commented [GMS2]: Word diz que é esta, e o
d. A distributed dataset on-disk and in- c. textFile abd_exames passados jan2021 também
memory d. wholeTextFiles

5. What is MapReduce? 10. Select the right statement to create a


a. A programming language DataFrame:
b. A programming model a. DF = spark.createDataFrame([“Line 1”,
c. A set of functions for processing big “Line 2”])
data b. DF = spark.table("mytable")
d. The original Apache Software c. DF = spark.load("mytable")
Foundation query engine for big data d. DF = sc.parallelize([“Line1”, “Line2”])

6. What is Avro in Hadoop?


11. Select the right statement regarding d. Reading a file called “data” stored in a
reduceByKey(): folder called "file"
a. reduceByKey() is a wide
transformation 16. Window operations in Spark are used
b. reduceByKey() is a narrow for?
transformation a. Define a window to display data
c. reduceByKey() is a lazy transformation b. Select a window of data to display
d. reduceByKey() is an action c. Freeze data in memory from a window
of data
12. Select the false statement regarding d. Apply transformations over a window
Spark terminology: of data
a. A Job is a set of tasks executed as a
result of an action 17. What is the data type that results from
b. A Stage is a set of tasks in a job that the following spark instruction:
can be executed in parallel spark.range(10)?
c. A task is an individual unit of work sent a. A list
to an executor b. A tuple
d. An Application can only contain one c. An RDD
job d. A DataFrame

13. What is a lambda function? 18. Select the right statement regarding
a. It’s a function defined without a name Spark transformations:
and with only one parameter a. Wide transformations are very efficient
b. It’s a function defined without a name because they don’t move data from the
and with only one expression node
c. It’s a function that can be reused with b. Narrow transformations are very
many parameters efficient because they don’t move data
d. It’s a function that can be reused with from the node
many expressions c. Both wide and narrow
transformations move data from the
14. collect() is a Spark function that? node
a. You can extensively use to display d. None of the narrow or wide
Dataframes content transformations move data from the
b. It's not available for Dataframes node
c. You can extensively use to display
RDDs content 19. In Spark, lazy execution means that:
d. You should use with caution to a. Execution will take some time because
display RDDs content it needs to be sent to the worker nodes
We should use the collect() on smaller b. Execution will take some time because
dataset usually after filter(), group(), count() the code is interpreted
e.t.c. Retrieving on larger dataset results in c. Execution is done one line at the time
out of memory. d. Execution is triggered only when an
action is found
15. With the instruction
sc.textFile(“file:/data”) you are? 20. What is the difference between
a. Reading a file from your hdfs file Spark Streaming and Structured
system Streaming?
b. Reading a file called “file:/data” a) Structured Streaming is for
c. Reading a file from your local non- structured streaming data processing
Hadoop file system and Spark Streaming is for
unstructured streaming data 24. In a Spark ML program, what is the
processing purpose of the code bellow?
b) Spark Streaming is the new ASF model.transform(mydata)
library for Streaming Data and a. Create a machine learning model
Structured Streaming the old one based on the data of ‘mydata’
c) Structured Streaming is a stream b. Apply the model in ‘model’ to the
processing engine and Spark data in ‘mydata’
Streaming is an extension to the c. Create a new model based on ‘mydata’
core Spark API to streaming data d. Adjust the model based on the data in
processing ‘mydata’
d) Structured Streaming relies on
micro batch and RDDs while Spark 25. Is a Spark ML program, what is the
Streaking relies on DataFrames and purpose of the code bellow?
Datasets model.fit(mydata)
a. Train a machine learning model based
21. What is a Tumbling Window in Spark on the data of ‘mydata’
Streaming? b. Apply the model in ‘model’ to the data
a. A fixed-sized, non-overlapping and in ‘mydata’
contiguous window of data c. Create a new model based on ‘mydata’
b. An overlapping and contiguous d. Adjust the model based on the data in
window of data ‘mydata’
c. A non-contiguous window of data 26. What is the result of the Spark ML
d. A dynamic size window of data instruction bellow? lr =
LogisticRegression(maxIter=10)
a. A logistic regression object is declared
22. Spark ML library can be classified as: with a maximum of 10 interactions
a. A mature ML library with a very wide b. A logistic regression is executed with a
range of predictive and descriptive maximum of 10 interactions
models to choose from c. A logistic regression is trained with a
b. A strong Deep Learning library maximum of 10 interactions
c. A complete ML framework for data d. A logistic regression is estimated with a
analysis maximum of 10 interactions
d. A ML library with a reasonable set of
models but still with work in progress 27. The vertex DataFrame in a GraphFrame
is:
23. What is the main difference between a. A free form DataFrame
Spark MLlib and ML? b. A DataFrame that must contain a
a. There is no difference apart from the column named 'id'
bigger set of algorithms available on c. A DataFrame that must contain a
Spark ML column named 'src' and 'dst'
b. Spark ML works with Streaming Data d. A DataFrame that must contain a
c. Spark MLlib is faster column named 'id', 'src' and 'dst'
d. Spark ML is based/ works on/with
Dataframes 28. What is DSL used for in GraphFrames?
Spark MLlib carries the original API built a. Formatting the output of a
on top of RDDs. GraphFrame query
Spark ML contains higher-level API built b. Declare a GraphFrame object
on top of DataFrames for constructing ML c. Search for patterns in a graph
pipelines. d. Define properties in a GraphFrame
29. Based on the figure bellow, explain the 30. What are the general steps of a Spark ML
shown algorithm line by line and the program? Give an example.
expected outcome with an example (of Regarding a Spark ML pipeline, the
the outcome). Complementary, explain general steps are
what is supposed to be input content for Data ingestions: in this step, the data is
the algorithm and where is it (the name loaded
of the the object) identified in the code. Data preparation: after loading the data,
an exploratory data analysis should be
performed, followed by cleaning the
dataset (incoherences, missing values).
Then the data should be properly split
The code is a map-reduce algorithm to into training and test datasets
perform a word count. Model creation and training, with the
In the first line, the outcome is a list of training data
words (one word per line). Each word is Model analysis and testing
split on the delimiter“ “, after applying Make predictions, by applying the trained
the flatMap, which flattens the results. In model to the test dataset
the second line, the map function applies (model.transform(test_data)
a lambda function to each word to create Model evaluation, where we should print
a tuple with (word, 1). the evaluation metrics
This way, the outcome is a key-value pair, Example for a multiple linear regression
being the word the key and the value 1. problem: We have data regarding clients
In the third line, an aggregation is from a insurance customer data and we
performed by key (word), being the want to predict the health insurance
outcome a list of the distinct words, with charges based on their ages, BMI,
the count of occurrences registered. smoking habits and the number of
The input content of the algorithm is an children. 1 - First we need to create a
rdd representing at least a line of text, dataframe with the customer data -
which is called in the first line before the which is in a list called clients df =
first lambda function. spark.createDataFrame(records,
Example: rdd = sc.parallelize([“This is the ["age","bmi","smoker","charges"]) 2 -
first line”, “This is the second line”, “This Convert the smoker column to a binary
is the last line”]) column using the String Indexer:
With this rdd, the outcome for the first smokerIndexer = StringIndexer(inputCol =
line (after applying the method collect() "smoker", outputCol = "smoker_binary")
)would be: df2 = smokerIndexer.fit(df).transform(df)
['This', 'is', 'the', 'first', 'line', 'This', 'is', df3 = df2.drop("smoker") 3 - Then we
'the', 'second', 'line', 'This', 'is', 'the', 'last', should vectorise the data, to be used by
'line'] the model, creating a features column
With this rdd, the outcome for the assembler =
second line (after applying the method VectorAssembler(inputCols=["age",
collect() ) would be: "bmi", "children", "smoker_binary"],
[('This', 1), ('is', 1), ('the', 1), ('first', 1), outputCol="features")
('line', 1), ('This', 1), ('is', 1), ('the', 1), Stages += [assembler]
('second', 1), ('line', 1), ('This', 1), ('is', 1), pipeline = Pipeline(stages=stages)
('the', 1), ('last', 1), ('line', 1)] pipelineModel = pipeline.fit(df3) dataset
Finally, the outcome after the line (after = pipelineModel.transform(df3) 4 - Then
applying the method collect() ) would be: keep relevant columns (label and
[('line', 3), ('This', 3), ('last', 1), ('the', 3), features) and rename the column
('first', 1), ('is', 3), ('second', 1)] "charges" to "label" df_final =
dataset.select(["features",
"charges"]).selectExpr("features as latency in this systems is the order of a
features", "charges as label") 5 - The next few seconds or miliseconds . An example
step is to split the data into training and can be a radar system, weather forecast
test sets (example 80% train, 20% test) or temperature measurement and
(trainingData, testData) = normally involve several IoT sensors
df_final.randomSplit([0.8,0.2]) 6 - Create
a linear regression object lr = ABD_Apoio_exame_vf9
LinearRegression(maxIter = 10) 7 - Chain 1. Which of the following is NOT a
Linear Regression in a Pipeline pipeline = component of big data architecture?
Pipeline(stages = [lr]) 8 - Train the Model
model = pipeline.fit(trainingData) 9 - a) Data storage
Make Predictions predictions = b) Data sources
model.transform(testData) 10 - Evaluate
the model eval = RegressionEvaluator c) Machine Learning
(labelCol = "label", predictionCol = d) Anonymization
"prediction") print("Coefficients: " +
str(model.stages[0].coefficients)) 2. Only one of the following sentences is
print("R2:", eval.evaluate(predictions, NOT true. Choose it.
{eval.metricName: "r2"})) a. Workspaces allow you to organize all the
work that you are doing on Databricks
31. Explain the main differences between b. The objective of data lake is to break data
real time data processing and batch data out of silos
processing. c. Clusters are single computer that you
Batch data processing deals with groups treat as a group of computers
of transactions that have already been d. The data lake stores data of any type:
collected over a period of time. The goal structured, unstructured, streaming
of a batch processing system is to
automatically execute periodic jobs in a 3. Which of the following is the most
batch. It is ideal for large volumes of appropriate definition for “Jobs”
data/transactions, as it is increases a. Are packages or modules that provide
efficiency in comparison with processing additional functionality that you need to solve
them individually. However, it can have a your business problems
delay between the collection of data and b. Are structured data that you and your team
getting the result after the batch process, will use for analysis
as it is normally a very time consuming c. Are the tool by which you can schedule
process. The latency in this systems is the execution to occur either on an already
order of a few hours. An example of existing cluster or a cluster of its own
batch processing are the processing of d. Are third party integrations with the
the salary within a company every Databricks platform
month.
On the other hand, real-time data 4. Only one of the following options is
processing deals with continuously NOT true. Choose it.
flowing data (individual records or micro a. Notebooks in order to be able to execute
batches) in real-time. Real-time commands
processing systems need to be very b. Dashboards can be created from
responsive and active all the time, in notebooks as a way of displaying the output
order to supply immediate response at of cells without the code that generates them
every instant. In this systems, the c. Clusters allow you to execute code
information is always up to date. from apps
However, the complexity of this process d. Workspaces allow you to organize all the
is higher than in batch processing. The work that you are doing on Databricks
d. databricks-finalizations
5. Only one of the following sentences is
correct. Choose it. 10. The example annex uses the lambda
a. Clusters allow you to execute code function. The syntaxes are correct?
from notebooks or libraries on set of data
b. Dashboards can not be created from
notebooks
c. Tables can not be stored on the cluster
that you're currently using
d. Applications like Tableau are jobs a. Yes
b. No
6. Only one of the following sentences is
correct. Choose it. The command “%sh”
allows you: 11. Concerning the use of Hadoop, only
a. To display the files of the folder one of the following sentences is correct.
b. To execute shell code in your notebook Choose it.
c. To use dbutils filesysytem commands a. A Node is a group of computers working
d. To include various types of documentation, together
including text, images and mathematical b. A Cluster is an individual computer in the
formula and equations cluster
c. A Daemon is a program running on a
7. Concerning the characteristics of node
“Lists” only one of the following options d. With Hadoop we can’t explore the nodes
is NOT true. Choose it. (name or data)
a. Lists are collections of items where each
item in the list has an assigned index value 12. Which of the following is NOT a
b. Lists consists of values separated by component of Hadoop data architecture?
commas a) HDFS
c. A list is mutable (meaning you can change b) MapIncrease – o que é é MapDecrease
its contents) c) Yarn
d. Lists are enclosed in square brackets [ ] d) Spark
and each item is separated by a comma

13. Concerning HDFS, only one of the


8. Only one of the following sentences is following sentences is NOT true. Choose
correct. Choose it. it.
a. A dictionary maps a set of keys to a. HDFS is responsible for storing data on the
another set of values cluster
b. Tuples are never enclosed in parentheses b. HDFS s a File System written in Java
c. Dictionaries are immutable c. HDFS sits on top of a native file system
d. The Lambda function is used for creating d. HDFS provides non-redundant storage
big and multimer function objects
14. Only one of the following sentences is
9. Only one of the following sentences is correct. Choose it.
NOT true. Choose it. a. YARN allows multiple data processing
Several types of data are stored in the engines to run on a single cluster
following DBFS root locations: b. MapReduce is a programming model for
processing data in a distributed way on a
a. /databricks-datasets unique node
b. databricks-results c. MapReduce drawbacks is optimized for
c. /FileStore iterative algorithms
d. MapReduce drawbacks is not limited to d. Apache Flume is a top level project at the
batch processing Apache Software Foundation

15. Only one of the following sentences is 20. Can you explain why use Apache
NOT true. Choose it. Avro data files: Storm?
a. Is a row-based storage format for Hadoop Apache Storm is a free and open source
b. It stores data in a non-binary format distributed realtime computation system.
c. It’s an efficient data serialization framework Apache Storm makes it easy to reliably
d. Uses JSON for defining the data schema process unbounded streams of data, doing for
realtime processing what Hadoop did for batch
16. Concerning Parquet files, only one of processing. Apache Storm is simple, can be
the following sentences is NOT true. used with any programming language, and is
Choose it. a lot of fun to use!
a. Parquet is supported in Spark, Apache Storm has many use cases: realtime
mapReduce, hive, Pig, Impala, and others analytics, online machine learning, continuous
b. Parquet reduces performance computation, distributed RPC, ETL, and more.
c. Parquet is a columnar format developed by Apache Storm is fast: a benchmark clocked it
Cloudera and Twitter at over a million tuples processed per
d. Parquet is most efficient when adding second per node. It is scalable, fault-tolerant,
many records at once guarantees your data will be processed, and is
easy to set up and operate.
17. Which of the following is NOT a Delta
21. Concerning Apache Spark, only one of
Lake key feature?
the following sentences is NOT true.
a. Closed Format:
Choose it.
b. Scalable Metadata Handling
a. Apache Spark is a unified analytics engine
c. Unified Batch and Streaming Source and
for large-scale data processing
Sink
b. Apache Sparks is a open-source
d. Schema Enforcement and Evolution
distributed computation framework for
executing code in parallel across many
18. Choose the option with the correct
different machines
command to copy the file foo.txt from
c. Apache Sparks is a slow, in-memory
local disk to user’s directory in HDFS
data processing engine without
a) $hdfs dfs -put foo.txt foo.txt
development APIs
b) $hdfs dfs -ls foo.txt foo.txt d. Apache sparks is ease to use and write
applications quickly in Java, Scala, Python, R
c) $hdfs dfs -get foo.txt foo.txt (??)
and SQL.
d) $hdfs dfs -rm foo.txt foo.txt
22. Which one of the following sentences
is NOT a key Spart characteristics?
19. Concerning Apache Flume, only one of a. Processing of unstructured, structured and
the following sentences is NOT true. streaming data
Choose it. b. Distributed processing
a. Apache Flame is a real time data ingestion c. Unified analytics and evolving platform
tool d. Works only with Python language
b. The use of Apache Flume is only
restricted to log data aggregation 23. Only one of the following sentences is
c. Apache Flame is a distributed, reliable, and NOT true. Choose it.
available system for efficiently collecting, Spark Core is:
aggregating and moving large amounts of a. Responsible for memory management and
streaming data from many different sources fault recovery
to a centralized data store
b. The base engine for large scale parallel Meaning: loading data in the cluster memory
data processing From data in memory - sc.parallelize
c. Responsible for interacting with storage From a file or set of files - sc.textFile,
systems spark.read
d. Home to the API that defines Data From another RDD - RDD operations,
warehouses ex: new-rdd.map(rdd)

24. Define the Spark’s main benefits in 28. Which one of the following sentences
terms of performance and productivity is NOT a key RDDs characteristics?
development a. Resilient
Spark’s main benefits in terms of performance b. Distributed
are using in-memory computing, Spark is c. Mutable
considerably faster than Hadoop MapReduce d. In-memory
(100x in some tests). It also can be used for
batch and real-time data processing. In terms 29. Which one of the following functions
of Developer productivity are Easy-to-use is NOT a single RDD transformation?
APIs for processing large datasets that a. sortBy
Includes 100+ operators for transforming. b. flatMap
c. reduce
25. Concerning main abstraction provides d. distinct
by Spark to work with data, only one of
the following sentences is correct. 30. Only one of the following sentences is
Choose it. NOT true. Choose it. Working with RDDs,
a. Data Frames are the main approach to function “collect” means:
work with unstructured data Scala objects a. Return the first element of the dataset
representing data b. Return an array with the first n elements of
b. RDD are structured data objects in Spark the data set
Data organized into named columns -tables c. Return all the elements of the dataset
c. Data sets are extensions of the Data as an array at the driver program
Frame API which provides type-safe, d. Return the first n elements of the RDD
object-oriented programming interface
and an optimizer 31. Concerning Spark terminology, only
d. A partition is a collection of columns that sit one of the following sentences is correct.
on one physical machine in our cluster Choose it.
a. Task is an individual unit of work sent
26. Only one of the following sentences is to an executor
NOT true. Choose it. b. Application is a set of tasks executed as a
a. Data Locality means moving data result of an action
towards computation rather than moving c. Job is a set of tasks in a job that can be
computation close to data executed in parallel
b. Hadoop stores data in HDFS, which splits d. Stage is a set of jobs managed by a driver
files into blocks and distribute among various
Data Nodes 32. Concerning RDDs persistence, only
c. With Hadoop, when a job is submitted, it is one of the following sentences is NOT
divided into map jobs and reduce jobs true. Choose it.
d. RDDs represent a collection of items a. Storage levels let you control partition
distributed across many compute nodes that replication
can be manipulated in parallel b. The persist method offers other options
called persistence levels
27. Define the 3 ways to create an RDD in c. Storage levels let you control storage
Commented [GMS3]: OpenAI diz que é esta, e as pessoas
Spark location não discordam
d. By default the persist method stores data d. PageRank
in memory only
38. What is the Spark algorithm used in
33. Concerning RDDs persistence, the code of this picture?
describe how to choose the persistence
level
Memory only: when possible for best
performance and saves space by saving a
serialized object in memory
Disk: when re-computation is more expensive a. Word Count
than disk read. Ex: expensive functions or b. PI estimation
filtering large datasets c. K-means clustering
Replication: when re-computation is more d. PageRank implementation
expensive than memory
39. Concerning K-means clustering, only
34. Concerning RDDs checkpointing, only one of the following sentences is NOT
one of the following sentences is NOT true. Choose it.
true. Choose it. a. K-means partitions the data into k clusters
a. Risk of stack overflow b. There are only one version on k-means
b. Maintaining RDD lineage provides algorithms and implementations
resilience but can cause problems when c. K-means is a common iterative algorithm
the lineage is very short used in Machine Learning for cluster analysis
c. Checkpointing saves data to HDFS d. The objective of k-means is to group the
d. Recovery can be expensive observations into clusters where each
observation belongs to the cluster with the
35. Only one of the following sentences is nearest mean
NOT true. Choose it. In Spark:
a. Tasks are executed on a data node within 40. Describe the 5 steps to implement the
a contained processing environment K-means clustering algorithm in Spark
b. Broadcast variables are used to give to 1.Choose the seeds (k)
every node a copy of input data 2.Each individual is associated with the
c. Read-write shared variables across nearest seed
tasks would be very efficient 3.Calculate the centroids of the formed
d. Accumulators are used to give to the driver clusters
a consolidated value from the data notes 4.Go back to step 2
5.End when the centroids cease to be re-
36. Only one of the following sentences is centered
NOT true. Choose it.
Spark is especially useful when working 41. Which one of the following functions
with any combination of: is NOT a Spark SQL component?
a. Iterative algorithms a. A SQL command line interface
b. Large amounts of data b. The DataFrame IPA
c. Intensive computations c. Integration with HiveQL
d. Plus privacy d. The SQL Engine

37. Which one of the following functions


is NOT an iterative algorithm used in
Spark? 42. Concerning Spark SQL, only one of
a. K-means the following sentences is NOT true.
b. PI estimation Choose it.
c. Sum count
a. DataFrames can be created by performing 46. Which one of the following functions
an operation or query on another DataFrame is NOT a streaming data source?
b. Instead of using Java serialization a. Sensors
datasets use a specialized encoder to b. Social media
serialize the objects for processing or c. Survey questionnaires
transmitting over the network d. Mobile devices
c. Python supports the Dataset API
d. DataFrames are built on base RDDs 47. Which one of the following functions
containing Row objects is NOT a streaming data use cases?
a. Dynamic routing of traffic data
43. In Spark SQL, when you operate with b. Fraud detection in bank transactions
DataFrame (DF) Queries, if you use the c. Anomalies detection in sensors data
command “limit” what result do you d. Discrete patient monitoring
obtain?
a. A new DF with data from one or more 48. Concerning processing streaming
columns of the base DF data, only one of the following sentences
b. Returns a new DF with distinct elements are correct. Choose it.
c. A new DF with the content of the DF added a. Stream processing is used for simple
to base DF response functions, aggregates or
d. A new DF with the first n rows of the calculations such as rolling averages
base DF b. Stream processing typically occurs
immediately, with latency in the order of
44. Concerning Spark SQL, only one of minutes
the following sentences is NOT true. c. Stream processing is intended for grouped
Choose it. records
a. DataFrames are the key unit for d. Stream processing typically only has
unstructured data access to the oldest data
b. Spark SQL is a Spark API for handling
structured and semi-structured data 49. Which one of the following values is
c. Requires SQLContext as entry point or NOT accepted by the command
SparkContext DataSreamWriter.format in Spark
d. Use a Catalog and SQL Optimizer Streaming?
a. Kafka
45. Using Spark SQL present an b. Context
instruction to join Table A and Table B c. Memory
using the field “name” d. Delta

50. Describe the main steps of the


structure of a Spark Streaming program.
Define the input sources by creating input
DStreams / StreamingDFs
Define the streaming process/ computations
by applying transformation and output
operations to Dstreams / StreamingDFs
Receiving the data - The data is received from
sources using receivers, use readStream() –
join = ta.join(tb, ta.name == tb.name) Create your input StreamingDF
Transforming the data - The received data is
join.show() transformed, use DF Transformations –
Define your query logic
Pushing out the data - The final transformed 56. Describe the differences between ML
data is pushed out (to external systems), Use supervised learning and unsupervised
writeStream() - Create your query name and Learning.
output Supervised leaning is when we have (a target)
Remember: you have to start() and stop() your labelled data to train the model. Regression (to
agent predict a real value output (linear variables)
and Classification (to predict a discrete value
51. Which one of the following is NOT a output (nonlinear / categorical variables)).
Type of Graph? Unsupervised leaning is when we don’t have
a. Undirect graph labeled data to train the model. Clustering,
b. Total graph association and Visualization.
c. Directed graph
d. Directed star 57. Concerning Machine Learning, only
one of the following sentences is NOT
52. Which one of the following is NOT a true. Choose it.
Type of Graph algorithms? a. A classification / regression system takes a
a. MapReduce set of data records with (a) known label(s)
b. Community detection b. Classification and regression are forms
c. Centrality of “unsupervised learning”
d. Pathfinding c. Clustering is classified as “unsupervised
learning”
53. Which one of the following is NOT a d. Collaborative Filtering helps users navigate
Graph operator? data by expanding to topics that have affinity
a. Neighborhood aggregation with their established interests
b. Join operators
c. Spark operators 58. Which one of the following is NOT an
d. Property operators application of ML?
a. Object Recognition
54. What is the GraphFrames algorithm b. Transport detection
used in the code annex? paths = c. Speech recognition
g.bfs(“column_start= ‘string’”, “columns_end= d. Classification systems
‘string’”)
a. Strongly connected components 59. Describe the reasons Spark is a good
b. Breadth-first search fit for a ML processing engine.
c. Label Propagation Algorithm Large parallelized datasets
d. Page Rank In-memory processing
Scalability
55. Concerning Graph Analysis, only one Iterative algorithms (using an API of choice)
of the following sentences is NOT true. Doesn’t require movement of data from the
Choose it. Big Data cluster
a. In Spark we have two libraries to work with
Graphs: GraphX and GraphFrames 60. Concerning Machine Learning, only
b. Graph are mathematical structures to one of the following sentences is NOT
model relationships between objects true. Choose it.
c. Graphs are defined with Squares and a. Spark does ML with two libraries:
Edges Hadoop ML and Spark ML
d. Graphs are everywhere in the world: b. Spark has unique characteristics for ML
Sociology, Biology, Physics, Medicine, models development and implementation
Computer Sciences etc. c. Spark implements many ML functions and
algorithms
d. ML system are design to “learn” from data Select a language at the cell level
by building a mathematical model and make
predictions based on that model What is an RDD?
Select one:
61. Which one of the following is NOT a A dataset in-memory
step in data preparation for ML?
a. Filter NULLs What is a lambda function?
b. Eliminate duplicates Select one:
c. Index Strings It’s a function that can be reused with many
d. Vectorize tables expressions

62. Using Spark ML programming what In Spark, lazy execution means that:
results can we obtain with the code Select one:
annex? Execution is triggered only when an action
is found

What is the difference between Spark


a. Drop features of a DataFrame Streaming and Structured Streaming?
b. Identifying Outliers Select one:
c. Vectorize features Structured Steaming is a stream
d. StringIndexer processing engine and Spark Streaming is
an extension to the core Spark API to
63. Which one of the following is NOT a streaming data processing
regression model available in Spark?
a. Perceptron regression What is the main difference between Spark
b. Gradient-boosted tree regression MLlib and ML?
c. Random forest regression Select one:
d. Decision tree regression Spark ML works with Dataframes

64. Which one of the following is NOT a The vertex DataFrame in a GraphFrame is:
regression model available in Spark? Select one:
a. Multilayer perceptron classifier A DataFrame that must contain a column
b. Decision classifier named 'id'
c. Gradient-boosted tree classifier
d. Random forest classifier ABD_Exames passados_jan2021 (removi
as perguntas repetidas)
65. Concerning Machine Learning, only 3. What is an RDD?
one of the following sentences is NOT Select one:
true. Choose it. a. A Hadoop data format
b. A dataset in-memory
a. ML requires a process with data loading,
c. A dataset in-disk
analysis and preparation d. A dataset in-disk and in-memory
b. Spark provides two libraries for ML:
“pyspark.ml” and “pyspark.mllib” 8. What is the output object type that results from
applying a map() function to an RDD that was created
c. Spark includes “dropna()” function for data
from a text file with the sc.textFile() method?
preparation for ML Select one:
d. Spark includes “UDFs” function for a. String
data modelling b. Tuple
c. List
d. Dictionary
In Databricks notebooks you can?
Select one:
9. Select the right statement to create an RDD: • Both map() and flatMap() are an transformations
Select one: that apply a function to the elements of an RDD an
a. myRDD = [“Alice”, “Carlos”, “Frank”, “Barbara”] return a new RDD with the transformed elements.
b. myRDD = sc.load(“Alice”, “Carlos”, “Frank”, • On the one hand, map() transformation takes one
“Barbara”) element and produces one element (one-to-one
c. myRDD = sc.parallelize([“Alice”, “Carlos”, “Frank”, transformation). On the other hand, flatMap() takes
“Barbara”])
one element and produces zero, one or more
d. myRDD = load(“Alice”, “Carlos”, “Frank”, “Barbara”)
elements (one-to-many transformation)
• In this example, let’s create an RDD which has a
16. Select the right instruction to create a list of 2 lines of text o rdd =
Dataframe? sc.parallelize(["First Word",
Select one: "Second Word"])
a. Dataframe = spark.range(10)
Out[1]: DataFrame[id: bigint] • If we perform an upper case transformation using
b. Dataframe = sc.textFile(“mydata”) the map(), the output will be a list with the two
c. Dataframe = spark.dataFrame(“Mydata”)
lines of text with all the words in uppercase: o
d. Dataframe = sc.parallelize(“mydata”)
code: rdd.map(lambda line: line.upper()).collect()
o output: ['FIRST WORD', 'SECOND
19. What is the difference between Spark Streaming WORD']
and Structured Streaming?
Select one: • If we perform an upper case transformation using
a. Structured Streaming is for structured streaming the flatMap(), the output will be a list with all the
data processing and Spark Streaming is for characters character in upper case, as the results
unstructured streaming data processing
are flattened: o code: rdd.flatMap(lambda line:
b. Spark Streaming is the new ASF library for
line.upper()).collect()
Streaming Data and Structured Streaming the old one
c. Structured Steaming is a stream processing engine o output: ['F', 'I','R','S', 'T','
and Spark Streaming is an extension to the core ','W','O','R','D','S','E','C','O','N','D',' ','W','O','R','D']
Spark API to streaming data processing
d. Structured Streaming relies on micro batch and 2nd Exam 2021
RDDs while Spark Streaming relies on DataFrames and Question 1
Datasets Write a Databricks Notebook program to do the
following tasks
20. What are DStreams? 1. Display your Spark session version, master and
Select one: AppName
a. Data abstractions provided from Spark core library
b. Data abstractions provided from Spark MLlib
c. Data abstractions provided from Spark Streaming
d. Data abstractions provided from Spark Structured
Streaming

Exams ABD (removi as perguntas


repetidas)
6 – What is a pair RDD?
2. Print the list of the files in /FileStore/tables
a) Two sets of RDDs in a transformation
b) An RDD with only two rows
c) An RDD with two data types
d) An RDD with only two columns

30 – Explain the differences between the map() 3. Copy one of the files (you may suggest a non-
and flatMap() transformations in Spark. existing) name to a new version with the name
Complementary, show and example created by prefix “New_”
you (with code) and explain the output
differences.
1. Create a RDD with the following 7 lines of text:
"First Line", "Now the 2nd", "This is the 3rd line",
"This is not the 3rd line?", "This is the 5th ", "This
is the 6th", "Last Line"

1 – sc
2 – %fs ls /FileStore/tables
3 - dbutils.fs.cp("dbfs:/FileStore/tables/File.csv",
2. Create a RDD (based on the previous one with
"dbfs:/FileStore/tables/New_File.csv", True) Spark functions) with only the lines that start
with the word "This"
Question 2
Write a Spark program to do the following tasks:
1. Create a Python list of Temperatures in ºF as
in: [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6,
55.4, 54.7]
2. Create an RDD based on the list above

3. Create a RDD (based on the previous one with


Spark functions) without lines with the word
"not"

3. Display the main stats of the RDD (like count,


mean, stdev etc.)

4. Create a program to convert the elements of


the RDD from ºF to ºC
5. Display the result of the new RDD with the 4. Create a RDD (based on the previous one with
elements in ºC Spark functions) with the text in capital letters

5. Display only two elements of the resulting RDD

1 – List = [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6,
55.4, 54.7]
2 – rdd = sc.parallelize(List)
rdd.collect()
3 – rdd.stats() 1–
4 – rdd_Celsius = rdd.map(lambda T: (T-32)*5/9) text = ["First Line", "Now the 2nd", "This is the 3rd line",
5 – rdd_Celsius.collect() "This is not the 3rd line?", "This is the 5th ", "This is the 6th",
"Last Line"]
rdd = sc.parallelize(text)
Question 3 rdd.collect()
Write a Spark program to do the following tasks: 2–
rdd_2 = rdd.filter(lambda line: line.startswith("This"))
rdd_2.collect() Question 5
3– Write a Spark algorithm to count the number of
rdd_3 = rdd.filter(lambda line: "not" not in line)
rdd_3.collect() distinct words on the text line below (or similar
4– input you may write).
rdd_4 = rdd.map(lambda line: line.upper()) • o "First Line"
rdd_4.collect() • o "Now the 2nd line"
5 – rdd_4.take(2) • o "This is the 3rd line"
• o "This is not the final
Question 4 line?"
Write a Spark program to do the following tasks: • o "This is the 5th "
1. Create an RDD with the data below (simulating • o "This is the 6th"
customer acquisitions): o • o "Last Line is the 7th"
'Client1:p1,p2,p3'
o 'Client2:p1,p2,p3' The output must be a pair RDD with the distinct
o 'Client3:p3,p4' words and the corresponding number of
occurrences.

2. Write a Spark program to convert the above


RDD in the following output: o 'Client1',
'p2'
o 'Client1', 'p3'
o 'Client2', 'p1'
o 'Client2', 'p2'
o 'Client2', 'p3'
o 'Client3', 'p3'
o 'Client3', 'p4'

1–
rdd = sc.parallelize(['Client1:p2,p3', 'Client2:p1,p2,p3',
'Client3:p3,p4'])
rdd.collect()
2- text = ["First Line", "Now the 2nd line", "This is the 3rd line",
rdd2 = rdd.map(lambda line: line.split(":")) "This is not the final line?", "This is the 5th", "This is the 6th",
rdd3 = rdd2.map(lambda fields: (fields[0],fields[1])) "Last Line is the 7th"]
rdd4 = rdd3.flatMapValues(lambda p: p.split(",")) rdd = sc.parallelize(text)
rdd4.collect() rdd2 = rdd.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda w1,w2: w1+w2).collect()
Question 7
Question 6 Explain the major differences between Spark SQL,
Write a Spark program to do the following tasks: Hive, and Impala. Give examples supporting your
1. Create a RDD based on the list of the following explanation for each of the 3 cases.
4 tuples: • Spark SQL is a distributed in-memory computation engine.
It is a spark module for structured data processing which is
built on top of Spark Core.
[('Mark',25),('Tom',22),('Mary',20),('Sofia',26)] o Spark SQL can handle several independent processes and
2. Create a Dataframe based on the previous RDD in a distributed manner across thousands of clusters that are
with the 2 columns named "Name" and "Age" distributed among several physical and virtual clusters.
3. Display the new DataFrame o It supports several other Spark Modules being used in
applications such as Stream Processing and Machine
4. Create a new DataFrame based on the
Learning
previous one, adding a new column named • On the other hand, Hive is a data warehouse software for
"AgePlus" with the content of Age multiplied by querying and managing large distributed datasets, built on
1.2 top of Hadoop File System (HDFS) o Hive is designed for
Batch Processing through the use of Map Reduce
Programming
• Finally, Impala is a massively parallel processing (MPP)
engine developed by Cloudera.
o Contrarly to Spark, it supports multi-user environment
while having all the qualities of Hadoop it supports column
storage, tree architecture, Apache HBase storage and HDFS.
o It has significantly higher query throughput than Spark
SQL and Hive.
o However, in large analytical queries Spark SQL and Hive
outperform Impala.

Question 8 - Write a Spark program to do the


following tasks:
1. Create a Graph that depicts the following data and
relationships: o Alice: age=31; Esther:
age=35; David: age=34; Bob age=29
o Alice is married to Bob and is
friend of Esther
o Esther is married to David and is
friend of Alice
o Bob and David are friends
5. Write the new DataFrame (with the columns
"Name", "Age" and "AgePlus") in dbfs in Delta 2. Print the edges and vertices of your graph
format 3. Create a subgraph with only the friend relationships
6. Check that the written DataFrame/file is in and show the result
dbfs

1 – rdd =
sc.parallelize([('Mark',25),('Tom',22),('Mary',20),('Sofia',26)])
2 – df = spark.createDataFrame(rdd).toDF("Name","Age")
3 – display(df)
4 – df2 = df.withColumn("AgePlus", df["Age"]*1.2)
5 – df2.write.format("delta").save("/FileStore/tables/df2")
6 - %fs ls /FileStore/tables/df2
1 - from graphframes import *
# Create the vertices
vertices = sqlContext.createDataFrame([
("a", "Alice", 31),
("b", "Esther", 35),
("c", "David", 34),
("d", "Bob", 29)], ["id", "name", "age"])
# Create the edges
edges = sqlContext.createDataFrame([
("a", "d", "married"),
("a", "b", "friend"),
("b", "c", "married"),
("b", "a", "friend"),
("c", "d", "friend")], ["src", "dst", "relationship"])
# Create the graph
g = GraphFrame(vertices, edges)
2–
display(g.vertices)
display(g.edges)
3–
friends = g.edges.filter("relationship = 'friend' ")
friends.show() 1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"],
Question 9 - Write a Spark program to do the [28,35,2,191,"y"], [32,38,3,208,"n"] ]
following tasks: df = spark.createDataFrame(records,
["age","bmi","children","charges","smoker"])
1. Create a DataFrame simulating insurance
2–
customer data with: o The columns: from pyspark.ml.feature import StringIndexer
["age","bmi","children","charges" smokerIndexer = StringIndexer(inputCol = "smoker",
,"smoker"] outputCol = "smokerIndex")
o 4 records with the following values: [ df2 = smokerIndexer.fit(df).transform(df)
3-
[19,27,0,168,"y"], [18,33,1,177,"n"], df_final = df2.drop("smoker")
[28,35,2,191,"s"], [32,38,3,208,"n"]] display(df_final)

2. Create a new DataFrame based on the 1. Write a Spark program to do the following
previous one with the values in the smoker tasks:
column encoded in 1/0 values o Create a DataFrame simulating insurance
3. Display the new DataFrame without the customer data, with (or assume you already have
original "smoker" column a file with the data and read it from dbfs):
o The columns:
["age","bmi","children","charges"
,"smoker"]
o 4 records with the following values: [
[19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"s"], [32,38,3,208,"n"] ]
3. At the end of your program print also the:
coefficients, RMSE and the r2 of your model

1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"y"], [32,38,3,208,"n"] ]
df = spark.createDataFrame(records,
["age","bmi","children","charges","smoker"])’
2–
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
# Convert the smoker column to a binary column
2. Write a Spark ML program to perform a smokerIndexer = StringIndexer(inputCol = "smoker",
outputCol = "smoker_binary")
multiple linear regression in an attempt to df2 = smokerIndexer.fit(df).transform(df)
predict the health insurance charges based on df3 = df2.drop("smoker")
the clients age, bmi, number of children and # Vectorize the data
his/hers smoking habits from pyspark.ml import Pipeline
stages = []
assembler = VectorAssembler(inputCols=["age", "bmi",
"children", "smoker_binary"], outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df3)
dataset = pipelineModel.transform(df3)
# Keep relevant columns and rename the column "charges"
to "label"
df_final = dataset.select(["features",
"charges"]).selectExpr("features as features", "charges as
label")
# Split the data into training and test sets (80% train, 20%
test)
(trainingData, testData) = df_final.randomSplit([0.8,0.2])
# Create a linear regression object
lr = LinearRegression(maxIter = 10, regParam = 0.3,
elasticNetParam = 0.8)
# Chain Linear Regression in a Pipeline
pipeline = Pipeline(stages = [lr])
# Train the Model
model = pipeline.fit(trainingData)
# Make Predictions
predictions = model.transform(testData)
3–
eval = RegressionEvaluator (labelCol = "label", predictionCol
= "prediction")
print("Coefficients: " + str(model.stages[0].coefficients))
print("RMSE:", eval.evaluate(predictions, {eval.metricName:
"rmse"}))
print("R2:", eval.evaluate(predictions, {eval.metricName:
"r2"}))

20 – Can you explain why use Apache Storm?


Apache Storm is a free and open source
distributed realtime computation system. Apache
Storm makes it easy to reliably process
unbounded streams of data, doing for realtime
processing what Hadoop did for batch
processing. Apache Storm is simple, can be used
with any programming language, and is a lot of
fun to use!
Apache Storm has many use cases: realtime
analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more.
Apache Storm is fast: a benchmark clocked it at
over a million tuples processed per second per
node. It is scalable, fault-tolerant, guarantees
your data will be processed, and is easy to set up
and operate.

You might also like