Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

CSIE59830/CSIEM0410 Big Data Systems, Spring 2021

Final Exam
(This is a limited time open-book, open-Web take-home exam. Upon
receiving the exam file, edit the file with your answers, convert it into a
PDF file, and send the PDF file to me through email by 12:30pm. You are
allow to refer to class notes, textbooks, or even search the Web. However,
you have to answer all questions on your own without discussion with
anyone else.)

ID: 610821245 Dept: CSIE Name:楊宗翰

1. (15%) For each description of the problem, specify the big data system/tool(s)
which is(are) best for solving the problem and briefly explain your choice.
Description of the problem to be Which big data system/tool(s) to use
solved and why?
The CDC (Taiwan Centers for Disease Spark streaming, because it’s need
Control, 衛 生 福 利 部 疾 病 管 制 署 ) updates in real-time.
wants to provide a new service of
reflecting the current status of the
COVID-19 pandemic by combining the
coronavirus updates sent by all hospitals
in real-time.

The 7-11 headquarter wants to analyze Spark, because need to analyze


seasonal trends in product sales for all seasonal trends. The data volume will
7-11 retail stores in Taiwan to make be huge. And Spark has SparkMlib so
better predictions on future demand and we can use it to do regression for
improve inventory management. predict the future demand.

The United Daily News(聯合報) wants Mongo DB, because it want to store
to convert all of its past newspaper newspaper articles and do search and
articles into a scalable database that can queried.
be queried and searched.

The NDHU Environment Protection Spark streaming or Structure


Project wants to provide real-time streaming, because we want to analyze
analytics of air quality data from IoT air quality in real time. And with IoT
sensors installed around the campus and device we can use MQTT tool to
neighboring areas. collect data form IoT device. Spark

1
can integrate with Kafka easily.
The Federal Bureau of Investigation Apache hama, because we want to
(FBI) needs to analyze user analyze user relationships, so each
relationships and group activities on user are considered as node, so use
Twitter to uncover potential terrorist graph compute tools, we can find the
organizations. connection between users.

2
2. (15%) Answer the following questions briefly and to the point.
(a) What is lineage in Spark? Where and when do we use it? Why is it useful?
(b) What is (are) the main reason(s) why Spark is faster than Hadoop MapReduce on
iterative jobs?
(c) Why structured streaming is considered more advanced than Spark streaming?

a. Lineage Graph is a dependencies graph in between existing RDD and new


RDD.
When failure we can rebuild data with lineage.
Because lineage makes RDDs are fault tolerant as they track data lineage
information.
b. Because Spark is in memory compute.
c. Because structured streaming is SQL like, for normal users they think RDD is
hard to understand, but with structured streaming users just need to know SQL
and they can deal with data by using Structured streaming. And The
computation is executed on the same optimized Spark SQL engine. Not like
RDD if you write bad code and the performance will be poor

3
(20%) Consider the problem of footprint tracking and contact tracing of
COVID-19 confirmed cases. Assume that we have N locations, M users and a
temporal threshold T. Each user is equipped with an APP to automatically
detect and upload all relevant events. More specifically, when a user visits a
location or meets with another user, an event is recorded and uploaded to the
CDC database with time, location ID, and user ID(s). Upon the confirmation
of a particular case, we want to print out the travel history of the case in the
past two weeks as well as a list of users that may have been exposed to the
coronavirus. The travel history is a list of locations ordered by visit time. The
list of users include all users who have direct contact with the confirmed case
or visit the same location as the confirmed case within the threshold T. Both are
for the past two weeks.
(a) Propose a big data framework that is best for solving the problem. Explain
your choice.
(b) Select an appropriate tool within the framework to solve the problem.
(c) Write program segments to generate the travel history and list of users with the
tool of your choice.
(* Design your own input/output formats. Explain it clearly. *)
a. Spark, because Spark can easy solving this big data problem by using map reduce
function and Spark is faster than hadoop.
b. I prefer using Spark core and RDD to slove this problem
c.input data:
userID Anna location 41.40338,2.17403 date 12/24 time 12:24
userID Anna location 41.40338,2.17404 date 12/24 time 12:25
userID Anna location 41.40338,2.17408 date 12/24 time 12:26
userID Janna location 41.40338,2.17404 date 12/24 time 12:25
userID Janna location 41.40338,2.17409 date 12/24 time 12:26
userID Janna location 41.40332,2.17404 date 12/24 time 12:27
val data = sc.textFile(“inputdata”)
//if we want to find Anna’s travel history
val anna_tarvel_histroy = data.map(x => x.split(“ “).filter(x =>
x(1).equals(“Anna”).groupBy(_(3))
//then print
anna_travel_histroy.foreach(println)
The output will be like:
(41.40338,2.17403)
(41.40338,2.17404)
(41.40338,2.17408)

4
Means Anna traveled these places

//if we want to find people was traveled same place


val same_travel_history = data.map(x => x.split(" ")).filter(x =>
x(3).equals("41.40338,2.17404")).filter(x => x(5).equals("12:25")).groupBy(_(1))
//then print
same_travel_history.foreach(println)

the output will be like:


Janna
Anna
Means Janna and Anna was at location(41.40338,2.17404) at same time

5
3. (20%) Consider the problem of analyzing YouTube user behavior. Assume that
you have a list of YouTubers each having one or more channels. Also given is a list
of Subscribers each may subscribes to zero or more channels. Solve the following
problems with Neo4j.
(a) Briefly describe how would you represent the problem with Neo4j.
(b) Define and construct the YouTuber-Subscriber database with the Cypher
query language.
(c) Write a query to print out the list of top 50 most-subscribed YouTube
channels.
(d) Write a query to print out the list of top 50 most-subscribed YouTubers based
on the sum of subscribers of all channels of the same YouTuber without
consideration of the overlap of subscribers among different channels.
(* Design your own input/output formats. Explain it clearly. *)
a. we can use node to represent each YouTuber and Subscribers with Neo4j. we can
find the relationships between Youtubers and Subscribers.

6
(20%) Answer the following questions about HBase.
(a) Summarize the data model of HBase.
(b) HBase was built on top of other Apache technologies. What are these
technologies?
(c) What are the fault tolerance mechanisms built into the HBase system? Briefly
describe the main purposes of each mechanism.
(d) Briefly describe the role of ZooKeeper in HBase.

a. Tables are stored by Row Key. And Table schema only define it’s column
families. Each family has any number of columns, each column has any
number of versions. Column within a family are stored together
b. HDFS
c.
 HMaster fault tolerance: HMaster can configure HA and use ZooKeeper to
reselect the new HMaster
 HRegionServer fault tolerance: If there is a downtime, HMaster will listen to
it through ZooKeeper, reassign the Region on the RS, and split the WAL of
the Routine RS. A new RS will read the recovered data.
 Zookeeper fault tolerance: As a coordinator, Zookeper itself is a reliable
distributed service.
 HDFS fault tolerance : The underlying storage relies on HDFS, the data is
transparent to HDFS, and multiple copies are redundant.
d. Distributed coordination service Maintain cluster status and server failure
notification.

7
(20%) Briefly answer the following questions about NoSQL, NewSQL and
distributed SQL.
(a) What are the differences between NewSQL and NoSQL?
(b) What is eventual consistency? What guarantees do we get from eventual
consistency? When and why do we use it?
(c) What are the purposes of using consistent hashing for partitioning in
DynamoDB?
(d) Given an existing set of configuration values for sloppy quorum, how do you
modify the value(s) to increase the availability? How about increasing
consistency? Can we get both of them?
(e) Give MongoDB statements to create a student collection with three student
documents.
(f) What is the main technology that enables Google Spanner to be a Distributed
SQL system? How is the technology used in Spanner? Why is it important?
a.
NoSQL NewSQL
Schema free Schema fixed also Schema free
Not supported SQL Supported SQL
Non-relational DBMS Relational DBMS
promotes CAP properties promotes ACID properties
High query complexity Very High query complexity
Supported OLTP Fully supported OLTP
It supports cloud It fully supports cloud
b. Eventual consistency is a consistency model used in distributed computing to
achieve high availability that informally guarantees that, if no new updates are
made to a given data item, eventually all accesses to that item will return the last
updated value. Eventually-consistent services are often classified as providing
BASE.
c. To scale incrementally. To partition data over the set of nodes dynamically.
Balanced distribution data.
d. To increasing the consistency that can change configuration to N=n, R=1,
W=n. To increasing availability that can change configuration to N= n, R=1,
W=1. We can’t get both of them.
e. db.createCollection(“students”)
db.students.insert({“name”: “student1”})
db.students.insert({“name”: “student2”})
db.students.insert({“name”: “student3”})

8
f.TureTime API, represents time as a TTinterval in Spanner, an interval with bounded
time uncertainty. TrueTime is implemented by a set of time master machines per
datacenter and a timesalve daemon per machine. The underlying time references are
GPSand atomic clocks.
TureTime API is the key enabler of the synchronous replication and distributed
transactions in Spanner.

You might also like