Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Spark SQL & Hadoop Course

(For Data Scientists & Big Data


Analysts)
Data Engineering
Solutions to Problems

PROBLEM 1 - SOLUTION

/// Read in data & create a DataFrame

var df_q1 = spark.read.format("csv")


.option("header", "true")
.load("/user/verulam_blue/data/WHO_data/population_data.csv.bz2")

/// Write code & save results

var col_names = df_q1.columns

df_q1
.na.fill("NIL", col_names)
.write.format("avro")
.option("compression", "snappy")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem01/")

/// Check Results

var check_1 = spark.read.format("avro")


.option("header", "true")
.option("compression", "snappy")
.load("/user/vb_student/problems/section07/problem01/")

check_1.show(3, false)
check_1.count()

Should be: 8665

Data Engineering - Solutions to Problems P a g e 1|7


PROBLEM 2 - SOLUTION

/// Read in data & create a DataFrame

var df_q2 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/practice_demographics/")

/// Write code & save results

df_q2
.filter("nbr_of_patients > 3000 AND nbr_of_patients < 4000")
.selectExpr("practice_code", "nbr_of_patients")
.write.format("json")
.option("compression", "deflate")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem02")

/// Check Results

var check_2 = spark.read.format("json")


.option("compression", "deflate")
.load("/user/vb_student/problems/section07/problem02")

check_2.show(3, false)
check_2.count()

Should be: 806

Data Engineering - Solutions to Problems P a g e 2|7


PROBLEM 3 - SOLUTION

/// Read in data & create a DataFrame

var df_q3 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/credit_cards")

/// Write code & save results

df_q3
.select("card_holder_name", "issuing_bank", "issue_date")
.write.format("orc")
.option("compression", "zlib")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem03")

/// Check Results

var check_3 = spark.read.format("orc")


.option("compression", "zlib")
.load("/user/vb_student/problems/section07/problem03/")

check_3.show(3, false)
check_3.count()

Should be:2000000

Data Engineering - Solutions to Problems P a g e 3|7


PROBLEM 4 - SOLUTION

/// Read in data & create a DataFrame

var df_q4 = spark.read.format("parquet")


.option("compression", "gzip")
.option("header", "true")
.load("/user/verulam_blue/data/gp_db/gp_rx")

/// Write code & save results

df_q4
.selectExpr("concat_ws('\t', sha, pct, practice_code, bnf_code, bnf_name, items, nic,
act_cost, quantity, period) as results")
.write.format("text")
.option("compression", "lz4")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem04")

/// Check Results

var check_4 = spark.read.format("csv")


.option("sep", "\t")
.option("compression", "lz4")
.load("/user/vb_student/problems/section07/problem04/")

check_4.show(3, false)
check_4.count()

Should be: 10272116

Data Engineering - Solutions to Problems P a g e 4|7


PROBLEM 5 - SOLUTION

/// Read in data & create a DataFrame

var df_q5 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/taxi_data")

/// Write code & save results

df_q5
.where(month($"pickup_datetime") === "03")
.selectExpr("concat_ws('|',*)")
.write.format("text")
.option("compression", "gzip")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem05")

/// Check Results

var check_5 = spark.read.format("csv")


.option("sep", "|")
.option("compression", "gzip")
.load("/user/vb_student/problems/section07/problem05/")

check_5.show(3, false)
check_5.count()

Should be: 834429

Data Engineering - Solutions to Problems P a g e 5|7


PROBLEM 6 - SOLUTION

/// Read in data & create a DataFrame

var df_q6 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/gp_rx/")

/// Write code & save results

df_q6
.selectExpr("practice_code", "bnf_code", "bnf_name", "items", "nic", "act_cost",
"abs(nic-act_cost) as difference")
.where($"difference" > 2)
.drop("difference")
.coalesce(1)
.write.format("parquet")
.option("compression", "gzip")
.option("path", "/user/vb_student/problems/section07/problem06/")
.mode("append")
.saveAsTable("gp_db.q6_soln")

/// Check Results

var check_6 = spark.sql("SELECT * FROM gp_db.q6_soln")

check_6.show(3,false)
check_6.count()

Should be: 4736749

End of Solutions to Problems


Data Engineering - Solutions to Problems P a g e 6|7
For more problems that focus on the “Data Engineering” section of this course
see the course:

CCA175 Exam Prep Questions Part A ETL Focus (With Spark 2.4 Hadoop Cluster
VM)

See the **Bonus Section** for more details.

Data Engineering - Solutions to Problems P a g e 7|7

You might also like