Welcome to Scribd!

Data Engineering - Solutions

Uploaded by

0% found this document useful (0 votes)

17 views7 pages

This document provides solutions to 6 problems focused on data engineering with Spark SQL and Hadoop. Each problem loads data from source files, performs data transformations, and saves the results to destination files using different file formats like Parquet, JSON, ORC, text, and tables. The solutions are checked by loading the saved data and verifying record counts. Additional exam prep problems are recommended for further practice with data engineering concepts.

Original Description:

Original Title

Data+Engineering+-+Solutions

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

17 views7 pages

Data Engineering - Solutions

Uploaded by

dgnovo

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 7

Search inside document

Spark SQL & Hadoop Course

(For Data Scientists & Big Data

Analysts)
Data Engineering
Solutions to Problems

PROBLEM 1 - SOLUTION

/// Read in data & create a DataFrame

var df_q1 = spark.read.format("csv")

.option("header", "true")
.load("/user/verulam_blue/data/WHO_data/population_data.csv.bz2")

/// Write code & save results

var col_names = df_q1.columns

df_q1
.na.fill("NIL", col_names)
.write.format("avro")
.option("compression", "snappy")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem01/")

/// Check Results

var check_1 = spark.read.format("avro")

.option("header", "true")
.option("compression", "snappy")
.load("/user/vb_student/problems/section07/problem01/")

check_1.show(3, false)
check_1.count()

Should be: 8665

Data Engineering - Solutions to Problems P a g e 1|7

PROBLEM 2 - SOLUTION

/// Read in data & create a DataFrame

var df_q2 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/practice_demographics/")

/// Write code & save results

df_q2
.filter("nbr_of_patients > 3000 AND nbr_of_patients < 4000")
.selectExpr("practice_code", "nbr_of_patients")
.write.format("json")
.option("compression", "deflate")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem02")

/// Check Results

var check_2 = spark.read.format("json")

.option("compression", "deflate")
.load("/user/vb_student/problems/section07/problem02")

check_2.show(3, false)
check_2.count()

Should be: 806

Data Engineering - Solutions to Problems P a g e 2|7

PROBLEM 3 - SOLUTION

/// Read in data & create a DataFrame

var df_q3 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/credit_cards")

/// Write code & save results

df_q3
.select("card_holder_name", "issuing_bank", "issue_date")
.write.format("orc")
.option("compression", "zlib")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem03")

/// Check Results

var check_3 = spark.read.format("orc")

.option("compression", "zlib")
.load("/user/vb_student/problems/section07/problem03/")

check_3.show(3, false)
check_3.count()

Should be:2000000

Data Engineering - Solutions to Problems P a g e 3|7

PROBLEM 4 - SOLUTION

/// Read in data & create a DataFrame

var df_q4 = spark.read.format("parquet")

.option("compression", "gzip")
.option("header", "true")
.load("/user/verulam_blue/data/gp_db/gp_rx")

/// Write code & save results

df_q4
.selectExpr("concat_ws('\t', sha, pct, practice_code, bnf_code, bnf_name, items, nic,
act_cost, quantity, period) as results")
.write.format("text")
.option("compression", "lz4")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem04")

/// Check Results

var check_4 = spark.read.format("csv")

.option("sep", "\t")
.option("compression", "lz4")
.load("/user/vb_student/problems/section07/problem04/")

check_4.show(3, false)
check_4.count()

Should be: 10272116

Data Engineering - Solutions to Problems P a g e 4|7

PROBLEM 5 - SOLUTION

/// Read in data & create a DataFrame

var df_q5 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/taxi_data")

/// Write code & save results

df_q5
.where(month($"pickup_datetime") === "03")
.selectExpr("concat_ws('|',*)")
.write.format("text")
.option("compression", "gzip")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem05")

/// Check Results

var check_5 = spark.read.format("csv")

.option("sep", "|")
.option("compression", "gzip")
.load("/user/vb_student/problems/section07/problem05/")

check_5.show(3, false)
check_5.count()

Should be: 834429

Data Engineering - Solutions to Problems P a g e 5|7

PROBLEM 6 - SOLUTION

/// Read in data & create a DataFrame

var df_q6 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/gp_rx/")

/// Write code & save results

df_q6
.selectExpr("practice_code", "bnf_code", "bnf_name", "items", "nic", "act_cost",
"abs(nic-act_cost) as difference")
.where($"difference" > 2)
.drop("difference")
.coalesce(1)
.write.format("parquet")
.option("compression", "gzip")
.option("path", "/user/vb_student/problems/section07/problem06/")
.mode("append")
.saveAsTable("gp_db.q6_soln")

/// Check Results

var check_6 = spark.sql("SELECT * FROM gp_db.q6_soln")

check_6.show(3,false)
check_6.count()

Should be: 4736749

End of Solutions to Problems

Data Engineering - Solutions to Problems P a g e 6|7
For more problems that focus on the “Data Engineering” section of this course
see the course:

CCA175 Exam Prep Questions Part A ETL Focus (With Spark 2.4 Hadoop Cluster
VM)

See the Bonus Section for more details.

Data Engineering - Solutions to Problems P a g e 7|7

Sap Hana - All About Views
From Everand
Sap Hana - All About Views
Alka Jain
Rating: 5 out of 5 stars
5/5 (29)
100 Configure Database Process, Application Sage-ERP-X3 - Configuration-Console Wiki
Document19 pages
100 Configure Database Process, Application Sage-ERP-X3 - Configuration-Console Wiki
fsussan
No ratings yet
Tarea de Ciencia de Datos
Document32 pages
Tarea de Ciencia de Datos
Leomaris Ferreras
No ratings yet
Spark RDD Commands - Spark Core
Document7 pages
Spark RDD Commands - Spark Core
Nagraj Goud
No ratings yet
Pyspark File Commands and Theory
Document29 pages
Pyspark File Commands and Theory
karangole7074
No ratings yet
05 Functions
Document6 pages
05 Functions
jen
No ratings yet
Cadence To Matlab Tutorial
Document3 pages
Cadence To Matlab Tutorial
petricli
No ratings yet
Notes of Azure Data Bricks
Document16 pages
Notes of Azure Data Bricks
Vikram sharma
No ratings yet
Create A PDF in Excel
Document5 pages
Create A PDF in Excel
WSG SARIR
No ratings yet
03 Data Input Output
Document43 pages
03 Data Input Output
Alexandra Gabriela Grecu
No ratings yet
Week12 Assignment Solution
Document10 pages
Week12 Assignment Solution
Arnab Dey
No ratings yet
C Make Lists
Document8 pages
C Make Lists
Subhaharan Sv
No ratings yet
Rice - Ipynb - Colab
Document11 pages
Rice - Ipynb - Colab
Mortal king JNJ
No ratings yet
Example Import GCP To ADLS
Document7 pages
Example Import GCP To ADLS
jenniferwright3264338
No ratings yet
Sjcam Make
Document3 pages
Sjcam Make
jackdsa22
No ratings yet
ASH Filter 11g+
Document7 pages
ASH Filter 11g+
Atul Sharma
No ratings yet
PySpark Learning Hub 1700684461
Document8 pages
PySpark Learning Hub 1700684461
karunakar.mes
No ratings yet
Mrtglib
Document4 pages
Mrtglib
Paul Becan
No ratings yet
CMake Lists
Document6 pages
CMake Lists
Dariusz Ochota
No ratings yet
Scnfinal
Document5 pages
Scnfinal
aryas
No ratings yet
DOCS
Document45 pages
DOCS
Gabriel Aguilar
No ratings yet
Changes
Document5 pages
Changes
amine sami
No ratings yet
Feature Analysis With Models
Document12 pages
Feature Analysis With Models
Mert Yetkiner
No ratings yet
CMake Lists
Document7 pages
CMake Lists
artem.matalytskij
No ratings yet
At A Aaaaaaaaaaaa
Document15 pages
At A Aaaaaaaaaaaa
Shazeb Ata
No ratings yet
Snow SQL
Document3 pages
Snow SQL
Durgesh Saindane
No ratings yet
Blade Retak
Document9 pages
Blade Retak
ikhsan febriyan
No ratings yet
Package Inpdfr': R Topics Documented
Document29 pages
Package Inpdfr': R Topics Documented
ahrounish
No ratings yet
AWS Big Data
Document39 pages
AWS Big Data
Aditya Mehta
100% (1)
CMake Lists
Document2 pages
CMake Lists
jota
No ratings yet
Oracle DBA - Oracle Apps DBA Instant Solutions
Document47 pages
Oracle DBA - Oracle Apps DBA Instant Solutions
Abdool Taslim Peersaheb
No ratings yet
Spark Job Dataproc
Document4 pages
Spark Job Dataproc
Denys Stolbov
No ratings yet
Bypass Linkveritise
Document6 pages
Bypass Linkveritise
teodororobson005
No ratings yet
Make File
Document8 pages
Make File
hodgestructure
No ratings yet
3 3 PSQL
Document31 pages
3 3 PSQL
Tien Nguyen Manh
No ratings yet
CART+ +Loan+Delinquent+ +Student+File+0.1 - New - Ipynb Colaboratory
Document5 pages
CART+ +Loan+Delinquent+ +Student+File+0.1 - New - Ipynb Colaboratory
SHEKHAR SWAMI
No ratings yet
Config Modification Tutorial
Document3 pages
Config Modification Tutorial
dotnetweb200
No ratings yet
C Make Lists
Document8 pages
C Make Lists
davey
No ratings yet
CMake Lists
Document3 pages
CMake Lists
k.c.lover 8
No ratings yet
Cengizhan Sahin
Document26 pages
Cengizhan Sahin
dummy account
No ratings yet
Database Performance Tuning by Examples
Document16 pages
Database Performance Tuning by Examples
Tejaswi Sedimbi
No ratings yet
Proyecto Final Model
Document13 pages
Proyecto Final Model
luis fuentes
No ratings yet
JSBSim TrainingShort UNINA 2014
Document57 pages
JSBSim TrainingShort UNINA 2014
raffacap
No ratings yet
Storing Data Directly From Oracle SGA
Document9 pages
Storing Data Directly From Oracle SGA
Karamela Melakara
No ratings yet
HTML - Lab Record
Document29 pages
HTML - Lab Record
jobishdepaul
No ratings yet
CMake Lists
Document4 pages
CMake Lists
dylan.hudon
No ratings yet
Lecture 36
Document15 pages
Lecture 36
api-3729920
No ratings yet
Migrating Data From HDFS To Big Query
Document5 pages
Migrating Data From HDFS To Big Query
Madhu Sudhan
No ratings yet
RDP Google Colab Gpu Vncremote
Document6 pages
RDP Google Colab Gpu Vncremote
Kucing
No ratings yet
CCA175 Demo Examenes
Document19 pages
CCA175 Demo Examenes
José Ramón Espinosa Muñoz
No ratings yet
Database Language Bindings: Java (Mysql)
Document7 pages
Database Language Bindings: Java (Mysql)
Brian LeGrand
No ratings yet
Floging
Document3 pages
Floging
yadiipein
No ratings yet
PDFX Def - Ps
Document1 page
PDFX Def - Ps
Faggot
No ratings yet
Python Code
Document7 pages
Python Code
Gnan Shetty
No ratings yet
Comp
Document27 pages
Comp
AdiRoasts
No ratings yet
Spark Cheat Sheet 1717838924
Document10 pages
Spark Cheat Sheet 1717838924
monachatterjee962
No ratings yet
Unit3-Basics of JavaScript
Document62 pages
Unit3-Basics of JavaScript
divya
50% (4)
C Make Lists
Document11 pages
C Make Lists
Богдан Жигайло
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Counter ID Counter Name
Document12 pages
Counter ID Counter Name
abhipareek17
No ratings yet
ACE Mega CodecS Pack 6
Document6 pages
ACE Mega CodecS Pack 6
Saradhi Td
No ratings yet
711-01774-123 Supported Media Types On BlackBerry Smartphones
Document56 pages
711-01774-123 Supported Media Types On BlackBerry Smartphones
Ronaldo Amos
No ratings yet
Wondershare Filmora - WSLog
Document92 pages
Wondershare Filmora - WSLog
Ganyu Mbek
No ratings yet
Freesat Huffman Table (Compression Type 1)
Document35 pages
Freesat Huffman Table (Compression Type 1)
Sergio Caceres
No ratings yet
Itc Cat - 2 Retest
Document2 pages
Itc Cat - 2 Retest
suganyamachendran
No ratings yet
Lec 05 - Arithmetic Coding
Document44 pages
Lec 05 - Arithmetic Coding
perhacker
No ratings yet
Codecs Ffmpeg
Document5 pages
Codecs Ffmpeg
Mooc Study
No ratings yet
Compression: Safeen H. Rasool Assist. Lecturer
Document16 pages
Compression: Safeen H. Rasool Assist. Lecturer
asmahan abdulwahid
No ratings yet
MUltimedia Compression Techniques Question Paper
Document2 pages
MUltimedia Compression Techniques Question Paper
Purush Jayaraman
100% (1)
Push - Fold - Call - CL TEAM
Document561 pages
Push - Fold - Call - CL TEAM
João Alisson Mendes
No ratings yet
Logcat CSC Compare Log
Document2,784 pages
Logcat CSC Compare Log
Nicoleta Lungu
No ratings yet
Gray Level Count Probabil Ity 21 12 3/8 95 4 1/8 169 4 1/8 243 12 3/8
Document51 pages
Gray Level Count Probabil Ity 21 12 3/8 95 4 1/8 169 4 1/8 243 12 3/8
kamnakhanna
No ratings yet
Duplicate Cleaner Log
Document183 pages
Duplicate Cleaner Log
Thomas Brian
No ratings yet
Presentation - Lempelziv Ese751 (DR Ti
Document55 pages
Presentation - Lempelziv Ese751 (DR Ti
Suhana Sabudin
No ratings yet
Log Cat 1701699872778
Document3 pages
Log Cat 1701699872778
lenzzokee9
No ratings yet
Trace
Document4 pages
Trace
Riel Siwabessy
No ratings yet
The Pussycat Dolls - Bottle Pop
Document2 pages
The Pussycat Dolls - Bottle Pop
kampac
No ratings yet
Fundamentals of Compression: Prepared By: Haval Akrawi
Document21 pages
Fundamentals of Compression: Prepared By: Haval Akrawi
Muhammad Yusif Abdulrahman
No ratings yet
Certificate
Document3 pages
Certificate
malav
No ratings yet
Btech Cs 6 Sem Data Compression Kcs 064 2023
Document2 pages
Btech Cs 6 Sem Data Compression Kcs 064 2023
Yash Chauhan
No ratings yet
Data Compression Techniques
Document29 pages
Data Compression Techniques
Zatin Gupta
No ratings yet
DC Question Bank
Document5 pages
DC Question Bank
pathak639326
No ratings yet
Res. 6333 - T.2.1.a Reporte Informacion Generica Por Operador
Document13 pages
Res. 6333 - T.2.1.a Reporte Informacion Generica Por Operador
Ruben Antonio Fuentes
No ratings yet
Trace
Document304 pages
Trace
daniel montesdeoca
No ratings yet
Chapter 07 - Lossless Compression Algorithms
Document50 pages
Chapter 07 - Lossless Compression Algorithms
a_setiaji
No ratings yet
Analysis and Implementation of Video Compression Using MPEG Standard
Document13 pages
Analysis and Implementation of Video Compression Using MPEG Standard
Elbahlul Fgee
No ratings yet
The Lempel Ziv Algorithm: Seminar "Famous Algorithms" January 16, 2003
Document26 pages
The Lempel Ziv Algorithm: Seminar "Famous Algorithms" January 16, 2003
Bhushan Shah
No ratings yet
RapidMoviez - Latest Movies
Document18 pages
RapidMoviez - Latest Movies
USR Suceava
No ratings yet