Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

DB Admin Technical Test

General Knowledge Check


1. How would you describe what a database is?
2. How would you define what data engineering is?
3. What are the differences between a star schema and a snowflake schema?
4. Explain about ACID property in a database! How is it implemented?
5. What is the difference between row level and table level locking?
6. What is an index and what are the types of indexes can be created on a table?
7. On executing DELETE statement, a Data Engineer keeps getting the error about “foreign
key constraint failing”. What should he/she do?
8. What are the differences between an RDBMS and NoSQL databases?
9. When would you use NoSQL?
10. What pros/cons could you tell about MongoDB and PostgreSQL?
11. What will be the challenge to migrate data from NoSQL to SQL database and vice-versa?
12. How would you deploy ETL pipelines using public cloud services providers?
13. What do you know about Apache Hadoop and Apache Spark? How would the two
interact with each other in data ingestion pipelines?
14. What are the core components of a distributed application in Apache Spark?
15. What is “Lazy Evaluation” in Apache Spark?

Take-Home Tests
This section is using AWS terms, but should be applicable with other public cloud services as
well.

Overview
You are working as a Data Engineer in ACME and it's your first day at the office

The hiring manager shows you around the office and at explain to you about the current data
ingestion architecture in the company :

• ACME utilizes 3 AWS accounts : INGESTOR, OTHER and DATABASES


• ACME also rents a co-location space COLO where they deploy some applications that
are currently not feasible to be moved into the cloud.
• ACME's data engineers are given access only to INGESTOR and they can use all AWS
services there.
• The INGESTOR and DATABASES accounts are located in the same network but they
are using different subnets. The InfoSec team mandates cross-account S3 access to use
IAM keypair.
• INGESTOR and COLO is able to talk with each other using L3 VPN tunnel that is
established between them, but some firewall configurations needs to be modified.
• You are tasked to bring the data from 4 data sources to ACME's data lake (S3).

The Architecture diagram as follows :

The data sources are :

1. A publicly-accessible shared spreadsheet in CSV format stored in AWS S3 in OTHER


account with the following properties:

Property Value
Bucket Name interview-bucket
Prefix datasets/sample.csv
Access Key ID "accesskeyid"
Secret Access Key "supersecretaccesskey"
2. An Amazon PostgreSQL RDS instance that is running in DATABASES account with the
following properties:

Property Value
Hostname interview-db.ap-southeast-3.rds.amazonaws.com
Port 5432
Instance name interviewInstance
Schema name interviewSchema
Table name interviewTable
Username candidate
Password candidatepassword
Table size 20 GB
Estimated ingestion time from start to finish 3 hours

3. An on-premise MS-SQL DB server that is deployed in COLO with the following properties :

Property Value
Hostname INTERVIEWDB.DATASINTESA.NET
Port 1433
Instance name interviewInstance
Schema name interviewSchema
Table name interviewTable
Username candidate
Password candidatepassword
Table size 50 MB
Estimated ingestion time from start to finish less than 15 minutes

4. A third-party API which exposes the data layer as REST endpoint to the client with the
following properties:

Property Value
URL https://interview-api-datasintesa.net/table_name=interview_table
Returned object JSON
Username candidate
Password candidatepassword

Goals
1. Write brief explanation on how you can access and ingest datasets from (1), (2), (3), (4).
2. Write example code to ingest the data from scenario (1), (2), (3), (4)
3. In which situation would you use a NoSQL database in your ETL pipeline?
4. Cloud services knowledge check :
1. Which AWS services to use and why? (Note : you can use equivalent Azure /
GCP services)
2. What programming language and framework to use and why?
3. The high-level data flow from source to sink e.g. do you have to mutate the data
in any way?
4. Credentials management, i.e. where will you put the username and password to
access the source system in your pipeline.

Outputs
1. Submit your answer 2 days after the hiring team sent you the questions.
2. Wrote your answer as a text file (.txt, .word, etc)
3. Code block must be formatted in an human-readable way.

You might also like