Project Guidelines - Analytics Engineering

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 7

Data Engineering

Project 3 : Data Warehousing


Project Objectives

1 Explore and Manipulate an unstructured Dataset

2 Build ETL Pipelines for Data Warehousing

3 Create your own Analytics Application


Explore and Manipulate an Unstructured Dataset

Chess Games DataBase (365chess.com)


● Chess games history of the current top 200 players (over 300.000 games)
● Link (Click me)
● 200MB Uncompressed, 41MB Compressed
● For each player:
○ 1 file with current player data (ranking, ELO)
○ 1 file with all recorded games

Steam Games DataBase (vginsights.com)


● Usage Statistics for PC video games sold on Steam (about 19000 games).
● Link (Click me)
● 1.01GB Uncompressed, 21MB Compressed
● For each game:
○ 9 JSON files containing diverse data (History, Company, Ranking, Pricing, etc.)
Explore and Manipulate an Unstructured Dataset

Main challenges
● The databases are very unstructured and undocumented so exploration will be required to
understand their contents.

Main Tasks

● Choose a Dataset and Design a DataBase Schema in third normal form (3NF) a.k.a.
Snowflake Schema (Facts + Dimensions) containing as much data from the original dataset as
possible.
● 1NF :
○ Each table cell should contain a single value (No lists)
○ No duplicate rows
● 2NF :
○ No dependency between the Primary Key and any other field.
● 3NF :
○ No dependency between any fields of a given table
Build ETL Pipelines for Data Warehousing

Main challenges
● ETL processes will need to be completely automated from the folder and file structures
since there are hundreds of files.
● Specific libraries might be needed for loading uncommon data types.

Main Tasks
● Build ETL pipelines to populate the Database:
○ Extract : Load the data from the original sources (YOU SHOULD NOT DOWNLOAD THE
DATASET MANUALLY but programmatically download from link and unzip it)
○ Transform : Merging, Encoding, Granularity
○ Load : Normalize and Load the data in the target system (SQL Database)
● Run the pipeline to integrate as much data from the original dataset as possible
Create your own Analytics Application

Main challenges
● The Dataset will need to be denormalized to increase the efficiency of SQL queries (Star
Schema 2NF)

Main Tasks
● Build a BI Dashboard (PowerBI / Streamlit) or ML model (Streamlit Optional)
● The dataset used by the app needs to be a View from the normalized dataset (A single SQL query should be required to build
it)
● Examples for Chess:
○ Openings Explorer (BI)
○ Player Repertoires and KPIs (BI)
● Examples for Steam Games:
○ Predict Sales for each game (ML)
○ Games/Publisher/Genre KPIs (BI)
Deadlines & Deliverables

1 Explore and Manipulate an unstructured Dataset


● Deliverables:
○ Database Schema in 3NF designed using Diagrams.net or any other tool.
● Deadline : January 5th, 2023

2 Build ETL Pipelines for Data Warehousing


● Deliverables:
○ ETL code (Python)
○ SQL Database file in 3NF (Snowflake Schema)
○ Optional : Airflow DAG if needed
● Deadline : February 2nd, 2023

3 Create your own Analytics Application


● Deliverables:
○ ETL code for Denormalization (SQL query)
○ PowerBI/Streamlit/ML App
○ Optional : If using streamlit -> Deploy your app
● Deadline : February 23rd, 2023
● Project Defense : February 27th

You might also like