Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Welcome to Data

Engineering
February 13, 2020
Dieter Devlaminck
Overview

● Defining data engineering


● Course topics
● Class format, lab sessions, exam and project
About me

Previously

● Master in informatica at Ghent University:


reinforcement learning with echo state neural networks
● PhD in biomedical engineering at Ghent University:
optimization of brain-computer interfaces
● Expert engineer at Institut de Recherche en
Informatique et en Automatique (INRIA), France
● Post-doc at University of Antwerp: targeted advertising

Currently: full-time data engineer at PrediCube, spinoff


founded by Prof. D. Martens CoAdapt P300 speller
About you
Differentiating between data
engineering and data
science?
Defining a data engineer by differentiating it
from a data scientist
A data scientist’s principal role is to find value or discover new
opportunities in the company’s data or fulfill business needs using that
data. The data scientist/analyst uses the company’s tools and
infrastructure together with his/her knowledge of basic mathematics,
machine learning and statistics

The role of the data engineer is to provide the data scientist with the
software infrastructure for fetching and processing the data so that the
data scientist can easily explore and gain insight in the data. He/she is
responsible deploying new models and applications typically making
use of a workflow management platform

Besides supporting data science, the data engineer is more generally


responsible for the processing of data

https://www.element61.be/en/colleagues/data-engineer
Extract/Transform/Load (ETL)
The data engineer is responsible for
implementing the interfaces that are necessary for
managing the data flow and keeping the data Data
source
available for analysis
extract
The data architect is usually the person load
responsible for the design of the whole system Data
transform Data
source
warehouse
Typically there are many different data sources
within the company. To enable data scientists to
gain insight in that data and generate value, all
Data
that data should be accessible in a central source

repository in some uniform format


The data pipeline

The set of processes to automatically extract data from different sources, transform it into some
uniform format and store it in a central place defines the data pipeline

The data pipeline can also contain production models made by data scientists. Depending on the
requirements these models have to run in real-time, once per hour/day...

Data engineers need to maintain this data flow and ensure its availability and quality:

● make changes if data is added/removed


● solve bottlenecks in the pipeline
● monitor, log and solve errors
● handle duplicate, incorrect or corrupted data
● scale
● test
● ...
Workflow Management Platform

DAG configuration and monitoring @PrediCube


The advantages of having a good data pipeline and a central data repository:

● Makes reporting and results of analysis more consistent (e.g. if the data scientist has to combine
multiple source each time he/she wants to do an analysis, chances are that this goes wrong at
some point)
● The data scientist does not need to bother with all the interfaces and formats of the different data
sources, so they can focus on their job of generating value from the data
Typical skill set of a data engineer
● Knowledge of Python, Java or Scala
● Experience with big data technologies: Hadoop Map-Reduce, Spark, Drill, Hive, Pig,
Storm...
● Knowledge of algorithms and data structures
● Knowledge of operating systems, mainly Linux-based
● Understanding the basics of distributed systems and parallel programming
● Experience with cloud platforms, like Amazon Web Services
● Good understanding of SQL and NoSQL databases
● Basic machine learning knowledge
● Knowledge of DevOps tools: version control, testing frameworks, application
containers....
● Basic knowledge of how computers and networks work
● REST APIs
● Some software engineering skills: design and architecture, agile development...
Job offer at Levi Strauss
Zaventem (indeed.com)
A small overview of the data engineer’s tools,
technologies and platforms
About this course
Goal

To give an introduction to some important techniques and technologies for handling data in practice
(and setting up data projects/pipelines)

The focus won’t be on machine learning and data mining techniques. It should introduce you to
numerous practical tools for storing and processing (large amounts of) data, understand their
complexities and know when to use which tools and technologies

You won’t necessarily have in-depth knowledge of anything, but at the very least you will know where
and what to look for in order to solve your problem

It should enable you to communicate more effectively with developers, software and data engineers
Where we will deviate
Data engineers typically have a strong background in computer science and programming languages
such as Python, Java, Scala… That’s why we will give an introduction to some typical computer
science subjects like data structures, algorithms, operating systems and distributed/parallel computing

There are some topics we won’t or only partially touch:

● Software architecture and design (e.g. micro services)


● Web technology
● Versioning (e.g. git)
● Continuous integration and delivery (e.g. Jenkins)
● Containers (e.g. Docker) and container orchestration (e.g. Kubernetes)
● Monitoring and logging systems
● Handling streaming data
● ...
Instead

Although not strictly the task of data engineers:

● We will revisit some algorithms from the Data Mining course and see some extensions to handle
larger amounts of data
● Talk about visualization tools for exploring data and reporting
Overview

Five parts:

● Foundations of computer engineering: hardware, algorithms and data structures, Linux as OS,
regexes and cloud services
● Storing and querying data: SQL and NoSQL database, Hadoop Map-Reduce, Spark
● Some data mining techniques extended for handling larger amounts of data
● Big Data technologies and REST APIs
● Visualisation
Part 1: Foundations
● Data representations and file formats: json, xml, csv,
parquet, compressed or not...
● Basic computer architecture: memory, disk, cpu, gpu,
parallel processes…
● Operating System: Linux command line tools for
navigating cloud servers
● Computer networks
● Cloud services: AWS as an example
● Basic insight in complexity analysis of algorithms
● Data structures to efficiently access and store data
● Regular expressions for matching patterns in text
Part 2: Data Querying

● SQL database: focus on practicing querying relational data


from a database (MySQL as an example)
● Massive parallel data processing: Map-Reduce algorithms
and theory, practice sessions using Apache Spark
Part 3: Data Mining

We will revisit some data mining techniques and discuss ways to extend them
for use on larger data sets:

● Apriori algorithm for frequent-itemset mining


● Hashing as a powerful technique for finding similar elements
● Clustering (k-means) and dimensionality reduction
● Predictive models (perceptron)
Overview of the class weeks and their topics

● week 1 (13/02): intro to the course, computer architecture, file formats and python basics
● week 2 (20/02): computer networks, linux and regexes
● week 3 (27/02): json queries, cloud services, algorithms and their complexity
● week 4 (05/03): parallel computing, data structures and community detection algorithm as use case
● week 5 (12/03): databases: SQL (and NoSQL) (Guest Speaker: Vinayak Javaly)
● week 6 (19/03): map-reduce framework and algorithms for matrix multiplication and pagerank
● week 7 (26/03): frequent itemsets and minhashing
● week 8 (02/04): no class: uitreiking eredoctoraten (voor- en namiddag geen les)
● week 9 (23/04): clustering, dimensionality reduction and predictive models at scale, naive bayes map-reduce
implementation
● week 10 (30/04): visualisation
● week 11 (07/05): taxonomy on big data technologies. REST APIs.
● Week 12 (14/05): PrediCube use case + questions
Class format
General format of the weekly sessions

● Theory session each Thursday morning between 10.30 and 12.30


● Lab/practice session in the afternoon between 14.00 and 16.00
● Lab sessions will focus on trying out the techniques/technologies discussed in the morning
session: alone or in groups of two
● No need to hand in anything during lab sessions, but it’s a good opportunity to better understand
the theory in preparation for the exam

Teaching assistants:

Yanou Ramon Stiene Praet Tom Vermeire Kevin Milis


For the lab sessions we use Python as
language

● Used in the course Data Mining


● Easy to prototype and get results fast
● Big community which provides a lot of libraries for data science
● It’s a popular framework used for data science/engineering in industry as well

It’s less ideal to implement computational algorithms! Python is an interpreted language and is quite
slow
Lab session locations

- Week 21: P010 + D013


- Week 22: P010 + P011
- Week 23: P010 + P011
- Week 24: P010 + S.D.013
- Week 25: P010 + R231
- Week 26: P010 + P011
- Week 27: P010 + P011
- Week 31: P010 + P011
- Week 32: P010 + P011
- Week 33: P010 + P011
- Week 34: P010 + S.D.013 → Week 35: P010 + P011
Not enough computer rooms available

Some weeks the lab sessions will be spread across P010 and a non-computer class room

Consequence: students will need to install some software on their laptop for some of these lab sessions

Week 21: P010 + D013 → install python 3 and jupyter notebook


Week 24: P010 + S.D.013 → install python 3 and jupyter notebook
Week 25: P010 + S.D.014 → install MySQL server and MySQL workbench (see tutorial)
Lab sessions

● week 1 (13/02): Python extras and/or Python tutorial for those who have no experience with
Python
● week 2 (20/02): finding patterns in text defined by regular expressions, processing data chunks
in parallel
● week 3 (27/02): setting up a server on Amazon’s AWS cloud
● week 4 (05/03): data structures and algorithms for sorting and community detection
● week 5 (12/03): practice session on querying a relational database
● week 6 (19/03): processing IMDB data using Spark map-reduce
● week 7 (26/03): applying minhashing and frequent itemset mining on YahooMovie/Movielens
database
● week 9 (23/04): clustering and predictive models on scale using Spark
● week 10 (30/04): project
● week 11 (07/05): project
● week 12 (14/05): project
Project

● Choice of 5 projects: extract data, transform, load into database, visualize in dashboard
● Groups of 3: preferably people of different backgrounds
● 5 points of 20
Exam

● Oral exam with written preparation


● Closed book
● Expect questions about the project
● Expect to do exercises on paper instead of the computer

You might also like