Professional Documents
Culture Documents
1 Intro
1 Intro
Engineering
February 13, 2020
Dieter Devlaminck
Overview
Previously
The role of the data engineer is to provide the data scientist with the
software infrastructure for fetching and processing the data so that the
data scientist can easily explore and gain insight in the data. He/she is
responsible deploying new models and applications typically making
use of a workflow management platform
https://www.element61.be/en/colleagues/data-engineer
Extract/Transform/Load (ETL)
The data engineer is responsible for
implementing the interfaces that are necessary for
managing the data flow and keeping the data Data
source
available for analysis
extract
The data architect is usually the person load
responsible for the design of the whole system Data
transform Data
source
warehouse
Typically there are many different data sources
within the company. To enable data scientists to
gain insight in that data and generate value, all
Data
that data should be accessible in a central source
The set of processes to automatically extract data from different sources, transform it into some
uniform format and store it in a central place defines the data pipeline
The data pipeline can also contain production models made by data scientists. Depending on the
requirements these models have to run in real-time, once per hour/day...
Data engineers need to maintain this data flow and ensure its availability and quality:
● Makes reporting and results of analysis more consistent (e.g. if the data scientist has to combine
multiple source each time he/she wants to do an analysis, chances are that this goes wrong at
some point)
● The data scientist does not need to bother with all the interfaces and formats of the different data
sources, so they can focus on their job of generating value from the data
Typical skill set of a data engineer
● Knowledge of Python, Java or Scala
● Experience with big data technologies: Hadoop Map-Reduce, Spark, Drill, Hive, Pig,
Storm...
● Knowledge of algorithms and data structures
● Knowledge of operating systems, mainly Linux-based
● Understanding the basics of distributed systems and parallel programming
● Experience with cloud platforms, like Amazon Web Services
● Good understanding of SQL and NoSQL databases
● Basic machine learning knowledge
● Knowledge of DevOps tools: version control, testing frameworks, application
containers....
● Basic knowledge of how computers and networks work
● REST APIs
● Some software engineering skills: design and architecture, agile development...
Job offer at Levi Strauss
Zaventem (indeed.com)
A small overview of the data engineer’s tools,
technologies and platforms
About this course
Goal
To give an introduction to some important techniques and technologies for handling data in practice
(and setting up data projects/pipelines)
The focus won’t be on machine learning and data mining techniques. It should introduce you to
numerous practical tools for storing and processing (large amounts of) data, understand their
complexities and know when to use which tools and technologies
You won’t necessarily have in-depth knowledge of anything, but at the very least you will know where
and what to look for in order to solve your problem
It should enable you to communicate more effectively with developers, software and data engineers
Where we will deviate
Data engineers typically have a strong background in computer science and programming languages
such as Python, Java, Scala… That’s why we will give an introduction to some typical computer
science subjects like data structures, algorithms, operating systems and distributed/parallel computing
● We will revisit some algorithms from the Data Mining course and see some extensions to handle
larger amounts of data
● Talk about visualization tools for exploring data and reporting
Overview
Five parts:
● Foundations of computer engineering: hardware, algorithms and data structures, Linux as OS,
regexes and cloud services
● Storing and querying data: SQL and NoSQL database, Hadoop Map-Reduce, Spark
● Some data mining techniques extended for handling larger amounts of data
● Big Data technologies and REST APIs
● Visualisation
Part 1: Foundations
● Data representations and file formats: json, xml, csv,
parquet, compressed or not...
● Basic computer architecture: memory, disk, cpu, gpu,
parallel processes…
● Operating System: Linux command line tools for
navigating cloud servers
● Computer networks
● Cloud services: AWS as an example
● Basic insight in complexity analysis of algorithms
● Data structures to efficiently access and store data
● Regular expressions for matching patterns in text
Part 2: Data Querying
We will revisit some data mining techniques and discuss ways to extend them
for use on larger data sets:
● week 1 (13/02): intro to the course, computer architecture, file formats and python basics
● week 2 (20/02): computer networks, linux and regexes
● week 3 (27/02): json queries, cloud services, algorithms and their complexity
● week 4 (05/03): parallel computing, data structures and community detection algorithm as use case
● week 5 (12/03): databases: SQL (and NoSQL) (Guest Speaker: Vinayak Javaly)
● week 6 (19/03): map-reduce framework and algorithms for matrix multiplication and pagerank
● week 7 (26/03): frequent itemsets and minhashing
● week 8 (02/04): no class: uitreiking eredoctoraten (voor- en namiddag geen les)
● week 9 (23/04): clustering, dimensionality reduction and predictive models at scale, naive bayes map-reduce
implementation
● week 10 (30/04): visualisation
● week 11 (07/05): taxonomy on big data technologies. REST APIs.
● Week 12 (14/05): PrediCube use case + questions
Class format
General format of the weekly sessions
Teaching assistants:
It’s less ideal to implement computational algorithms! Python is an interpreted language and is quite
slow
Lab session locations
Some weeks the lab sessions will be spread across P010 and a non-computer class room
Consequence: students will need to install some software on their laptop for some of these lab sessions
● week 1 (13/02): Python extras and/or Python tutorial for those who have no experience with
Python
● week 2 (20/02): finding patterns in text defined by regular expressions, processing data chunks
in parallel
● week 3 (27/02): setting up a server on Amazon’s AWS cloud
● week 4 (05/03): data structures and algorithms for sorting and community detection
● week 5 (12/03): practice session on querying a relational database
● week 6 (19/03): processing IMDB data using Spark map-reduce
● week 7 (26/03): applying minhashing and frequent itemset mining on YahooMovie/Movielens
database
● week 9 (23/04): clustering and predictive models on scale using Spark
● week 10 (30/04): project
● week 11 (07/05): project
● week 12 (14/05): project
Project
● Choice of 5 projects: extract data, transform, load into database, visualize in dashboard
● Groups of 3: preferably people of different backgrounds
● 5 points of 20
Exam