Make Data Pipeline

Building a Data Pipeline from Scratch | by Alan Marazzi... https://medium.com/the-data-experience/building-a-dat...
Building a Data Pipeline from Scratch

Alan Marazzi Follow
Jun 20, 2016 · 9 min read
This is the story of my first project as a Data Scientist: fighting with databases, Excel
files, APIs and cloud storage. If you ever had to build something like this you know
exactly what I’m talking about.
“Why are you using data and pipeline in the same sentence?”
For those who don’t know it, a data pipeline is a set of actions that extract data (or
directly analytics and visualization) from various sources.
1 of 11 10/28/20, 9:22 AM
Extract, Transform, Load
It is an automated process: take these columns from this database, merge them
with these columns from this API, subset rows according to a value, substitute NAs
with the median and load them in this other database. This is known as a “job”, and
pipelines are made of many jobs.
Why we need an automated pipeline, you say?
First, we will have most of the data we care about in one place and in the same
format
Second, we don’t have to do it ourself every once in a while
Third, it is reproducible (more on reproducibility later).
Business needs != Research Questions

Big Data, Machine Learning, AI and Data Science are just buzzwords, right? Mmh.
Probably not. Or better: yes. The gist is that as long businesses will invest on these
“things” people will work on them. Full stop.
The fact is that these buzzwords are often misused especially in business context.
We can have small Big Data (just use HBase for less than terabytes of data), we can
have “Artificial Stupidity” (apply the same algorithm to all business problems) and
we we can have Data Ignorance (bad communication is enough to let ignorance
2 of 11 10/28/20, 9:22 AM
spread as a disease).
Community - Elroy Patashnik (Keith David) Pretty Handy With Technolo…
The usual businessman reaction to Big Data
In my opinion the important thing is to not lose focus: tools are means to an end,
and not viceversa. Hadoop is neither bad nor good per se, it is just a way to store
and retrieve semi unstructured data. Saying the opposite it would be like saying that
chainsaws are bad because someone uses them to kill people.
We have come to the point. The questions are the important thing, and the tools
are the mean to reach the answer to these questions.
http://xkcd.com/
3 of 11 10/28/20, 9:22 AM
Businesses are facing tough times not only because of an elusive lack of talent, but
also because they need to adopt processes and practices typical of academia. This is
disruptive for them. For decades business people used to mock academia: it’s slow,
it’s bureaucratic, it isn’t cost effective, etc.
Now they’re mocking them less, or at least they should. Businesses are starting to
understand that there isn’t only one question: “How do we make more money?”,
but there are many connected to the latter which is an objective.
“What’s the cost connected to the retention of churning customers?” “Can we reduce
it focusing on customers before they will churn?” “Which customers are going to
churn?”. These are questions that can be answered with data, but many people
are not used to state issues in this way.
So the first problem when building a data pipeline is that you need a
translator. This translator is going to try to understand what are the real questions
tied to business needs. This can be a slow and long process, especially if starting
from scratch.
Infrastructure or: how I learned to stop worrying and love SQL

On the internet you’ll find countless resources about pipeline and warehouse
infrastructure possibilities. You won’t find as many resources on the process to
follow or on best practices. This is bad.
A bit dated, but always good.
4 of 11 10/28/20, 9:22 AM
Here’s the second risk: many will focus on tools and technology, not on ends. This
is bad as well.
Instead of thinking what would be the easiest, fastest and most efficient solution
to get an answer to business questions, some people are going to focus on raw power
and huge amounts of storage.
Fun fact: working with big data is relatively easy (once it’s all setup), the very
difficult thing is working with small data. Or worse with non-existing data.
Usually you’ll have someone sitting at a table talking about scalability and about the
limitless possibilities of some tool. It’s easy to be dragged in the discussion without
thinking that in order to have Big Data one has to have, well, a lot of data.
For using Neural Networks you need a huge amount of data
It can happen that some information is lost in businesses because is not collected at
all or not consistently. When you find out that some data you would like to analyze
are not collected consistently you’ll have a small data problem (or worse a zero
data problem).
Sure, you can envision a new data collection process, but for some time you’ll be
5 of 11 10/28/20, 9:22 AM
stuck with no or few data. In this case tools won’t be able to help you. Either
you’re able to perform robust (literally robust, not the methods) statistical analysis
or you’ll have to wait.
The point is that the infrastructure must be simple and efficient: one can even think
to integrate missing information through a new collection process directly within
the pipeline.
Process — Project Every Tiny Detail

The process is the most important step. You will define what, where and how data
are collected, transformed and loaded. Though we hear hearing everyday of AI
and its endless possibilities there is still at least one thing they cannot do yet: decide
the pipeline process.
This means that you’ll need to manually pick every field, table, data source,
transformation, join, etc. The good news is that if you do it right you’ll have to do it
just once. Afterwards everything will be automated.
Vincent Vega unable to reproduce some calculation in Excel
Why automation is so important? Well, of course because you won’t have to do the
same thing over and over, saving a lot of time. But probably the most important
reason is that to have automation you need to think, plan and write down
6 of 11 10/28/20, 9:22 AM
somewhere in some sort of language the whole process.
This makes it reproducible, which means it can be reproduced by almost anyone

and nearly everywhere (only if they have access to the data, of course). Security
and the possibility to backup the process are important keys, but the major feature is
that you can debug it.
In this case debugging is not only referred to the code, but to the whole process.
What if you have a transformation of a categorical variable from 0–1 to Male-Female,
but later on you find out that is not 0 = Male, but 0 = Female?
The proper reaction when you find out the coding of the variable is wrong
If you have a well built and structured pipeline, you can go to check where is the
wrong transformation, change it and that’s it. If you don’t have a pipeline either you
go changing the coding in every analysis, transformation, merging, data whatever,
or you pretend every analysis made before is to be considered void.
There are clear issues with both “no-pipeline-no-party” solutions. Businesses must
understand that is much better losing a bit more time before, when building the
pipeline, than risking to lose months and/or even money because of wrong
7 of 11 10/28/20, 9:22 AM
decisions later on.
Data Driven Decision Making: a matter of culture

What comes first: the evidence based decision making or the data culture?
It’s an usual chicken and egg problem and it can be tricky to solve. The usual answer
is that the the data culture has to come from somewhere. It’s not unusual to find
people in many businesses who were never exposed to structured data and they
never had to take decisions based on evidence.
For these people even talking about averages, medians, distributions and other
simple descriptive statistics can be overwhelming. During the startup phase it’s
important to not overload people with data: there’s a reason downloading a
database to a file is called “dump”.
You have to carefully decide which metrics really matter for every business area and
which one has the largest chance to click something in people’s heads. To find it out
the only way is to talk to them, with both managers and employees, expect the
unexpected, especially if you’re undertaking this process from zero.
8 of 11 10/28/20, 9:22 AM
It’s almost certain you’ll find duplication of data, as well as the lack of potentially
vital information. During this phase you can also start to present what are your
objective and let people start to enter in a new mindset.
This is important because the best solutions are those that make employees
virtually independent accessing, visualizing and analyzing data. Independence is a
goal, to reach it you’ll have to guide and help people. As for the pipeline, it will take
more time at the beginning but it is going to pay off in the long run.
For many people this would be a legit working dashboard
At the beginning don’t use very fancy modeling, simple insights and descriptive
statistics will be more than enough to uncover many major patterns.
I find that explaining carefully to people the difference between mean and median
and why they should almost always use the median is already enough as one of
the first steps. For some this will already be pretty difficult to understand, since we
have means ingrained very deeply in our reasoning.
9 of 11 10/28/20, 9:22 AM
To spur, a data culture must be raised slowly but firmly: the first point is to let
people trust data. Accuracy is always better than precision, don’t be afraid to give
ranges and confidence intervals and explain them carefully.
Finally! Some Analysis eventually

The whole pipeline process must be thought in function of the analysis you
would like to perform and present. Visualization is also an important goal, and I
can never stress enough how important it is to present data in a good way.
A good example of what you shouldn’t do
You can use the fanciest models, the latest convolutional neural network and get the
best possible results, but if you’re unable to communicate them effectively you’ll
have a tough time convincing people of their value.
10 of 11 10/28/20, 9:22 AM
Besides, most of the business can be run efficiently with very simple metrics and
usually more advanced models are left for production or one shot analysis. Just
Some rights
few of themreserved
will ever enter in the pipeline and it is a good practice to put them as
close as possible to the last step. The reason is that you will change and tweak them
Data Science Big Data Pipeline Data Analysis
many times, and the less they are ingrained in the pipeline, the better.
After analysis the process will restart, it is a loop. You’ll find something interesting
About Help Legal
and you’ll want to dig deeper. After discovering a new trend or correlation you’ll
want to monitor it constantly, and a new process in the pipeline is born.
Get the Medium app
When you will reach this step you’ll congratulate yourself for all the hard work spent
on building a pipeline!
Enjoyed it? Hated it? Was it useful? Let me know with a comment, or get
in touch on Twitter! If you want to keep following my adventure follow me
on Medium.
11 of 11 10/28/20, 9:22 AM

Make Data Pipeline

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Make Data Pipeline

Uploaded by

Copyright:

Available Formats

Building a Data Pipeline from Scratch | by Alan Marazzi... https://medium.com/the-data-experience/building-a-dat...

Building a Data Pipeline from Scratch

Extract, Transform, Load

Why we need an automated pipeline, you say?

Second, we don’t have to do it ourself every once in a while

Third, it is reproducible (more on reproducibility later).

Business needs != Research Questions

Community - Elroy Patashnik (Keith David) Pretty Handy With Technolo…

The usual businessman reaction to Big Data

Infrastructure or: how I learned to stop worrying and love SQL

A bit dated, but always good.

For using Neural Networks you need a huge amount of data

Process — Project Every Tiny Detail

Vincent Vega unable to reproduce some calculation in Excel

somewhere in some sort of language the whole process.

This makes it reproducible, which means it can be reproduced by almost anyone

decisions later on.

Data Driven Decision Making: a matter of culture

For many people this would be a legit working dashboard

Finally! Some Analysis eventually

A good example of what you shouldn’t do

You might also like