Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

5 things you should know for a career in data

engineering
stitchdata.com/blog/5-things-you-should-know-for-career-in-data-engineering

23 January 2019

The demand for big data professionals has never been higher. "Machine Learning
Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on
LinkedIn," Forbes proclaims. Many people are building high-salary careers working with
big data. We've already talked about things you should know before getting a job in data
science — now let's talk about data engineering.

First, you should know that a data science degree isn't training for a data engineering
career. Data science is heavily math-oriented. By contrast, data engineers work primarily
on the tech side, building data pipelines. What the two roles have in common is that both
work with big data.

Working with big data often takes a big team. Data engineers work with people in roles
like data warehouse engineer, data platform engineer, data infrastructure engineer,
analytics engineer, data architect, and devops engineer.

To help students and mid-career professionals decide whether data engineering is for
them, we spoke with people who've worked as data engineers themselves and hired data
engineering teams:

Jesse Anderson is both a data engineer and managing director of the Big Data
Institute. He's the author of The Ultimate Guide to Switching Careers to Big Data, a
79-page ebook that would interest anyone who enjoys this post.
Paul Lappas is co-founder and CEO of Intermix, which provides a monitoring
dashboard for data engineers to keep an eye on mission-critical data flows.
Alex Ng is a senior data engineer for Manifold, a full-service AI consulting company
offering a complete range of AI engineering services.
Stephanie Tam is an engineering director for Kyruus, which uses data to help solve
the enterprise-wide patient access paradox.

1) You must be a strong developer


Everyone agrees that you need strong developer skills for a data engineering job.

"You'll have to write scripts and maybe some glue code," Ng says. "Everything is code
now: infrastructure as code, pipeline as code, etc. Courses are OK but nothing beats real-
world experience. A textbook doesn't teach you how to handle a data pipeline outage – at
least none of mine did!"

In a blog post about what he looks for in a data engineer, Anderson said, "I can't stress
enough how important it is for a data engineer to have a strong programming
background. They also need a love of or at least an interest in data, in finding patterns in

1/6
data, otherwise they may find the work boring. Also, they have to like and have the ability
to create systems that are difficult and complex. Big data projects are 10 times more
complex than small data. So it's a love of data combined with a love of programming to
create data pipelines."

In addition to being comfortable coding, Lappas says, "You have to have the operations
mindset that uptime is critically important. You have to be careful how you build your
infrastructure for reliability, so that any changes won't break any of the pieces. Devops
experience is very valuable. And you need DBA skills." In fact, he says, "I generally see the
title data engineer in midmarket and smaller companies. When you get to bigger
companies the title is still DBA or senior DBA."

2) You need to know about a lot of technologies


Lappas says, "A data engineer has three main duties:

To ensure that the data pipeline – the acquisition and processing of data – is
working
To serve the needs of internal customers – the data scientists and data analysts
To control the cost of moving and storing data

"The critical skills are SQL, Python, and R, and ETL methodologies and practices."

Tam confirms the value of knowing SQL and having competency in a language. "Just
understanding the foundation of a language will allow you to work at any company."

Anderson says most data engineering is done in Java, but "you have to be aware that most
universities are teaching programming from an academic point of view, and there's a
disparity between what industry wants and what academia is providing. A university may
have classes on programming, but people who want to become data engineers may have to
learn the technical and systems side on their own."

But there's more to being a data engineer than knowing SQL and programming, he says.
"A qualified data engineer's value is to know the right tool for the job. People who are new
to big data think I'm exaggerating when I say that data engineers need to know 10 to 30
different technologies to choose the right tool for the job in technologies, such as:

2/6
Apache Hadoop
Apache Spark
Apache Hive
Apache HBase
Apache Impala
Apache Kafka
Apache Crunch
Hue
Apache Oozie
Apache NiFi
Apache Flink
Apache Apex
Apache Storm
Heron
Apache Beam
Apache Cassandra

To make the right decision in choosing, for example, a NoSQL cluster, you'll need to have
learned the pros and cons of five to 10 different NoSQL technologies," Anderson says.
From that list, you can narrow it down to two to three for a more in-depth look.

Learning all those tools takes time. You can't expect to spend a weekend watching
YouTube videos and studying MOOCs and expect to do well in a job interview. Also,
Anderson cautions that many low-cost classes are useless. "They're too general, taught by
people with not enough knowledge, and they won't help you get a job." Similarly, most
certifications just don't make enough of a difference to be worth it. "You're be better off
putting your time and money into a better personal project that shows true mastery.

"The best options are to get professional training, read books, and work on big data
projects. You have to both internalize the knowledge and practice it. If you've learned
passively but never practiced, you won't be able to code a project, and that will come out
in an interview. Practice practice practice!"

Most companies standardize on a single vendor's suite of cloud computing services, so Ng


recommends, "Go deep on one of the big three: Google Cloud Platform (GCP), Amazon
Web Services (AWS), or Microsoft Azure. Understand how their service offerings can be
used as building blocks for highly scalable and available data pipelines."

"So if your company uses AWS, for instance," Lappas adds, "you'd need to know Amazon
Redshift, Amazon EMR, Amazon Athena, and Amazon S3."

Finally, Ng says, "A neglected skill that I feel is core to being a good engineer is technical
diagramming. Data engineers will inevitably need to map out their pipeline architectures
in a clear and presentable way. No amount of words can beat a thoughtful and clean
diagram. My advice would be to pick a diagramming tool and get really good with it. My
favorite is Lucidchart and I use it for everything from AWS architecture diagrams to block
diagrams and flow charts."

3/6
3) Experience beats education
How do you pick up all those skills? Typically, on the job. Everyone we spoke with told us
it wasn't necessary to have an advanced degree to get a job as a data engineer.

"I think education has its place, but a lot of things you don't learn until you operate in the
real world, meaning you have to deal with real customers," Ng says. "I think anyone with a
software background that has had some experience in operations or systems can make a
smooth transition to data engineering. A lot of the skills that devops and site reliability
engineers have overlap significantly with data engineering responsibilities."

"A lot of the skills you need you can pick up yourself or on the job," Lappas agrees. "If
you're just starting out, I recommend starting as an analyst to get a feel for the business
value that the data brings. Eventually you can move down the stack into data
engineering."

Nevertheless, he says, training in both software development and data science skills such
as statistics and math is important. "Data engineers are responsible for acquiring data for
data scientists and data analysts, who need all the company's data available in a format
that lets them query it with the tool of their choice. The data engineer has to migrate it
from where it lives and transform it so that it makes sense to the data scientists and data
analysts. That may require aggregating it and running statistical methods to derive higher
insights. For example, if a mobile app generates 10,000 events per second, chances are
you're going to have to do some transformation on that raw data to make it useful for the
rest of the data team."

Tam says, "Education has its place but experience makes the best engineer. I've seen
people in customer-facing areas like support and customer success, if they have interest in
programming, move into a data engineering role. Most support people are so in tune with
what customers are asking for in terms of custom data integrations that it's an easy
sidestep for them, in the sense that they understand the use case and the thought process
behind it. I've also seen data scientists move from the analysis side into data engineering.
Generally, it's people who have a hand and the experience in the day-to-day use of the
data."

She says, "I've hired people of many different educational backgrounds – from people
who've just graduated with a computer science degree to people who've done bootcamp
courses in Python. You shouldn't be pigeonholed by your background. It depends on the
person's overall goal. If they have the vision and drive, anyone could make a good data
engineer with time."

Anderson, however, says, "Data engineering teams generally skew toward senior people.
More broadly based software engineering teams will have people with a wider range of
experience.

"If you have only a bachelor's degree and want to get on a data engineering team, I
recommend you make a personal project that shows what you can do, not just what you
can talk about."

4/6
4) Social and communication skills are important
Ng says, "Aside from hard technical skills, a good data engineer should also have certain
soft skills and qualities":

1. Attention to detail: Data quality is extremely important when building pipelines. All
downstream work is only as good as the quality and integrity of the data you're
moving through the pipeline. You have to really care about and appreciate the
"garbage in, garbage out" principle.

2. Appreciation for clean design: There's never one way to design and build a pipeline
for moving data from point A to point B. A good data engineer should appreciate the
elegance of clean and simple designs that are not over-architected.

3. Good communication skills: A lot of times there's a discovery period when you start
to design a pipeline because your data is sitting in different silos that may be located
in different areas of your infrastructure. You'll have to talk to people to understand the
playing field before you design anything. This discovery step isn't easy, but it's a
requirement for making sure you're building the right thing. A good data engineer
should find satisfaction in helping their customers solve painful problems.

4. Excitement about working on back-end systems: Data engineers don't build a lot of
UIs and front-end apps. They work deep in the systems stack, and in many cases they
won't be able to point to something shiny and say "I built that!" You have to be OK
with that and take pride in being the hero behind the scenes.

5. A love of learning: This isn't really data engineering-specific, it's just how the
software engineering world operates. You have to keep up with new libraries,
frameworks, and tools out there in the community. Things change fast and you need to
be able to quickly understand, evaluate, and learn new tools if necessary.

"Having good people skills is critical," Lappas agrees. "A data engineer serves internal
teams, so he or she has to understand the business goal that the data analyst wants to
achieve to best support them. If a data scientist has a specific tool they want to use, the
data engineer has to set up the environment in a way that lets them use it. So you have to
be really good at interacting with the rest of the data team."

Tam says, "a lot of teams are really collaborative, so you have to communicate what you're
doing technically with people who aren't as technical in the process of getting the
requirements vetted out." But she says another kind of skill is also important. "I look for
people who like stories and puzzles – the process of piecing together stories that might
not seem like they make sense into a more complete picture. That mindset is useful as you
move up the data engineering ladder and you have more input into designing systems that
make sense."

5) The job is changing


While all of the above is important, data engineering is an evolving discipline.

5/6
Lappas says, "We're seeing a shift to data services, which means a change in the job of the
data engineer to delivering data services. When it was expensive to store and process, data
was siloed. Few people had access to it, and it was hard to make changes to it. With the
cloud, it's now cheap and easy to store and process data, so everyone is putting data into
cloud data warehouses and allowing anyone in the company to connect to it. Data
engineers are still responsible for the performance of that infrastructure."

In other words, Tam says, "As data becomes even more ginormous than it is today, it
becomes more about infrastructure and sustainable processes than it does about single
processes. The job is growing toward more being able to maintain things."

Ng is more cautious. "I think this one is very company-specific. For large companies a
data engineer may be able to have a narrower focus, e.g. just building pipelines. If you're
at a small startup and you're the only data engineer, you'll inevitably have to wear
multiple hats. Both of these scenarios exist today; you just have to decide what is best for
you."

Conclusion
What would our experts tell someone who was considering a career as a data engineer?

Lappas says, "The job is very difficult. It's an unsexy job, but it's super-critical. Data
engineers are kind of like the unsung heroes of the data world. Their job is incredibly
complex, involving new skills and new tech. It's really hard to build new ETL pipelines."

Anderson agrees. "It's more difficult than a regular software engineering job. It may not
be for everybody. The technical bar for data engineers is pretty high. Some people's efforts
may be better put elsewhere."

Ng's advice: "Work for a startup and find a great mentor. Whether this is at an internship
or your first job, find a place where you can work directly for someone who's a great
teacher. More than anything else, a great mentor is the most efficient way to learn the
right things and learn those things quickly. By working at a startup you'll be forced to
wear multiple hats and will learn an incredible amount while doing that. Each hat is an
opportunity to learn something new. Be a hat collector."

"Data engineering is a job that takes a lifetime to master," Tam says. "Every year there's
something new to learn. You're never doing the same thing year after year."

Anderson agrees. "There is no such thing as future-proofing your career by choosing the
right technology. The right technology will eventually become the wrong technology.
You'll need to spend time and effort to keep up with what's happening."

Or as Tam puts it, "You can continue to grow forever."

Image credit: Image Editor

6/6

You might also like