Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 2

In this video, we will listen to

several data professionals talk about their experience of


working with very data sources and types of data. So you'd be
surprised the different ways that data can come at you. I
tend to be a relational database fan. And so I spend a lot of time
with SQL and using the power of SQL to deal with moving data
from one place to another, to deal with structuring of data,
to deal with all the security details around data. But
obviously that doesn't apply to every scenario and even when
we're dealing entirely in relational databases we're often
moving data from one relational database to another. And
especially when we're talking about one vendor to another that
can be challenging. The things that also get in the way tend to
be the versioning. So sometimes the feature of something that
you want is in a version two levels above where you are, or it doesn't work the
same way as
it did two versions ago. So working with multiple data
sources is about flexibility. It's about finding the
function that works and works with the performance you need.
Moving data one time is usually not all that hard as long as
we're sub-terabyte. But moving data consistently and continually,
and in a performant way can cause cause us to evaluate a lot of
different solutions. So we really need to be open to new ideas and
looking for new solutions that meet the requirements that we
have. Mostly I work with relational databases. They are
extremely flexible and withstood the test of time. However,
with the evolution of unstructured data such as logs,
documents, XML, and JSON, their reputation as a cure for all of
your data problems came under intense scrutiny. And most of data intensive
applications such
as IoT on social media applications started to look
elsewhere. For example, Google released a white paper back in
2006 called Google BigTable. That idea quickly caught fire.
For example, Cassandra and HBase came out of the same
architectural model as the Google BigTable. And they became
widely popular databases to solve some of the problems that
relational databases failed to solve. For example, relational
databases struggle a little bit with heavy write intensive
applications such as IoT or sensor data, social media data
because the B-tree data structures that drive, or power,
these relational databases slows down due to their nature
of the random reads and random writes for the heavy write
applications. It's an inevitable part of a data engineer's job to
work with a variety of data. You will need to work with
standard formats like CSV, JSON, XML, but also you'll need
to work with proprietary formats. And you will need to
get data in different sources, whether it be
relational databases, NoSQL or big data repositories. You will need to work with
data
at rest, streaming data, or data in motion. And you might not
have the skills to work with all of these different types of data
sources from day one. But you need to be able to learn
as you go and pick up the skills required for the project to work
with different datasets, different data formats, and
different data sources. When it comes to the data
formats, log data, XML data, JSON, etc., each of them
comes with their own challenges. For example, log data is extremely challenging
because it's unstructured and you may need to write
your own custom tools to pass the data depending on
what you want to look at. Whereas XML was widely popular
like a decade ago, especially with the SOAP protocol of the
web applications. However, soon the web developers and
corporations discovered that it can be a resource intensive,
especially memory, because it has both the starting and ending
tags. So then JSON came into the picture. They got it off the
ending tags and just looked like a key-value pairs and it saved
some resources. And it is now widely used as part of the
RESTful APIs. And then even newer versions of the data
format such as Apache Avro are gaining wide popularity
because of the efficiency on how they store the data. One particular situation
where
we were converting data from a Db2 database into a SQL Server
database and it was challenging because the way that each of
those expect imports and exports to happen is a little bit
different. The data was particularly challenging, and
that's where a lot of your challenge might come from in
these projects, is from the data itself. In this particular case,
the data had a lot of different characters in it. So usually
we're looking for a character we can use as a
delimiter. Oftentimes that's comma delimited,
so we can separate our fields using commas, but
we also have to think about situations where we have
data that has commas in it. How do we properly separate
that data? How do we properly define our fields? And in this
particular case we had to use different separators for
different tables, because every single special character that we
could think of was in one of those tables. And the special
characters that weren't were sometimes ones we couldn't
use for separation, such as the Bell character.

You might also like