Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Data Preparation

& Exploration
Understanding Data Science

Course Instructor
ANAM SHAHID
Data Preparation
Data preparation happens after collecting and storing the data.

Why prepare data?


Data rarely comes in ready for analysis. Real-life data is messy and dirty. It needs to be
cleaned. Skipping this step may lead to errors down the way, incorrect results, or throw
off your algorithms. You would not use vegetables without cleaning, peeling and dicing
them, as your soup would taste weird and no one would eat it. Well, if you don't clean,
peel and dice your data, your results will look weird, and no one will use them!

Let's start cleaning


Let's take a simple, but dirty, dataset, and clean it together. Maybe you can already
notice a few things.

Tidy data
One fundamental aspect of cleaning data is "tidiness". Tidy data is a way of presenting
a matrix of data, with observations on rows and variables as columns. This is not the
case here. Our observations (people) are in columns, and their features are on rows.
Let's take care of that. It's easy to do that programmatically with Python or R. They also
help with the other cases you're about to see.

Tidy data output


The data looks much clearer this way.

1. Remove duplicates
In general, you want to remove duplicates. Python and R make them easy to identify.
Here we can see that Lis appears twice. Let’s remove the duplicate.

2. Unique ID
What if there's another person called Lis? Then, you want a way to uniquely identify
each observation. It can be a combination of features (name plus last name plus year of
birth, for example)

10. Unique ID | output

but the safest way is to assign a unique ID. Sara's ID is now 0, Lis' 1 and Hadrien's 2.
3. Homogeneity (the quality or state of being all the same or all of the same kind)
Something fishy is going on in the size column. Lis simply can't be that tall (or Hadrien
and Sara that small, depending where you're from). Lis is in the US, she inputted her
size in feet. Sara and Hadrien are based in Europe, they use the metric system. All
variables should use the same standard.

Programmatically, you can filter values above 2.5 meters, and apply a division by 3.281
to get the metric value. Here we go.

Homogeneity, again
Similarly, countries should follow the same format. The United States and France are
abbreviated, but Belgium is written in full. Let's fix that.
Looking better already!

4. Data types
Another common issue relates to data types. The tools you use might be able to infer
data types for each column, but you'd better make sure they are correct. Here, the Age
column is encoded as text. If you try to get the mean, you'll get an error, because the
average of two words doesn't make sense. You should change the type of this feature
to numbers.

Ages are now numbers; you can see the quotes have disappeared.

5. Missing values
Last but not least, missing values. They are common and occur for various reasons: the
agent doing the entry was distracted, the person surveyed did not understand the
question, or it's on purpose, for example an event that has not happened yet. There are
several ways to deal with missing values. You can substitute the exact value if you
have access to the source. For example, you can take an aggregate value, like the
mean, median or max depending on the situation. You can drop the observation
altogether, but each observation you remove means less training data for your model.
Or, you can keep it as is and ignore it, if your algorithm allows it.

Here, we take the mean, 27.5, and round it up to get 28, which happens to be the
correct value.
Exploratory Data Analysis (EDA)
What is EDA?
Exploratory Data Analysis, or EDA for short, is a process that was promoted by John
Tukey, a respected statistician. It consists in exploring the data and formulating
hypotheses about it, and assessing its main characteristics, with a strong emphasis on
visualization.

Data workflow
EDA happens after data preparation, but they can get mixed. EDA can reveal new
things that need cleaning.

Example: SpaceX launches (data set)


Let's look at SpaceX launches!

Knowing your data


I mean, let's look at the data behind SpaceX launches. The first thing to do is to know
what features we're looking at. We have different information, such as the flight number
or what the rocket transported. All have the correct data type.

Previewing your data


Looking at your tables helps make sense of your observations. Can you notice the
missing payload mass for the first two rows?
Descriptive statistics
It's always a good idea to calculate descriptive statistics. The SpaceX dataset is mainly
qualitative, but we still get a lot of information. We have a count of 55 pretty much
everywhere, because we have 55 launches. The Payload Mass column shows 53
because of the two missing values we saw before. Only 1 mission failed. Most of the
time, there is no attempt at landing. You could also calculate the average payload mass,
or the count of launches per year. But do you know what would be best for this last
option?

Visualize!
Visualization! In a glance we can see that there were no launches in 2011. The count of
launches then gradually increased before doubling in 2017. 2018 is lower, but
remember we only have 3 months of data for this year, so it actually looks like it's going
to double again.

Now this launch count is informative, but you probably have a couple more. How about
count by launch site? Rockets originally launched from Cape Canaveral Air Force
Station, but in 2017 most rockets launched from Kennedy Space Center Launch
Complex 39.
Outliers
Another thing you do during EDA is look for outliers, that is, unusual values. Whether
they are errors or valid, it's nice to know about them, as they can throw your results off.
Here, we can see we have only 5 launches with a weight greater than 7,000 kg, when
the average mass is closer 3800 kg.

Interactive dashboards
We finished the last lesson looking at some graphs, so let's talk a little more about
visualization.

1. One picture...
One picture is worth a thousand words, they say. Based on what we saw in the previous
lesson; we would tend to agree. However, there are a few things to pay attention to, to
ensure your chart is easily understandable and straight to the point.

2. Use color purposefully


For example, you should use color purposefully. Remember this graph? Count of
launches by year, pretty straightforward.

What about this one?


Wrong! Granted, it's aesthetically pleasing, but it's also confusing. What do the colors
correspond to? Nothing. We're just counting launches per year. One color is enough.

3. Colorblindness
You should also be mindful of colorblindness. You may distinguish red and green very
well, but some people don't, and more than you think. You can find a lot of information
on colorblindness online, as well as palettes of colors that are accessible to colorblind
people.
4. Readable fonts
You should also use readable fonts. Sans-serif ones are easier to read. There are
nicer fonts available, sure, but your readers should focus on your viz message, not on
the font.

5.Labeling
An image is worth a thousand words, but words do help. Your graphs should
always have a title, so we know what we're looking at; the x and y axis should have
labels, otherwise they could be anything; and you should provide a legend if you use
colors and patterns, so that we know what they refer to.

Question
If a picture is worth a thousand words, then what is worth a thousand pictures?

A dashboard!
A dashboard! Well, technically, a dashboard of a few pictures is enough already. My
point is, showing several pictures together can be more insightful than looking at them
separately, or trying to pack all the insights in one graph. Your car dashboard indicates
the car speed, the motor rotation speed and the proportion of gas left. Individually, these
pieces of information are useful. But together, they paint a much bigger picture and
make your trip safer and more comfortable.

1. 1 Photo by Marek Szturc on Unsplash


Dashboards
That's what dashboards do: group all the relevant information in one place to make it
easier to gather insights and act on them. On this dashboard, any sales person can see
not only how sales are progressing this quarter, but also how this progression compares
to previous quarters. On top of that, they can keep track of transactions and
opportunities, as well as of customer count. They can filter all of this data by software,
service or maintenance sales. All of that in one place, customized for their needs!

BI tools
Business Intelligence tools let you clean, explore, visualize data, and build dashboards,
without requiring any programming knowledge. Such tools are Tableau, Looker, or
Power BI. Of course, you can also do that programmatically using Python, R, or even
JavaScript.

Bibliography
https://campus.datacamp.com/courses/understanding-data-science/experimentation-and-
prediction?ex=14

You might also like