Professional Documents
Culture Documents
Data Preparation & Exploration
Data Preparation & Exploration
& Exploration
Understanding Data Science
Course Instructor
ANAM SHAHID
Data Preparation
Data preparation happens after collecting and storing the data.
Tidy data
One fundamental aspect of cleaning data is "tidiness". Tidy data is a way of presenting
a matrix of data, with observations on rows and variables as columns. This is not the
case here. Our observations (people) are in columns, and their features are on rows.
Let's take care of that. It's easy to do that programmatically with Python or R. They also
help with the other cases you're about to see.
1. Remove duplicates
In general, you want to remove duplicates. Python and R make them easy to identify.
Here we can see that Lis appears twice. Let’s remove the duplicate.
2. Unique ID
What if there's another person called Lis? Then, you want a way to uniquely identify
each observation. It can be a combination of features (name plus last name plus year of
birth, for example)
but the safest way is to assign a unique ID. Sara's ID is now 0, Lis' 1 and Hadrien's 2.
3. Homogeneity (the quality or state of being all the same or all of the same kind)
Something fishy is going on in the size column. Lis simply can't be that tall (or Hadrien
and Sara that small, depending where you're from). Lis is in the US, she inputted her
size in feet. Sara and Hadrien are based in Europe, they use the metric system. All
variables should use the same standard.
Programmatically, you can filter values above 2.5 meters, and apply a division by 3.281
to get the metric value. Here we go.
Homogeneity, again
Similarly, countries should follow the same format. The United States and France are
abbreviated, but Belgium is written in full. Let's fix that.
Looking better already!
4. Data types
Another common issue relates to data types. The tools you use might be able to infer
data types for each column, but you'd better make sure they are correct. Here, the Age
column is encoded as text. If you try to get the mean, you'll get an error, because the
average of two words doesn't make sense. You should change the type of this feature
to numbers.
Ages are now numbers; you can see the quotes have disappeared.
5. Missing values
Last but not least, missing values. They are common and occur for various reasons: the
agent doing the entry was distracted, the person surveyed did not understand the
question, or it's on purpose, for example an event that has not happened yet. There are
several ways to deal with missing values. You can substitute the exact value if you
have access to the source. For example, you can take an aggregate value, like the
mean, median or max depending on the situation. You can drop the observation
altogether, but each observation you remove means less training data for your model.
Or, you can keep it as is and ignore it, if your algorithm allows it.
Here, we take the mean, 27.5, and round it up to get 28, which happens to be the
correct value.
Exploratory Data Analysis (EDA)
What is EDA?
Exploratory Data Analysis, or EDA for short, is a process that was promoted by John
Tukey, a respected statistician. It consists in exploring the data and formulating
hypotheses about it, and assessing its main characteristics, with a strong emphasis on
visualization.
Data workflow
EDA happens after data preparation, but they can get mixed. EDA can reveal new
things that need cleaning.
Visualize!
Visualization! In a glance we can see that there were no launches in 2011. The count of
launches then gradually increased before doubling in 2017. 2018 is lower, but
remember we only have 3 months of data for this year, so it actually looks like it's going
to double again.
Now this launch count is informative, but you probably have a couple more. How about
count by launch site? Rockets originally launched from Cape Canaveral Air Force
Station, but in 2017 most rockets launched from Kennedy Space Center Launch
Complex 39.
Outliers
Another thing you do during EDA is look for outliers, that is, unusual values. Whether
they are errors or valid, it's nice to know about them, as they can throw your results off.
Here, we can see we have only 5 launches with a weight greater than 7,000 kg, when
the average mass is closer 3800 kg.
Interactive dashboards
We finished the last lesson looking at some graphs, so let's talk a little more about
visualization.
1. One picture...
One picture is worth a thousand words, they say. Based on what we saw in the previous
lesson; we would tend to agree. However, there are a few things to pay attention to, to
ensure your chart is easily understandable and straight to the point.
3. Colorblindness
You should also be mindful of colorblindness. You may distinguish red and green very
well, but some people don't, and more than you think. You can find a lot of information
on colorblindness online, as well as palettes of colors that are accessible to colorblind
people.
4. Readable fonts
You should also use readable fonts. Sans-serif ones are easier to read. There are
nicer fonts available, sure, but your readers should focus on your viz message, not on
the font.
5.Labeling
An image is worth a thousand words, but words do help. Your graphs should
always have a title, so we know what we're looking at; the x and y axis should have
labels, otherwise they could be anything; and you should provide a legend if you use
colors and patterns, so that we know what they refer to.
Question
If a picture is worth a thousand words, then what is worth a thousand pictures?
A dashboard!
A dashboard! Well, technically, a dashboard of a few pictures is enough already. My
point is, showing several pictures together can be more insightful than looking at them
separately, or trying to pack all the insights in one graph. Your car dashboard indicates
the car speed, the motor rotation speed and the proportion of gas left. Individually, these
pieces of information are useful. But together, they paint a much bigger picture and
make your trip safer and more comfortable.
BI tools
Business Intelligence tools let you clean, explore, visualize data, and build dashboards,
without requiring any programming knowledge. Such tools are Tableau, Looker, or
Power BI. Of course, you can also do that programmatically using Python, R, or even
JavaScript.
Bibliography
https://campus.datacamp.com/courses/understanding-data-science/experimentation-and-
prediction?ex=14