EDA Structuring With Python

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

EDA structuring with Python

You've learned about

how structuring data can help professionals analyze, understand, and learn

more about their data. Now, let's use a

python notebook, and discover how it

works in practice. We'll continue using our NOAA

lightning strike dataset. For this video, we'll

consider the data for 2018, and use our structuring tools

to learn more about whether lightning strikes are more prevalent on some

days than others. Before we do anything else, let's import our Python

packages, and libraries. These are all packages

and libraries you're familiar with, Pandas, NumPy, Seaborn, datetime,

matplotlib.pyplot. For a quick refresher, let's convert our date

column to datetime, and take a look at

our column headers. We do this to get

our dates ready for any future string manipulation

we may want to do, and to remind us of

what is in our data. As you remember, there are three columns in

the dataset; date, number of strikes, and

center point geom, which you'll find after

running the head function. Next, let's learn about

the shape of our data by using the df shape function. When we run this cell, we get 3,401,012, 3. Take a
moment to picture

the shape of this dataset. We're talking about only

three columns wide and nearly 3.5 million

cells vertically. That's incredibly long and thin. We'll use a function for

finding any duplicates. When we enter df.drop_underscore duplicates with an empty argument field

followed by.shape, the notebook will return

the number of rows, and columns remaining after


duplicates are removed. Because this returns

the same exact number as our shape function, we know there are no

duplicate values. Let's discuss some of those

structuring concepts we learned about earlier.

Let's start with sorting. We'll sort the number

of strikes column in descending value

or most to least. While we do this, let's consider the dates with the highest number of strikes. We'll
input df sort_values. Then in the argument

field type by, then the equals sign. Next, we input the

column we want to sort, number of strikes followed by ascending equals sign false. If we add the head

function to the end, the notebook outputs the top

10 cells for us to analyze. We find that the

highest number of strikes are in the lower 2000s. It does seem like a lot of lightning strikes

in just one day, but given that it happened in August when storms are likely, it is probable these

2000 plus strikes were counted during a storm. Next, let's look

at the number of strikes based on the

geographic coordinates, latitude, and longitude. We can do this by using

the value_counts function. We type in df, followed by the

center point geom. Then we type in.value underscore counts with

an empty argument field. Based on our result, we learned that

the locations with the most strikes have

lightning on average, every one in three days, with numbers in the low 100s. Meanwhile, some

locations are reporting only one lightning strike

for the entire year of 2018. We also want to learn if we have an even distribution of values, or whether
108 is a notably high-value for

lightning strikes in the US. To do this, copy the same

value counts function, but input a colon, 20 in the brackets so that you can see the first 20 lines. The rest
of the

coding here is to help present the data clearly. We rename the axis, and index to unique values,

and counts respectively. Lastly, we'll add a gradient background to the counts column
for visual effect. After running the cell, we discover zero

notable drops in lightning strike counts

among the top 20 location. This suggests that

there are zero, notably high lightning

strike data points, and that the data values

are evenly distributed. Next, let's use another

structuring method, grouping. You'll often find stories hidden among different

groups in your data. Like the most profitable times a day for retail

store, for instance. For this dataset,

one useful grouping is categorizing lightning

strikes by day of week which will tell us whether

any particular day has fewer or more lightning

strikes than others. Let's first create

some new data columns. We create a column called week by inputting

df.date.dt.isocalendar. Let's leave the argument

field blank and add a.week at the end. This will create a

column assigning numbers 1-52 for each of the

days in the year 2018. Let's also add a column

that names the weekday. Type in df.date.dt.day_name, leaving the argument

field blank. For this last part,

let's input df.head. Again, you'll discover

the dates now have week numbers and

assigned weekdays. We have some new columns, so let's group the number of strikes by weekday to
determine whether any particular

day of week has more lightning

strikes than others. Let's create a DataFrame with just the weekday and number

of lightning strikes. We'll do this by inputting df, double bracket, weekday, comma, number of strikes
both

in single quotes, followed by more


double brackets. Next, we'll add one of our structuring

functions, groupby, followed by weekday dot mean

within the argument field. What we're telling the

notebook here is to create a DataFrame with weekday

and number of strikes, but then also group the total number of strikes

by day of the week, giving us the mean number

of strikes for that day. To understand what this

data is telling us, let's plot a box plot chart. A boxplot is a

data visualization that depicts the locality, spread, and skew of groups

of values within quartiles. For this dataset and notebook, a box plot visualization will be the most

helpful because it will tell us a lot about the distribution of

lightning strike values. Most of the lightning

strike values will be shown as grouped

into colored boxes, which is why this visualization

is called a box plot. The rest of the values

will string out to either side with a

straight line that ends in a t. We will discuss more about box

plots in an upcoming video. Now before we plot, let's set the weekday order

to start with Monday. Now to code that, input g, equal sign, sns.boxplot. Next, in the argument field,
let's have x equal weekday and

y equal number of strikes. For order, let's

do weekday order, and for the showfliers

field, let's input False. Showfliers refers to outliers that may or may not be

included in the box plot. If you input, true,

outliers are included. If you input false, outliers are left off

the box plot chart. Keep in mind, we aren't deleting any outliers from the dataset

when we create this chart, we're only excluding them

from our visualization to get a good sense of the

distribution of strikes across the


days of the week. Lastly, we will plug in

our visualization title, lightning distribution

per weekday for 2018 and click run cell. Now you'll discover something

really interesting. The median, indicated by these horizontal black

lines remains the same on all of the

days of the week. As for Saturday and

Sunday, however, the distributions are both lower than the rest of the week. Let's consider why that is.
What do you think

is more likely? That lightning strikes across the United States

take a break on the weekends or that people do not report as many lightning

strikes on weekends? While we don't know for sure, we have clear data suggesting the total quantity of
weekend lightning strikes

is lower than weekdays. We've also learned a story

about our dataset that we didn't know before we tried

grouping it in this way. Let's get back into

our notebook and learn some more about

our lightning data. One common structuring

method we learned about in another

video was merging, which you'll remember means combining two different

data sources into one. We'll need to know

how to perform this method in Python if we want to learn more about our

data across multiple years. Let's add two more years to

our data, 2016 and 2017. To merge three years

of data together, we need to make sure each

dataset is formatted the same. The new datasets do not have the extra columns week and weekday that
we created earlier. To merge them successfully, we need to either remove the new columns or add

them to the new datasets. There's an easy way to

merge the three years of data and remove the extra

columns at the same time. Let's call our new

data frame union_df. We'll use the pandas


function concat to merge or more accurately concatenate

the three years of data. Inside the concat argument

field we'll type in df.drop to pull the weekday

and weak columns out. We also input the axis we want to drop,

which is one. Lastly, and most essentially, we add the data

frame name we are concatenating to, df_2. We also input true

for ignore index because the two data

frames will already align along their first columns and now you've just learned to

merge three years of data. To help us with the next

part of structuring, create three date columns following the same steps

you used previously. We've already added

the columns for year, month, and month_text

to the code. Now let's add all the

lightning strikes together by year so

we can compare them. We can do this by simply taking the two columns

we want to look at, year and number of

strikes and group them by year with the

function.sum on the end. You'll find that 2017 did have fewer total strikes

in 2016 and 2018. Because the totals

are different, it might be interesting as

part of our analysis to see lightning strike percentages

by month of each year. Let's call this lightning

by month grouping our union data frame by

month, text and year. Additionally, let's

aggregate the number of strikes column by using the

pandas function NamedAgga. In the argument field, we place our column name and our aggregate
function

equal to some, so that we get the totals for each of the months

in all three years. When we input the head function, we have the months in
alphabetical order, along with the sums

of each month. We can do the same

aggregation for year and years strikes to review those same numbers

we saw before with 2017 having fewer strikes

than the two other years. We created those

two data frames, lightning by month and

lightning by year in order to derive our percentages of lightning strikes

by month and year. We can get those percentages

by typing lightning by month.merge with

lightning by year, on equal sign year in

the argument field. You'll find that

the merge function is merging lightning by year into our

lightning by month data frame according

to the year. Lastly, we can create a percentage lightning per

month column by dividing the percentage lightning.number

of strikes by percentage lightning

after which we'll add the asterix 100 to

give us percentage. Now, when we use

our head function, we have a restructured

data frame. To more easily review our

percentages by month, let's plot a data visualization. For this one, a simple grouped

bar graph will work well. We'll adjust our figure

size to 10 and 6 first. Then we use the seaborne

library bar plot with our x-axis as month texts and our y-axis as percentage

lightning per month. For some color, we'll have

our hue change according to the year column with the data following the month

order column. Finally, let's input our x and y labels and our title

and run the cell. When you analyze the bar chart, August 2018 really stands out. In fact, more than one-
third of the lightning strikes for 2018 occurred in
August of that year. The next step for a data professional trying

to understand these findings might be to research storm and

hurricane data, to learn whether those

factors contributed to a greater number of lightning strikes for

this particular month. Now that you've learned some of the Python code for the EDA

practice of structuring, you'll have time to

try them out yourself. Good luck finding those

stories about your data.

You might also like