Professional Documents
Culture Documents
EDA Structuring With Python
EDA Structuring With Python
EDA Structuring With Python
how structuring data can help professionals analyze, understand, and learn
consider the data for 2018, and use our structuring tools
to learn more about whether lightning strikes are more prevalent on some
days than others. Before we do anything else, let's import our Python
the shape of our data by using the df shape function. When we run this cell, we get 3,401,012, 3. Take a
moment to picture
cells vertically. That's incredibly long and thin. We'll use a function for
finding any duplicates. When we enter df.drop_underscore duplicates with an empty argument field
the same exact number as our shape function, we know there are no
or most to least. While we do this, let's consider the dates with the highest number of strikes. We'll
input df sort_values. Then in the argument
field type by, then the equals sign. Next, we input the
column we want to sort, number of strikes followed by ascending equals sign false. If we add the head
highest number of strikes are in the lower 2000s. It does seem like a lot of lightning strikes
in just one day, but given that it happened in August when storms are likely, it is probable these
2000 plus strikes were counted during a storm. Next, let's look
lightning on average, every one in three days, with numbers in the low 100s. Meanwhile, some
for the entire year of 2018. We also want to learn if we have an even distribution of values, or whether
108 is a notably high-value for
value counts function, but input a colon, 20 in the brackets so that you can see the first 20 lines. The rest
of the
coding here is to help present the data clearly. We rename the axis, and index to unique values,
and counts respectively. Lastly, we'll add a gradient background to the counts column
for visual effect. After running the cell, we discover zero
structuring method, grouping. You'll often find stories hidden among different
groups in your data. Like the most profitable times a day for retail
field blank and add a.week at the end. This will create a
assigned weekdays. We have some new columns, so let's group the number of strikes by weekday to
determine whether any particular
strikes than others. Let's create a DataFrame with just the weekday and number
of lightning strikes. We'll do this by inputting df, double bracket, weekday, comma, number of strikes
both
and number of strikes, but then also group the total number of strikes
data visualization that depicts the locality, spread, and skew of groups
of values within quartiles. For this dataset and notebook, a box plot visualization will be the most
plots in an upcoming video. Now before we plot, let's set the weekday order
to start with Monday. Now to code that, input g, equal sign, sns.boxplot. Next, in the argument field,
let's have x equal weekday and
field, let's input False. Showfliers refers to outliers that may or may not be
outliers are included. If you input false, outliers are left off
the box plot chart. Keep in mind, we aren't deleting any outliers from the dataset
per weekday for 2018 and click run cell. Now you'll discover something
Sunday, however, the distributions are both lower than the rest of the week. Let's consider why that is.
What do you think
take a break on the weekends or that people do not report as many lightning
strikes on weekends? While we don't know for sure, we have clear data suggesting the total quantity of
weekend lightning strikes
video was merging, which you'll remember means combining two different
how to perform this method in Python if we want to learn more about our
dataset is formatted the same. The new datasets do not have the extra columns week and weekday that
we created earlier. To merge them successfully, we need to either remove the new columns or add
and weak columns out. We also input the axis we want to drop,
frames will already align along their first columns and now you've just learned to
part of structuring, create three date columns following the same steps
we can compare them. We can do this by simply taking the two columns
function.sum on the end. You'll find that 2017 did have fewer total strikes
pandas function NamedAgga. In the argument field, we place our column name and our aggregate
function
equal to some, so that we get the totals for each of the months
in all three years. When we input the head function, we have the months in
alphabetical order, along with the sums
aggregation for year and years strikes to review those same numbers
percentages by month, let's plot a data visualization. For this one, a simple grouped
library bar plot with our x-axis as month texts and our y-axis as percentage
our hue change according to the year column with the data following the month
order column. Finally, let's input our x and y labels and our title
and run the cell. When you analyze the bar chart, August 2018 really stands out. In fact, more than one-
third of the lightning strikes for 2018 occurred in
August of that year. The next step for a data professional trying
this particular month. Now that you've learned some of the Python code for the EDA