Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

# ## Cleaning

# Standard imports

# In[41]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
get_ipython().run_line_magic('matplotlib', 'inline')
# In[42]:
#billboard = pd.read_csv('billboard.csv')
# In Unix or MacOS, we can easily find out the encoding with a command line ultility, file. In
Windows, you may have to open the file with an editor.
# In[43]:
get_ipython().system('file billboard.csv')
# In[44]:
billboard = pd.read_csv('billboard.csv',encoding='latin-1')
# In Pandas, we can use head to see the first few lines of a table.
# In[45]:
# Let's look at the name of the columns.
# In[46]:
# We can plot the evolution of the ranking for any given song.
# In[47]:
# In[48]:
# We're going to tell plot explicitly to use a range of one through 76.
# In[49]:
# Plot several songs at once:
# We can iterate over rows with the DataFrame method iterrows, which yields both the index and
the content of the row.

# In[50]:
for index, row in billboard.iterrows():
# In[51]:
for i in range (0,2):
# One problem with this DataFrame is that the rankings are not very usable for analysis because
they're divided over multiple columns.
# It would make more sense to have each ranking in a separate row with the week number as a
# Technically, this is called melting each row into multiple ones, each of which represent a single
# In[52]:
bshort =
# In[53]:
# We should also rename the columns to better, simpler, and more consistent variable names. We
can do that by assigning directly to the columns attribute.
# In[54]:
bshort.columns = ['artist','track','time','date.entered','wk1','wk2','wk3']
# We get to melting using the Pandas melt method.
# Melt requires that we specify which should be the identifier variables that are repeated for
several rows. In our case, we need artist, track, time, and date.entered.
# Next we tell melt which should be the values, or observation columns. In this case, week one,
week two, and week three.
# Next what should be the name of the column that holds the type of observation. It would be
# And what should be the name of the column that holds the value of the observation. It should
be rank.
# In[55]:
bmelt = bshort.melt(['artist','track','time','date.entered'],['wk1','wk2','wk3'],'week','rank')
# To see that this works, it's better to select a specific song.
# We can do that with the Pandas method query which takes a query as a string
# in something like a natural language specification.
# In[56]:
bmelt.query('track == "Liar"')
# In[57]:
bmelt.query('artist == "Madonna"')
# In[58]:
bmelt.query('artist == "Savage Garden"')
# Converting the week to a number:
# You can do using the apply method on the Pandas series for the column week. Apply takes a
Python function. And we're going to define it on-the-fly using lambda. We need the third
character in each string and we need to turn that
# into a number, an integer.
# In[59]:
bmelt['week'] = bmelt['week'].apply(lambda s: int(s[2]))
# In[60]:
# Once we have true dates, we can do date arithmetic and apply, for instance, Timedeltas,
obtaining the correct date for each ranking in week one, two, and three.
# In[61]:
bmelt['date.entered'] = pd.to_datetime(bmelt['date.entered'])
# In[62]:
# In[63]:
bmelt['date.entered'][0] + pd.Timedelta('7 days')
# In[64]:
bmelt['date'] = bmelt['date.entered'] + pd.Timedelta('7 days') * (bmelt['week'] - 1)
# In[78]:
# In[65]:
# At this point, we may as well drop the column date.entered. We need to tell Pandas that we're
working with the columns, so axis 1.
# In[66]:
# In[67]:
bmelt.query('track == "Liar"')
# We sort both the columns and the rows. Artist, track, time, date, week, and rank.
# In[68]:
bfinal = bmelt[['artist','track','time','date','week','rank']]
# In[69]:
# The artist name, the track name, and the track length appear in multiple rows.
# The correct way to remove the redundancy is to create a separate table with track data and
linking it to the rankings using an index.
# In the context of relational databases, this is called data normalization.
# In[70]:
tracks = bfinal[['artist','track','time']].drop_duplicates()
# The index is unique and we need to carry it explicitly as a column. So, we reset the index and
rename it id. We can assign directly to
# In[71]: = 'id'
tracksid = tracks.reset_index()
# We now perform a database style join operation between the new table of tracks and the
Billboard table. We do so by matching by artist and track.
# In[72]:
# In[73]:
# In[74]:
tidy = pd.merge(tracksid,bfinal,on=['track','artist','time']).drop(['artist','track','time'],axis=1)
# In[80]:
tidy[tidy.week == 3]['rank']
# In[76]:
tidy.loc[tidy[tidy.week == 1]['rank'].idxmin()]
# In[77]:
tracksid.query('id == 1')

You might also like