Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Data Sciences

Introduction
Data visualization is part of data exploration, which is a critical step in the AI cycle. You will
use this technique to gain understanding and insights to the data you have gathered, and
determine if the data is ready for further processing or if you need to collect more data or
clean the data.
You will also use this technique to present your results.
In this notebook, we will explore python packages crucial for Data Sciences. Packages like
Pandas, NumPy and Matplotlib are used in the whole process.

About the Notebook


This jupyter notebook focusses on Data Visualisation in Python. To let youth understand it
in the best way possible, a lot of additional resources have been provided in the notebook
as links. The readers can simply go to those links to explore more on the subject.

Context
We will be working with Jaipur weather data obtained from Kaggle, a platform for data
enthusiasts to gather, share knowledge and compete for many prizes!
The data has been cleaned and simplified, so that we can focus on data visualization instead
of data cleaning. Our data is stored in the file named JaipurFinalCleanData.csv. This file
contains weather information of Jaipur and is saved at the same location as the notebook.
What do you do next?

Side note: What is csv?


CSV (Comma-Separated Value) is a file containing a set of data, separated by commas.
We usually access these files using spreadsheet applications such as Excel or Google Sheet.
Do you know how this is done?
Today, we will learn how to use Python to open csv files.

Use Python to open csv files


We will use the pandas library to work with our dataset. Pandas is a popular Python library
for data science. It offers powerful and flexible data structures to make data manipulationa
and analysis easier.
Import Pandas
import pandas as pd #import pandas as pd means we can type "pd" to
call the pandas library

Now that we have imported pandas, let's start by reading the csv file.
#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv("JaipurFinalCleanData.csv")

Exploring our data


Great! We have now a variable to contain our weather data. Let's explore our data. Use
the .head() function to see the first few rows of data.
#dataframe.head() means we are getting the first 5 rows of data
# try running it to see what data is in the jaipur csv file
print (dataframe.head())

date mean_temperature max_temperature min_temperature \


0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21
3 2016-05-07 30 38 23
4 2016-05-08 34 41 26

Mean_dew_pt mean_pressure max_humidity min_humidity


max_dew_pt_1 \
0 6 1006.00 27 5
12
1 7 1005.65 29 6
13
2 11 1007.94 61 13
16
3 13 1008.39 69 18
17
4 10 1007.62 50 8
14

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1


max_pressure_2 \
0 10 -2 -2 1009
1008
1 12 0 -2 1008
1009
2 13 6 0 1011
1008
3 16 9 6 1011
1011
4 17 6 9 1010
1011
min_pressure_1 min_pressure_2 rainfall
0 1000 1001 0.0
1 1001 1000 0.0
2 1003 1001 5.0
3 1004 1003 0.0
4 1002 1004 0.0

Task 1: Display the first 10 rows of data by modifying the function above
print (dataframe.head(10))

date mean_temperature max_temperature min_temperature \


0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21
3 2016-05-07 30 38 23
4 2016-05-08 34 41 26
5 2016-05-09 34 42 27
6 2016-05-10 34 41 27
7 2016-05-11 32 40 25
8 2016-05-12 34 42 27
9 2016-05-13 34 42 26

Mean_dew_pt mean_pressure max_humidity min_humidity


max_dew_pt_1 \
0 6 1006.00 27 5
12
1 7 1005.65 29 6
13
2 11 1007.94 61 13
16
3 13 1008.39 69 18
17
4 10 1007.62 50 8
14
5 8 1006.73 32 7
12
6 11 1005.75 45 7
16
7 16 1007.10 51 12
18
8 16 1006.78 66 16
22
9 13 1003.83 58 9
20

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1


max_pressure_2 \
0 10 -2 -2 1009
1008
1 12 0 -2 1008
1009
2 13 6 0 1011
1008
3 16 9 6 1011
1011
4 17 6 9 1010
1011
5 14 6 6 1010
1010
6 12 7 6 1008
1010
7 16 13 7 1010
1008
8 18 10 13 1011
1010
9 22 10 10 1007
1011

min_pressure_1 min_pressure_2 rainfall


0 1000 1001 0.0
1 1001 1000 0.0
2 1003 1001 5.0
3 1004 1003 0.0
4 1002 1004 0.0
5 1002 1002 0.0
6 1000 1002 0.3
7 1002 1000 0.8
8 1001 1002 2.0
9 998 1001 0.3

Now that you have listed the first few rows of the data, what do you notice?
• What headers are there? If you are not sure, look them up online!
• Does the values recorded make sense to you?

Find out your data type


You can use dtypes to find out the type of data (i.e. string, float, integer) you have.
dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object

Side notes: Find out about a function


Some times, we encounter a function we are not familiar with. We can use the help function
within Jupyter to obtain more information about a function: It's functions, expected input,
parameters and output. It will even give you examples on how to use the function! Let's try
it with the .head() function.
dataframe.head?

What do you notice? What are the parameters involved? What are the outputs generated?

Task 2: What does dtypes do?


Use the help function ? to find out what dtypes is for
dataframe.dtypes?

What do you notice? What are the parameters involved? What are the outputs generated?

Choosing only dataset we are interested in.


Let's see all the headers that we have first and consider which dataset we want to work
with for now.

Task 3: list down the headers from the dataset


dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object

Looks like there are 16 columns in this dataset and we don't need all of them for the
purposes of this activity. One way to go about doing this, is to drop the columns that we
don't need. Pandas provide an easy way for us to drop columns using the ".drop" function.
dataframe = dataframe.drop(["max_dew_pt_2"], axis=1)

Let's print to ensure that the columns are dropped, try printing them with head() or dtypes.
dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object

Task 4: Drop the following collums: (min_dew_pt_2, max_pressure_2, min_pressure_2)


dataframe = dataframe.drop(["min_dew_pt_2", "max_pressure_2",
"min_pressure_2"], axis=1)

Task 5: Now check again if these collumns have beeen dropped


dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
max_pressure_1 int64
min_pressure_1 int64
rainfall float64
dtype: object

Great! We can now focus on this set of data!

Sorting values using pandas


Many times, you want to have a sense of range of data to help you understand more about
it. Another feature of pandas dataframe is sorting of values. You can do so by using the
sort_values() function.
jaipur_weather = dataframe.sort_values(by='date',ascending = False)
print(jaipur_weather.head(5))

date mean_temperature max_temperature min_temperature \


675 2018-03-11 26 34 18
674 2018-03-10 26 34 19
673 2018-03-09 26 33 19
672 2018-03-08 24 32 15
671 2018-03-07 24 32 15

Mean_dew_pt mean_pressure max_humidity min_humidity


max_dew_pt_1 \
675 4 1013.76 38 6
8
674 3 1014.16 37 8
6
673 1 1014.41 42 7
5
672 2 1014.07 55 5
8
671 4 1015.39 48 6
9

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall


675 0 1017 1009 0.0
674 -1 1017 1009 0.0
673 -5 1017 1011 0.0
672 -6 1017 1011 0.0
671 -3 1018 1012 0.0

What do you notice from the number? Look at the date. Can you see how the function help
us sort data based on the date?

Task 6: Sort the values in ascending order of mean temperature and print the first 5 rows
jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending
= True)
print(jaipur_weather.head(5))
date mean_temperature max_temperature min_temperature \
252 2017-01-11 10 18 3
253 2017-01-12 12 19 4
254 2017-01-13 12 20 4
255 2017-01-14 12 20 5
258 2017-01-17 12 20 5

Mean_dew_pt mean_pressure max_humidity min_humidity


max_dew_pt_1 \
252 3 1017.00 94 17
9
253 -3 1017.54 70 13
2
254 -5 1017.24 75 4
2
255 -1 1017.75 70 10
1
258 3 1017.35 74 15
7

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall


252 -5 1019 1015 0.0
253 -7 1020 1015 0.0
254 -93 1020 1015 0.0
255 -8 1020 1016 0.0
258 -2 1019 1015 0.0

Look at the max and min temperature! See the range of temperature that one can
experience within a day.

Task 7: Sort the values in descending order of mean temperature and print the first 5 rows
jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending
= False)
print(jaipur_weather.head(5))

date mean_temperature max_temperature min_temperature \


32 2016-06-05 38 45 31
15 2016-05-19 38 46 29
31 2016-06-04 38 44 31
34 2016-06-07 38 45 30
35 2016-06-08 38 44 31

Mean_dew_pt mean_pressure max_humidity min_humidity


max_dew_pt_1 \
32 5 1004.67 27 4
18
15 11 999.88 45 5
17
31 13 1004.93 34 10
18
34 13 1003.29 51 5
21
35 12 1002.83 47 4
22

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall


32 2 1007 999 0.0
15 6 1002 994 0.0
31 7 1008 999 0.0
34 5 1007 997 0.0
35 2 1006 996 0.0

See how the temperature ranges within a year? 10C to 38C!


Now we have a clearer picture of our dataset. Using these functions, we can analyze our
data and gain insights of them.
However, we want to get an even better picture. We want to learn how to explore these
data visually.
Let's now use the matplotlib library to help us with data visualization in Python.

Importing matplotlib
Matplotlib is a Python 2D plotting library that we can use to produce high quality data
visualization. It is highly usable (as you will soon find out), you can create simple and
complex graphs with just a few lines of codes!
Now let's load matplotlib to start plotting some graphs
import matplotlib.pyplot as plt
import numpy as np

Scatter plot
Scatter plots use a collection of points on a graph to display values from two variables. This
allow us to see if there is any relationship or correlation between the two variables.
Let's see how mean temperature changes over the years!
x = dataframe.date
y = dataframe.mean_temperature

plt.scatter(x,y)
plt.show()
Do you see that the x axis is filled with a thick line, and that there's no tick label available?
This makes us unable to analyze the data.
Let's try to modify this scatter plot so that we can see the ticks!

Choose only several ticks


One reason why there's a thick bar below the x axis is that there are numerous labels (2
years daily data) on the x axis.
The first thing we are going to do is to then reduce the number of ticks/ points for the x
axis. We do this using the np.arrange function as below:
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 30))
plt.show()
See, now you can see the numbers clearer, but they are still overlapping.

Task 8: Change x ticks interval so that you can see the dates clearly
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 180))
plt.show()

What is the interval you use so that you can see all the dates? Do you notice that now we
are only having very few ticks?
Let's try to rotate our ticks. See the example on Stackoverflow!
Note: Stackoverflow is a site where technical personnel gather and share their knowledge.
You can search for any queries over the site and see if there are already others who solve it!

Task 9: Rotate our x ticks label so that we can see more ticks more clearly
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=90)
plt.show()

Now we can see the x-ticks clearly.


Notice how temperature changes according to the time of the year. Compare it with this
website. Does it inform you when to best plant your crop?

Giving label to the x and y axis


You can also give label to the x and y axis. This will make it easier for you to visualise and
share your data.
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels and set a font size


plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)

plt.show()

Looks good!

Task 10: Now, let's add a title.


See how to do it here.
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels and set a font size


plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)
plt.title('Mean Temperature at Jaipur')

plt.show()
Task 11: Change the title size to be bigger than the x and y labels!
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size


plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)
plt.title('Mean Temperature at Jaipur', fontsize = 20)

plt.show()
Good! Now, we can also change the size of the plot
# Change the default figure size
plt.figure(figsize=(10,10))

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size


plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes


plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')


plt.show()

Looking good! Now, let's customize our graphs with the shapes and colours that we like.
See here for examples

Task 12: Change your marker shape!


# Change the default figure size
plt.figure(figsize=(10,10))

plt.scatter(x,y, marker='*')
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
# Add x and y labels, title and set a font size
plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes


plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')

plt.show()

<Figure size 432x288 with 0 Axes>


Changing color
You can also change the marker color. Check out the code below which show you how to do
it!
# Change the default figure size
plt.figure(figsize=(10,10))

plt.scatter(x,y, c='green', marker='*')


plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size


plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes


plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')

plt.show()

Look at your working directory and check if the new image file has been created!
Task 13: Change the data points to your favourite colour.
Change the font colour and size too!
If you are wondering how to get the nice bubble looking scatter plot in the slides. Here a
sample code, try running it!
import numpy as np

colors = np.random.rand(len(y)) # Create random value


#plt.scatter(x,y,c=colors,alpha=0.5)
plt.scatter(x,y,c=colors,alpha=0.5)
plt.show()

Task 14: Try changing the alpha value and see what happens?

Saving plot
You can use plt.savefig("figurename.png") to save the figure. Try it!
plt.savefig("jaipur_scatter_plot.png")

Line Plots
Besides showing relationship using scatter plot, time data as above can also be represented
with a line plot. Let's see how this is done!
y = dataframe.mean_temperature

plt.plot(y)
plt.ylabel("Mean Temperature")
plt.xlabel("Time")

Y_tick =
['May16','Jul16','Sept16','Nov16','Jan17','Mar17','May17','Jul17','Sep
t17','Nov17','Jan18','Mar18' ]

plt.xticks(np.arange(0, 731, 60), Y_tick , rotation=30)


plt.xticks()

plt.show()

----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_8140/3656475819.py in <module>
1 y = dataframe.mean_temperature
2
----> 3 plt.plot(y)
4 plt.ylabel("Mean Temperature")
5 plt.xlabel("Time")

NameError: name 'plt' is not defined

Task 15: Change the labels and add title so that it is clearer and easier for you to show this
graph to others

Drawing multiple lines in a plot


x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature

plt.plot(x,y_1, label = "Max temp")


plt.plot(x,y_2, label = "Min temp")

plt.xticks(np.arange(0, 731, 60))


plt.xticks (rotation=30)

plt.legend()
plt.show()
Task 16: Draw at least 3 line graphs in one plot!
# Change the default figure size
plt.figure(figsize=(20,10))

x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature
y_3 = dataframe.mean_temperature

z = y_1-y_2

plt.plot(x,y_1, label = "Max temp")


plt.plot(x,y_2, label = "Min temp")
plt.plot(x,y_3, label = "Mean temp")
plt.plot(x,z, label = "range")

plt.xticks(np.arange(0, 731, 60))


plt.xticks (rotation=30)

plt.legend()
plt.show()
Histograms
The histogram is useful to look at desity of values. For example, you might want to know
how many days are hotter than 35C so that you can see what types of plants would survive
better in your climate zone. The histogram allows us to see the probability distribution of
our variables
Let's look at how histograms are plotted.
y = dataframe.mean_temperature

plt.hist(y,bins=15)

plt.show()

----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_8140/1106375958.py in <module>
1 y = dataframe.mean_temperature
2
----> 3 plt.hist(y,bins=15)
4
5 plt.show()

NameError: name 'plt' is not defined

What does the above means?


Let's label the graph clearly as follows:
• Title: Probability distribution of temperature over 2 years (2016 - 2018) in Jaipur
• Y-axis: No.of days
• X-axis: Temperature
y = dataframe.mean_temperature

plt.hist(y,bins=10)

plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016
- 2018) in Jaipur')

plt.show()

What is the mode of this dataset? what temperature range is represented the most/ the
least?

Task 17: What do you think are bins? Try changing the number of bins to 20. What do you
notice?
y = dataframe.mean_temperature

plt.hist(y,bins=20)

plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016
- 2018) in Jaipur')

plt.show()
What does the histogram tell you about the temperature in Jaipur over the last two years?

Bar Charts
Bar chart looks like histogram, but they are not the same! See the difference between bar
charts and histogram here
Now, head over to the matplotlib library and look at the example for bar charts. Here's the
link!
import matplotlib.pyplot as plt
import numpy as np

x_pos = ['Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp']

usage = [10,8,6,4,2,1]

plt.bar(x_pos, usage, align='center')

plt.show()
Because we are not dealing with categories with Jaipur weather data, we will not use it to
make a barchart. However, do remember how to create your bar chart!

Boxplots
Boxplots is used to determine the distribution of our dataset.
We will explore boxplot using a sample tutorial obtained from matplotlib website

First, we will create random data for our example


np.random.seed(10)
data = np.random.normal(100, 10, 200)

Next, we will draw our boxplot


fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(data)
plt.show()
Great! you now have your boxplot. Do you remember how to read it? See here
From the reading, can you notice what are the values of the following:
• Outlier
• Median
• First quartile
• Third quartile
Now that you've understand how boxplot is plotted, apply boxplot to the data on
temperature of Jaipur.
y=dataframe.mean_temperature

plt.boxplot(y)
plt.show()
What does the boxplot tell you about the temperature of Jaipur over the past two years?

Subplots
Many times, you want to plot more than one graphs side by side. You can use subplots to do
that!
Here's how you can make them!
x = dataframe.date
y = dataframe.mean_temperature

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True) #create 2 subplots


with shared y axis
fig.suptitle('Sharing Y axis')
ax1.plot(x, y)
ax2.scatter(x, y)
plt.show()
Task 18: Create subplots with 3 plots where the y axis is shared!
x = dataframe.date
y_1 = dataframe.mean_temperature
y_2 = dataframe.max_temperature
y_3 = dataframe.min_temperature
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True) #create 3
subplots with shared y axis
fig.suptitle('Sharing Y axis')
ax1.plot(x, y_1)
ax2.plot(x, y_2)
ax3.plot(x, y_3)

plt.show()

Task 19: Create subplots where the x axis is shared!


Check out this link to see how this is done!
x = dataframe.date
y = dataframe.mean_temperature
fig, axarr = plt.subplots(2, sharex=True)
fig.suptitle('Sharing X axis')
axarr[0].plot(x, y)
axarr[1].scatter(x, y)

plt.show()
Great! You have now gained the ability to visualize data using matplotlib.

You might also like