Data

Data Sciences
Introduction
Data visualization is part of data exploration, which is a critical step in the AI cycle. You will
use this technique to gain understanding and insights to the data you have gathered, and
determine if the data is ready for further processing or if you need to collect more data or
clean the data.
You will also use this technique to present your results.
In this notebook, we will explore python packages crucial for Data Sciences. Packages like
Pandas, NumPy and Matplotlib are used in the whole process.
About the Notebook

This jupyter notebook focusses on Data Visualisation in Python. To let youth understand it
in the best way possible, a lot of additional resources have been provided in the notebook
as links. The readers can simply go to those links to explore more on the subject.
Context
We will be working with Jaipur weather data obtained from Kaggle, a platform for data
enthusiasts to gather, share knowledge and compete for many prizes!
The data has been cleaned and simplified, so that we can focus on data visualization instead
of data cleaning. Our data is stored in the file named JaipurFinalCleanData.csv. This file
contains weather information of Jaipur and is saved at the same location as the notebook.
What do you do next?
Side note: What is csv?

CSV (Comma-Separated Value) is a file containing a set of data, separated by commas.
We usually access these files using spreadsheet applications such as Excel or Google Sheet.
Do you know how this is done?
Today, we will learn how to use Python to open csv files.
Use Python to open csv files

We will use the pandas library to work with our dataset. Pandas is a popular Python library
for data science. It offers powerful and flexible data structures to make data manipulationa
and analysis easier.
Import Pandas
import pandas as pd #import pandas as pd means we can type "pd" to
call the pandas library
Now that we have imported pandas, let's start by reading the csv file.
#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv("JaipurFinalCleanData.csv")
Exploring our data

Great! We have now a variable to contain our weather data. Let's explore our data. Use
the .head() function to see the first few rows of data.
#dataframe.head() means we are getting the first 5 rows of data
# try running it to see what data is in the jaipur csv file
print (dataframe.head())
date mean_temperature max_temperature min_temperature \

0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21
3 2016-05-07 30 38 23
4 2016-05-08 34 41 26
Mean_dew_pt mean_pressure max_humidity min_humidity

max_dew_pt_1 \
0 6 1006.00 27 5
12
1 7 1005.65 29 6
13
2 11 1007.94 61 13
16
3 13 1008.39 69 18
17
4 10 1007.62 50 8
14
max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1

max_pressure_2 \
0 10 -2 -2 1009
1008
1 12 0 -2 1008
1009
2 13 6 0 1011
1008
3 16 9 6 1011
1011
4 17 6 9 1010
1011
min_pressure_1 min_pressure_2 rainfall
0 1000 1001 0.0
1 1001 1000 0.0
2 1003 1001 5.0
3 1004 1003 0.0
4 1002 1004 0.0
Task 1: Display the first 10 rows of data by modifying the function above
print (dataframe.head(10))

0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21
3 2016-05-07 30 38 23
4 2016-05-08 34 41 26
5 2016-05-09 34 42 27
6 2016-05-10 34 41 27
7 2016-05-11 32 40 25
8 2016-05-12 34 42 27
9 2016-05-13 34 42 26

max_dew_pt_1 \
0 6 1006.00 27 5
12
1 7 1005.65 29 6
13
2 11 1007.94 61 13
16
3 13 1008.39 69 18
17
4 10 1007.62 50 8
14
5 8 1006.73 32 7
12
6 11 1005.75 45 7
16
7 16 1007.10 51 12
18
8 16 1006.78 66 16
22
9 13 1003.83 58 9
20
max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1

max_pressure_2 \
0 10 -2 -2 1009
1008
1 12 0 -2 1008
1009
2 13 6 0 1011
1008
3 16 9 6 1011
1011
4 17 6 9 1010
1011
5 14 6 6 1010
1010
6 12 7 6 1008
1010
7 16 13 7 1010
1008
8 18 10 13 1011
1010
9 22 10 10 1007
1011
min_pressure_1 min_pressure_2 rainfall

0 1000 1001 0.0
1 1001 1000 0.0
2 1003 1001 5.0
3 1004 1003 0.0
4 1002 1004 0.0
5 1002 1002 0.0
6 1000 1002 0.3
7 1002 1000 0.8
8 1001 1002 2.0
9 998 1001 0.3
Now that you have listed the first few rows of the data, what do you notice?
• What headers are there? If you are not sure, look them up online!
• Does the values recorded make sense to you?
Find out your data type

You can use dtypes to find out the type of data (i.e. string, float, integer) you have.
dataframe.dtypes
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
min_pressure_1 int64
rainfall float64
dtype: object
Side notes: Find out about a function

Some times, we encounter a function we are not familiar with. We can use the help function
within Jupyter to obtain more information about a function: It's functions, expected input,
parameters and output. It will even give you examples on how to use the function! Let's try
it with the .head() function.
dataframe.head?
What do you notice? What are the parameters involved? What are the outputs generated?
Task 2: What does dtypes do?

Use the help function ? to find out what dtypes is for
dataframe.dtypes?
What do you notice? What are the parameters involved? What are the outputs generated?
Choosing only dataset we are interested in.

Let's see all the headers that we have first and consider which dataset we want to work
with for now.
Task 3: list down the headers from the dataset

dataframe.dtypes
date object
Mean_dew_pt int64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
rainfall float64
dtype: object
Looks like there are 16 columns in this dataset and we don't need all of them for the
purposes of this activity. One way to go about doing this, is to drop the columns that we
don't need. Pandas provide an easy way for us to drop columns using the ".drop" function.
dataframe = dataframe.drop(["max_dew_pt_2"], axis=1)
Let's print to ensure that the columns are dropped, try printing them with head() or dtypes.
dataframe.dtypes
date object
Mean_dew_pt int64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
rainfall float64
dtype: object
Task 4: Drop the following collums: (min_dew_pt_2, max_pressure_2, min_pressure_2)

dataframe = dataframe.drop(["min_dew_pt_2", "max_pressure_2",
"min_pressure_2"], axis=1)
Task 5: Now check again if these collumns have beeen dropped

dataframe.dtypes
date object
Mean_dew_pt int64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
rainfall float64
dtype: object
Great! We can now focus on this set of data!
Sorting values using pandas

Many times, you want to have a sense of range of data to help you understand more about
it. Another feature of pandas dataframe is sorting of values. You can do so by using the
sort_values() function.
jaipur_weather = dataframe.sort_values(by='date',ascending = False)
print(jaipur_weather.head(5))

675 2018-03-11 26 34 18
674 2018-03-10 26 34 19
673 2018-03-09 26 33 19
672 2018-03-08 24 32 15
671 2018-03-07 24 32 15

max_dew_pt_1 \
675 4 1013.76 38 6
8
674 3 1014.16 37 8
6
673 1 1014.41 42 7
5
672 2 1014.07 55 5
8
671 4 1015.39 48 6
9
min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall

675 0 1017 1009 0.0
674 -1 1017 1009 0.0
673 -5 1017 1011 0.0
672 -6 1017 1011 0.0
671 -3 1018 1012 0.0
What do you notice from the number? Look at the date. Can you see how the function help
us sort data based on the date?
Task 6: Sort the values in ascending order of mean temperature and print the first 5 rows
jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending
= True)
252 2017-01-11 10 18 3
253 2017-01-12 12 19 4
254 2017-01-13 12 20 4
255 2017-01-14 12 20 5
258 2017-01-17 12 20 5

max_dew_pt_1 \
252 3 1017.00 94 17
9
253 -3 1017.54 70 13
2
254 -5 1017.24 75 4
2
255 -1 1017.75 70 10
1
258 3 1017.35 74 15
7

252 -5 1019 1015 0.0
253 -7 1020 1015 0.0
254 -93 1020 1015 0.0
255 -8 1020 1016 0.0
258 -2 1019 1015 0.0
Look at the max and min temperature! See the range of temperature that one can
experience within a day.
Task 7: Sort the values in descending order of mean temperature and print the first 5 rows
jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending
= False)

32 2016-06-05 38 45 31
15 2016-05-19 38 46 29
31 2016-06-04 38 44 31
34 2016-06-07 38 45 30
35 2016-06-08 38 44 31

max_dew_pt_1 \
32 5 1004.67 27 4
18
15 11 999.88 45 5
17
31 13 1004.93 34 10
18
34 13 1003.29 51 5
21
35 12 1002.83 47 4
22

32 2 1007 999 0.0
15 6 1002 994 0.0
31 7 1008 999 0.0
34 5 1007 997 0.0
35 2 1006 996 0.0
See how the temperature ranges within a year? 10C to 38C!

Now we have a clearer picture of our dataset. Using these functions, we can analyze our
data and gain insights of them.
However, we want to get an even better picture. We want to learn how to explore these
data visually.
Let's now use the matplotlib library to help us with data visualization in Python.
Importing matplotlib
Matplotlib is a Python 2D plotting library that we can use to produce high quality data
visualization. It is highly usable (as you will soon find out), you can create simple and
complex graphs with just a few lines of codes!
Now let's load matplotlib to start plotting some graphs
import matplotlib.pyplot as plt
import numpy as np
Scatter plot
Scatter plots use a collection of points on a graph to display values from two variables. This
allow us to see if there is any relationship or correlation between the two variables.
Let's see how mean temperature changes over the years!
x = dataframe.date
y = dataframe.mean_temperature
plt.scatter(x,y)
plt.show()
Do you see that the x axis is filled with a thick line, and that there's no tick label available?
This makes us unable to analyze the data.
Let's try to modify this scatter plot so that we can see the ticks!
Choose only several ticks

One reason why there's a thick bar below the x axis is that there are numerous labels (2
years daily data) on the x axis.
The first thing we are going to do is to then reduce the number of ticks/ points for the x
axis. We do this using the np.arrange function as below:
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 30))
plt.show()
See, now you can see the numbers clearer, but they are still overlapping.
Task 8: Change x ticks interval so that you can see the dates clearly
plt.scatter(x,y)
plt.show()
What is the interval you use so that you can see all the dates? Do you notice that now we
are only having very few ticks?
Let's try to rotate our ticks. See the example on Stackoverflow!
Note: Stackoverflow is a site where technical personnel gather and share their knowledge.
You can search for any queries over the site and see if there are already others who solve it!
Task 9: Rotate our x ticks label so that we can see more ticks more clearly
plt.scatter(x,y)
plt.xticks (rotation=90)
plt.show()
Now we can see the x-ticks clearly.

Notice how temperature changes according to the time of the year. Compare it with this
website. Does it inform you when to best plant your crop?
Giving label to the x and y axis

You can also give label to the x and y axis. This will make it easier for you to visualise and
share your data.
plt.scatter(x,y)
# Add x and y labels and set a font size

plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)
plt.show()
Looks good!
Task 10: Now, let's add a title.

See how to do it here.
plt.scatter(x,y)
# Add x and y labels and set a font size

plt.title('Mean Temperature at Jaipur')
plt.show()
Task 11: Change the title size to be bigger than the x and y labels!
plt.scatter(x,y)
# Add x and y labels, title and set a font size

plt.title('Mean Temperature at Jaipur', fontsize = 20)
plt.show()
Good! Now, we can also change the size of the plot
# Change the default figure size
plt.figure(figsize=(10,10))
plt.scatter(x,y)

# Set the font size of the number labels on the axes

plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)
plt.xticks (rotation=30, horizontalalignment='right')

plt.show()
Looking good! Now, let's customize our graphs with the shapes and colours that we like.
See here for examples
Task 12: Change your marker shape!

plt.scatter(x,y, marker='*')

plt.show()
<Figure size 432x288 with 0 Axes>

Changing color
You can also change the marker color. Check out the code below which show you how to do
it!
plt.scatter(x,y, c='green', marker='*')



plt.show()
Look at your working directory and check if the new image file has been created!
Task 13: Change the data points to your favourite colour.
Change the font colour and size too!
If you are wondering how to get the nice bubble looking scatter plot in the slides. Here a
sample code, try running it!
import numpy as np
colors = np.random.rand(len(y)) # Create random value

#plt.scatter(x,y,c=colors,alpha=0.5)
plt.scatter(x,y,c=colors,alpha=0.5)
plt.show()
Task 14: Try changing the alpha value and see what happens?
Saving plot
You can use plt.savefig("figurename.png") to save the figure. Try it!
plt.savefig("jaipur_scatter_plot.png")
Line Plots
Besides showing relationship using scatter plot, time data as above can also be represented
with a line plot. Let's see how this is done!
plt.plot(y)
plt.ylabel("Mean Temperature")
plt.xlabel("Time")
Y_tick =
['May16','Jul16','Sept16','Nov16','Jan17','Mar17','May17','Jul17','Sep
t17','Nov17','Jan18','Mar18' ]
plt.xticks(np.arange(0, 731, 60), Y_tick , rotation=30)

plt.xticks()
plt.show()
----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_8140/3656475819.py in <module>
1 y = dataframe.mean_temperature
2
----> 3 plt.plot(y)
4 plt.ylabel("Mean Temperature")
5 plt.xlabel("Time")
NameError: name 'plt' is not defined
Task 15: Change the labels and add title so that it is clearer and easier for you to show this
graph to others
Drawing multiple lines in a plot

x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature
plt.plot(x,y_1, label = "Max temp")

plt.plot(x,y_2, label = "Min temp")

plt.legend()
plt.show()
Task 16: Draw at least 3 line graphs in one plot!
x = dataframe.date
y_3 = dataframe.mean_temperature
z = y_1-y_2
plt.plot(x,y_1, label = "Max temp")

plt.plot(x,y_2, label = "Min temp")
plt.plot(x,y_3, label = "Mean temp")
plt.plot(x,z, label = "range")

plt.legend()
plt.show()
Histograms
The histogram is useful to look at desity of values. For example, you might want to know
how many days are hotter than 35C so that you can see what types of plants would survive
better in your climate zone. The histogram allows us to see the probability distribution of
our variables
Let's look at how histograms are plotted.
plt.hist(y,bins=15)
plt.show()
----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_8140/1106375958.py in <module>
1 y = dataframe.mean_temperature
2
----> 3 plt.hist(y,bins=15)
4
5 plt.show()
NameError: name 'plt' is not defined
What does the above means?

Let's label the graph clearly as follows:
• Title: Probability distribution of temperature over 2 years (2016 - 2018) in Jaipur
• Y-axis: No.of days
• X-axis: Temperature
plt.hist(y,bins=10)
plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016
- 2018) in Jaipur')
plt.show()
What is the mode of this dataset? what temperature range is represented the most/ the
least?
Task 17: What do you think are bins? Try changing the number of bins to 20. What do you
notice?
plt.hist(y,bins=20)
plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016
- 2018) in Jaipur')
plt.show()
What does the histogram tell you about the temperature in Jaipur over the last two years?
Bar Charts
Bar chart looks like histogram, but they are not the same! See the difference between bar
charts and histogram here
Now, head over to the matplotlib library and look at the example for bar charts. Here's the
link!
import matplotlib.pyplot as plt
import numpy as np
x_pos = ['Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp']
usage = [10,8,6,4,2,1]
plt.bar(x_pos, usage, align='center')
plt.show()
Because we are not dealing with categories with Jaipur weather data, we will not use it to
make a barchart. However, do remember how to create your bar chart!
Boxplots
Boxplots is used to determine the distribution of our dataset.
We will explore boxplot using a sample tutorial obtained from matplotlib website
First, we will create random data for our example

np.random.seed(10)
data = np.random.normal(100, 10, 200)
Next, we will draw our boxplot

fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(data)
plt.show()
Great! you now have your boxplot. Do you remember how to read it? See here
From the reading, can you notice what are the values of the following:
• Outlier
• Median
• First quartile
• Third quartile
Now that you've understand how boxplot is plotted, apply boxplot to the data on
temperature of Jaipur.
y=dataframe.mean_temperature
plt.boxplot(y)
plt.show()
What does the boxplot tell you about the temperature of Jaipur over the past two years?
Subplots
Many times, you want to plot more than one graphs side by side. You can use subplots to do
that!
Here's how you can make them!
x = dataframe.date
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True) #create 2 subplots

with shared y axis
fig.suptitle('Sharing Y axis')
ax1.plot(x, y)
ax2.scatter(x, y)
plt.show()
Task 18: Create subplots with 3 plots where the y axis is shared!
x = dataframe.date
y_1 = dataframe.mean_temperature
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True) #create 3
subplots with shared y axis
fig.suptitle('Sharing Y axis')
ax1.plot(x, y_1)
ax2.plot(x, y_2)
ax3.plot(x, y_3)
plt.show()
Task 19: Create subplots where the x axis is shared!

Check out this link to see how this is done!
x = dataframe.date
fig, axarr = plt.subplots(2, sharex=True)
fig.suptitle('Sharing X axis')
axarr[0].plot(x, y)
axarr[1].scatter(x, y)
plt.show()
Great! You have now gained the ability to visualize data using matplotlib.

Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data

Uploaded by

Copyright:

Available Formats

Data Sciences

About the Notebook

Side note: What is csv?

Use Python to open csv files

Exploring our data

date mean_temperature max_temperature min_temperature \

Mean_dew_pt mean_pressure max_humidity min_humidity

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1

date mean_temperature max_temperature min_temperature \

Mean_dew_pt mean_pressure max_humidity min_humidity

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1

min_pressure_1 min_pressure_2 rainfall

Find out your data type

Side notes: Find out about a function

Task 2: What does dtypes do?

Choosing only dataset we are interested in.

Task 3: list down the headers from the dataset

Task 4: Drop the following collums: (min_dew_pt_2, max_pressure_2, min_pressure_2)

Task 5: Now check again if these collumns have beeen dropped

Great! We can now focus on this set of data!

Sorting values using pandas

date mean_temperature max_temperature min_temperature \

Mean_dew_pt mean_pressure max_humidity min_humidity

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall

Mean_dew_pt mean_pressure max_humidity min_humidity

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall

date mean_temperature max_temperature min_temperature \

Mean_dew_pt mean_pressure max_humidity min_humidity

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall

See how the temperature ranges within a year? 10C to 38C!

Choose only several ticks

Now we can see the x-ticks clearly.

Giving label to the x and y axis

# Add x and y labels and set a font size

Task 10: Now, let's add a title.

# Add x and y labels and set a font size

# Add x and y labels, title and set a font size

# Add x and y labels, title and set a font size

# Set the font size of the number labels on the axes

plt.xticks (rotation=30, horizontalalignment='right')

Task 12: Change your marker shape!

# Set the font size of the number labels on the axes

plt.xticks (rotation=30, horizontalalignment='right')

<Figure size 432x288 with 0 Axes>

plt.scatter(x,y, c='green', marker='*')

# Add x and y labels, title and set a font size

# Set the font size of the number labels on the axes

plt.xticks (rotation=30, horizontalalignment='right')

colors = np.random.rand(len(y)) # Create random value

plt.xticks(np.arange(0, 731, 60), Y_tick , rotation=30)

NameError: name 'plt' is not defined

Drawing multiple lines in a plot

plt.plot(x,y_1, label = "Max temp")

plt.xticks(np.arange(0, 731, 60))

plt.plot(x,y_1, label = "Max temp")

plt.xticks(np.arange(0, 731, 60))

NameError: name 'plt' is not defined

What does the above means?

x_pos = ['Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp']

plt.bar(x_pos, usage, align='center')

First, we will create random data for our example

Next, we will draw our boxplot