Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Introduction: What is NumPy? Pandas?

Numpy and Pandas are Python libraries that are incredibly useful
for all data scientists. Numpy is a used for scientific computing, and
its main feature is its high-performance implementations of arrays
and matrices. You’ll find NumPy extremely useful in working with
large-scale, multi-dimensional data, and you can use it in
conjunction with many popular machine learning libraries such as
sci-kit and Tensorflow.

Pandas is a library used for data manipulation, and its main feature
is the use of DataFrame objects to work with data in an easy-to-use
table format. Pandas is built on top of the functionality provided by
NumPy.

Assuming you have Python and pip (Python Package Installer), you
can easily install NumPy and Pandas using your command line.
pip install numpy
pip install pandas

And with that, let’s delve into a quick overview of NumPy and
Pandas!

Intro to NumPy

Like any regular python package, you’ll need to import NumPy


before you do anything with it.
import numpy as np

Creating NumPy arrays


There are several ways to create an array in NumPy, such as
np.array, np.zeros, np.ones, etc. Each has its own purpose.

np.array allows you to pass in a regular Python list in order to


create a NumPy array. Note that the object you get is different from
the Python list type.
>>> a = np.array([43, 56, 35, 3])
>>> a
array([43, 56, 35, 3])>>> b = [43, 56, 35, 3]>>> type(a)
<class 'numpy.ndarray'>
>>> type(b)
<class 'list'>

Note that you can create multidimensional arrays as well!


>>> c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> c
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

np.ones creates a NumPy array full of ones. You can specify the
shape of the array too (shape describes the dimensions of the array
object. I’ll discuss this in just a moment). np.zeros works the same
way, but with zeros!
>>> d = np.ones((2, 2))
>>> d
array([[1., 1.],
[1., 1.]])>>> d = np.zeros((2, 2))
>>> d
array([[0., 0.],
[0., 0.]])

np.linspace takes a start point, end point, and the number of


elements you want in the array. It then makes an array with evenly
spaced numbers.
>>> e = np.linspace(0, 5, 3)
>>> e
array([0. , 2.5, 5. ]) # THREE evenly spaced numbers>>> e =
np.linspace(0, 5, 4)
>>> e
array([0. , 1.66666667, 3.33333333, 5. ]) # FOUR
of em

NumPy arrays have some attributes which are very useful to know.
These include:

1. ndim: the dimension of the array

2.shape: a tuple of integers indicating the size of the array in


each dimension

3.size: the total number of elements in the array

4.dtype: returns the type of elements in the array (ex. int64,


float, bool)

Let’s look at one of the simple arrays we created earlier.


>>> c
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])>>> c.ndim
2>>> c.shape
(3, 3)>>> c.size
9>>> c.dtype
dtype('int32')

Accessing array elements

NumPy array elements can be accessed using a similar indexing


scheme to good ole Python’s (called slicing notation). Let’s say we
have an array A.
 A[2] will give the element at index 2. Remember indices start
at 0.

 A[2:5] will give the elements from 2 to 4. The endpoint is not


inclusive.

 A[:3] will give all elements from the beginning until index 3.
Similarly, A[3:] will give all elements from index 3 to the very
end

 You can specify multiple dimensions by using a comma (“,”) in


between the ranges you want to specify. For example, A[2:4,
1:4].

Let’s look at this in action. We’ve got an array c.


>>> c
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

We can access a single element two ways.


>>> c[2][2]
9
>>> c[2,2]
9

You can also access a full row or column. Note putting just ‘:’ will
refer the entire range of values from beginning to end.
>>> c[1]
array([4, 5, 6])
>>> c[:,1]
array([2, 5, 8])

For more info on indexing NumPy arrays, visit their documentation


page here.
Array operations

There’s a lot of things you can do with NumPy arrays, and I won’t get
the chance to cover them all. I’ll start with the basics.

Operations with scalar values applies the operation to each element


of the array.
>>> a
array([1, 2, 3])>>> a + 1
array([2, 3, 4])
>>> a * 2
array([2, 4, 6])

Operations between two arrays will be element-wise.


>>> a
array([1, 2, 3])
>>> b
array([4, 5, 6])
>>> a * b
array([ 4, 10, 18])

Make sure they’re the same shape though!


>>> x
array([1, 2, 3])
>>> y
array([1, 2])
>>> x + y
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes
(3,) (2,)

You can also check for equality (or any comparison rather). You can
do it element-wise, and get back and array of boolean values. You
can also check whether two arrays are equal
using np.array_equal().
>>> a = np.array([1, 2, 3, 4])
>>> b = np.array([4, 2, 2, 4])
>>> a == b
array([False, True, False, True])
>>> a > b
array([False, False, True, False])>>> np.array_equal(a, b)
False

You can transpose a matrix too.


>>> a
array([[1, 1],
[0, 0]])
>>> a.T
array([[1, 0],
[1, 0]])

You can also reshape an array by specifying a tuple, which will be


the shape of the resulting array.
>>> a.reshape((4,1))
array([[1],
[1],
[0],
[0]])
>>> a.reshape((1,4))
array([[1, 1, 0, 0]])

Let’s move on to slightly more complex operations. We’ll go over


some basic, but useful reductions, which are values you calculate
from all of the elements in a list.

Let’s take a simple array. I’ll simply show you some of the reductions
you should know.
>>> x = np.array([1, 3, 2])>>> x.sum()
6
>>> x.min()
1
>>> x.max()
3
>>> x.mean()
2.0
>>> np.median(x)
2.0
>>> x.std()
0.816496580927726

You can also do reductions on rows and columns of multi-


dimensional arrays by specifying the axis along which to do your
operation, like so:
>>> a
array([[1, 1],
[2, 2]])
>>> a.sum(axis=0)
array([3, 3])
>>> a.sum(axis=1)
array([2, 4])

How axes are labelled in a matrix

The last couple operations that’ll be super important are your cross
product, dot product, and matrix multiplication operations.
>>> x = np.array([1, 2, 3])
>>> y = np.array([2, 3, 4])# Dot product
>>> np.dot(x, y)
20# Cross product
>>> np.cross(x, y)
array([-1, 2, -1])# Matrix multiplication
>>> a
array([[1, 1],
[2, 2]])
>>> b
array([[1, 3],
[2, 4]])
>>> np.matmul(a,b)
array([[ 3, 7],
[ 6, 14]])
Phew! That was a lot of examples with NumPy, and there’s a bunch
more you can do with it. Let’s move on to Pandas, which is very
useful for organizing data you’ll typically encounter in the real
world!

Intro to Pandas

Just like NumPy, we have to import pandas.


import pandas as pd

Creating Pandas data structures

The two main data structures you’ll come across in Pandas are
the DataFrame and the Series.

A Series can be treated as a 1D array, similar to a single column in a


spreadsheet. A DataFrame is a 2D table, analogous to an entire
spreadsheet.

A Series can be created by passing a list of values to


the pd.Series() function.
>>> s = pd.Series([1, 2, 5, np.nan, 6, 8])
>>> s
0 1.0
1 2.0
2 5.0
3 NaN # note that np.nan creates a 'not a number', or a
'NaN'
4 6.0
5 8.0
dtype: float64

You can create a DataFrame by using


the pd.DataFrame() constructor. A link to the documentation
page can be found here for a full overview of how to create a
DataFrame.
>>> df = pd.DataFrame(np.random.randn(6, 4), columns=['A', 'B',
'C', 'D'])
>>> df
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617
3 1.066155 1.067526 -0.684484 1.310766
4 0.385859 0.087228 1.476244 0.511632
5 1.035326 1.011037 -0.753938 -0.285154

Here, I’ve created a DataFrame with random numbers, with


columns A, B, C, an D. Let’s see what we can do with it!

Viewing data in a DataFrame

Firstly, we can view the head (first couple of rows) and tail (last
couple of rows) of the DataFrame. Normally, it’d give you 5 rows,
but we can specify how many rows we’d want.
>>> df.head()
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617
3 1.066155 1.067526 -0.684484 1.310766
4 0.385859 0.087228 1.476244 0.511632>>> df.tail(2)
A B C D
4 0.385859 0.087228 1.476244 0.511632
5 1.035326 1.011037 -0.753938 -0.285154

You can view the column labels of the DataFrame.


>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

You can grab a single column, which yields a Series, or you can grab
rows using Python slice notation.
>>> df['B'] # Specify the column label
0 -0.933167
1 -0.952360
2 0.376262
3 1.067526
4 0.087228
5 1.011037
Name: B, dtype: float64>>> df[0:3] # Specify the row indices
using slice notation
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617

EDIT: You can also select multiple columns by passing a list of the
column names you want! For example: df[['A', 'B']] .

However, you cannot select a single row using [] notation. For that,
you’ll need to use either loc or iloc. loc will use the named label for
the index, while iloc will use the integer index.

In our example, we used the default integer indexing scheme, but


your data could be indexed by date or something like that, in which
case you’d use loc.
>>> df.iloc[0]
A 1.529833
B -0.933167
C 0.728422
D -0.797813
Name: 0, dtype: float64

You can also use slice notation for more powerful data accesses.
>>> df.iloc[0:3]
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617>>> df.iloc[0:3, 1:3] #
Rows 0-2, columns 1-2
B C
0 -0.933167 0.728422
1 -0.952360 -0.148712
2 0.376262 0.367797
Editing DataFrames

There are a number of ways we can change our DataFrames.

Firstly, we have setting. You can set by column using a Series.


>>> s1 = pd.Series([1, 2, 3, 4, 5, 6])
>>> s1
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64>>> df['F'] = s1 # This creates a new column 'F'
equal to Series s1

Or you could set values using labels


>>> df.at[0, 'A'] = 0 # The first 0 is the label, 'A' is the
column

Or you could set using an integer position.


>>> df.iat[0, 1] = np.nan

Here’s the resulting DataFrame after these changes were made:


>>> df
A B C D F
0 0.000000 NaN 0.728422 -0.797813 1
1 -0.508315 -0.952360 -0.148712 0.702790 2
2 -1.590158 0.376262 0.367797 -0.226617 3
3 1.066155 1.067526 -0.684484 1.310766 4
4 0.385859 0.087228 1.476244 0.511632 5
5 1.035326 1.011037 -0.753938 -0.285154 6

If you have multiple DataFrames and Series that you want to


combine, you can do that! These are done using either concatenation
or appending.
You’ll want to use append when you have rows that you want to
add on to an existing DataFrame.
>>> df1 = pd.DataFrame(np.random.randn(3, 5), columns=['A', 'B',
'C', 'D', 'F'])
>>> df1
A B C D F
0 -0.075624 0.210857 0.215464 -0.732181 2.151847
1 -0.265325 1.323702 -0.488284 1.253780 -1.949705
2 -0.592924 -0.442635 0.601039 1.839268 -1.247409>>> df =
df.append(df1)
>>> df
A B C D F
0 0.000000 NaN 0.728422 -0.797813 1.000000
1 -0.508315 -0.952360 -0.148712 0.702790 2.000000
2 -1.590158 0.376262 0.367797 -0.226617 3.000000
3 1.066155 1.067526 -0.684484 1.310766 4.000000
4 0.385859 0.087228 1.476244 0.511632 5.000000
5 1.035326 1.011037 -0.753938 -0.285154 6.000000
0 -0.075624 0.210857 0.215464 -0.732181 2.151847
1 -0.265325 1.323702 -0.488284 1.253780 -1.949705
2 -0.592924 -0.442635 0.601039 1.839268 -1.247409

You can reset the indices in the resulting DataFrame to fix the
indices a bit using reset_index(). The ‘drop’ setting makes sure
the original indices are not saved into a new column.
>>> df = df.reset_index(drop=True)
>>> df
A B C D F
0 0.000000 NaN 0.728422 -0.797813 1.000000
1 -0.508315 -0.952360 -0.148712 0.702790 2.000000
2 -1.590158 0.376262 0.367797 -0.226617 3.000000
3 1.066155 1.067526 -0.684484 1.310766 4.000000
4 0.385859 0.087228 1.476244 0.511632 5.000000
5 1.035326 1.011037 -0.753938 -0.285154 6.000000
6 -0.075624 0.210857 0.215464 -0.732181 2.151847
7 -0.265325 1.323702 -0.488284 1.253780 -1.949705
8 -0.592924 -0.442635 0.601039 1.839268 -1.247409

To concatenate, you’d use the pd.concat() function. You can add


both rows and columns, as long as you specify the axis along which
you’re adding new data.
>>> c1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9], name='Z')
>>> c1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
Name: Z, dtype: int64>>> pd.concat([df, c1], axis=1) # Specify
axis 1 to add a column
A B C D F Z
0 0.000000 NaN 0.728422 -0.797813 1.000000 1
1 -0.508315 -0.952360 -0.148712 0.702790 2.000000 2
2 -1.590158 0.376262 0.367797 -0.226617 3.000000 3
3 1.066155 1.067526 -0.684484 1.310766 4.000000 4
4 0.385859 0.087228 1.476244 0.511632 5.000000 5
5 1.035326 1.011037 -0.753938 -0.285154 6.000000 6
6 -0.075624 0.210857 0.215464 -0.732181 2.151847 7
7 -0.265325 1.323702 -0.488284 1.253780 -1.949705 8
8 -0.592924 -0.442635 0.601039 1.839268 -1.247409 9

You might also like