Professional Documents
Culture Documents
Numpy & Pandas
Numpy & Pandas
Numpy and Pandas are Python libraries that are incredibly useful
for all data scientists. Numpy is a used for scientific computing, and
its main feature is its high-performance implementations of arrays
and matrices. You’ll find NumPy extremely useful in working with
large-scale, multi-dimensional data, and you can use it in
conjunction with many popular machine learning libraries such as
sci-kit and Tensorflow.
Pandas is a library used for data manipulation, and its main feature
is the use of DataFrame objects to work with data in an easy-to-use
table format. Pandas is built on top of the functionality provided by
NumPy.
Assuming you have Python and pip (Python Package Installer), you
can easily install NumPy and Pandas using your command line.
pip install numpy
pip install pandas
And with that, let’s delve into a quick overview of NumPy and
Pandas!
Intro to NumPy
np.ones creates a NumPy array full of ones. You can specify the
shape of the array too (shape describes the dimensions of the array
object. I’ll discuss this in just a moment). np.zeros works the same
way, but with zeros!
>>> d = np.ones((2, 2))
>>> d
array([[1., 1.],
[1., 1.]])>>> d = np.zeros((2, 2))
>>> d
array([[0., 0.],
[0., 0.]])
NumPy arrays have some attributes which are very useful to know.
These include:
A[:3] will give all elements from the beginning until index 3.
Similarly, A[3:] will give all elements from index 3 to the very
end
You can also access a full row or column. Note putting just ‘:’ will
refer the entire range of values from beginning to end.
>>> c[1]
array([4, 5, 6])
>>> c[:,1]
array([2, 5, 8])
There’s a lot of things you can do with NumPy arrays, and I won’t get
the chance to cover them all. I’ll start with the basics.
You can also check for equality (or any comparison rather). You can
do it element-wise, and get back and array of boolean values. You
can also check whether two arrays are equal
using np.array_equal().
>>> a = np.array([1, 2, 3, 4])
>>> b = np.array([4, 2, 2, 4])
>>> a == b
array([False, True, False, True])
>>> a > b
array([False, False, True, False])>>> np.array_equal(a, b)
False
Let’s take a simple array. I’ll simply show you some of the reductions
you should know.
>>> x = np.array([1, 3, 2])>>> x.sum()
6
>>> x.min()
1
>>> x.max()
3
>>> x.mean()
2.0
>>> np.median(x)
2.0
>>> x.std()
0.816496580927726
The last couple operations that’ll be super important are your cross
product, dot product, and matrix multiplication operations.
>>> x = np.array([1, 2, 3])
>>> y = np.array([2, 3, 4])# Dot product
>>> np.dot(x, y)
20# Cross product
>>> np.cross(x, y)
array([-1, 2, -1])# Matrix multiplication
>>> a
array([[1, 1],
[2, 2]])
>>> b
array([[1, 3],
[2, 4]])
>>> np.matmul(a,b)
array([[ 3, 7],
[ 6, 14]])
Phew! That was a lot of examples with NumPy, and there’s a bunch
more you can do with it. Let’s move on to Pandas, which is very
useful for organizing data you’ll typically encounter in the real
world!
Intro to Pandas
The two main data structures you’ll come across in Pandas are
the DataFrame and the Series.
Firstly, we can view the head (first couple of rows) and tail (last
couple of rows) of the DataFrame. Normally, it’d give you 5 rows,
but we can specify how many rows we’d want.
>>> df.head()
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617
3 1.066155 1.067526 -0.684484 1.310766
4 0.385859 0.087228 1.476244 0.511632>>> df.tail(2)
A B C D
4 0.385859 0.087228 1.476244 0.511632
5 1.035326 1.011037 -0.753938 -0.285154
You can grab a single column, which yields a Series, or you can grab
rows using Python slice notation.
>>> df['B'] # Specify the column label
0 -0.933167
1 -0.952360
2 0.376262
3 1.067526
4 0.087228
5 1.011037
Name: B, dtype: float64>>> df[0:3] # Specify the row indices
using slice notation
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617
EDIT: You can also select multiple columns by passing a list of the
column names you want! For example: df[['A', 'B']] .
However, you cannot select a single row using [] notation. For that,
you’ll need to use either loc or iloc. loc will use the named label for
the index, while iloc will use the integer index.
You can also use slice notation for more powerful data accesses.
>>> df.iloc[0:3]
A B C D
0 1.529833 -0.933167 0.728422 -0.797813
1 -0.508315 -0.952360 -0.148712 0.702790
2 -1.590158 0.376262 0.367797 -0.226617>>> df.iloc[0:3, 1:3] #
Rows 0-2, columns 1-2
B C
0 -0.933167 0.728422
1 -0.952360 -0.148712
2 0.376262 0.367797
Editing DataFrames
You can reset the indices in the resulting DataFrame to fix the
indices a bit using reset_index(). The ‘drop’ setting makes sure
the original indices are not saved into a new column.
>>> df = df.reset_index(drop=True)
>>> df
A B C D F
0 0.000000 NaN 0.728422 -0.797813 1.000000
1 -0.508315 -0.952360 -0.148712 0.702790 2.000000
2 -1.590158 0.376262 0.367797 -0.226617 3.000000
3 1.066155 1.067526 -0.684484 1.310766 4.000000
4 0.385859 0.087228 1.476244 0.511632 5.000000
5 1.035326 1.011037 -0.753938 -0.285154 6.000000
6 -0.075624 0.210857 0.215464 -0.732181 2.151847
7 -0.265325 1.323702 -0.488284 1.253780 -1.949705
8 -0.592924 -0.442635 0.601039 1.839268 -1.247409