Numpy&pandas

NUMPY:-
The NumPy library is the core library for scientific computing in Python.
It provides a high-performance multidimensional array object, and tools
for working with these arrays
NumPy (or Numpy) is the Python Linear Algebra library, the most
important reason for Python Data Science is that almost all libraries in
the PyData Ecosystem rely on NumPy as one of its building blocks
Using Numpy:-
Import numpy as np
Numpy has a lot of built-in jobs and skills. We will not cover them all but
instead we will focus on some of the most important of Numpy.
NUMPY ARAYS:-
NumPy arrays are a great way to use Numpy throughout the lesson.
Numpy arrays actually form two flavors: Vectors and matrics. Vectors
are 1-d arrows firmly and the matrics is 2-d .
create NumPy arrays
Creating NumPy Arrays
From a Python List
We can create an array by directly converting a list or list of lists:
My_list=[1,2,3,5]
My_list
np.array(my_list)
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix
np.array(my_matrix)
Built-in Methods
There are lots of built-in ways to generate Arrays
arange
Return evenly spaced values within a given interval.
np.arange(0,10)
np.arange(0,11,2)
zeros and ones
Generate arrays of zeros or ones
np.zeros((5,5))
np.ones(3)
np.ones((3,3))
linspace
Return evenly spaced numbers over a specified interval.
np.linspace(0,10,3)
np.linspace(0,10,50)
eye
Creates an identity matrix
np.eye(4)
Random
Numpy also has lots of ways to create random number arrays:
rand
Create an array of the given shape and populate it with random samples from a uniform
distribution over [0, 1].
np.random.rand(2)
np.random.rand(5,5)
randn
Return a sample (or samples) from the "standard normal" distribution. Unlike rand which is
uniform:
np.random.randn(2)
np.random.randn(5,5)
randint
Return random integers from low (inclusive) to high (exclusive).
np.random.randint(1,100)
np.random.randint(1,100,10)
Array Attributes and Methods

Let's discuss some useful attributes and methods or an array:
arr = np.arange(25)ranarr = np.random.randint(0,50,10)
arr
ranarr
Reshape
Returns an array containing the same data with a new shape.
arr.reshape(5,5)
max,min,argmax,argmin
These are useful methods for finding max or min values. Or to find their index locations using
argmin or argmax
ranarr
ranarr.max()
ranarr.argmax()
ranarr.min()
ranarr.argmin()
Shape
Shape is an attribute that arrays have (not a method):
# Vectorarr.shape
# Notice the two sets of bracketsarr.reshape(1,25)
arr.reshape(1,25).shape
arr.reshape(25,1)
arr.reshape(25,1).shape
dtype
You can also grab the data type of the object in the array:
arr.dtype
Numpy Indexing and Selection

In this lecture we will discuss how to select elements or groups of elements from an array.
import numpy as np
#Creating sample arrayarr = np.arange(0,11)
#Showarr
Bracket Indexing and Selection

The simplest way to pick one or some elements of an array looks very similar to python lists:
#Get a value at an indexarr[8]
#Get values in a rangearr[1:5]
#Get values in a rangearr[0:5]
Broadcasting
Numpy arrays differ from a normal Python list because of their ability to broadcast:
#Setting a value with index range (Broadcasting)arr[0:5]=100
#Showarr
# Reset array, we'll see why I had to reset in a momentarr = np.arange(0,11)
#Showarr
#Important notes on Slicesslice_of_arr = arr[0:6]
#Show sliceslice_of_arr
#Change Sliceslice_of_arr[:]=99
#Show Slice againslice_of_arr
#Now note the changes also occur in our original array!
#Data is not copied, it's a view of the original array! This avoids memory problems!
#To get a copy, need to be explicitarr_copy = arr.copy()

arr_copy
Indexing a 2D array (matrices)

The general format is arr_2d[row][col] or arr_2d[row,col]. I recommend usually using the
comma notation for clarity.
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
#Showarr_2d
#Indexing rowarr_2d[1]
# Format is arr_2d[row][col] or arr_2d[row,col]
# Getting individual element valuearr_2d[1][0]
In [53]:
# Getting individual element valuearr_2d[1,0]
# 2D array slicing
#Shape (2,2) from top right cornerarr_2d[:2,1:]
#Shape bottom rowarr_2d[2]
#Shape bottom rowarr_2d[2,:]
Fancy Indexing
Fancy indexing allows you to select entire rows or columns out of order,to show this, let's quickly
build out a numpy array:
#Set up matrixarr2d = np.zeros((10,10))
#Length of arrayarr_length = arr2d.shape[1]
#Set up array
for i in range(arr_length):
arr2d[i] = i
arr2d
#Fancy indexing allows the following
arr2d[[2,4,6,8]]
#Allows in any orderarr2d[[6,4,2,7]]
Selection
Let's briefly go over how to use brackets for selection based off of comparison operators.
arr = np.arange(1,11)arr
arr > 4
bool_arr = arr>4
bool_arr
arr[bool_arr]
arr[arr>2]
x = 2arr[arr>x]
NumPy Operations
Arithmetic
You can easily perform array with array arithmetic, or scalar with array arithmetic. Let's see some
examples:
import numpy as nparr = np.arange(0,10)
arr + arr
arr * arr
arr - arr
# Warning on division by zero, but not an error!# Just replaced with nanarr/arr
arr**3
Universal Array Functions

Numpy comes with many universal array functions, which are essentially just mathematical
operations you can use to perform the operation across the array. Let's show some common
ones:
#Taking Square Rootsnp.sqrt(arr)
#Calcualting exponential (e^)np.exp(arr)
np.max(arr) #same as arr.max()
np.sin(arr)
np.log(arr)
PANDAS:-
Pandas has so many uses that it might make sense to list the things it
can't do instead of what it can do.
This tool is essentially your data’s home. Through pandas, you get
acquainted with your data by cleaning, transforming, and analyzing it.
For example, say you want to explore a dataset stored in a CSV on your
computer. Pandas will extract the data from that CSV into a DataFrame
— a table, basically — then let you do things like:
 Calculate statistics and answer questions about the data, like

o What's the average, median, max, or min of each column?
o Does column A correlate with column B?
o What does the distribution of data in column C look like?
 Clean the data by doing things like removing missing values and
filtering rows or columns by some criteria
 Visualize the data with help from Matplotlib. Plot bars, lines,
histograms, bubbles, and more.
 Store the cleaned, transformed data back into a CSV, other file or
database
Before you jump into the modeling or the complex visualizations you
need to have a good understanding of the nature of your dataset and
pandas is the best avenue through which to do that.
Not only is the pandas library a central component of the data science
toolkit but it is used in conjunction with other libraries in that collection.
Pandas is built on top of the NumPy package, meaning a lot of the

structure of NumPy is used or replicated in Pandas. Data in pandas is
often used to feed statistical analysis in SciPy, plotting functions
from Matplotlib, and machine learning algorithms in Scikit-learn.
Pandas First Steps
Install and import
Pandas is an easy package to install. Open up your terminal program (for

Mac users) or command line (for PC users) and install it using either of
the following commands:
Conda install pandas
import pandas as pd
Core components of pandas: Series and DataFrames
The primary two components of pandas are the series and datframe.
A series is essentially a column, and a dataframe is a multi-dimensional

table made up of a collection of Series.
DataFrames and Series are quite similar in that many operations that
you can do with one you can do with the other, such as filling in null
values and calculating the mean.
You'll see how these components work when we start working with data
below.
Creating DataFrames from scratch

Creating DataFrames right in Python is good to know and quite useful
when testing new methods and functions you find in the pandas docs.
There are many ways to create a DataFrame from scratch, but a great

option is to just use a simple dict.
Let's say we have a fruit stand that sells apples and oranges. We want to
have a column for each fruit and a row for each customer purchase. To
organize this as a dictionary for pandas we could do something like:
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]}
And then pass it to the pandas DataFrame constructor:
purchases = pd.DataFrame(data)
purchases
The Index of this DataFrame was given to us on creation as the numbers

0-3, but we could also create our own when we initialize the DataFrame.
Let's have customer names as our index:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases
The Index of this DataFrame was given to us on creation as the numbers

0-3, but we could also create our own when we initialize the DataFrame.
Let's have customer names as our index:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases
purchases.loc['June']
Reading data from CSVs
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv')
df
df = pd.read_json('purchases.json')
df
Data frame
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
movies_df.head()
movies_df.tail(2)
movies_df.shape
movies_df.info()
Handling duplicates
This dataset does not have duplicate rows, but it is always important to
verify you aren't aggregating duplicate rows.
To demonstrate, let's simply just double up our movies DataFrame by

appending it to itself:
temp_df = movies_df.append(movies_df)
temp_df.shape
Using append() will return a copy without affecting the original

DataFrame. We are capturing this copy in temp so we aren't working
with the real data.Notice call .shape quickly proves our DataFrame rows
have doubled.Now we can try dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Just like append() the drop duplicates method will also return a copy of

your DataFrame, but this time with duplicates removed. Calling
It's a little verbose to keep assigning DataFrames to the same variable

like in this example. For this reason, pandas has the inplace keyword
argument on many of its methods. Using inplace= true will modify the
DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Now our temp_df will have the transformed data automatically.
temp_df = movies_df.append(movies_df) # make a new copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Column cleanup
Many times datasets will have verbose column names with symbols,
upper and lowercase words, spaces, and typos. To make selecting data
by column name easier we can spend a little time cleaning up their
names.
Here's how to print the column names of our dataset:
movies_df.columns
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors',
'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
How to work with missing values
When exploring data, you’ll most likely encounter missing or null values,
which are essentially placeholders for non-existent values. Most
commonly you'll see Python's or NumPy's np.nan, each of which are
handled differently in some situations.
There are two options in dealing with nulls:
1. Get rid of rows or columns with nulls

2. Replace nulls with non-null values, a technique known
as imputation
Let's calculate to total number of nulls in each column of our dataset.

The first step is to check which cells in our DataFrame are null:
movies_df.isnull()
movies_df.isnull().sum()
movies_df.dropna()
revenue = movies_df['revenue_millions']
revenue.head()
revenue_mean = revenue.mean()
revenue_mean
revenue.fillna(revenue_mean, inplace=True)
movies_df.isnull().sum()
movies_df.describe()
movies_df['genre'].describe()
movies_df['genre'].value_counts().head(10)
movies_df.corr()
DataFrame slicing, selecting, extracting
Up until now we've focused on some basic summaries of our data.

We've learned about simple column extraction using single brackets, and
we imputed null values in a column using fillna(). Below are the other
methods of slicing, selecting, and extracting you'll need to use
constantly.
It's important to note that, although many methods are the same,
DataFrames and Series have different attributes, so you'll need be sure
to know which type you are working with or else you will receive
attribute errors.
Let's look at working with columns first.

By column
You already saw how to extract a column using square brackets like this:
genre_col = movies_df['genre']
type(genre_col)
genre_col = movies_df[['genre']]
type(genre_col)
prom = movies_df.loc["Prometheus"]
prom
prom = movies_df.iloc[1]
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
movie_subset
condition = (movies_df['director'] == "Ridley Scott")
condition.head()
condition = (movies_df['director'] == "Ridley Scott")
condition.head()
movies_df[(movies_df['director'] == 'Christopher Nolan') |
(movies_df['director'] == 'Ridley Scott')].head()
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley

Scott'])].head()
movies_df[
((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
& (movies_df['rating'] > 8.0)
& (movies_df['revenue_millions'] <
movies_df['revenue_millions'].quantile(0.25))]
Applying functions
It is possible to iterate over a DataFrame or Series as you would with a

list, but doing so — especially on large datasets — is very slow.
An efficient alternative is to apply() a function to the dataset. For

example, we could use a function to convert movies with an 8.0 or
greater to a string value of "good" and the rest to "bad" and use this
transformed values to create a new column.
First we would create a function that, when given a rating, determines if

it's good or bad:
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
Now we want to send the entire rating column through this function,
which is what apply()does:
movies_df["rating_category"] =
movies_df["rating"].apply(rating_function)
movies_df.head(2)
movies_df["rating_category"] = movies_df["rating"].apply(lambda x:
'good' if x >= 8.0 else 'bad')
movies_df.head(2)
Let's plot the relationship between ratings and revenue. All we need to
do is call .plot() on movies_df with some info about how to construct the
plot:
movies_df.plot(kind='scatter', x='rating', y='revenue_millions',

title='Revenue (millions)
if we want to plot a simple Histogram based on a single column, we can

call plot on a column:
movies_df['rating'].plot(kind='hist', title='Rating');
movies_df['rating'].describe()
movies_df['rating'].plot(kind="box");
movies_df.boxplot(column='revenue_millions', by='rating_category');

Numpy&pandas

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numpy&pandas

Uploaded by

Copyright:

Available Formats

NUMPY:-

create NumPy arrays

Creating NumPy Arrays

From a Python List

We can create an array by directly converting a list or list of lists:

Return evenly spaced values within a given interval.

zeros and ones

Generate arrays of zeros or ones

Return evenly spaced numbers over a specified interval.

Creates an identity matrix

Return random integers from low (inclusive) to high (exclusive).

Array Attributes and Methods

arr = np.arange(25)ranarr = np.random.randint(0,50,10)

# Notice the two sets of bracketsarr.reshape(1,25)

Numpy Indexing and Selection

#Creating sample arrayarr = np.arange(0,11)

Bracket Indexing and Selection

#Get a value at an indexarr[8]

#Get values in a rangearr[1:5]

#Get values in a rangearr[0:5]

#Setting a value with index range (Broadcasting)arr[0:5]=100

# Reset array, we'll see why I had to reset in a momentarr = np.arange(0,11)

#Important notes on Slicesslice_of_arr = arr[0:6]

#Show Slice againslice_of_arr

#Now note the changes also occur in our original array!

#To get a copy, need to be explicitarr_copy = arr.copy()

Indexing a 2D array (matrices)

# Getting individual element valuearr_2d[1][0]

# Getting individual element valuearr_2d[1,0]

#Shape (2,2) from top right cornerarr_2d[:2,1:]

#Shape bottom rowarr_2d[2]

#Shape bottom rowarr_2d[2,:]

#Set up matrixarr2d = np.zeros((10,10))

#Length of arrayarr_length = arr2d.shape[1]

#Fancy indexing allows the following

#Allows in any orderarr2d[[6,4,2,7]]

import numpy as nparr = np.arange(0,10)

Universal Array Functions

#Taking Square Rootsnp.sqrt(arr)

#Calcualting exponential (e^)np.exp(arr)

np.max(arr) #same as arr.max()

 Calculate statistics and answer questions about the data, like

Pandas is built on top of the NumPy package, meaning a lot of the

Pandas First Steps

Install and import

Pandas is an easy package to install. Open up your terminal program (for

Conda install pandas

Core components of pandas: Series and DataFrames

The primary two components of pandas are the series and datframe.

A series is essentially a column, and a dataframe is a multi-dimensional

Creating DataFrames from scratch

There are many ways to create a DataFrame from scratch, but a great

And then pass it to the pandas DataFrame constructor:

The Index of this DataFrame was given to us on creation as the numbers

Let's have customer names as our index:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

The Index of this DataFrame was given to us on creation as the numbers

Let's have customer names as our index:

Reading data from CSVs

To demonstrate, let's simply just double up our movies DataFrame by

Using append() will return a copy without affecting the original

Just like append() the drop duplicates method will also return a copy of