Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Lecture 2.

Numpy
2-1. Numpy basic
Numpy : efficient implementation of n-dim array
built in C : fast
1-d array, 2-d array, ..., n-d array

In [414]:
i m p o r t numpy a s np

In [415]:
a = [1,2,3,4,5,6,7,8]

print("a : List =", a)

b = np. array(a)

print("b : np array =", b)

a : List = [1, 2, 3, 4, 5, 6, 7, 8]
b : np array = [1 2 3 4 5 6 7 8]

NumPy standard data types


NumPy arrays contain values of a single type
data type can be specified when constructing an array
np.zeros(10, dtype=int)

np.zeros(10, dtype=float)

np.zeros(10, dtype='int16')

np.zeros(10, dtype=np.float32)

Data type Description

bool_ Boolean (True or False) stored as a byte

int_ Default integer type (same as C long ; normally either int64 or int32 )

intc Identical to C int (normally int32 or int64 )


Data type Description

intp Integer used for indexing (same as C ssize_t ; normally either int32 or int64 )

int8 Byte (-128 to 127)

int16 Integer (-32768 to 32767)

int32 Integer (-2147483648 to 2147483647)

int64 Integer (-9223372036854775808 to 9223372036854775807)

uint8 Unsigned integer (0 to 255)

uint16 Unsigned integer (0 to 65535)

uint32 Unsigned integer (0 to 4294967295)

uint64 Unsigned integer (0 to 18446744073709551615)

float_ Shorthand for float64 .

float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa

float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa

float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa

complex_ Shorthand for complex128 .

complex64 Complex number, represented by two 32-bit floats

complex128 Complex number, represented by two 64-bit floats

In [416]:
np. array([1, 2, 3, 4], dtype= 'float32')

Out[416]: array([1., 2., 3., 4.], dtype=float32)

Creating arrays from scratch


shape :
(d1) : 1-d array of size d1
(d1,d2) : 2-d array of size d1xd2
(d1,d2,d3) : 3-d array of size d1xd2xd3
...

In [417]:
# Create a length-10 integer array filled with zeros

a0 = np. zeros(10, dtype= int)

print(a0. shape)

print(a0)

# Create a 3x5 floating-point array filled with ones

a1 = np. ones((2, 5), dtype= float)

print(a1. shape)

print(a1)

# Create a 3x5 array filled with 3.14

af = np. full((2, 5), 3.14)

print(af)

(10,)
[0 0 0 0 0 0 0 0 0 0]
(2, 5)
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
[[3.14 3.14 3.14 3.14 3.14]
[3.14 3.14 3.14 3.14 3.14]]

In [418]:
# Create a 3x3 identity matrix

np. eye(3)

Out[418]: array([[1., 0., 0.],


[0., 1., 0.],
[0., 0., 1.]])

In [419]:
# Create an array filled with a linear sequence

# Starting at 0, ending at 20, stepping by 2

# (this is similar to the built-in range() function)

np. arange(0, 20, 2)

Out[419]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])

Creating array with random numbers


In [9]:
# setting seed for random number generator

# for reproducibility

i m p o r t random

i m p o r t numpy a s np

random. seed(0)

np. random. seed(0)

In [10]:
# Create a 3x3 array of uniform[0,1] random numbers

np. random. random(size= (3, 3))

Out[10]: array([[0.5488135 , 0.71518937, 0.60276338],


[0.54488318, 0.4236548 , 0.64589411],
[0.43758721, 0.891773 , 0.96366276]])

In [421]:
# Create a 3x3 array of random integers in the interval [0, 10)

np. random. randint(0, 10, size= (3, 3))

Out[421]: array([[7, 1, 6],


[9, 9, 8],
[6, 3, 4]])

In [422]:
# Create a 3x3 array of N(0,1)

np. random. randn(3,3)

Out[422]: array([[ 0.33786932, 1.39970946, 1.1298669 ],


[-0.07111281, -0.80368313, -1.11158007],
[ 1.01861985, 0.36387617, -0.30621626]])

In [423]:
# Create a 3x3 array of N(50,1)

np. random. normal(50, 10, size= (3, 3))

Out[423]: array([[52.7827885 , 50.90993972, 35.35756522],


[66.58160611, 53.7680413 , 36.89402385],
[53.89007158, 60.90909425, 61.22526522]])

NumPy array attributes


dtype : the data type of the array
ndim : the number of axes
shape : the size of each axis
size : the total number of element in the array
itemsize : the size of each array element (in bytes)
nbytes : the total size of the array (in bytes)

In [424]:
np. random. seed(0) # seed for reproducibility

x3 = np. random. randint(0, 10, size= (3, 4, 5)) # Three-dimensional array

print("dtype:", x3. dtype)

print("ndim: ", x3. ndim)

print("shape:", x3. shape)

print("size: ", x3. size)

print("itemsize:", x3. itemsize, "bytes")

print("nbytes:", x3. nbytes, "bytes")

dtype: int32
ndim: 3
shape: (3, 4, 5)
size: 60
itemsize: 4 bytes
nbytes: 240 bytes

Array indexing, slicing: similar to python list


In [425]:
x1 = np. arange(10)

print (x1)

[0 1 2 3 4 5 6 7 8 9]

In [426]:
x1[1] = 1.8 # truncated to integer

print (x1[0], x1[1], x1[- 1], x1[- 2])

print (x1[:4])

print (x1[4:7])

print (x1[7:])

print (x1[7:- 1])

print (x1[1:8:2])

0 1 9 8
[0 1 2 3]
[4 5 6]
[7 8 9]
[7 8]
[1 3 5 7]

Slicing

- a view (not a copy) of the base array

- to make a copy, use copy()

In [ ]:
x1 = np. arange(10)

y = x1[4:7]

print(x1, y)

y[0] = 0

print(x1, y) # TAQ

z = x1[7:]. copy()

z[0] = 0

print(x1,z) # TAQ

In [428]:
x2 = np. random. randint(0, 100, size= (3,4))

print (x2)

print (x2[0]) # first row

print (x2[0,1:3])

print (x2[:,1]) # second col

[[42 58 31 1]
[65 41 57 35]
[11 46 82 91]]
[42 58 31 1]
[58 31]
[58 41 46]

Array reshaping
y = x.reshape(new_shape) : changes shapes of x
no-copy view : reference the original array
x.reshape(-1) : flattens to 1-d array
numpy n-d array : 1-d array storage + n-d view

In [ ]:
x1 = np. arange(9)

print(x1, '\n')

x2 = x1. reshape((3, 3))

print(x2, '\n')

x2[1] = 0 # x2[1] = x2[1,:]

print(x1) # TAQ

In [16]:
x2 = np. arange(12). reshape((3, 4))

print (x2, '\n')

print (x2[0], '\n')

print (x2[1]. reshape((1,4)), '\n')

print (x2[2]. reshape((4,1)), '\n')

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

[0 1 2 3]

[[4 5 6 7]]

[[ 8]
[ 9]
[10]
[11]]

In [17]:
print (x2. reshape(- 1), '\n')

print (x2. reshape((4,- 1)), '\n')

print (x2. reshape((- 1,6)), '\n')

print (x2. reshape((2,2,- 1)))

[ 0 1 2 3 4 5 6 7 8 9 10 11]

[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]

[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]]

[[[ 0 1 2]
[ 3 4 5]]

[[ 6 7 8]
[ 9 10 11]]]

1-d index of n-d array


x.shape = (d 0, d1 , d2 )

y.reshape(-1) : flattened(1-d) view of x


x[a, b, c] ≡ y[k], where k = a(d1 d2 ) + bd2 + c = (ad1 + b)d2 + c

In [6]:
i m p o r t numpy a s np

x3 = np. arange(2* 3* 4)

x4 = x3. reshape((2,3,4))

a,b,c = 1,1,2

print(x4[a,b,c])

k = (a* 3 + b)* 4 + c

print(x3[k])

x3[k] = 0

print (x4[a,b,c]) # TAQ

18
18
0

Concatenating arrays
In [18]:
x = np. array([1, 2, 3])

y = np. array([3, 2, 1])

print (np. concatenate([x, y]), '\n')

z = [99, 99, 99]

print(np. concatenate([x, y, z]))

[1 2 3 3 2 1]
[ 1 2 3 3 2 1 99 99 99]

In [20]:
x = np. arange(0,8). reshape((2, 4))

y = np. arange(8,16). reshape((2, 4))

print (x, '\n')

print (y, '\n')

print (np. concatenate([x,y], axis= 0), '\n')

print (np. vstack([x,y]), '\n')

print (np. concatenate([x,y], axis= 1), '\n')

print (np. hstack([x,y]))

[[0 1 2 3]
[4 5 6 7]]

[[ 8 9 10 11]
[12 13 14 15]]

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]

[[ 0 1 2 3 8 9 10 11]
[ 4 5 6 7 12 13 14 15]]

[[ 0 1 2 3 8 9 10 11]
[ 4 5 6 7 12 13 14 15]]

2-2. Computation on NumPy arrays: Universal


Functions
Loops are slow
In [22]:
i m p o r t numpy a s np

a = np. random. random(size= 1000000)

print(a. sum()) # TAQ : any guess?

500387.3135894248

the following code is slow, because of?


using list?
using for loop?

In [23]:
# using list and for-loop

d e f reciprocal_1(x):

n = len(x)

y = []

s = 0.0

f o r i i n range(n):

z = 1.0 / x[i]

s + = z

y. append(z)

r e t u r n y, s

% t i m e i t b, s = reciprocal_1(a)

381 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]:
# using numpy array and for-loop

d e f reciprocal_2(x):

n = len(x)

y = np. zeros(n)

f o r i i n range(n):

y[i] = 1.0 / x[i]

r e t u r n y, y. sum()

% t i m e i t b, s = reciprocal_2(a)

373 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [25]:
# using numpy array and no-loop

d e f reciprocal_3(x):

y = 1/ x

r e t u r n y, y. sum()

% t i m e i t b, sum = reciprocal_3(a)

4.38 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
UFuncs : vectorized opertions
ufunc : universal function
very fast
element-wise operations on numpy array
unary
binary: scalar ⊙ np array
binary: np array ⊙ np array

In [438]:
x = np. arange(0,11,2)

y = np. arange(1,12,2)

print(x)

print(y)

[ 0 2 4 6 8 10]
[ 1 3 5 7 9 11]

In [439]:
print("x =", x)

print("x + 5 =", x + 5)

print("x - 5 =", x - 5)

print("x * 2 =", x * 2)

print("x / 2 =", x / 2)

print("x // 2 =", x / / 2) # floor division

x = [ 0 2 4 6 8 10]
x + 5 = [ 5 7 9 11 13 15]
x - 5 = [-5 -3 -1 1 3 5]
x * 2 = [ 0 4 8 12 16 20]
x / 2 = [0. 1. 2. 3. 4. 5.]
x // 2 = [0 1 2 3 4 5]

In [440]:
print("-x = ", - x)

print("x ** 2 = ", x * * 2)

print("x % 2 = ", x % 2)

print("-(x/2+1)**2 = ", - (0.5* x + 1) * * 2)

-x = [ 0 -2 -4 -6 -8 -10]
x ** 2 = [ 0 4 16 36 64 100]
x % 2 = [0 0 0 0 0 0]
-(x/2+1)**2 = [ -1. -4. -9. -16. -25. -36.]

In [441]:
print (x + 2)

print (np. add(x, 2))

[ 2 4 6 8 10 12]
[ 2 4 6 8 10 12]
The following table lists the arithmetic operators implemented in NumPy:

Operator Equivalent ufunc Description

+ np.add Addition (e.g., 1 + 1 = 2 )

- np.subtract Subtraction (e.g., 3 - 2 = 1 )

- np.negative Unary negation (e.g., -2 )

* np.multiply Multiplication (e.g., 2 * 3 = 6 )

/ np.divide Division (e.g., 3 / 2 = 1.5 )

// np.floor_divide Floor division (e.g., 3 // 2 = 1 )


Operator Equivalent ufunc Description

** np.power Exponentiation (e.g., 2 ** 3 = 8 )

% np.mod Modulus/remainder (e.g., 9 % 4 = 1 )

Math functions
In [442]:
x = [- 1, 2, - 3]

print("x =", x) # x: python list

y = np. abs(x) # x is converted to np array, so is y

print("y=|x| =", y)

x = [-1, 2, -3]
y=|x| = [1 2 3]

In [443]:
x = [1, 2, 3]

print("e^y =", np. exp(y))

print("2^y =", np. exp2(y))

print("3^y =", np. power(3, y))

e^y = [ 2.71828183 7.3890561 20.08553692]


2^y = [2. 4. 8.]
3^y = [ 3 9 27]

In [444]:
x = [1, 2, 4, 10]

print("x =", x)

print("ln(x) =", np. log(x))

print("log2(x) =", np. log2(x))

print("log10(x) =", np. log10(x))

x = [1, 2, 4, 10]
ln(x) = [0. 0.69314718 1.38629436 2.30258509]
log2(x) = [0. 1. 2. 3.32192809]
log10(x) = [0. 0.30103 0.60205999 1. ]

Special function : np.expm1(x), np.log1p(x)


for computing more pricisely when x is small
np.expm1(x) : high precision function for np.exp(x)-1
np.log1p(x) : high precision function for np.log(1+x)

In [445]:
x = np. array([0.001, 0.0001, 0.00001], dtype= np. float32)

y = np. array(x, dtype= np. float64)

print("exp(x) - 1 =", np. exp(x)- 1)

print("exp(y) - 1 =", np. exp(y)- 1)

print("expm1(x) =", np. expm1(x))

print("log(1 + x) =", np. log(1+ x))

print("log(1 + y) =", np. log(1+ y))

print("log1p(x) =", np. log1p(x))

exp(x) - 1 = [1.00052357e-03 1.00016594e-04 1.00135803e-05]


exp(y) - 1 = [1.00050021e-03 1.00004998e-04 1.00000497e-05]
expm1(x) = [1.0005003e-03 1.0000499e-04 1.0000050e-05]
log(1 + x) = [9.99546959e-04 1.00011595e-04 1.00135303e-05]
log(1 + y) = [9.99500381e-04 9.99949978e-05 9.99994975e-06]
log1p(x) = [9.995003e-04 9.999500e-05 9.999950e-06]

Trigonometric functions
In [446]: theta = np. linspace(0, np. pi, 4)

print("theta = ", theta)

print("sin(theta) = ", np. sin(theta))

print("cos(theta) = ", np. cos(theta))

print("tan(theta) = ", np. tan(theta))

theta = [0. 1.04719755 2.0943951 3.14159265]


sin(theta) = [0.00000000e+00 8.66025404e-01 8.66025404e-01 1.22464680e-16]
cos(theta) = [ 1. 0.5 -0.5 -1. ]
tan(theta) = [ 0.00000000e+00 1.73205081e+00 -1.73205081e+00 -1.22464680e-16]

In [447]:
x = [- 1, 0, 1]

print("x = ", x)

print("arcsin(x) = ", np. arcsin(x))

print("arccos(x) = ", np. arccos(x))

print("arctan(x) = ", np. arctan(x))

x = [-1, 0, 1]
arcsin(x) = [-1.57079633 0. 1.57079633]
arccos(x) = [3.14159265 1.57079633 0. ]
arctan(x) = [-0.78539816 0. 0.78539816]

Specialized functions : gamma, beta, erf, ...


scipy.special : provides many special functions

In [448]:
f r o m scipy i m p o r t special

In [449]:
# Gamma functions (generalized factorials) and related functions

x = [1, 5, 10]

print("gamma(x) =", special. gamma(x))

print("ln|gamma(x)| =", special. gammaln(x))

print("beta(x, 2) =", special. beta(x, 2))

gamma(x) = [1.0000e+00 2.4000e+01 3.6288e+05]


ln|gamma(x)| = [ 0. 3.17805383 12.80182748]
beta(x, 2) = [0.5 0.03333333 0.00909091]

In [450]:
# Error function (integral of Gaussian)

# its complement, and its inverse

x = np. array([0, 0.3, 0.7, 1.0])

print("erf(x) =", special. erf(x))

print("erfc(x) =", special. erfc(x))

print("erfinv(x) =", special. erfinv(x))

erf(x) = [0. 0.32862676 0.67780119 0.84270079]


erfc(x) = [1. 0.67137324 0.32219881 0.15729921]
erfinv(x) = [0. 0.27246271 0.73286908 inf]

Specifying output
In [27]:
x = np. arange(5)

y = np. arange(10)

np. multiply(x, 10, out= y[3:8]) # store x*10 to y[3:8]

print(y)

y[3:8] = np. multiply(x, 10)

print(y)

[ 0 1 2 0 10 20 30 40 8 9]
[ 0 1 2 0 10 20 30 40 8 9]
2-3. Aggregations : sum, min, max, and so on
numpy aggregation functions are much faster than standard python aggregation

In [452]:
L = np. random. random(100000)

% t i m e i t sum(L)

% t i m e i t np.sum(L)

15.4 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
45.2 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [453]:
% t i m e i t max(L)

% t i m e i t L.max()

10 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
36.3 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [28]:
M = np. random. random((3, 4))

print(M)

[[0.59753225 0.07463842 0.43803321 0.21812627]


[0.53159247 0.10793176 0.8448757 0.03346237]
[0.32140902 0.65124221 0.61473394 0.2195387 ]]

In [29]:
print(M. sum())

print(M. sum(axis= 0), '\n')

print(M. cumsum(axis= 1), '\n')

print(M. prod(axis= 0), '\n')

print(M. cumprod(axis= 1), '\n')

4.653116331422877
[1.45053374 0.83381239 1.89764286 0.47112734]

[[0.59753225 0.67217067 1.11020389 1.32833016]


[0.53159247 0.63952423 1.48439993 1.5178623 ]
[0.32140902 0.97265124 1.58738518 1.80692388]]

[0.10209353 0.00524631 0.22750296 0.00160242]

[[0.59753225 0.04459886 0.01953578 0.00426127]


[0.53159247 0.05737571 0.04847534 0.0016221 ]
[0.32140902 0.20931512 0.12867311 0.02824873]]

In [30]:
print('min =', M. min(axis= 0))

print('max =', M. max(axis= 0))

print('mean=', M. mean(axis= 0))

print('var =', M. var(axis= 0))

print('std =', M. std(axis= 0))

print('med =', np. median(M, axis= 0)) # M.median(axis=0) does not work

print('p75%=', np. percentile(M, 75, axis= 0))

min = [0.32140902 0.07463842 0.43803321 0.03346237]


max = [0.59753225 0.65124221 0.8448757 0.2195387 ]
mean= [0.48351125 0.27793746 0.63254762 0.15704245]
var = [0.01386324 0.06986296 0.02774547 0.00763635]
std = [0.11774227 0.26431602 0.1665697 0.08738621]
med = [0.53159247 0.10793176 0.61473394 0.21812627]
p75%= [0.56456236 0.37958699 0.72980482 0.21883248]

In [31]:
x = np. arange(9,- 1,- 1)

print (np. argmin(x))

print (x. argmax())

9
0

2-4. Example: What is the Average Height of US


Presidents?
Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a
simple example, let's consider the heights of all US presidents.
This data is available in the file
president_heights.csv, which is a simple comma-separated list of labels and values:

president_heights.csv
order,name,height(cm)

1,George Washington,189

2,John Adams,170

3,Thomas Jefferson,189
...

pandas to read the file


pandas will be explored more fully later

In [458]:
i m p o r t pandas a s pd

data = pd. read_csv('data/president_heights.csv')

heights = np. array(data['height(cm)'])

print(heights)

[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
177 185 188 188 182 185]

summary statistics:
In [459]:
print("Mean height: ", heights. mean())

print("Standard deviation:", heights. std())

print("Minimum height: ", heights. min())

print("Maximum height: ", heights. max())

Mean height: 179.73809523809524


Standard deviation: 6.931843442745892
Minimum height: 163
Maximum height: 193

quantiles:
In [460]:
print("25th percentile: ", np. percentile(heights, 25))

print("Median: ", np. median(heights))

print("75th percentile: ", np. percentile(heights, 75))

25th percentile: 174.25


Median: 182.0
75th percentile: 183.0

In [461]:
% m a t p l o t l i b inline

i m p o r t matplotlib.pyplot a s plt

i m p o r t seaborn; seaborn. set() # set plot style

In [462]:
plt. hist(heights)

plt. title('Height Distribution of US Presidents')

plt. xlabel('height (cm)')

plt. ylabel('number');

In [ ]:

2-5. Broadcasting
Motivation
ufunc : element-wise operation
A⊙B
what if A and B has different shape?

we want to match shapes

as long as there is a natural way


</span>
broadcasting : rules for binary ufunc when shapes differ

In [463]:
i m p o r t numpy a s np

In [464]:
a = np. array([0, 1, 2])

b = np. array([5, 5, 5])

print (a + b)

[5 6 7]

In [465]:
print(a + 5)

[5 6 7]

we can view a + 5 as :
duplicate the value 5 into the array [5, 5, 5]
then add element-wise
this is only mental model (simple way of thinking broadcasting)
numpy does this in a more efficient way
We can similarly extend this to arrays of higher dimension

In [466]:
a = np. array([0, 1, 2])

M = np. ones((3, 3))

print(M+ a)

[[1. 2. 3.]
[1. 2. 3.]
[1. 2. 3.]]

M+a
a is duplicated, or broadcast
across the second dimension (vertically)
in order to match the shape of M .

In [32]:
a = np. arange(3)

b = np. arange(3). reshape((3,1))

print(a, '\n')

print(b, '\n')

print(a+ b)

[0 1 2]

[[0]
[1]
[2]]

[[0 1 2]
[1 2 3]
[2 3 4]]

visualization of broadcasting in a + 5 , M + b , and a + b

The light boxes represent the broadcasted values: again, this extra memory is not actually
allocated in the course of the operation, but it can be useful conceptually to imagine that it is.

Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two
arrays:

Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with
fewer dimensions is padded with ones on its leading (left) side.
Rule 2: If the shape of the two arrays does not match in any dimension, the array with
shape equal to 1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Non-compatible example
In [468]:
M = np. ones((3, 2))

a = np. arange(3)

M + a # TAQ : result?

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
< i p y t h o n - i n p u t - 4 6 8 - d 4 a d f a 6 8 c d 6 2 > in <module>
1 M = np. ones( ( 3 , 2 ) )
2 a = np. arange( 3 )
----> 3 M + a

V a l u e E r r o r : operands could not be broadcast together with shapes (3,2) (3,)

Broadcasting rules apply to any binary ufunc .


e.g. logaddexp(a, b) = log(exp(a) + exp(b))

In [33]:
a = np. array([0, 1, 2])

M = np. ones((3, 3))

print (np. logaddexp(M, a),'\n')

print (np. logaddexp(M, a. reshape((3,1))))

[[1.31326169 1.69314718 2.31326169]


[1.31326169 1.69314718 2.31326169]
[1.31326169 1.69314718 2.31326169]]

[[1.31326169 1.31326169 1.31326169]


[1.69314718 1.69314718 1.69314718]
[2.31326169 2.31326169 2.31326169]]

Broadcasting Example : Centering an array (i.e. zero mean)


Some data analysis algorithms assume zero-mean data for simplicity
PCA
How to do centering?

In [34]:
X = np. random. random((10, 3))

print(X,'\n')

Xmean = X. mean(axis= 0)

print(Xmean)

[[0.20020551 0.71990961 0.04386778]


[0.24253241 0.58362154 0.19365921]
[0.73131058 0.54673692 0.34738314]
[0.4634587 0.36476761 0.48515853]
[0.83694219 0.67311311 0.08619293]
[0.08002749 0.40544108 0.42883816]
[0.34285624 0.21887864 0.82597284]
[0.91164433 0.76492665 0.4030136 ]
[0.2637624 0.37390141 0.97775962]
[0.40045545 0.65564646 0.33258765]]

[0.44731953 0.5306943 0.41244335]


We can compute the mean of each feature using the mean aggregate across the first
dimension:

In [471]:
X_centered = X - Xmean

print (X_centered. mean(axis= 0))

[ 9.43689571e-17 -5.55111512e-17 4.44089210e-17]

Broadcasting Example : Plotting a two-dimensional function


plot a function z = f (x, y)

need to evaluate f (x, y) at 50x50 grid points


use broadcasting to compute z = f (x, y)

then plot z using matplotlib, which will be covered later

In [472]:
# x and y have 50 steps from 0 to 5

x = np. linspace(0, 5, 50) # shape=(50,)

y = np. linspace(0, 5, 50). reshape((- 1,1)) # shape=(50,1)

z = np. sin(x)* * 10 + np. cos(10 + y * x) * np. cos(x)

print('shape of z = ', z. shape)

print(z)

shape of z = (50, 50)


[[-0.83907153 -0.83470697 -0.8216586 ... 0.8956708 0.68617261
0.41940746]
[-0.83907153 -0.82902677 -0.8103873 ... 0.92522407 0.75321348
0.52508175]
[-0.83907153 -0.82325668 -0.79876457 ... 0.96427357 0.84172689
0.66446403]
...
[-0.83907153 -0.48233077 -0.01646558 ... 0.96449925 0.75196531
0.41982581]
[-0.83907153 -0.47324558 0.00392612 ... 0.92542163 0.68540362
0.37440839]
[-0.83907153 -0.46410908 0.02431613 ... 0.89579384 0.65690314
0.40107702]]

In [473]:
% m a t p l o t l i b inline

i m p o r t matplotlib.pyplot a s plt

plt. imshow(z, origin= 'lower', extent= [0, 5, 0, 5], cmap= 'viridis')

plt. colorbar();

2-6. Comparisons, Masks, and Boolean Logic


In [ ]:
x = np. array([1, 2, 3, 4, 5])

b = (x < = 3)

print ("x <= 3 : ", b)

print (np. sum(x < = 3)) # TAQ

print (np. count_nonzero(x < = 3)) # TAQ

In [475]:
print (x* b) # masking

print (np. sum(x* b))

print (x[x< = 3])

print (x[b])

print (np. sum(x[x< = 3]))

[1 2 3 0 0]
6
[1 2 3]
[1 2 3]
6

In [476]:
print ((3 < = x) & (x < = 4))

print (np. any((3 < = x) & (x < = 4)))

print ((x < 3) | (x > 4))

print (np. all((x < 3) | (x > 4)))

[False False True True False]


True
[ True True False False True]
False

In [477]:
b = (x < = 3)

print (x* b) # masking

print (np. sum(x* b))

[1 2 3 0 0]
6

Motivating Example: Sleepless in Seatle


Is Seatle really rainy city?
Let's get data first!
daily rainfall data from January 1 to December 31, 2014.

In [37]:
i m p o r t numpy a s np

i m p o r t pandas a s pd

# use pandas to extract rainfall inches as a NumPy array

data = pd. read_csv('data/Seattle2014.csv')

rainfall = data['PRCP']. values

rainfall. shape

Out[37]: (365,)

In [43]:
% m a t p l o t l i b inline

i m p o r t matplotlib.pyplot a s plt

# you may need to install seaborn to set nice plot styles

# >>> conda install seaborn

i m p o r t seaborn; seaborn. set()

plt. hist(rainfall, 40);

Questions (on Seatle rainfall data in 2014)


number of rainy days
number of rainy days in non-summer
precipitation in summer
precipitation in non-summer
...

In [480]:
print("Number days without rain: ", np. sum(rainfall = = 0))

print("Number days with rain: ", np. sum(rainfall > 0))

print("Days with more than 10 mm: ", np. sum(rainfall > 10))

print("Rainy days with < 5 mm: ", np. sum((rainfall > 0) & (rainfall < 5)))

Number days without rain: 215


Number days with rain: 150
Days with more than 10 mm: 120
Rainy days with < 5 mm: 10

In [481]:
# construct a mask of all rainy days

rainy = (rainfall > 0)

# construct a mask of all summer days (June 21st is the 172nd day)

days = np. arange(365)

summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (mm): ",

np. median(rainfall[rainy]))

print("Median precip on summer days in 2014 (mm): ",

np. median(rainfall[summer]))

print("Maximum precip on summer days in 2014 (mm): ",

np. max(rainfall[summer]))

print("Median precip on non-summer rainy days (mm):",

np. median(rainfall[rainy & ~ summer]))

Median precip on rainy days in 2014 (mm): 49.5


Median precip on summer days in 2014 (mm): 0.0
Maximum precip on summer days in 2014 (mm): 216
Median precip on non-summer rainy days (mm): 51.0

2-7. Fancy Indexing


Indexing np array
simple index: arr[0]
slice: arr[:5]
Boolean mask: arr[arr > 0]
fancy indexing: arr[[1,3,7]]

In [482]:
x = np. random. randint(100, size= 10)

print(x)

[ 7 8 89 16 52 87 72 34 4 0]

In [ ]:
ind = [3, 7, 4]

print (x[ind]) # TAQ : result?

In [ ]:
ind = np. array([[3, 7],

[4, 5]])

x[ind] # TAQ : result?

In [485]:
X = np. arange(12). reshape((3, 4))

Out[485]: array([[ 0, 1, 2, 3],


[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

In [486]:
row = np. array([0, 1, 2])

col = np. array([2, 1, 3])

X[row, col] # TAQ : result?

Out[486]: array([ 2, 5, 11])

In [487]:
row = np. array([0, 2]). reshape((2,1))

col = np. array([2, 1, 3])

X[row, col] # TAQ : result? Hint : broadcasting is applied

Out[487]: array([[ 2, 1, 3],


[10, 9, 11]])
Combined Indexing
combining simple, slice, mask, and fancy index

In [488]:
print(X)

col = [2,0,1]

print(X[2, col])

print(X[1:, col])

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[10 8 9]
[[ 6 4 5]
[10 8 9]]

In [489]:
row = np. array([0, 2]). reshape((2,1))

col_mask = np. array([T r u e , F a l s e , T r u e , F a l s e ])

X[row, col_mask]

Out[489]: array([[ 0, 2],


[ 8, 10]])

Example: Selecting Random Points


consider a set of N points in D dimensions
we generate N = 100 points in 2D
bi-variate normal
plot them using matplotlib
randomly select 20 points from them
mark selected points in different shape

In [45]:
mean = [0, 0]

cov = [[1, 2],

[2, 5]]

X = np. random. multivariate_normal(mean, cov, 100)

X. shape

Out[45]: (100, 2)

In [46]:
% m a t p l o t l i b inline

i m p o r t matplotlib.pyplot a s plt

i m p o r t seaborn; seaborn. set() # for plot styling

plt. scatter(X[:, 0], X[:, 1]);

In [47]:
indices = np. random. choice(X. shape[0], 20, replace= F a l s e )

print (indices)

selection = X[indices] # fancy indexing here

print (selection. shape)

[ 2 81 12 6 94 68 82 30 1 23 37 3 64 21 11 45 83 67 92 71]
(20, 2)

In [48]:
plt. scatter(X[:, 0], X[:, 1], alpha= 0.3)

plt. scatter(selection[:, 0], selection[:, 1],

facecolor= 'red', s= 7);

Modifying Values with Fancy Indexing


In [494]:
x = np. arange(10)

idx = np. array([2, 1, 8, 4])

x[idx] = 99

print(x)

x[idx] - = 10

print(x)

[ 0 99 99 3 99 5 6 7 99 9]
[ 0 89 89 3 89 5 6 7 89 9]
We can use any assignment-type operator for this. For example:

In [495]:
x[i] - = 10

print(x)

[ 0 89 89 3 89 5 6 7 89 -1]

Avoid duplication in fancy index


may cause unexpected results
use at() method if duplication is unavoidable

In [496]:
# duplication in fancy index

x = np. zeros(10)

x[[0, 0]] = [4, 6]

print(x)

[6. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

In [497]:
# duplication in fancy index

idx = [2, 3, 3, 4, 4, 4]

x[idx] + = 1

print(x)

[6. 0. 1. 1. 1. 0. 0. 0. 0. 0.]

In [498]:
x = np. zeros(10)

np. add. at(x, idx, 1)

print(x)

[0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]

Sorting
np.sort
np.argsort

In [499]:
x = np. random. random(5)

y = np. sort(x)

print(x) # x is not changed

print(y)

x. sort() # in-place sort

print(x)

[0.59488531 0.19874637 0.33881144 0.23509604 0.80192003]


[0.19874637 0.23509604 0.33881144 0.59488531 0.80192003]
[0.19874637 0.23509604 0.33881144 0.59488531 0.80192003]

In [500]:
height = 150 + 40* np. random. random(5)

print('hieght=', height)

money = 100* np. random. random(5)

print('money =', money)

idx = np. argsort(height) # idx in order of height

print('index =', idx)

print('h[idx]=', height[idx]) # fancy index

print('m[idx]=', money[idx]) # fancy index

hieght= [189.40070616 161.84410721 163.2017855 167.6655901 184.75984166]


money = [ 3.3759033 12.44325455 84.22738284 5.61546723 45.12811488]
index = [1 2 3 4 0]
h[idx]= [161.84410721 163.2017855 167.6655901 184.75984166 189.40070616]
m[idx]= [12.44325455 84.22738284 5.61546723 45.12811488 3.3759033 ]

Partitioning
g
complete sorting is not needed
want to find the k-smallest values in the array
np.partition :
the smallest K values to the left of the partition
and the remaining values to the right, in arbitrary order:

In [501]:
x = np. array([7, 2, 3, 1, 6, 5, 4])

y = np. partition(x, 3)

print (y)

idx = np. argpartition(x, 3)

print (idx)

print (x[idx])

[2 1 3 4 6 5 7]
[1 3 2 6 4 5 0]
[2 1 3 4 6 5 7]

In [502]:
X = np. random. randint(0, 10, (4, 6))

print (X)

print (np. partition(X, 2, axis= 1))

[[5 8 4 2 0 0]
[6 5 1 9 6 8]
[8 4 4 1 2 1]
[0 4 1 0 6 7]]
[[0 0 2 4 5 8]
[1 5 6 9 6 8]
[1 1 2 8 4 4]
[0 0 1 4 6 7]]

Example: k-Nearest Neighbors


Randomy created 10 points in 2D

In [509]:
N = 10

X = np. random. rand(N, 2)

In [510]:
% m a t p l o t l i b inline

i m p o r t matplotlib.pyplot a s plt

i m p o r t seaborn; seaborn. set() # Plot styling

plt. scatter(X[:, 0], X[:, 1], s= 100);

In [511]: # squared distance matrix (NxN)

dist_sq = np. sum((X. reshape(N,1,2) - X. reshape(1,N,- 1)) * * 2, axis= - 1)

dist = np. sqrt(dist_sq)

In [512]:
# the above can be done using scipy.spatial.distance

f r o m scipy.spatial.distance i m p o r t pdist, squareform

# pdist(.) : pairwise distance, metric = 'euclid' by default

# squareform(.) : nxn matrix form

dist = squareform(pdist(X))

In [513]:
K = 2

knn0 = np. argpartition(dist, K + 1, axis= 1)

In [514]:
plt. scatter(X[:, 0], X[:, 1], s= 100)

# draw lines from each point to its two nearest neighbors

knn = knn0[:, 1:K+ 1] # exclude column 0

f o r i i n range(N):

f o r j i n knn[i]:

# plot a line from X[i] to X[j]

# use some zip magic to make it happen:

plt. plot(* zip(X[j], X[i]), color= 'black')

In [ ]:

In [ ]:

You might also like