Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

3 Renderers & Visual Customizations

Python For Data Science


Glyphs Legend Location
Inside Plot Area
Bokeh Cheat Sheet
Scatter Markers
>>> p.legend.location = 'bottom_left'
>>> p1.circle(np.array([1,2,3]), np.array([3,2,1]),

fill_color='white')
Outside Plot Area
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3],

Learn Bokeh online at www.DataCamp.com


color='blue', size=1) >>>
>>>
from bokeh.models import Legend

r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1])

taught by Bryan Van de Ven, core contributor >>> r2 = p2.line([1,2,3,4], [3,4,5,6])

>>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],

Line Glyphs location=(0, -30))

>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)


>>> p.add_layout(legend, 'right')

Plotting With Bokeh


>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]),

pd.DataFrame([[3,4,5],[3,2,1]]),

color="blue") Legend Orientation


The Python interactive visualization library Bokeh enables high-performance >>> p.legend.orientation = "horizontal"

visual presentation of large datasets in modern web browsers.

Customized Glyphs Also see Data


>>> p.legend.orientation = "vertical"

Bokeh’s mid-level general purpose bokeh.plotting interface is centered Legend Background & Border
around two main components: data and glyphs. Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select')
>>> p.legend.border_line_color = "navy"

>>> p.circle('mpg', 'cyl', source=cds_df,


>>> p.legend.background_fill_color = "white"
selection_color='red',

nonselection_alpha=0.1)
Rows & Columns Layout
Hover Glyphs
>>> from bokeh.models import HoverTool
Rows
>>> hover = HoverTool(tooltips=None, mode='vline')
>>> from bokeh.layouts import row

The basic steps to creating plots with the bokeh.plotting interface are:
>>> p3.add_tools(hover) >>> layout = row(p1,p2,p3)
1. Prepare some data (Python lists, NumPy arrays, Pandas DataFrames and other sequences of values)
Columns
2. Create a new plot
>>> from bokeh.layouts import columns

3. Add renderers for your data, with visual customizations


Colormapping >>> layout = column(p1,p2,p3)
4. Specify where to generate the output
>>> from bokeh.models import CategoricalColorMapper
Nesting Rows & Columns
5. Show or save the results >>> color_mapper = CategoricalColorMapper(

>>>layout = row(column(p1,p2), p3)


factors=['US', 'Asia', 'Europe'],

>>> from bokeh.plotting import figure


palette=['blue', 'red', 'green'])

>>> from bokeh.io import output_file, show

>>> x = [1, 2, 3, 4, 5] #Step 1

>>> p3.circle('mpg', 'cyl', source=cds_df,

color=dict(field='origin',
Grid Layout
>>> y = [6, 7, 2, 4, 5]
transform=color_mapper),

>>> p = figure(title="simple line example", #Step 2


legend='Origin') >>> from bokeh.layouts import gridplot

x_axis_label='x',
>>> row1 = [p1,p2]

y_axis_label='y')
>>> row2 = [p3]

>>> p.line(x, y, legend="Temp.", line_width=2) #Step 3

4 Output & Export


>>> layout = gridplot([[p1,p2],[p3]])
>>> output_file("lines.html") #Step 4

>>> show(p) #Step 5


Tabbed Layout
Notebook
1 Data Also see Lists, NumPy & Pandas
>>> from bokeh.io import output_notebook, show

>>>
>>>
>>>
from bokeh.models.widgets import Panel, Tabs

tab1 = Panel(child=p1, title="tab1")

tab2 = Panel(child=p2, title="tab2")

>>> output_notebook() >>> layout = Tabs(tabs=[tab1, tab2])


Under the hood, your data is converted to Column Data Sources.

You can also do this manually: HTML Linked Plots


>>> import numpy as np

>>> import pandas as pd

>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'],

Standalone HTML Linked Axes


[32.4, 4, 66, 'Asia'],
>>> from bokeh.embed import file_html
>>> p2.x_range = p1.x_range

[21.4, 4, 109, 'Europe']]),


>>> from bokeh.resources import CDN
>>> p2.y_range = p1.y_range
columns=['mpg','cyl', 'hp', 'origin'],
>>> html = file_html(p, CDN, "my_plot")

index=['Toyota', 'Fiat', 'Volvo'])

Linked Brushing
>>> from bokeh.io import output_file, show
>>> p4 = figure(plot_width = 100, tools='box_select,lasso_select')

>>> from bokeh.models import ColumnDataSource


>>> output_file('my_bar_chart.html', mode='cdn') >>> p4.circle('mpg', 'cyl', source=cds_df)

>>> cds_df = ColumnDataSource(df) >>> p5 = figure(plot_width = 200, tools='box_select,lasso_select')

Components >>> p5.circle('mpg', 'hp', source=cds_df)

>>> from bokeh.embed import components


>>> layout = row(p4,p5)
>>> script, div = components(p)
2 Plotting
PNG
>>> from bokeh.plotting import figure
5 Show or Save Your Plots
>>> p1 = figure(plot_width=300, tools='pan,box_zoom')

>>> from bokeh.io import export_png

>>> p2 = figure(plot_width=300, plot_height=300,


>>> show(p1)

>>> export_png(p, filename="plot.png")


x_range=(0, 8), y_range=(0, 8))
>>> show(layout)

>>> p3 = figure() >>> save(p1)

SVG >>> save(layout)

>>> from bokeh.io import export_svgs

>>> p.output_backend = "svg"

>>> export_svgs(p, filename="plot.svg") Learn Data Skills Online at www.DataCamp.com


> Advanced Indexing Also see NumPy Arrays > Combining Data
Python For Data Science
Selecting
>>> df3.loc[:,(df3>1).any()] #Select cols with any vals >1

Data Wrangling in Pandas Cheat Sheet >>>


>>>
>>>
df3.loc[:,(df3>1).all()] #Select cols with vals > 1

df3.loc[:,df3.isnull().any()] #Select cols with NaN

df3.loc[:,df3.notnull().all()] #Select cols without NaN

Learn Data Wrangling online at www.DataCamp.com Indexing With isin()


>>> df[(df.Country.isin(df2.Type))] #Find same elements

>>> df3.filter(items=”a”,”b”]) #Filter on values


Merge
>>> df.select(lambda x: not x%5) #Select specific elements

Where >>> pd.merge(data1,

data2,

> Reshaping Data >>> s.where(s > 0) #Subset the data

Query
how='left',

on='X1')

>>> df6.query('second > first') #Query DataFrame


Pivot >>> pd.merge(data1,

data2,

>>> df3= df2.pivot(index='Date', #Spread rows into columns


Setting/Resetting Index how='right',

on='X1')
columns='Type',

values='Value') >>> df.set_index('Country') #Set the index

>>> df4 = df.reset_index() #Reset the index


>>> pd.merge(data1,

>>> df = df.rename(index=str, #Rename


data2,

DataFrame columns={"Country":"cntry",
how='inner',

"Capital":"cptl",
on='X1')
"Population":"ppltn"})
>>> pd.merge(data1,

Reindexing data2,

how='outer',

on='X1')
Pivot Table >>> s2 = s.reindex(['a','c','d','e','b'])

Forward Filling Backward Filling


>>> df4 = pd.pivot_table(df2, #Spread rows into

columns values='Value',
>>> df.reindex(range(4),
>>> s3 = s.reindex(range(5),

index='Date',
method='ffill') method='bfill') Join
columns='Type']) Country Capital Population
0 3

0 Belgium Brussels 11190846


1 3
>>> data1.join(data2, how='right')
1 India New Delhi 1303171035
2 3

Stack / Unstack 2 Brazil Brasília 207847528


3 3

3 Brazil Brasília 207847528 4 3


Concatenate
>>> stacked = df5.stack() #Pivot a level of column labels

>>> stacked.unstack() #Pivot a level of index labels


MultiIndexing Vertical
>>> s.append(s2)
>>> arrays = [np.array([1,2,3]),

np.array([5,4,3])]
Horizontal/Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)

>>> pd.concat([s,s2],axis=1, keys=['One','Two'])

>>> tuples = list(zip(*arrays))

>>> pd.concat([data1, data2], axis=1, join='inner')


>>> index = pd.MultiIndex.from_tuples(tuples,

names=['first', 'second'])

>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)

Melt >>> df2.set_index(["Date", "Type"])

> Dates
> Duplicate Data
>>> pd.melt(df2, #Gather columns into rows

id_vars=["Date"],

value_vars=["Type", "Value"],
>>> df2['Date']= pd.to_datetime(df2['Date'])

value_name="Observations") >>> df2['Date']= pd.date_range('2000-1-1',

>>> s3.unique() #Return unique values


periods=6,

>>> df2.duplicated('Type') #Check duplicates


freq='M')

>>> df2.drop_duplicates('Type', keep='last') #Drop duplicates


>>> dates = [datetime(2012,5,1), datetime(2012,5,2)]

>>> df.index.duplicated() #Check index duplicates >>> index = pd.DatetimeIndex(dates)

>>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')

> Grouping Data


> Visualization Also see Matplotlib
Aggregation
> Iteration >>> df2.groupby(by=['Date','Type']).mean()

>>> import matplotlib.pyplot as plt


>>> s.plot()
>>> df2.plot()

>>> df4.groupby(level=0).sum()

>>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), 'b': np.sum}) >>> plt.show() >>> plt.show()


>>> df.iteritems() #(Column-index, Series) pairs

>>> df.iterrows() #(Row-index, Series) pairs


Transformation
>>> customSum = lambda x: (x+x%2)

> Missing Data


>>> df4.groupby(level=0).transform(customSum)

>>> df.dropna() #Drop NaN values

>>> df3.fillna(df3.mean()) #Fill NaN values with a predetermined value


Learn Data Skills Online at www.DataCamp.com
>>> df2.replace("a", "f") #Replace values with others
Data Science Cheat Sheet for Business Leaders
Data Science Basics
Types of Data Science Building a Data Science Team
Descriptive Analytics (Business Intelligence): Get useful data in Your data team members require different skills for different purposes.
front of the right people in the form of dashboards, reports, and
Machine Learning
emails Data Engineer Data Analyst Data Scientist
Engineer
- Which customers have churned?
Store and maintain Visualize and Write production-level Build custom models to
- Which homes have sold in a given location, and do homes of a
data describe data code to predict with data drive business decisions
certain size sell more quickly?
SQL/Java/Scala/ SQL + BI Tools + Python/Java/R Python/R/SQL
Predictive Analytics (Machine Learning): Put data science Python Spreadsheets
models continuously into production
- Which customers may churn?
- How much will a home sell for, given its location and number of
rooms?
Data Science Team Organizational Models
Prescriptive Analytics (Decision Science): Use data to help a Centralized/isolated Embedded Hybrid
company make decisions The data team is the owner Data experts are Data experts sit with functional
- What should we do about the particular types of customers of data and answers dispersed across an teams and also report to the
requests from other teams organization and report Chief Data Scientist—so data
that are prone to churn?
to functional leaders is an organizational priority
- How should we market a home to sell quickly, given its location
and number of rooms? Data Engineering Design & Squad 1 Squad 2 Squad 3 Squad 1 Squad 2 Squad 3
Product

The Standard Data Science Workflow Data

Data Collection: Compile data from different sources and


1
store it for efficient access

Exploration and Visualization: Explore and visualize data


2
through dashboards

Experimentation and Prediction: The buzziest topic in data


3
science—machine learning!

www.datacamp.com/courses/data-science-for-business www.datacamp.com/groups/business
Exploration and Visualization Experimentation and Prediction
The type of dashboard you should use depends on what you’ll be using it for. Machine Learning
Common Dashboard Elements Machine learning is an application of artificial intelligence (AI) that builds
algorithms and statistical models to train data to address specific questions
Type What is it best for? Example without explicit instructions.

Supervised Machine Learning Unsupervised Machine Learning


Time series Tracking a value over time
Purpose Makes predictions from data Makes predictions by
with labels and features clustering data with no
labels into categories

Stacked bar chart Tracking composition over time Example Recommendation systems, email Image segmentation,
subject optimization, churn customer segmentation
prediction

Bar chart Categorical comparison

Popular Dashboard Tools


Spreadsheets
Special Topics in Machine Learning
BI Tools Customized Tools

Excel Power BI R Shiny Time Series Forecasting is a technique for predicting events through a
sequence of time and can capture seasonality or periodic events.
Sheets Tableau d3.js
Natural Language Processing (NLP) allows computers to process and analyze
Looker
large amounts of natural language data.
- Text as input data
- Word counts track the important words in a text
When You Should Request a Dashboard - Word embeddings create features that group similar words

When you’ll use it multiple times Deep Learning / Neural Networks enables Explainable AI is an emerging field in
unsupervised machine learning using data machine learning that applies AI such
that is unstructured or unlabeled. that results can be easily understood.

When you’ll need the information updated regularly


Highly accurate predictions Understandable by humans

Better for “What?” Better for “Why?"


When the request will always be the same

www.datacamp.com/courses/data-science-for-business www.datacamp.com/groups/business
> Model Architecture > Inspect Model
Sequential Model
Python For Data Science

>>> model.output_shape #Model output shape

>>> model.summary() #Model summary representation

>>> model.get_config() #Model configuration

>>> from keras.models import Sequential


>>> model.get_weights() #List all weight tensors in the model

Keras Cheat Sheet


>>> model = Sequential()

>>> model2 = Sequential()

>>> model3 = Sequential()

Multilayer Perceptron (MLP)


> Compile Model
Learn Keras online at www.DataCamp.com
MLP: Binary Classification
Binary Classification
>>> model.compile(optimizer='adam',

>>> from keras.layers import Dense


loss='binary_crossentropy',

>>> model.add(Dense(12,
metrics=['accuracy'])
input_dim=8,

MLP: Multi-Class Classification


kernel_initializer='uniform',

Keras activation='relu'))

>>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))

>>> model.compile(optimizer='rmsprop',

loss='categorical_crossentropy',

>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) metrics=['accuracy'])


Keras is a powerful and easy-to-use deep learning library forTheano and Multi-Class Classification MLP: Regression
TensorFlow that provides a high-level neural networks API to develop and >>> from keras.layers import Dropout
>>> model.compile(optimizer='rmsprop',

>>> model.add(Dense(512,activation='relu',input_shape=(784,)))
loss='mse',

evaluate deep learning models. >>> model.add(Dropout(0.2))


metrics=['mae'])
>>> model.add(Dense(512,activation='relu'))

A Basic Example >>> model.add(Dropout(0.2))

Recurrent Neural Network


>>> import numpy as np
>>> model.add(Dense(10,activation='softmax')) >>> model3.compile(loss='binary_crossentropy',

>>> from keras.models import Sequential


optimizer='adam',

>>> from keras.layers import Dense


Regression metrics=['accuracy'])
>>> data = np.random.random((1000,100))
>>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1]))

>>> labels = np.random.randint(2,size=(1000,1))


>>> model.add(Dense(1))
>>> model = Sequential()

>>> model.add(Dense(32,

activation='relu',
Convolutional Neural Network (CNN) > Model Training
input_dim=100))

>>> model.add(Dense(1, activation='sigmoid'))


>>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten
>>> model3.fit(x_train4,

>>> model.compile(optimizer='rmsprop',
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
y_train4,

loss='binary_crossentropy',
>>> model2.add(Activation('relu'))
batch_size=32,

metrics=['accuracy'])
>>> model2.add(Conv2D(32,(3,3)))
epochs=15,

>>> model.fit(data,labels,epochs=10,batch_size=32)
>>> model2.add(Activation('relu'))
verbose=1,

>>> predictions = model.predict(data) >>> model2.add(MaxPooling2D(pool_size=(2,2)))


validation_data=(x_test4,y_test4))
>>> model2.add(Dropout(0.25))

>>> model2.add(Conv2D(64,(3,3), padding='same'))

> Data
>>> model2.add(Activation('relu'))

>>>
>>>
model2.add(Conv2D(64,(3, 3)))

model2.add(Activation('relu'))
> Evaluate Your Model's Performance
>>> model2.add(MaxPooling2D(pool_size=(2,2)))

Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ideally, you split the data in training and >>> model2.add(Dropout(0.25))
>>> score = model3.evaluate(x_test,

test sets, for which you can also resort to the train_test_split module of sklearn.cross_validation. >>> model2.add(Flatten())
y_test,

>>> model2.add(Dense(512))
batch_size=32)
>>> model2.add(Activation('relu'))

Keras Data Sets >>> model2.add(Dropout(0.5))

>>> model2.add(Dense(num_classes))

>>>
>>>
from keras.datasets import boston_housing, mnist, cifar10, imdb

(x_train,y_train),(x_test,y_test) = mnist.load_data()

>>> model2.add(Activation('softmax'))
> Save/ Reload Models
>>>
>>>
(x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()

(x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data()

Recurrent Neural Network (RNN) >>> from keras.models import load_model

>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000)


>>> model3.save('model_file.h5')

>>> num_classes = 10 >>> from keras.klayers import Embedding,LSTM


>>> my_model = load_model('my_model.h5')
>>> model3.add(Embedding(20000,128))

>>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2))

Other >>> model3.add(Dense(1,activation='sigmoid'))

>>> from urllib.request import urlopen


> Model Fine-tuning
> Prediction
>>> data =
np.loadtxt(urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-di
abetes/pima-indians-diabetes.data"),delimiter=",")
Optimization Parameters
>>> X = data[:,0:8]

>>> y = data [:,8] >>> model3.predict(x_test4, batch_size=32)


>>> from keras.optimizers import RMSprop

>>> model3.predict_classes(x_test4,batch_size=32) >>> opt = RMSprop(lr=0.0001, decay=1e-6)

>>> model2.compile(loss='categorical_crossentropy',

optimizer=opt,

> Preprocessing Also see NumPy & Scikit-Learn metrics=['accuracy'])

Early Stopping
Sequence Padding Train and Test Sets
>>> from keras.callbacks import EarlyStopping

>>> from keras.preprocessing import sequence


>>> from sklearn.model_selection import train_test_split
>>> early_stopping_monitor = EarlyStopping(patience=2)

>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80)


>>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, y,
>>> model3.fit(x_train4,

>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) test_size=0.33,


y_train4,

random_state=42) batch_size=32,

epochs=15,

One-Hot Encoding validation_data=(x_test4,y_test4),

Standardization/Normalization callbacks=[early_stopping_monitor])
>>> from keras.utils import to_categorical

>>> Y_train = to_categorical(y_train, num_classes)


>>> from sklearn.preprocessing import StandardScaler

>>> Y_test = to_categorical(y_test, num_classes)


>>> scaler = StandardScaler().fit(x_train2)

>>> Y_train3 = to_categorical(y_train3, num_classes)


>>> standardized_X = scaler.transform(x_train2)

>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Data Skills Online at www.DataCamp.com
> Plotting Routines > Plotting Cutomize Plot

Python For Data Science


1D Data
>>> fig, ax = plt.subplots()

Colors, Color Bars & Color Maps


>>> plt.plot(x, x, x, x**2, x, x**3)

Matplotlib Cheat Sheet


>>> lines = ax.plot(x,y) #Draw points with lines or markers connecting them
>>> ax.plot(x, y, alpha = 0.4)

>>> ax.scatter(x,y) #Draw unconnected points, scaled or colored


>>> ax.plot(x, y, c='k')

>>> axes[0,0].bar([1,2,3],[3,4,5]) #Plot vertical rectangles (constant width)


>>> fig.colorbar(im, orientation='horizontal')

>>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) #Plot horiontal rectangles (constant height)


>>> im = ax.imshow(img,

>>> axes[1,1].axhline(0.45) #Draw a horizontal line across axes


cmap='seismic')
Learn Matplotlib online at www.DataCamp.com >>> axes[0,1].axvline(0.65) #Draw a vertical line across axes

>>> ax.fill(x,y,color='blue') #Draw filled polygons

>>> ax.fill_between(x,y,color='yellow') #Fill between y-values and 0 Markers


2D Data >>> fig, ax = plt.subplots()

Matplotlib >>> fig, ax = plt.subplots()

>>> ax.scatter(x,y,marker=".")

>>> ax.plot(x,y,marker="o")

>>> im = ax.imshow(img, #Colormapped or RGB arrays

Matplotlib is a Python 2D plotting library which produces


cmap='gist_earth',
Linestyles
interpolation='nearest',

publication-quality figures in a variety of hardcopy formats and


vmin=-2,

interactive environments across platforms. vmax=2)


>>> plt.plot(x,y,linewidth=4.0)

>>> axes2[0].pcolor(data2) #Pseudocolor plot of 2D array


>>> plt.plot(x,y,ls='solid')

Also see lists & NumPy


>>> axes2[0].pcolormesh(data) #Pseudocolor plot of 2D array
>>> plt.plot(x,y,ls='--')

>>> CS = plt.contour(Y,X,U) #Plot contours


>>> plt.plot(x,y,'--',x**2,y**2,'-.')

> Prepare The Data >>> axes2[2].contourf(data1) #Plot filled contours

>>> axes2[2]= ax.clabel(CS) #Label a contour plot


>>> plt.setp(lines,color='r',linewidth=4.0)

Text & Annotations


1D Data Vector Fields
>>> ax.text(1,

>>> import numpy as np


>>> axes[0,1].arrow(0,0,0.5,0.5) #Add an arrow to the axes
-2.1,

>>> x = np.linspace(0, 10, 100)


>>> axes[1,1].quiver(y,z) #Plot a 2D field of arrows
'Example Graph',

>>> y = np.cos(x)
>>> axes[0,1].streamplot(X,Y,U,V) #Plot a 2D field of arrows style='italic')

>>> z = np.sin(x) >>> ax.annotate("Sine",

xy=(8, 0),

Data Distributions xycoords='data',

2D Data or Images xytext=(10.5, 0),

textcoords='data',

>>> ax1.hist(y) #Plot a histogram


arrowprops=dict(arrowstyle="->",

>>> data = 2 * np.random.random((10, 10))


>>> ax3.boxplot(y) #Make a box and whisker plot
connectionstyle="arc3"),)
>>> data2 = 3 * np.random.random((10, 10))
>>> ax3.violinplot(z) #Make a violin plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j]

>>> U = -1 - X**2 + Y
Mathtext
>>> V = 1 + X - Y**2

>>>
>>>
from matplotlib.cbook import get_sample_data

img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) > Plot Anatomy & Workflow >>> plt.title(r'$sigma_i=15$', fontsize=20)

Plot Anatomy Limits, Legends and Layouts


> Create Plot Limits & Autoscaling
Axes/Subplot
>>> import matplotlib.pyplot as plt >>> ax.margins(x=0.0,y=0.1) #Add padding to a plot

>>> ax.axis('equal') #Set the aspect ratio of the plot to 1

>>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) #Set limits for x-and y-axis

Figure Y-axis Figure >>> ax.set_xlim(0,10.5) #Set limits for x-axis

Legends
>>> fig = plt.figure()

>>> ax.set(title='An Example Axes', #Set a title and x-and y-axis labels

>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) X- axis


ylabel='Y-Axis',

xlabel='X-Axis')

Axes Workflow >>> ax.legend(loc='best') #No overlapping plot elements

The basic steps to creating plots with matplotlib are: Ticks


All plotting is done with respect to an Axes. In most cases, a subplot will fit your needs.
1 Prepare Data 2 Create Plot 3 Plot 4 Customized Plot 5 Save Plot 6 Show Plot >>> ax.xaxis.set(ticks=range(1,5), #Manually set x-ticks

A subplot is an axes on a grid system. ticklabels=[3,100,-12,"foo"])

>>> import matplotlib.pyplot as plt


>>> ax.tick_params(axis='y', #Make y-ticks longer and go in and out

>>> fig.add_axes()
>>> x = [1,2,3,4] #Step 1
direction='inout',

>>> ax1 = fig.add_subplot(221) #row-col-num


>>> y = [10,20,25,30]
length=10)
>>> ax3 = fig.add_subplot(212)
>>> fig = plt.figure() #Step 2

>>> fig3, axes = plt.subplots(nrows=2,ncols=2)


>>> ax = fig.add_subplot(111) #Step 3
Subplot Spacing
>>> fig4, axes2 = plt.subplots(ncols=3) >>> ax.plot(x, y, color='lightblue', linewidth=3) #Step 3, 4

>>> ax.scatter([2,4,6],
>>> fig3.subplots_adjust(wspace=0.5, #Adjust the spacing between subplots

[5,15,25],
hspace=0.3,

left=0.125,

> Save Plot


color='darkgreen',

marker='^')
right=0.9,

>>> ax.set_xlim(1, 6.5)


top=0.9,

>>> plt.savefig('foo.png') #Step 5


bottom=0.1)

>>> plt.savefig('foo.png') #Save figures


>>> plt.show() #Step 6 >>> fig.tight_layout() #Fit subplot(s) in to the figure area
>>> plt.savefig('foo.png', transparent=True) #Save transparent figures
Axis Spines

>>> ax1.spines['top'].set_visible(False) #Make the top axis line for a plot invisible

> Show Plot > Close and Clear >>> ax1.spines['bottom'].set_position(('outward',10)) #Move the bottom axis line outward

>>> plt.cla() #Clear an axis

>>> plt.show()
>>> plt.clf() #Clear the entire figure

>>> plt.close() #Close a window


Learn Data Skills Online at www.DataCamp.com
> Inspecting Your Array > Sorting Arrays
Python For Data Science

>>> a.shape #Array dimensions


>>> a.sort() #Sort an array

>>> len(a) #Length of array


>>> c.sort(axis=0) #Sort the elements of an array's axis
>>> b.ndim #Number of array dimensions

NumPy Cheat Sheet


>>> e.size #Number of array elements

>>> b.dtype #Data type of array elements

>>>
>>>
b.dtype.name #Name of data type

b.astype(int) #Convert an array to a different type > Subsetting, Slicing, Indexing


Learn NumPy online at www.DataCamp.com Subsetting

> Data Types >>> a[2] #Select the element at the 2nd index

>>> b[1,2] #Select the element at row 1 column 2 (equivalent to b[1][2])

1.5 2
2 3

3
6.0 4 5 6
>>> np.int64 #Signed 64-bit integer types

Numpy
>>> np.float32 #Standard double-precision floating point
Slicing
>>> np.complex #Complex numbers represented by 128 floats
>>> a[0:2] #Select items at index 0 and 1
1 2 3
>>> Numpy
np.bool #Boolean type storing TRUE and FALSE values
array([1, 2])

>>> np.object #Python object type


>>> b[0:2,1] #Select items at rows 0 and 1 in column 1
1.5 2 3
The NumPy library is the core library for scientific computing in Python.
>>>
>>>
np.string_ #Fixed-length string type

np.unicode_ #Fixed-length unicode type


array([ 2., 5.])
4 5 6
>>> b[:1] #Select all items at row 0 (equivalent to b[0:1, :])

It provides a high-performance multidimensional array object, and tools for array([[1.5, 2., 3.]])

1.5 2 3
4 5 6
working with these arrays >>> c[1,...] #Same as [1,:,:]

> Array Mathematics


array([[[ 3., 2., 1.],

Use the following import convention: [ 4., 5., 6.]]])

>>> a[ : :-1] #Reversed array a array([3, 2, 1])


>>> import numpy as np
Boolean Indexing
Arithmetic Operations >>> a[a<2] #Select elements from a less than 2
1 2 3
NumPy Arrays array([1])

>>> g = a - b #Subtraction
Fancy Indexing
array([[-0.5, 0. , 0. ],

>>> b[[1, 0, 1, 0],[0, 1, 2, 0]] #Select elements (1,0),(0,1),(1,2) and (0,0)

[-3. , -3. , -3. ]])

array([ 4. , 2. , 6. , 1.5])

>>> np.subtract(a,b) #Subtraction

>>> b[[1, 0, 1, 0]][:,[0,1,2,0]] #Select a subset of the matrix’s rows and columns

>>> b + a #Addition

array([[ 4. ,5. , 6. , 4. ],

array([[ 2.5, 4. , 6. ],

[ 1.5, 2. , 3. , 1.5],

[ 5. , 7. , 9. ]])

[ 4. , 5. , 6. , 4. ],

>>> np.add(b,a) Addition

[ 1.5, 2. , 3. , 1.5]])
>>> a / b #Division

array([[ 0.66666667, 1. , 1. ],

[ 0.25 , 0.4 , 0.5 ]])

>>> np.divide(a,b) #Division

>>> a * b #Multiplication
> Array Manipulation
> Creating Arrays
array([[ 1.5, 4. , 9. ],

[ 4. , 10. , 18. ]])

>>> np.multiply(a,b) #Multiplication


Transposing Array
>>> np.exp(b) #Exponentiation
>>> i = np.transpose(b) #Permute array dimensions

>>> a = np.array([1,2,3])
>>> np.sqrt(b) #Square root
>>> i.T #Permute array dimensions
>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float)
>>> np.sin(a) #Print sines of an array

>>> c = np.array([[(1.5,2,3), (4,5,6)],[(3,2,1), (4,5,6)]], dtype = float) >>> np.cos(b) #Element-wise cosine
Changing Array Shape
>>> np.log(a) #Element-wise natural logarithm
>>> b.ravel() #Flatten the array

>>> e.dot(f) #Dot product


>>> g.reshape(3,-2) #Reshape, but don’t change data
Initial Placeholders array([[ 7., 7.],

[ 7., 7.]]) Adding/Removing Elements


>>> h.resize((2,6)) #Return a new array with shape (2,6)

>>> np.zeros((3,4)) #Create an array of zeros


>>> np.append(h,g) #Append items to an array

>>> np.ones((2,3,4),dtype=np.int16) #Create an array of ones


Comparison >>> np.insert(a, 1, 5) #Insert items in an array

>>> d = np.arange(10,25,5) #Create an array of evenly spaced values (step value)


>>> np.delete(a,[1]) #Delete items from an array
>>> np.linspace(0,2,9) #Create an array of evenly spaced values (number of samples)

>>> e = np.full((2,2),7) #Create a constant array


>>> a == b #Element-wise comparison
Combining Arrays
>>> f = np.eye(2) #Create a 2X2 identity matrix
array([[False, True, True],
>>> np.concatenate((a,d),axis=0) #Concatenate arrays

>>> np.random.random((2,2)) #Create an array with random values


[False, False, False]], dtype=bool)
array([ 1, 2, 3, 10, 15, 20])

>>> np.empty((3,2)) #Create an empty array >>> a < 2 #Element-wise comparison


>>> np.vstack((a,b)) #Stack arrays vertically (row-wise)

array([True, False, False], dtype=bool)


array([[ 1. , 2. , 3. ],

>>> np.array_equal(a, b) #Array-wise comparison [ 1.5, 2. , 3. ],

[ 4. , 5. , 6. ]])

> I/O Aggregate Functions


>>> np.r_[e,f] #Stack arrays vertically (row-wise)

>>> np.hstack((e,f)) #Stack arrays horizontally (column-wise)

array([[ 7., 7., 1., 0.],

[ 7., 7., 0., 1.]])

Saving & Loading On Disk >>> a.sum() #Array-wise sum


>>> np.column_stack((a,d)) #Create stacked column-wise arrays

>>> a.min() #Array-wise minimum value


array([[ 1, 10],

>>> b.max(axis=0) #Maximum value of an array row


[ 2, 15],

>>> np.save('my_array', a)
>>> b.cumsum(axis=1) #Cumulative sum of the elements
[ 3, 20]])

>>> np.savez('array.npz', a, b)
>>> a.mean() #Mean
>>> np.c_[a,d] #Create stacked column-wise arrays
>>> np.load('my_array.npy') >>> np.median(b) #Median

>>> np.corrcoef(a) #Correlation coefficient


Splitting Arrays
>>> np.std(b) #Standard deviation >>> np.hsplit(a,3) #Split the array horizontally at the 3rd index

Saving & Loading Text Files [array([1]),array([2]),array([3])]

>>> np.vsplit(c,2) #Split the array vertically at the 2nd index

[array([[[ 1.5, 2. , 1. ],

>>> np.loadtxt("myfile.txt")

>>> np.genfromtxt("my_file.csv", delimiter=',')


> Copying Arrays [ 4. , 5. , 6. ]]]),

array([[[ 3., 2., 3.],

>>> np.savetxt("myarray.txt", a, delimiter=" ") [ 4., 5., 6.]]])]


>>> h = a.view() #Create a view of the array with the same data

>>> np.copy(a) #Create a copy of the array

> Asking For Help


>>> h = a.copy() #Create a deep copy of the array

Learn Data Skills Online at www.DataCamp.com


>>> np.info(np.ndarray.dtype)
> I/O > Retrieving Series/DataFrame Information
Python For Data Science Read and Write to CSV Basic Information

Pandas Basics Cheat Sheet >>> pd.read_csv(‘file.csv’, header=None, nrows=5)

>>> df.to_csv('myDataFrame.csv')
>>>
>>>
>>>
df.shape #(rows,columns)

df.index #Describe index

df.columns #Describe DataFrame columns

>>> df.info() #Info on DataFrame

Learn Pandas Basics online at www.DataCamp.com Read and Write to Excel >>> df.count() #Number of non-NA values

>>> pd.read_excel(‘file.xlsx’)

>>> df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')


Summary
Read multiple sheets from the same file df.sum() #Sum of values

Pandas
>>>
>>> df.cumsum() #Cummulative sum of values

>>> xlsx = pd.ExcelFile(‘file.xls’)

>>> df.min()/df.max() #Minimum/maximum values

>>> df = pd.read_excel(xlsx, 'Sheet1')


>>> df.idxmin()/df.idxmax() #Minimum/Maximum index value

>>> df.describe() #Summary statistics

The Pandas library is built on NumPy and provides easy-to-use data


structures and data analysis tools for the Python programming language. Read and Write to SQL Query or Database Table >>>
>>>
df.mean() #Mean of values

df.median() #Median of values

Use the following import convention: >>> from sqlalchemy import create_engine

>>> engine = create_engine('sqlite:///:memory:')

>>> import pandas as pd >>>


>>>
pd.read_sql("SELECT * FROM my_table;", engine)

pd.read_sql_table('my_table', engine)
> Applying Functions
>>> pd.read_sql_query("SELECT * FROM my_table;", engine)
read_sql() is a convenience wrapper around read_sql_table() and read_sql_query() >>> f = lambda x: x*2

> Pandas Data Structures >>> df.to_sql('myDf', engine) >>> df.apply(f) #Apply function

>>> df.applymap(f) #Apply function element-wise

Series
> Selection Also see NumPy Arrays
> Data Alignment
A one-dimensional labeled array
a 3
capable of holding any data type b -5 Getting Internal Data Alignment
Index
c 7 >>> s['b'] #Get one element

NA values are introduced in the indices that don’t overlap:


d 4 -5

>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd']) >>> df[1:] #Get subset of a DataFrame
>>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])

Country Capital Population


>>> s + s3

1 India New Delhi 1303171035


a 10.0

Dataframe 2 Brazil Brasília 207847528 b NaN

c 5.0

Selecting, Boolean Indexing & Setting


d 7.0
A two-dimensional labeled data structure

with columns of potentially different types


By Position Arithmetic Operations with Fill Methods
Columns Country Capital Population
>>> df.iloc[[0],[0]] #Select single value by row & column

0 Belgium Brussels 11190846 'Belgium'

You can also do the internal data alignment yourself with the help of the fill methods:
Index 1 India New Delhi 1303171035 >>> df.iat([0],[0])
>>> s.add(s3, fill_values=0)

'Belgium' a 10.0

2 Brazil Brasilia 207847528


b -5.0

By Label
>>> data = {'Country': ['Belgium', 'India', 'Brazil'],
c 5.0

'Capital': ['Brussels', 'New Delhi', 'Brasília'],


>>> df.loc[[0], ['Country']] #Select single value by row & column labels
d 7.0

'Population': [11190846, 1303171035, 207847528]}


'Belgium'
>>> s.sub(s3, fill_value=2)

>>> df = pd.DataFrame(data,
>>> df.at([0], ['Country'])
>>> s.div(s3, fill_value=4)

columns=['Country', 'Capital', 'Population']) 'Belgium' >>> s.mul(s3, fill_value=3)

By Label/Position

> Dropping
>>> df.ix[2] #Select single row of subset of rows

Country Brazil

Capital Brasília

Population 207847528

>>> s.drop(['a', 'c']) #Drop values from rows (axis=0)


>>> df.ix[:,'Capital'] #Select a single column of subset of columns

>>> df.drop('Country', axis=1) #Drop values from columns(axis=1) 0 Brussels

1 New Delhi

2 Brasília

>>> df.ix[1,'Capital'] #Select rows and columns

> Asking For Help 'New Delhi'

Boolean Indexing
>>> help(pd.Series.loc) >>> s[~(s > 1)] #Series s where value is not >1

>>> s[(s < -1) | (s > 2)] #s where value is <-1 or >2

>>> df[df['Population']>1200000000] #Use filter to adjust DataFrame

> Sort & Rank Setting

>>> s['a'] = 6 #Set index a of Series s to 6

>>> df.sort_index() #Sort by labels along an axis


Learn Data Skills Online at
>>> df.sort_values(by='Country') #Sort by the values along an axis

>>> df.rank() #Assign ranks to entries


www.DataCamp.com
> Strings > Lists Also see NumPy Arrays

Python For Data Science >>> my_string = 'thisStringIsAwesome'

>>> my_string

>>>
>>>
a = 'is'

b = 'nice'

Basics Cheat Sheet


'thisStringIsAwesome' >>> my_list = ['my', 'list', a, b]

>>> my_list2 = [[4,5,6,7], [3,4,5,6]]

String Operations
Selecting List Elements Index starts at 0
Learn Python Basics online at www.DataCamp.com >>> my_string * 2

'thisStringIsAwesomethisStringIsAwesome'
Subset
>>> my_string + 'Innit'

'thisStringIsAwesomeInnit'
>>> my_list[1] #Select item at index 1

>>> 'm' in my_string


>>> my_list[-3] #Select 3rd last item
True Slice

> Variables and Data Types String Indexing Index starts at 0


>>>
>>>
>>>
my_list[1:3] #Select items at index 1 and 2

my_list[1:] #Select items after index 0

my_list[:3] #Select items before index 3

>>> my_list[:] #Copy my_list


Variable Assignment >>> my_string[3]

>>> my_string[4:9] Subset Lists of Lists


>>> my_list2[1][0] #my_list[list][itemOfList]

>>> x=5

>>> my_list2[1][:2]
>>> x

5
String Methods
>>> my_string.upper() #String to uppercase
List Operations
Calculations With Variables >>>
>>>
my_string.lower() #String to lowercase

my_string.count('w') #Count String elements


>>> my_list + my_list

>>> my_string.replace('e', 'i') #Replace String elements


['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']

>>> x+2 #Sum of two variables


>>> my_string.strip() #Strip whitespaces >>> my_list * 2

['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']

>>> x-2 #Subtraction of two variables

>>> my_list2 > 4

True
>>> x*2 #Multiplication of two variables

10

>>> x**2 #Exponentiation of a variable

25

> NumPy Arrays Also see Lists


List Methods
>>> x%2 #Remainder of a variable

>>> my_list = [1, 2, 3, 4]

1
>>> my_list.index(a) #Get the index of an item

>>> my_array = np.array(my_list)

>>> x/float(2) #Division of a variable


>>> my_list.count(a) #Count an item

>>> my_2darray = np.array([[1,2,3],[4,5,6]])


2.5 >>> my_list.append('!') #Append an item at a time

>>> my_list.remove('!') #Remove an item

Types and Type Conversion Selecting Numpy Array Elements Index starts at 0 >>>
>>>
del(my_list[0:1]) #Remove an item

my_list.reverse() #Reverse the list

>>> my_list.extend('!') #Append an item

Subset >>> my_list.pop(-1) #Remove an item

str()
>>> my_list.insert(0,'!') #Insert an item

'5', '3.45', 'True' #Variables to strings >>> my_array[1] #Select item at index 1
>>> my_list.sort() #Sort the list
2
int()
Slice
5, 3, 1 #Variables to integers
>>> my_array[0:2] #Select items at index 0 and 1

float()
5.0, 1.0 #Variables to floats
array([1, 2])

Subset 2D Numpy arrays


> Python IDEs (Integrated Development Environment)
bool() >>> my_2darray[:,0] #my_2darray[rows, columns]

array([1, 4])
True, True, True #Variables to booleans
Leading open data science
Free IDE that is included
Create and share

Numpy Array Operations platform powered by Python with Anaconda documents with live code

> Libraries >>> my_array > 3

> Asking For Help


array([False, False, False, True], dtype=bool)

>>> my_array * 2

array([2, 4, 6, 8])

>>> my_array + np.array([5, 6, 7, 8])

Data analysis Scientific computing 2D plotting Machine learning array([6, 8, 10, 12]) >>> help(str)

Import Libraries Numpy Array Functions


>>> import numpy
>>> my_array.shape #Get the dimensions of the array

>>> import numpy as np >>> np.append(other_array) #Append items to an array

>>> np.insert(my_array, 1, 5) #Insert items in an array

>>> np.delete(my_array,[1]) #Delete items in an array

Selective import >>>


>>>
np.mean(my_array) #Mean of the array

np.median(my_array) #Median of the array

>>> my_array.corrcoef() #Correlation coefficient

>>> from math import pi >>> np.std(my_array) #Standard deviation

Learn Data Skills Online at


www.DataCamp.com
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF

Writing Code And Text


Code and text are encapsulated by 3 basic cell types: markdown cells, code
cells, and raw NBConvert cells.
Edit Cells Edit Mode: 1. Save and checkpoint 9. Interrupt kernel
2. Insert cell below 10. Restart kernel
3. Cut cell 11. Display characteristics
Cut currently selected cells Copy cells from 4. Copy cell(s) 12. Open command palette
to clipboard clipboard to current 5. Paste cell(s) below 13. Current kernel
cursor position 6. Move cell up 14. Kernel status
Paste cells from Executing Cells 7. Move cell down 15. Log out from notebook server
clipboard above Paste cells from 8. Run current cell
current cell Run selected cell(s) Run current cells down
clipboard below
and create a new one
Paste cells from current cell
below Asking For Help
clipboard on top Run current cells down
Delete current cells
of current cel and create a new one Walk through a UI tour
Split up a cell from above Run all cells
Revert “Delete Cells” List of built-in keyboard
current cursor Run all cells above the Run all cells below
invocation shortcuts
position current cell the current cell Edit the built-in
Merge current cell Merge current cell keyboard shortcuts
Change the cell type of toggle, toggle Notebook help topics
with the one above with the one below current cell scrolling and clear Description of
Move current cell up Move current cell toggle, toggle current outputs markdown available Information on
down scrolling and clear in notebook unofficial Jupyter
Adjust metadata
underlying the Find and replace all output Notebook extensions
Python help topics
current notebook in selected cells IPython help topics
View Cells
Remove cell Copy attachments of NumPy help topics
attachments current cell Toggle display of Jupyter SciPy help topics
Toggle display of toolbar Matplotlib help topics
Paste attachments of Insert image in logo and filename
SymPy help topics
current cell selected cells Toggle display of cell Pandas help topics
action icons:
Insert Cells - None About Jupyter Notebook
- Edit metadata
Toggle line numbers - Raw cell format
Add new cell above the Add new cell below the - Slideshow
current one in cells - Attachments
current one DataCamp
- Tags
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists
>>> a.shape Array dimensions Subsetting
NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 0 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2]
array([1, 2])
1 2 3 Select items at index 0 and 1
2
The NumPy library is the core library for scientific computing in Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
>>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])

>>> np.subtract(a,b) Boolean Indexing


1D array 2D array 3D array Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2,
>>> np.vstack((a,b))
3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.std(b) Standard deviation [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type >>> a.sort() Sort an array


>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
SciPy >>> B = np.asmatrix(b) Subtraction
>>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
scientific computing that provides mathematical >>> np.divide(A,D) Division
Basic Matrix Routines Multiplication
algorithms and convenience functions built on the
>>> np.multiply(D,A) Multiplication
NumPy extension of Python. Inverse >>> np.dot(A,D) Dot product
>>> A.I Inverse >>> np.vdot(A,D) Vector dot product
>>> linalg.inv(A) Inverse
Interacting With NumPy Also see NumPy >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> np.outer(A,D) Outer product
>>> import numpy as np >>> A.H Conjugate transposition >>> np.tensordot(A,D) Tensor dot product
>>> a = np.array([1,2,3]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)])
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) Norm Exponential Functions
>>> linalg.norm(A) Frobenius norm >>> linalg.expm(A) Matrix exponential
Index Tricks >>> linalg.norm(A,1) L1 norm (max column sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,np.inf) L inf norm (max row sum) >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> np.mgrid[0:5,0:5] Create a dense meshgrid decomposition)
>>> np.ogrid[0:2,0:2] Create an open meshgrid Rank Logarithm Function
>>> np.r_[[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise) >>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
>>> np.c_[b,c] Create stacked column-wise arrays Determinant Trigonometric Tunctions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
Shape Manipulation Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> np.transpose(b) Permute array dimensions >>> linalg.solve(A,b) Solver for dense matrices >>> linalg.tanm(A) Matrix tangent
>>> b.flatten() Flatten the array >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.lstsq(D,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) equation >>> linalg.coshm(D) Hyperbolic matrix cosine
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
>>> np.vpslit(d,2) Split the array vertically at the 2nd index >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.sigm(A) Matrix sign function
Polynomials >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix
>>> from numpy import poly1d (SVD) Matrix Square Root
>>> linalg.sqrtm(A) Matrix square root
>>> p = poly1d([3,4,5]) Create a polynomial object
Creating Sparse Matrices Arbitrary Functions
Vectorizing Functions >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
>>> F = np.eye(3, k=1) Create a 2X2 identity matrix
>>> def myfunc(a):
if a < 0: >>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a*2 >>> C[C > 0.5] = 0
else: >>> H = sparse.csr_matrix(C)
return a/2
Compressed Sparse Row matrix Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> np.vectorize(myfunc) Vectorize functions >>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
>>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
Type Handling >>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> v[:,1] Second eigenvector
>>> np.real(c) Return the real part of the array elements
>>> np.imag(c) Return the imaginary part of the array elements Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Singular Value Decomposition
>>> np.cast['f'](np.pi) Cast object to a data type Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> sparse.linalg.norm(I) Norm LU Decomposition
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> P,L,U = linalg.lu(C) LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values
Solving linear problems
(number of samples) >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> g [3:] += np.pi
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale) Sparse Matrix Functions
>>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
>>> np.select([c<4],[c*2]) Return values from a list of arrays depending on >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
conditions
>>> misc.factorial(a) Factorial
>>> Combine N things taken at k time
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
>>> s.drop(['a', 'c']) Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays >>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com
Getting
>>> s['b'] Get one element Sort & Rank
-5
Pandas >>> df.sort_index() Sort by labels along an axis
>>> df.sort_values(by='Country') Sort by the values along an axis
>>> df[1:] Get subset of a DataFrame
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035
2 Brazil Brasília 207847528
programming language. Retrieving Series/DataFrame Information
Selecting, Boolean Indexing & Setting Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc([0],[0]) Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.iat([0],[0])
>>>
>>>
df.info()
df.count()
Info on DataFrame
Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc([0], ['Country']) Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5
'Belgium' column labels >>> df.cumsum() Cummulative sum of values
>>> df.min()/df.max() Minimum/maximum values
c 7 >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax()
Index Minimum/Maximum index value
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
By Label/Position >>> df.median() Median of values
>>> df.ix[2] Select single row of
DataFrame Country
Capital
Brazil
Brasília
subset of rows Applying Functions
Population 207847528 >>> f = lambda x: x*2
Columns
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index >>> df.ix[1,'Capital'] Select rows and columns
2 Brazil Brasília 207847528 Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data, Setting
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0

I/O Arithmetic Operations with Fill Methods


You can also do the internal data alignment yourself with
Read and Write to CSV Read and Write to SQL Query or Database Table
the help of the fill methods:
>>> pd.read_csv('file.csv', header=None, nrows=5) >>> from sqlalchemy import create_engine >>> s.add(s3, fill_value=0)
>>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:') a 10.0
>>> pd.read_sql("SELECT * FROM my_table;", engine) b -5.0
Read and Write to Excel c 5.0
>>> pd.read_sql_table('my_table', engine) d 7.0
>>> pd.read_excel('file.xlsx') >>> pd.read_sql_query("SELECT * FROM my_table;", engine) >>> s.sub(s3, fill_value=2)
>>> pd.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1') >>> s.div(s3, fill_value=4)
read_sql()is a convenience wrapper around read_sql_table() and
Read multiple sheets from the same file >>> s.mul(s3, fill_value=3)
read_sql_query()
>>> xlsx = pd.ExcelFile('file.xls')
>>> df = pd.read_excel(xlsx, 'Sheet1') >>> pd.to_sql('myDf', engine) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> y_pred = lr.predict(X_test)
>>> print(cross_val_score(lr, X, y, cv=2))
Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plotting library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')

1 Prepare The Data Also see Lists & NumPy


>>> plt.show() Step 6

1D Data 4 Customize Plot


>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
2 Create Plot >>>
>>>
plt.plot(x,y,linewidth=4.0)
plt.plot(x,y,ls='solid')
>>> ax.legend(loc='best')
Ticks
No overlapping plot elements

>>> import matplotlib.pyplot as plt >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks


>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plotting is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom axis line outward

3 Plotting Routines 5 Save Plot


1D Data Vector Fields Save figures
>>> plt.savefig('foo.png')
>>> fig, ax = plt.subplots() >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[0,1].streamplot(X,Y,U,V) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width)
>>>
>>>
>>>
axes[1,0].barh([0.5,1,2.5],[0,1,2])
axes[1,1].axhline(0.45)
axes[0,1].axvline(0.65)
Plot horiontal rectangles (constant height)
Draw a horizontal line across axes
Draw a vertical line across axes
Data Distributions
>>> ax1.hist(y) Plot a histogram
6 Show Plot
>>> plt.show()
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array >>> plt.clf() Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot DataCamp
Learn Python for Data Science Interactively
Matplotlib 2.0.0 - Updated on: 02/2017
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Scatterplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3
Point Plot Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Bokeh Glyphs Grid Layout
Learn Bokeh Interactively at www.DataCamp.com, Scatter Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Plotting With Bokeh Line Glyphs Tabbed Layout
>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: US
Colormapping >>> layout = row(p4,p5)
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Asia
Europe

Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')

Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row

2 Plotting >>> layout = row(p1,p2,p3)


Columns
5 Show or Save Your Plots
>>> from bokeh.plotting import figure >>> from bokeh.layouts import columns >>> show(p1) >>> show(layout)
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> layout = column(p1,p2,p3) >>> save(p1) >>> save(layout)
>>> p2 = figure(plot_width=300, plot_height=300, Nesting Rows & Columns
x_range=(0, 8), y_range=(0, 8)) >>>layout = row(column(p1,p2), p3) DataCamp
>>> p3 = figure() Learn Python for Data Science Interactively
> Linear Algebra Also see NumPy

Python For Data Science


You’ll use the linalg and sparse modules.

Note that scipy.linalg contains and expands on numpy.linalg.


Matrix Functions

SciPy Cheat Sheet


Addition
>>> from scipy import linalg, sparse
>>> np.add(A,D) #Addition

Creating Matrices Subtraction


Learn SciPy online at www.DataCamp.com >>> np.subtract(A,D) #Subtraction

>>> A = np.matrix(np.random.random((2,2)))
Division
>>> B = np.asmatrix(b)
>>> np.divide(A,D) #Division
>>> C = np.mat(np.random.random((10,5)))

>>> D = np.mat([[3,4], [5,6]]) Multiplication


>>> np.multiply(D,A) #Multiplication

SciPy Basic Matrix Routines


>>>
>>>
>>>
np.dot(A,D) #Dot product

np.vdot(A,D) #Vector dot product

np.inner(A,D) #Inner product

Inverse >>> np.outer(A,D) #Outer product

The SciPy library is one of the core packages for


>>> np.tensordot(A,D) #Tensor dot product

>>> A.I #Inverse


>>> np.kron(A,D) #Kronecker product
scientific computing that provides mathematical
>>> linalg.inv(A) #Inverse

algorithms and convenience functions built on the


>>> A.T #Tranpose matrix
Exponential Functions
>>> A.H #Conjugate transposition
>>> linalg.expm(A) #Matrix exponential

NumPy extension of Python. >>> np.trace(A) #Trace >>> linalg.expm2(A) #Matrix exponential (Taylor Series)

Norm >>> linalg.expm3(D) #Matrix exponential (eigenvalue decomposition)

>>> linalg.norm(A) #Frobenius norm


Logarithm Function

> Interacting With NumPy Also see NumPy >>> linalg.norm(A,1) #L1 norm (max column sum)

>>> linalg.norm(A,np.inf) #L inf norm (max row sum)


>>> linalg.logm(A) #Matrix logarithm
Trigonometric Functions
Rank >>> linalg.sinm(D) Matrix sine

>>> import numpy as np

>>> a = np.array([1,2,3])
>>> np.linalg.matrix_rank(C) #Matrix rank >>> linalg.cosm(D) Matrix cosine

>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)])


>>> linalg.tanm(A) Matrix tangent
Determinant
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) Hyperbolic Trigonometric Functions
>>> linalg.det(A) #Determinant
>>> linalg.sinhm(D) #Hypberbolic matrix sine

Solving linear problems


Index Tricks >>> linalg.solve(A,b) #Solver for dense matrices

>>> linalg.coshm(D) #Hyperbolic matrix cosine

>>> linalg.tanhm(A) #Hyperbolic matrix tangent


>>> E = np.mat(a).T #Solver for dense matrices

>>> np.mgrid[0:5,0:5] #Create a dense meshgrid


>>> linalg.lstsq(D,E) #Least-squares solution to linear matrix equation Matrix Sign Function
>>> np.ogrid[0:2,0:2] #Create an open meshgrid
>>> np.sigm(A) #Matrix sign function
>>> np.r_[[3,[0]*5,-1:1:10j] #Stack arrays vertically (row-wise)
Generalized inverse
>>> np.c_[b,c] #Create stacked column-wise arrays >>> linalg.pinv(C) #Compute the pseudo-inverse of a matrix (least-squares solver)
Matrix Square Root
>>> linalg.pinv2(C) #Compute the pseudo-inverse of a matrix (SVD) >>> linalg.sqrtm(A) #Matrix square root

Shape Manipulation Arbitrary Functions


Creating Sparse Matrices >>> linalg.funm(A, lambda x: x*x) #Evaluate matrix function
>>> np.transpose(b) #Permute array dimensions

>>>
>>>
b.flatten() #Flatten the array

np.hstack((b,c)) #Stack arrays horizontally (column-wise)

>>>
>>>
F = np.eye(3, k=1) #Create a 2X2 identity matrix

G = np.mat(np.identity(2)) #Create a 2x2 identity matrix


Decompositions
>>> np.vstack((a,b)) #Stack arrays vertically (row-wise)
>>> C[C > 0.5] = 0

>>> np.hsplit(c,2) #Split the array horizontally at the 2nd index


>>> H = sparse.csr_matrix(C) #Compressed Sparse Row matrix
Eigenvalues and Eigenvectors
>>> np.vpslit(d,2) #Split the array vertically at the 2nd index >>> I = sparse.csc_matrix(D) #Compressed Sparse Column matrix
>>> la, v = linalg.eig(A) #Solve ordinary or generalized eigenvalue problem for square matrix

>>> J = sparse.dok_matrix(A) #Dictionary Of Keys matrix


>>> l1, l2 = la #Unpack eigenvalues

>>> E.todense() #Sparse matrix to full matrix

Polynomials >>> sparse.isspmatrix_csc(A) #Identify sparse matrix


>>>
>>>
v[:,0] #First eigenvector

v[:,1] #Second eigenvector

>>> linalg.eigvals(A) #Unpack eigenvalues


>>> from numpy import poly1d

>>> p = poly1d([3,4,5]) #Create a polynomial object Sparse Matrix Routines Singular Value Decomposition
>>> U,s,Vh = linalg.svd(B) #Singular Value Decomposition (SVD)

Inverse >>> M,N = B.shape

Vectorizing Functions >>> sparse.linalg.inv(I) #Inverse >>> Sig = linalg.diagsvd(s,M,N) #Construct sigma matrix in SVD

Norm LU Decomposition
>>> def myfunc(a): if a < 0:
>>> P,L,U = linalg.lu(C) #LU Decomposition
return a*2
>>> sparse.linalg.norm(I) #Norm
else:
Solving linear problems
return a/2

>>> sparse.linalg.spsolve(H,I) #Solver for sparse matrices


>>> np.vectorize(myfunc) #Vectorize functions

Type Handling Sparse Matrix Functions


>>> sparse.linalg.expm(I) #Sparse matrix exponential
>>> np.real(c) #Return the real part of the array elements

>>> np.imag(c) #Return the imaginary part of the array elements

>>>
>>>
np.real_if_close(c,tol=1000) #Return a real array if complex parts close to 0

np.cast['f'](np.pi) #Cast object to a data type Sparse Matrix Decompositions


>>> la, v = sparse.linalg.eigs(F,1) #Eigenvalues and eigenvectors

Other Useful Functions >>> sparse.linalg.svds(H, 2) #SVD

>>> np.angle(b,deg=True) #Return the angle of the complex argument

>>>
>>>
g = np.linspace(0,np.pi,num=5) #Create an array of evenly spaced values(number of samples)

g [3:] += np.pi
> Asking For Help Learn Data Skills Online at
>>>
>>>
np.unwrap(g) #Unwrap

np.logspace(0,10,3) #Create an array of evenly spaced values (log scale)

www.DataCamp.com
>>> np.select([c<4],[c*2]) #Return values from a list of arrays depending on conditions
>>> help(scipy.linalg.diagsvd)

>>> misc.factorial(a) #Factorial


>>> np.info(np.matrix)
>>> misc.comb(10,3,exact=True) #Combine N things taken at k time

>>> misc.central_diff_weights(3) #Weights for Np-point central derivative

>>> misc.derivative(myfunc,1.0) #Find the n-th derivative of a function at a point


3 Plotting With Seaborn
Regression Plots
Python For Data Science
Axis Grids
>>> g = sns.FacetGrid(titanic, #Subplot grid for plotting conditional relationships
>>> sns.regplot(x="sepal_width", #Plot data and a linear regression model fit

Seaborn Cheat Sheet col="survived",

row="sex")

>>> g = g.map(plt.hist,"age")

y="sepal_length",

data=iris,

ax=ax)
>>> sns.factorplot(x="pclass", #Draw a categorical plot onto a Facetgrid

y="survived",

Learn Seaborn online at www.DataCamp.com hue="sex",


Distribution Plots
data=titanic)

>>> sns.lmplot(x="sepal_width", #Plot data and regression model fits across a FacetGrid

y="sepal_length",
>>> plot = sns.distplot(data.y, #Plot univariate distribution

hue="species",
kde=False,

data=iris)
color="b")
>>> h = sns.PairGrid(iris) #Subplot grid for plotting pairwise relationships

Statistical Data Visualization With Seaborn >>> h = h.map(plt.scatter)

>>> sns.pairplot(iris) #Plot pairwise bivariate distributions


Matrix Plots
>>> i = sns.JointGrid(x="x", #Grid for bivariate plot with marginal univariate plots

y="y",

The Python visualization library Seaborn is based on matplotlib and provides data=data)

>>> sns.heatmap(uniform_data,vmin=0,vmax=1) #Heatmap

a high-level interface for drawing attractive statistical graphics. >>> i = i.plot(sns.regplot,

Make use of the following aliases to import the libraries:


sns.distplot)

>>> sns.jointplot("sepal_length", #Plot bivariate distribution

Categorical Plots
"sepal_width",

>>> import matplotlib.pyplot as plt


data=iris,
Scatterplot
>>> import seaborn as sns kind='kde')
>>> sns.stripplot(x="species", #Scatterplot with one categorical variable

y="petal_length",

The basic steps to creating plots with Seaborn are:


data=iris)

1. Prepare some data


>>> sns.swarmplot(x="species", #Categorical scatterplot with non-overlapping points

2. Control figure aesthetics

3. Plot with Seaborn

4 Further Customizations Also see Matplotlib


y="petal_length",

data=iris)

Bar Chart
4. Further customize your plot

Axisgrid Objects >>> sns.barplot(x="sex", #Show point estimates & confidence intervals with scatterplot glyphs

5. Show your plot y="survived",

hue="class",

>>> import matplotlib.pyplot as plt


>>> g.despine(left=True) #Remove left spine
data=titanic)
>>> import seaborn as sns
>>> g.set_ylabels("Survived") #Set the labels of the y-axis

>>> tips = sns.load_dataset("tips") #Step 1


>>> g.set_xticklabels(rotation=45) #Set the tick labels for x
Count Plot
>>> sns.set_style("whitegrid") #Step 2
>>> g.set_axis_labels("Survived", #Set the axis labels
>>> sns.countplot(x="deck", #Show count of observations

>>> g = sns.lmplot(x="tip", #Step 3


"Sex")
data=titanic,

y="total_bill",
>>> h.set(xlim=(0,5), #Set the limit and ticks of the x-and y-axis
palette="Greens_d")
data=tips,
ylim=(0,5),

aspect=2)
xticks=[0,2.5,5],
Point Plot
>>> g = (g.set_axis_labels("Tip","Total bill(USD)").
yticks=[0,2.5,5])
>>> sns.pointplot(x="class", #Show point estimates & confidence intervals as rectangular bars

set(xlim=(0,10),ylim=(0,100)))

y="survived",

>>> plt.title("title") #Step 4

>>> plt.show(g) #Step 5 Plot hue="sex",

data=titanic,

palette={"male":"g",

>>> plt.title("A Title") #Add plot title


"female":"m"},

>>> plt.ylabel("Survived") #Adjust the label of the y-axis


markers=["^","o"],

1 Data Also see Lists, NumPy & Pandas


>>>
>>>
>>>
plt.xlabel("Sex") #Adjust the label of the x-axis

plt.ylim(0,100) #Adjust the limits of the y-axis

plt.xlim(0,10) #Adjust the limits of the x-axis

Boxplot
linestyles=["-","--"])

>>> plt.setp(ax,yticks=[0,5]) #Adjust a plot property


>>> sns.boxplot(x="alive", #Boxplot

>>> import pandas as pd


>>> plt.tight_layout() #Adjust subplot params y="age",

>>> import numpy as np


hue="adult_male",

>>> uniform_data = np.random.rand(10, 12)


data=titanic)

>>> data = pd.DataFrame({'x':np.arange(1,101),


>>> sns.boxplot(data=iris,orient="h") #Boxplot with wide-form data
'y':np.random.normal(0,4,100)})
Violinplot
Seaborn also offers built-in data sets:
>>> sns.violinplot(x="age", #Violin plot

>>> titanic = sns.load_dataset("titanic")


y="sex",

>>> iris = sns.load_dataset("iris") hue="survived",

data=titanic)

2 Figure Aesthetics Also see Matplotlib 5 Show or Save Plot Also see Matplotlib

>>> plt.show() #Show the plot

>>> f, ax = plt.subplots(figsize=(5,6)) #Create a figure and one subplot


Context Functions >>> plt.savefig("foo.png") #Save the plot as a figure

>>> plt.savefig("foo.png", #Save transparent figure

Seaborn styles >>> sns.set_context("talk") #Set context to "talk"

transparent=True)

>>> sns.set_context("notebook", #Set context to "notebook",

>>> sns.set() #(Re)set the seaborn default

font_scale=1.5, #Scale font elements and

rc={"lines.linewidth":2.5}) #override param mapping

> Close & Clear


>>> sns.set_style("whitegrid") #Set the matplotlib parameters

>>> sns.set_style("ticks", #Set the matplotlib parameters


Also see Matplotlib
{"xtick.major.size":8,

"ytick.major.size":8})
Color Palette
#Return a dict of params or use with with to temporarily set the style
>>> plt.cla() #Clear an axis

>>> sns.axes_style("whitegrid") >>> sns.set_palette("husl",3) #Define the color palette


>>> plt.clf() #Clear an entire figure
Also see Matplotlib
>>> sns.color_palette("husl") #Use with with to temporarily set palette
>>> plt.close() #Close a window
>>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]

>>> sns.set_palette(flatui) #Set your own color palette

Learn Data Skills Online at www.DataCamp.com


> Preprocessing The Data > Evaluate Your Model’s Performance
Python For Data Science
Standardization Classification Metrics

Scikit-Learn Cheat Sheet >>>


>>>
>>>
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)

standardized_X = scaler.transform(X_train)

Accuracy Score
>>> knn.score(X_test, y_test) #Estimator score method

>>> from sklearn.metrics import accuracy_score #Metric scoring functions

>>> standardized_X_test = scaler.transform(X_test) >>> accuracy_score(y_test, y_pred)


Learn Scikit-Learn online at www.DataCamp.com
Classification Report
Normalization >>> from sklearn.metrics import classification_report #Precision, recall, f1-score and support

>>> print(classification_report(y_test, y_pred))


>>> from sklearn.preprocessing import Normalizer

Confusion Matrix
>>> scaler = Normalizer().fit(X_train)

Scikit-learn >>>
>>>
normalized_X = scaler.transform(X_train)

normalized_X_test = scaler.transform(X_test)
>>> from sklearn.metrics import confusion_matrix

>>> print(confusion_matrix(y_test, y_pred))

Scikit-learn is an open source Python library that implements a range of Binarization Regression Metrics
machine learning, preprocessing, cross-validation and visualization

algorithms using a unified interface. >>> from sklearn.preprocessing import Binarizer


Mean Absolute Error
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> from sklearn.metrics import mean_absolute_error

>>> binary_X = binarizer.transform(X)


A Basic Example >>> y_true = [3, -0.5, 2]

>>> mean_absolute_error(y_true, y_pred)

>>> from sklearn import neighbors, datasets, preprocessing

Encoding Categorical Features Mean Squared Error


>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import mean_squared_error

>>> from sklearn.metrics import accuracy_score


>>> from sklearn.preprocessing import LabelEncoder
>>> mean_squared_error(y_test, y_pred)
>>> iris = datasets.load_iris()
>>> enc = LabelEncoder()

>>> X, y = iris.data[:, :2], iris.target


R² Score
>>> y = enc.fit_transform(y)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> from sklearn.metrics import r2_score

>>> scaler = preprocessing.StandardScaler().fit(X_train)


>>> r2_score(y_true, y_pred)
>>>
>>>
X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

Imputing Missing Values


>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)

>>> from sklearn.preprocessing import Imputer

Clustering Metrics
>>> knn.fit(X_train, y_train)

>>> y_pred = knn.predict(X_test)


>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)

>>> imp.fit_transform(X_train) Adjusted Rand Index


>>> accuracy_score(y_test, y_pred)
>>> from sklearn.metrics import adjusted_rand_score

Generating Polynomial Features >>> adjusted_rand_score(y_true, y_pred)

> Loading The Data Also see NumPy & Pandas


>>> from sklearn.preprocessing import PolynomialFeatures

Homogeneity

>>> from sklearn.metrics import homogeneity_score

>>> poly = PolynomialFeatures(5)


>>> homogeneity_score(y_true, y_pred)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are >>> poly.fit_transform(X)
convertible to numeric arrays, such as Pandas DataFrame, are also acceptable. V-measure
>>> import numpy as np
>>> from sklearn.metrics import v_measure_score

> Create Your Model


>>> X = np.random.random((10,5))
>>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])

>>> X[X < 0.7] = 0


Cross-Validation
Supervised Learning Estimators
> Training And Test Data Linear Regression
>>> from sklearn.cross_validation import cross_val_score

>>> print(cross_val_score(knn, X_train, y_train, cv=4))

>>> print(cross_val_score(lr, X, y, cv=2))


>>> from sklearn.linear_model import LinearRegression

>>> from sklearn.model_selection import train_test_split

>>> lr = LinearRegression(normalize=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X,

y,

random_state=0)
Support Vector Machines (SVM)
>>> from sklearn.svm import SVC

> Tune Your Model


>>> svc = SVC(kernel='linear')

Grid Search
> Model Fitting
Naive Bayes
>>> from sklearn.naive_bayes import GaussianNB

>>> gnb = GaussianNB() >>> from sklearn.grid_search import GridSearchCV

>>> params = {"n_neighbors": np.arange(1,3),

Supervised learning KNN "metric": ["euclidean", "cityblock"]}

>>> lr.fit(X, y) #Fit the model to the data


>>> from sklearn import neighbors
>>> grid = GridSearchCV(estimator=knn,

>>> knn.fit(X_train, y_train)


>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) param_grid=params)

>>> svc.fit(X_train, y_train) >>> grid.fit(X_train, y_train)

>>> print(grid.best_score_)

Unsupervised Learning
Unsupervised Learning Estimators >>> print(grid.best_estimator_.n_neighbors)
>>> k_means.fit(X_train) #Fit the model to the data

>>> pca_model = pca.fit_transform(X_train) #Fit to data, then transform it


Principal Component Analysis (PCA) Randomized Parameter Optimization
>>> from sklearn.decomposition import PCA

>>> pca = PCA(n_components=0.95) >>> from sklearn.grid_search import RandomizedSearchCV

> Prediction K Means


>>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}

>>> rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params,

>>> from sklearn.cluster import KMeans


cv=4, n_iter=8, random_state=5)

Supervised Estimators >>> k_means = KMeans(n_clusters=3, random_state=0) >>> rsearch.fit(X_train, y_train)

>>> print(rsearch.best_score_)
>>> y_pred = svc.predict(np.random.random((2,5))) #Predict labels

>>> y_pred = lr.predict(X_test) #Predict labels

>>> y_pred = knn.predict_proba(X_test) #Estimate probability of a label


Unsupervised Estimators
Learn Data Skills Online at www.DataCamp.com
>>> y_pred = k_means.predict(X_test) #Predict labels in clustering algos
> Spans > Visualizing
Python For Data Science
Accessing spans If you're in a Jupyter notebook, use displacy.render otherwise,

use displacy.serve to start a web server and show the visualization in your browser.

spaCy Cheat Sheet Span indices are exclusive. So doc[2:4] is a span starting at token 2, up to – but not including! – token 4.
>>> doc = nlp("This is a text")

>>> from spacy import displacy

>>> span = doc[2:4]


Visualize dependencies
Learn spaCy online at www.DataCamp.com >>>
'a
span.text

text'
>>> doc = nlp("This is a sentence")

>>> displacy.render(doc, style="dep")

Creating a span manually

spaCy >>> from spacy.tokens import Span #Import the Span object

>>> doc = nlp("I live in New York") #Create a Doc object

>>> span = Span(doc, 3, 5, label="GPE") #Span for "New York" with label GPE (geopolitical)

>>> span.text

spaCy is a free, open-source library for advanced Natural Language 'New York’

processing (NLP) in Python. It's designed specifically for production use and Visualize named entities
helps you build applications that process and "understand" large volumes
>>> doc = nlp("Larry Page founded Google")

of text. Documentation: spacy.io


>>> $ pip install spacy

> Linguistic features >>> displacy.render(doc, style="ent")

>>> import spacy Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_ .

Part-of-speech tags Predicted by Statistical model


> Statistical models >>> doc = nlp("This is a text.")

> Word vectors and similarity


>>> [token.pos_ for token in doc] #Coarse-grained part-of-speech tags

Download statistical models ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']


To use word vectors, you need to install the larger models ending in md or lg , for example en_core_web_lg .
>>> [token.tag_ for token in doc] #Fine-grained part-of-speech tags

['DT', 'VBZ', 'DT', 'NN', '.']


Predict part-of-speech tags, dependency labels, named entities

and more. See here for available models: spacy.io/models Comparing similarity
>>> $ python -m spacy download en_core_web_sm Syntactic dependencies Predicted by Statistical model
>>> doc1 = nlp("I like cats")

>>> doc2 = nlp("I like dogs")

Check that your installed models are up to date >>> doc = nlp("This is a text.")

>>> [token.dep_ for token in doc] #Dependency labels

>>>
>>>
doc1.similarity(doc2) #Compare 2 documents

doc1[2].similarity(doc2[2]) #Compare 2 tokens

['nsubj', 'ROOT', 'det', 'attr', 'punct']


>>> doc1[0].similarity(doc2[1:3]) # Comparetokens and spans
>>> $ python -m spacy validate >>> [token.head.text for token in doc] #Syntactic head token (governor)

['is', 'is', 'text', 'is', 'is']


Accessing word vectors
Loading statistical models
Named entities Predicted by Statistical model
>>> doc = nlp("I like cats") #Vector as a numpy array

>>> import spacy


>>> doc[2].vector #The L2 norm of the token's vector

>>> nlp = spacy.load("en_core_web_sm") # Load the installed model "en_core_web_sm" >>> doc = nlp("Larry Page founded Google")
>>> doc[2].vector_norm
>>> [(ent.text, ent.label_) for ent in doc.ents] #Text and label of named entity span

[('Larry Page', 'PERSON'), ('Google', 'ORG')]

> Documents and tokens > Syntax iterators


Processing text
> Pipeline components
Sentences Ususally needs the dependency parser
Functions that take a Doc object, modify it and return it.
Processing text with the nlp object returns a Doc object that holds all
>>> doc = nlp("This a sentence. This is another one.")

information about the tokens, their linguistic features and their relationships >>> [sent.text for sent in doc.sents] #doc.sents is a generator that yields sentence spans

['This is a sentence.', 'This is another one.']


>>> doc = nlp("This is a text")

Accessing token attributes Base noun phrases Needs the tagger and parser

>>> doc = nlp("This is a text")

Pipeline information >>> doc = nlp("I have a red car")

#doc.noun_chunks is a generator that yields spans

>>>[token.text for token in doc] #Token texts

>>> nlp = spacy.load("en_core_web_sm")


>>> [chunk.text for chunk in doc.noun_chunks]

['This', 'is', 'a', 'text']


>>> nlp.pipe_names
['I', 'a red car']
['tagger', 'parser', 'ner']

>>> nlp.pipeline

[('tagger', <spacy.pipeline.Tagger>),

> Label explanations ('parser', <spacy.pipeline.DependencyParser>),

('ner', <spacy.pipeline.EntityRecognizer>)]

>>> spacy.explain("RB")

'adverb'

Custom components
>>> spacy.explain("GPE")
Learn Data Skills Online at
'Countries, cities, states' def custom_component(doc): #Function that modifies the doc and returns it

print("Do something to the doc here!")


www.DataCamp.com
return doc

nlp.add_pipe(custom_component, first=True) #Add the component first in the pipeline


Components can be added first , last (default), or before or after an existing component.
> Extension attributes > Rule-based matching > Glossary
Custom attributes that are registered on the global Doc, Token and Span classes and become available as ._ .
>>> from spacy.tokens import Doc, Token, Span

Using the matcher Tokenization


>>> doc = nlp("The sky over New York is blue")
# Matcher is initialized with the shared vocab
Segmenting text into words, punctuation etc
>>> from spacy.matcher import Matcher

Attribute extensions With default value # Each dict represents one token and its attributes

>>> matcher = Matcher(nlp.vocab)


Lemmatization
# Add with ID, optional callback and pattern(s)

# Register custom attribute on Token class


>>> pattern = [{"LOWER": "new"}, {"LOWER": "york"}]

>>> Token.set_extension("is_color", default=False)


>>> matcher.add("CITIES", None, pattern)
Assigning the base forms of words, for example:

# Overwrite extension attribute with default value


# Match by calling the matcher on a Doc object

doc[6]._.is_color = True "was" → "be" or "rats" → "rat".


>>> doc = nlp("I live in New York")

>>> matches = matcher(doc)

# Matches are (match_id, start, end) tuples

Property extensions With getter and setter >>> for match_id, start, end in matches:
Sentence Boundary Detection
# Get the matched span by slicing the Doc

span = doc[start:end]

# Register custom attribute on Doc class


print(span.text)
Finding and segmenting individual sentences.
>>> get_reversed = lambda doc: doc.text[::-1]
'New York'
>>> Doc.set_extension("reversed", getter=get_reversed)

# Compute value of extension attribute with getter


Part-of-speech (POS) Tagging
>>> doc._.reversed

'eulb si kroY weN revo yks ehT'


Token patterns
Assigning word types to tokens like verb or noun.
# "love cats", "loving cats", "loved cats"

Method extensions Callable Method >>> pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]

# "10 people", "twenty people"

>>> pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]

Dependency Parsing
# Register custom attribute on Span class
# "book", "a cat", "the sea" (noun + optional article)

>>> has_label = lambda span, label: span.label_ == label


>>> pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
>>> Span.set_extension("has_label", method=has_label)
Assigning syntactic dependency labels,

# Compute value of extension attribute with method


describing the relations between individual

>>> doc[3:5].has_label("GPE")

True
Operators and quantifiers tokens, like subject or object.
Can be added to a token dict as the "OP" key
Named Entity Recognition (NER)
! Negate pattern and match exactly 0 times

Labeling named "real-world" objects,

? Make pattern optional and match 0 or 1 times


like persons, companies or locations.
+ Require pattern to match 1 or more times

Allow pattern to match 0 or more time


Text Classification
*

Assigning categories or labels to a whole

document, or parts of a document.

Statistical model
Process for making predictions based on examples.

Training
Updating a statistical model with new examples.

Learn Data Skills Online at


www.DataCamp.com
TensorFlow v2.0 Cheat Sheet

TensorFlow™ A layer instance is called on a tensor and returns a tensor. An


input tensor and output tensor can then be used to define a
TensorFlow is an open-source software library for high- Model, which is compiled and trained just as a Sequential model.
performance numerical computation. Its flexible architecture Models are callable by themselves and can be stacked the
enables to easily deploy computation across a variety of same way while reusing trained weights.
platforms (CPUs, GPUs, and TPUs), as well as mobile and edge
devices, desktops, and clusters of servers. TensorFlow comes Transfer learning and fine-tuning of pretrained models saves
with strong support for machine learning and deep learning. your time if your data set does not differ significantly from the
original one.
High-Level APIs for Deep Learning import tensorflow as tf
Keras is a handy high-level API standard for deep learning import tensorflow_datasets as tfds
models widely adopted for fast prototyping and state-of- dataset = tfds.load(name=’tf_flowers’, as_supervised=True)
the-art research. It was originally designed to run on top of NUMBER_OF_CLASSES_IN_DATASET = 5
different low-level computational frameworks and therefore the IMG_SIZE = 160
TensorFlow platform fully implements it.
def preprocess_example(image, label):
The Sequential API is the most common way to define your image = tf.cast(image, tf.float32)
neural network model. It corresponds to the mental image we image = (image / 127.5) - 1
use when thinking about deep learning: a sequence of layers. image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
return image, label
import tensorflow as tf
from tensorflow.keras import datasets, layers, models DATASET_SIZE = 3670
BATCH_SIZE = 32
# Load data set train = dataset[’train’].map(preprocess_example)
mnist = datasets.mnist train_batches = train.shuffle(DATASET_SIZE).batch(BATCH_SIZE)
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Load MobileNetV2 model pretrained on ImageNet data
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.applications.MobileNetV2(
# Construct a neural network model input_shape=(IMG_SIZE, IMG_SIZE, 3),
model = models.Sequential() include_top=False, weights=’imagenet’, pooling=’avg’)
model.add(layers.Flatten(input_shape=(28, 28))) model.trainable = False
model.add(layers.Dense(512, activation=tf.nn.relu))
# Add a new layer for multiclass classification
model.add(layers.Dropout(0.2))
new_output = tf.keras.layers.Dense(
model.add(layers.Dense(10, activation=tf.nn.softmax))
NUMBER_OF_CLASSES_IN_DATASET, activation=’softmax’)
model.compile(optimizer=’adam’,
new_model = tf.keras.Sequential([model, new_output])
loss=’sparse_categorical_crossentropy’,
new_model.compile(
metrics=[’accuracy’])
loss=tf.keras.losses.categorical_crossentropy,
# Train and evaluate the model optimizer=tf.keras.optimizers.RMSprop(lr=1e-3),
model.fit(x_train, y_train, epochs=5) metrics=[’accuracy’])
model.evaluate(x_test, y_test)
# Train the classification layer
new_model.fit(train_batches.repeat(), epochs=10,
The Functional API enables engineers to define complex
steps_per_epoch=DATASET_SIZE // BATCH_SIZE)
topologies, including multi-input and multi-output models, as
well as advanced models with shared layers and models with
residual connections. After the execution of the given transfer learning code, you can
make MobileNetV2 layers trainable and perform fine-tuning of
from tensorflow.keras.layers import Flatten, Dense, Dropout the resulting model to achieve better results.
from tensorflow.keras.models import Model

# Loading data set must be here <...> Jupyter Notebook


inputs = tf.keras.Input(shape=(28, 28)) Jupyter Notebook is a web-based interactive computational
x = Flatten()(inputs) environment for data science and scientific computing.
x = Dense(512, activation=’relu’)(x)
x = Dropout(0.2)(x) Google Colaboratory is a free notebook environment that
predictions = Dense(10, activation=’softmax’)(x) requires no setup and runs entirely in the cloud. Use it for
model = Model(inputs=inputs, outputs=predictions) jump-starting a machine learning project.
# Compile, train and evaluate the model here <...>

Version 2.1 Get the latest version at www.altoros.com/visuals Order private training at www.altoros.com/training
TensorFlow v2.0 Cheat Sheet

A Reference Machine Learning Workflow tf.data.Dataset represents a sequence of elements each containing
one or more Tensor object(-s). This can be exemplified by a pair of
Here’s a conceptual diagram and a workflow example:
tensors representing an image and a corresponding class label.

import tensorflow as tf

DATASET_URL = “https://archive.ics.uci.edu/ml/machine-” \
“learning-databases/covtype/covtype.data.gz”
DATASET_SIZE = 387698
dataset_path = tf.keras.utils.get_file(
fname=DATASET_URL.split(’/’)[-1], origin=DATASET_URL)

COLUMN_NAMES = [
’Elevation’, ’Aspect’, ’Slope’,
’Horizontal_Distance_To_Hydrology’,
’Vertical_Distance_To_Hydrology’,
’Horizontal_Distance_To_Roadways’,
’Hillshade_9am’, ’Hillshade_Noon’, ’Hillshade_3pm’,
’Horizontal_Distance_To_Fire_Points’, ’Soil_Type’,
’Cover_Type’]

def _parse_line(line):

# Decode the line into values


fields = tf.io.decode_csv(
records=line, record_defaults=[0.0] * 54 + [0])

# Pack the result into a dictionary


features = dict(zip(COLUMN_NAMES,
fields[:10] + [tf.stack(fields[14:54])] + [fields[-1]]))

# Extract one-hot encoded class label from the features


class_label = tf.argmax(fields[10:14], axis=0)
return features, class_label

def csv_input_fn(csv_path, test=False,


batch_size=DATASET_SIZE // 1000):

# Create a dataset containing the csv lines


dataset = tf.data.TextLineDataset(filenames=csv_path,
compression_type=’GZIP’)
# Parse each line
01 Load the training data using pipelines created with tf.data. dataset = dataset.map(_parse_line)
As an input, you can use either in-memory data (NumPy), or a
# Shuffle, repeat, batch the examples for train and test
local storage, or a remote persistent storage.
dataset = dataset.shuffle(buffer_size=DATASET_SIZE,
02 Build, train, and validate a model with tf.keras, or use seed=42)
premade estimators.
TEST_SIZE = DATASET_SIZE // 10
03 Run and debug with eager execution, then use tf.function return dataset.take(TEST_SIZE).batch(TEST_SIZE) if test \
for the benefits of graphs. else dataset.skip(TEST_SIZE).repeat().batch(batch_size)

04 For large ML training tasks, use the Distribution Strategy API


for deploying training on Kubernetes clusters within on-premises Functions from the tf.feature_column namespace are used to
or cloud environments. put raw data into a TensorFlow data set. A feature column
is a high-level configuration abstraction for ingesting and
05 Export to SavedModel—an interchange format for representing features. It does not contain any data but tells the
TensorFlow Serving, TensorFlow Lite, TensorFlow.js, etc. model how to transform the raw data so that it matches the
expectation. The exact feature column to choose depends on
The tf.data API enables to build complex input pipelines from the feature type and the model type. The continuous feature
simple pieces. The pipeline aggregates data from a distributed type is handled by numeric_column and can be directly fed into
file system, applies transformation to each object, and merges a neural network or a linear model.
shuffled examples into training batches.

Version 2.1 Get the latest version at www.altoros.com/visuals Order private training at www.altoros.com/training
TensorFlow v2.0 Cheat Sheet
# Build, train, and evaluate the estimator
model = tf.estimator.LinearClassifier(feature_columns,
n_classes=4)
model.train(input_fn=lambda: csv_input_fn(dataset_path),
steps=10000)
model.ev aluate(
input_fn=lambda: csv_input_fn(dataset_path, test=True))

SavedModel contains a complete TF program and does not


require the original model-building code to run, which makes it
useful for deploying and sharing models.
# Export model to SavedModel
_builder = tf.estimator.export. \
build_parsing_serving_input_receiver_fn
_spec_maker = tf.feature_column.make_parse_example_spec

serving_input_fn = _builder(_spec_maker(feature_columns))

export_path = model.export_saved_model(
“/tmp/from_estimator/”, serving_input_fn)

The following code sample shows how to load and use the
saved model with Python.
# Import model from SavedModel
imported = tf.saved_model.load(export_path)

# Use imported model for prediction


def predict(new_object):
example = tf.train.Example()

# All regular continuous features


Categorical features can be ingested by functions with the for column in COLUMN_NAMES[:-2]:
“categorical_column_” prefix, but they need to be wrapped val = new_object[column]
by embedding_column or indicator_column before being fed into example.features.feature[column]. \
Neural Network models. For linear models, indicator_column is float_list.value.extend([val])
an internal representation when categorical columns are passed
# One-hot encoded feature of 40 columns
in directly.
for val in new_object[’Soil_Type’]:
example.features.feature[’Soil_Type’]. \
feature_columns = [tf.feature_column.numeric_column(name)
float_list.value.extend([val])
for name in COLUMN_NAMES[:10]]
# Categorical column with ID
feature_columns.append(
example.features.feature[’Cover_Type’]. \
tf.feature_column.categorical_column_with_identity(
int64_list.value.extend([new_object[’Cover_Type’]])
’Cover_Type’, num_buckets=8)
) return imported.signatures[’predict’](
# Soil_type[1-40] is a tensor of length 40 examples=tf.constant([example.SerializeToString()]))
feature_columns.append(
predict({
tf.feature_column.numeric_column(’Soil_Type’, shape=(40,))
’Elevation’: 2296, ’Aspect’: 312, ’Slope’: 27,
)
’Horizontal_Distance_To_Hydrology’: 256,
’Horizontal_Distance_To_Fire_Points’: 836,
The Estimator API provides high-level encapsulation for best
’Horizontal_Distance_To_Roadways’: 1273,
practices: model training, evaluation, prediction, and export
’Vertical_Distance_To_Hydrology’: 145,
for serving. The tf.estimator.Estimator subclass represents a
’Hillshade_9am’: 136, ’Hillshade_Noon’: 208,
complete model. Its object creates and manages tf.Graph and
’Hillshade_3pm’: 206,
tf.Session for you. Premade estimators include Linear Classifier,
’Soil_Type’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
DNN Classifier, and Gradient Boosted Trees. BaselineClassifier
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
and BaselineRegressor will help to establish a simple model for
0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
sanity check during further model development.
’Cover_Type’: 6})

Version 2.1 Get the latest version at www.altoros.com/visuals Order private training at www.altoros.com/training

You might also like