Professional Documents
Culture Documents
Python-cheatsheets Merged 230118 192222
Python-cheatsheets Merged 230118 192222
Python-cheatsheets Merged 230118 192222
Sequential Model
Python For Data Science
>>> model.add(Dense(12,
metrics=['accuracy'])
input_dim=8,
Keras
kernel_initializer='uniform',
activation='relu'))
>>> model.compile(optimizer='rmsprop',
>>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
loss='categorical_crossentropy',
TensorFlow that provides a high-level neural
networks API to develop and >>> from tensorflow.keras.layers import Dropout
>>> model.compile(optimizer='rmsprop',
>>> model.add(Dense(512,activation='relu',input_shape=(784,)))
loss='mse',
Regression metrics=['accuracy'])
>>> from tensorflow.keras.layers import Dense
>>> model.add(Dense(32,
activation='relu',
Convolutional Neural Network (CNN) > Model Training
input_dim=100))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
loss='binary_crossentropy',
batch_size=32,
>>> model2.add(Activation('relu'))
metrics=['accuracy'])
epochs=15,
>>> model2.add(Conv2D(32,(3,3)))
>>> model.fit(data,labels,epochs=10,batch_size=32)
verbose=1,
>>> model2.add(Activation('relu'))
>>> model2.add(Dropout(0.25))
> Data
>>> model2.add(Activation('relu'))
>>>
>>>
model2.add(Conv2D(64,(3, 3)))
model2.add(Activation('relu'))
> Evaluate Your Model's Performance
>>> model2.add(MaxPooling2D(pool_size=(2,2)))
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ideally, you split the data in training and >>> model2.add(Dropout(0.25))
>>> score = model3.evaluate(x_test,
test sets, for which you can also resort to the train_test_split module of sklearn.cross_validation. >>> model2.add(Flatten())
y_test,
>>> model2.add(Dense(512))
batch_size=32)
>>> model2.add(Activation('relu'))
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
> Save/ Reload Models
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2))
>>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
Early Stopping
Sequence Padding Train and Test Sets
>>> from tensorflow.keras.callbacks import EarlyStopping
random_state=42) batch_size=32,
epochs=15,
Standardization/Normalization callbacks=[early_stopping_monitor])
>>>
>>>
b.dtype.name #Name of data type
> Data Types >>> a[2] #Select the element at the 2nd index
1.5 2
2 3
3
6.0 4 5 6
>>> np.int64 #Signed 64-bit integer types
Numpy
>>> np.float32 #Standard double-precision floating point
Slicing
>>> np.complex #Complex numbers represented by 128 floats
>>> a[0:2] #Select items at index 0 and 1
1 2 3
>>> Numpy
np.bool #Boolean type storing TRUE and FALSE values
array([1, 2])
It provides a high-performance multidimensional array object, and tools for array([[1.5, 2., 3.]])
1.5 2 3
4 5 6
working with these arrays >>> c[1,...] #Same as [1,:,:]
>>> g = a - b #Subtraction
Fancy Indexing
array([[-0.5, 0. , 0. ],
array([ 4. , 2. , 6. , 1.5])
>>> b[[1, 0, 1, 0]][:,[0,1,2,0]] #Select a subset of the matrix’s rows and columns
>>> b + a #Addition
array([[ 4. ,5. , 6. , 4. ],
array([[ 2.5, 4. , 6. ],
[ 1.5, 2. , 3. , 1.5],
[ 5. , 7. , 9. ]])
[ 4. , 5. , 6. , 4. ],
[ 1.5, 2. , 3. , 1.5]])
>>> a / b #Division
array([[ 0.66666667, 1. , 1. ],
>>> a * b #Multiplication
> Array Manipulation
> Creating Arrays
array([[ 1.5, 4. , 9. ],
>>> a = np.array([1,2,3])
>>> np.sqrt(b) #Square root
>>> i.T #Permute array dimensions
>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float)
>>> np.sin(a) #Print sines of an array
>>> c = np.array([[(1.5,2,3), (4,5,6)],[(3,2,1), (4,5,6)]], dtype = float) >>> np.cos(b) #Element-wise cosine
Changing Array Shape
>>> np.log(a) #Element-wise natural logarithm
>>> b.ravel() #Flatten the array
[ 4. , 5. , 6. ]])
>>> np.save('my_array', a)
>>> b.cumsum(axis=1) #Cumulative sum of the elements
[ 3, 20]])
>>> np.savez('array.npz', a, b)
>>> a.mean() #Mean
>>> np.c_[a,d] #Create stacked column-wise arrays
>>> np.load('my_array.npy') >>> np.median(b) #Median
[array([[[ 1.5, 2. , 1. ],
>>> np.loadtxt("myfile.txt")
data2,
Query
how='left',
on='X1')
data2,
on='X1')
columns='Type',
DataFrame columns={"Country":"cntry",
how='inner',
"Capital":"cptl",
on='X1')
"Population":"ppltn"})
>>> pd.merge(data1,
Reindexing data2,
how='outer',
on='X1')
Pivot Table >>> s2 = s.reindex(['a','c','d','e','b'])
columns values='Value',
>>> df.reindex(range(4),
>>> s3 = s.reindex(range(5),
index='Date',
method='ffill') method='bfill') Join
columns='Type']) Country Capital Population
0 3
np.array([5,4,3])]
Horizontal/Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
names=['first', 'second'])
> Dates
> Duplicate Data
>>> pd.melt(df2, #Gather columns into rows
id_vars=["Date"],
value_vars=["Type", "Value"],
>>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df4.groupby(level=0).sum()
'variable':'var', == Equals pd.isnull(obj) Is NaN '^Sepal' Matches strings beginning with the word 'Sepal'
'value':'val'}) <= Less than or equals pd.notnull(obj) Is not NaN '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
.query('val >= 200')
>= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all '^(?!Species$).*' Matches strings except the string 'Species'
)
Cheatsheet for pandas (http://pandas.pydata.org/ originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet
Summarize Data Handling Missing Data Combine Data Sets
df['w'].value_counts() df.dropna() adf bdf
Count number of rows with each unique value of variable Drop rows with any column having NA/null data. x1 x2 x1 x3
len(df) df.fillna(value) A 1 A T
# of rows in DataFrame. Replace all NA/null data with value. B 2 B F
df.shape C 3 D T
Tuple of # of rows, # of columns in DataFrame.
df['w'].nunique()
Make New Columns Standard Joins
>>> df.to_csv('myDataFrame.csv')
>>>
>>>
>>>
df.shape #(rows,columns)
Learn Pandas Basics online at www.DataCamp.com Read and Write to Excel >>> df.count() #Number of non-NA values
>>> pd.read_excel(‘file.xlsx’)
Pandas
>>>
>>> df.cumsum() #Cummulative sum of values
Use the following import convention: >>> from sqlalchemy import create_engine
pd.read_sql_table('my_table', engine)
> Applying Functions
>>> pd.read_sql_query("SELECT * FROM my_table;", engine)
read_sql() is a convenience wrapper around read_sql_table() and read_sql_query() >>> f = lambda x: x*2
> Pandas Data Structures >>> df.to_sql('myDf', engine) >>> df.apply(f) #Apply function
Series
> Selection Also see NumPy Arrays
> Data Alignment
A one-dimensional labeled array
a 3
capable of holding any data type b -5 Getting Internal Data Alignment
Index
c 7 >>> s['b'] #Get one element
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd']) >>> df[1:] #Get subset of a DataFrame
>>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
c 5.0
You can also do the internal data alignment yourself with
the help of the fill methods:
Index 1 India New Delhi 1303171035 >>> df.iat([0],[0])
>>> s.add(s3, fill_values=0)
'Belgium' a 10.0
By Label
>>> data = {'Country': ['Belgium', 'India', 'Brazil'],
c 5.0
>>> df = pd.DataFrame(data,
>>> df.at([0], ['Country'])
>>> s.div(s3, fill_value=4)
By Label/Position
> Dropping
>>> df.ix[2] #Select single row of subset of rows
Country Brazil
Capital Brasília
Population 207847528
1 New Delhi
2 Brasília
Boolean Indexing
>>> help(pd.Series.loc) >>> s[~(s > 1)] #Series s where value is not >1
>>> s[(s < -1) | (s > 2)] #s where value is <-1 or >2
pickled_data = pickle.load(file)
Importing Data Cheat Sheet >>> data_array.dtype #Data type of array elements
df_sas = file.to_data_frame()
> Exploring Dictionaries
>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv)
Querying relational databases with pandas
> Stata File
>>> print(mat.keys()) #Print dictionary keys
> Text Files >>> data = pd.read_stata('urbanpop.dta') >>> for key in data.keys(): #Print dictionary keys
print(key)
meta
quality
>>>
>>> print(file.closed) #Check whether file is closed
print(file.readline())
names=['Country']) Detector
print(file.readline()) To access the sheet names, use the sheet_names attribute: Duration
>>> data.sheet_names
GPSstart
Observatory
UTCstart
>>> print(data['meta']['Description'].value)
Querying Relational Databases !ls #List directory contents of files and directories
con.close()
OS Library
>>> filename = 'titanic.csv'
Using the context manager with >>> import os
delimiter=',',
>>> with engine.connect() as con:
dtype=None)
df = pd.DataFrame(rs.fetchmany(size=5))
Importing Flat Files with Pandas Querying relational databases with pandas >>> os.remove("test1.txt") #Delete an existing file
"""
Learn Python online at www.DataCamp.com List functions and methods A Frame of Data
> How to use this cheat sheet reversed(x) # Reverse the order of elements in x e.g., [2,3,1]
"""
Python is the most popular programming language in data science. It is easy to learn and comes with a wide array of str[0:2] # Get a substring from starting to ending index (exclusive)
powerful libraries for data analysis. This cheat sheet provides beginners and intermediate users a guide to starting
using python. Use it to jump-start your journey with python. If you want more detailed Python cheat sheets, check out Selecting list elements
the following cheat sheets below:
Combining and splitting strings
Python lists are zero-indexed (the first element has index 0). For ranges, the first element is included but the last is not.
Mutate strings
Importing data in python Data wrangling in pandas
Concatenating lists
str = "Jack and Jill" # Define str
x = [1, 3, 6]
3 * x # Returns [1, 3, 6, 1, 3, 6, 1, 3, 6] str.lower() # Convert a string to lowercase, returns 'jack and jill'
type('a') # Get the type of an object — this returns str > Getting started with dictionaries
A dictionary stores data values in key-value pairs. That is, unlike lists which are indexed by position, dictionaries are indexed
> Getting started with DataFrames
> Importing packages by their keys, the names of which must be unique.
Pandas is a fast and powerful package for data analysis and manipulation in python. To import the package, you can
use import pandas as pd. A pandas DataFrame is a structure that contains two-dimensional data stored as rows and
Python packages are a collection of useful tools developed by the open-source community. They extend the
Creating dictionaries columns. A pandas series is a structure that contains one-dimensional data.
capabilities of the python language. To install a new package (for example, pandas), you can go to your command
prompt and type in pip install pandas. Once a package is installed, you can import it as follows.
# Create
{'a': 1,
a dictionary with {}
'b': 4, 'c': 9}
Creating DataFrames
import pandas # Import a package without an alias
dictionary
# Create a dataframe from a list
pd.DataFrame([
of dictionaries
> The working directory x.values() # Get the values of a dictionary, returns dict_values([1, 2, 3])
}) ])
df['col']
> Operators NumPy is a python package for scientific computing. It provides multidimensional array objects and efficient operations
on them. To import NumPy, you can run this Python code import numpy as np
df[['col1', 'col2']]
df.iloc[:, 2]
df.iloc[3, 2]
# Return a stepped sequence from start (inclusive) to end (exclusive)
pd.concat([df, df])
df.mean()
a = 5 # Assign a value to a
np.repeat([1, 3, 6], 3) # Returns array([1, 1, 1, 3, 3, 3, 6, 6, 6])
# Get rows matching a condition
# Get unique rows
# Rename columns
df.sort_values(by='col_name')
df.nlargest(n, 'col_name')
(1 != 1) & (1 < 1) # Logical AND with & (1 != 1) ^ (1 < 1) # Logical XOR with ^ np.mean(x) # Calculate mean
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF
1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')
Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row
ro \ \ first : 1 hell o
Example
12 Angry men
Example
10032
non-matches -
Syntax Description Syntax Description second
pattern matches pattern matches non matches
\ w
match word character s
\wee\w
s
tree The bee
( |y)
re b re d rant
bee4
eels eat meat
alternative pattern s
banter
bear
b \ \ blob
bea r
character s
Swing the bat fast
bat 53
captures where n is the bribe
bring
1
Learn regular expressions online at www DataCamp com . . group index starting at
\ s
match whitespace
\sfox\s
the fox ate
’
it s the fox .
his fox ran
foxfur
Regular expression (regex or regexp) is a pattern of characters that describes an amount of text. To process
\metacharacter escape a metacharacter
to match on the
\.
\^
The cat ate
2^3
.
the cat ate
23
> Lookahead
regexes, you will use a “regex engine.” Each of these engines use slightly different syntax called regex metacharacter
You can specify that specific characters must appear before or after you match, without including those
flavor. A list of popular engines can be found here. Two common programming languages we discuss on
characters in the match.
DataCamp are Python and R which each have their own engines.
Character
Example
Example
Example
Since regex describes patterns of text, it can be used to check for the existence of patterns in a text,
> classes
Syntax Description
pattern matches -
non matches
extract substrings from longer strings, and help make adjustments to text. Regex can be very simple to
describe specific words, or it can be more advanced to find vague patterns of characters like the top-level (?= )
x looks ahead at the next an(?=an)
banan a
ban d
domain in a url.
Character classes are sets or ranges of characters.
characters without using iss(?=ipp)
Mississippi
missed
[x y]
match several character s
gr[ea]y
gray
green
(?! )
x looks ahead at next ai(?!n)
fail
faint
> Definitions grey
gree k
characters to not match brail
trai n
on
Literal character: A literal character is the most basic regular expression you can use. It simply matches
[x -y]
match a range of [a-e]
am ber
fo x
the actual character you write. So if you are trying to represent an “r,” you would write r.
character s
brand
join
(?<= )
x looks at previous
(?<= )a
tr
trail
bea r
characters for a match translat e
streak
Metacharacter: Metacharacters signify to the regex engine that the following character has a special
[^x y]
does not match several gr[^ea]y
green
gray
without using those in
meaning. You typically include a \ in front of the metacharacter and they can do things like signify the
character s
greek
grey
the match
It is signified by [ and ] with the characters you are looking for in the middle of the brackets.
on
class
Capture group: A capture group is signified by opening and closing, round parenthesis. They allow you to
group regexes together to apply other regex features like quantifiers (see below) to the group.
> R epetition
> Literal matches and modifiers
Modifiers are settings that change the way the matching rules work.
> Anchors Rather than matching single instances of characters, you can match repeated characters.
Syntax Description
pattern matches -
non matches
times
carrot
artichok e
^
match start of line
^r
rabbit
parrot
(?i) (?-i).
te sTep Trench
raccoon
ferret
x +
match one or more time s
re +
green
trap
case-insensitiv e
tEach
bear
tre e
ruine d
$
match end of line
t$
rabbit
trap
(? ) (?- )
x t x tap c a t
foot
star
x ?
match z ero or one time s
ro ?a
roast
root
whitespac e
tapdanc e
rot a potato
rant
rear
\A
match start of line
\Ar
rabbit
parrot
(?s) (?-s)
x {m}
match m time s
\we{2}\w
dee r
re d
DOTALL mode which s c nd(?-s)
e o Secon d an d third
second
see r
enter
makes the “.” include and hi d
t r and third
foot
star
2222224
123
else
sleep$
start or end of a word
the fox ate
foxskin scarf
times
1222384
1222223
end of line rather than eat an d
sleep
de
middle of other non- beef tree
et c. number of times - known freeeee roasted
space characters
as a la z
y quantifier
> Unico
Graphemes: Is either a codepoint or a character. All characters are made up of one or more graphemes
In order to extract specific parts of a string, you can capture those parts, and even name the parts that you
Rather than matching specific characters, you can match specific types of characters such as letters,
in a sequence.
captured.
numbers, and more.
.
anything except for a c.e
clean
acert
( )
x capturing a patter n
(iss)+
Mississipp i
mist
\X
match grapheme s
\u0000 gmail
@gmail
gmail
linebrea k
chea p
cent
misse d
persist
www.email@gmail
@aol
\ d
match a digi t
\d
6060-842
tw o
(?: )
x create a group without (?:a )(cd)
2b|^2b
**___
capturin g
Group 1: cd
like ones with an accent \u006 5\u0 300
\ D
match a non-digi t
\D
The 5 cats ate
52
(?<nam > )
e x create a named capture (?< i s > d)(?
f r t \ Match : 1325
2
12 Angry men
10032
p
<sc nd> d) d*
: 1
o
. .
grou ro \ \ first hell
second : 3
Learn Data Skills Online at www DataCamp com
\ w
match word character s
\wee\w
s
tree The bee
bee4
eels eat meat
( |y)
re b re d rant
match non-word
alternative pattern s
banter
bear
>>> A = np.matrix(np.random.random((2,2)))
Division
>>> B = np.asmatrix(b)
>>> np.divide(A,D) #Division
>>> C = np.mat(np.random.random((10,5)))
NumPy extension of Python. >>> np.trace(A) #Trace >>> linalg.expm2(A) #Matrix exponential (Taylor Series)
> Interacting With NumPy Also see NumPy >>> linalg.norm(A,1) #L1 norm (max column sum)
>>> a = np.array([1,2,3])
>>> np.linalg.matrix_rank(C) #Matrix rank >>> linalg.cosm(D) Matrix cosine
>>>
>>>
b.flatten() #Flatten the array
>>>
>>>
F = np.eye(3, k=1) #Create a 2X2 identity matrix
>>> p = poly1d([3,4,5]) #Create a polynomial object Sparse Matrix Routines Singular Value Decomposition
>>> U,s,Vh = linalg.svd(B) #Singular Value Decomposition (SVD)
Vectorizing Functions >>> sparse.linalg.inv(I) #Inverse >>> Sig = linalg.diagsvd(s,M,N) #Construct sigma matrix in SVD
Norm LU Decomposition
>>> def myfunc(a): if a < 0:
>>> P,L,U = linalg.lu(C) #LU Decomposition
return a*2
>>> sparse.linalg.norm(I) #Norm
else:
Solving linear problems
return a/2
>>>
>>>
np.real_if_close(c,tol=1000) #Return a real array if complex parts close to 0
>>>
>>>
g = np.linspace(0,np.pi,num=5) #Create an array of evenly spaced values(number of samples)
g [3:] += np.pi
> Asking For Help Learn Data Skills Online at
>>>
>>>
np.unwrap(g) #Unwrap
www.DataCamp.com
>>> np.select([c<4],[c*2]) #Return values from a list of arrays depending on conditions
>>> help(scipy.linalg.diagsvd)
row="sex")
>>> g = g.map(plt.hist,"age")
y="sepal_length",
data=iris,
ax=ax)
>>> sns.factorplot(x="pclass", #Draw a categorical plot onto a Facetgrid
y="survived",
>>> sns.lmplot(x="sepal_width", #Plot data and regression model fits across a FacetGrid
y="sepal_length",
>>> plot = sns.distplot(data.y, #Plot univariate distribution
hue="species",
kde=False,
data=iris)
color="b")
>>> h = sns.PairGrid(iris) #Subplot grid for plotting pairwise relationships
y="y",
The Python visualization library Seaborn is based on matplotlib and provides data=data)
Categorical Plots
"sepal_width",
y="petal_length",
data=iris)
Bar Chart
4. Further customize your plot
Axisgrid Objects >>> sns.barplot(x="sex", #Show point estimates & confidence intervals with scatterplot glyphs
hue="class",
y="total_bill",
>>> h.set(xlim=(0,5), #Set the limit and ticks of the x-and y-axis
palette="Greens_d")
data=tips,
ylim=(0,5),
aspect=2)
xticks=[0,2.5,5],
Point Plot
>>> g = (g.set_axis_labels("Tip","Total bill(USD)").
yticks=[0,2.5,5])
>>> sns.pointplot(x="class", #Show point estimates & confidence intervals as rectangular bars
set(xlim=(0,10),ylim=(0,100)))
y="survived",
data=titanic,
palette={"male":"g",
Boxplot
linestyles=["-","--"])
data=titanic)
2 Figure Aesthetics Also see Matplotlib 5 Show or Save Plot Also see Matplotlib
transparent=True)
"ytick.major.size":8})
Color Palette
#Return a dict of params or use with with to temporarily set the style
>>> plt.cla() #Clear an axis
# x
# Count the number of matches with .str.count()
# 0 0.123
suits.str.count("[ae]") # 0 1 2 2
# 1 4.567
Learn Python online at www.DataCamp.com # 2 8.901 # Locate the position of substrings with str.find()
suits.str.find("e") # -1 -1 1 4
df.style.format(precision = 1)
2 8.9
Throughout this cheat sheet, we’ll be using two pandas series named suits and # Extract capture groups with .str.extractall()
rock_paper_scissors. suits.str.extractall("([ae])(.)")
# 0 1
import pandas as pd
Splitting strings
# match
> # 1 0
# 2 0
a m
e a
suits.str.split(pat="")
suits.str.len() # Returns 5 8 6 6
suits.str.split(pat = "a")
# ["clubs"]
# ["Di", "monds"]
# ["Sp", "des"]
# 1 Di monds
# 2 he rts
# 3 Sp des
www.DataCamp.com
# Convert to uppercase with .str.upper()
# Combine two strings with +
dt.date.today()
dt.datetime.now()
now = dt.datetime.now()
now - then
(now - then).total_seconds()
dt.datetime(2022,8,5,11,13,50) + dt.timedelta(days=1)
Key
> Key definitions > Time Zones
When working with dates and times, you will encounter technical terms and
> Parsing dates, datetimes, and times # Get current time zone
pd.to_datetime(iso)
POSIXct: Handles date & time in calendar time # Parse dates in US format
Hms: Parses periods with hour, minute, and secon dttm = pd.to_datetime(iso)
Timestamp: Represents a single pandas date & tim # Parse dates in NON US format
pd.to_datetime(non_us, dayfirst=True)
pd.to_datetime(iso, infer_datetime_format=True)
dttm.dt.tz_localize('+0100')
The
> The ISO8601 datetime format # Parse dates in single, specified format
dttm.dt.tz_localize('+0100').tz_convert('US/Central')
The ISO 8601 datetime format specifies datetimes from the largest to the # Parse dates in single, specified format
smallest unit of time (YYYY-MM-DD HH:MM:SS TZ). Some of the advantages of pd.to_datetime(us, format="%m/%d/%Y %H:%M:%S")
It avoids ambiguities between MM/DD/YYYY and DD/MM/YYYY formats # Create interval datetimes
The 4-digit year representation mitigates overflow problems after the year 2099 start_1 = pd.Timestamp('2021-10-21 03:02:10')
Using numeric month values (08 not AUG) makes it language independent, so
dates make sense throughout the world
> Extracting components finish_1 = pd.Timestamp('2022-09-15 10:03:30')
Python is optimized for this format since it makes comparison and sorting easier. finish_2 = pd.Timestamp('2022-12-15 10:03:30')
dttm = pd.to_datetime(iso)
dttm.dt.year
dttm.dt.day_of_year
import datetime as dt
pd.Interval(start_1, finish_1,
import time as tm
# Get month name from datetime pandas series
closed='right').overlaps(pd.Interval(start_2, finish_2, closed='right'))
import pytz
dttm.dt.month_name()
import pandas as pd
In this cheat sheet, we will be using 3 pandas series — iso, us, non_us,
# Get day name from datetime pandas series
dttm.dt.day_name()
dttm.dt.to_pydatetime()
pd.Timedelta(7, "d")
dttm.dt.round('1min')
parts
# Flooring dates to nearest time unit
www.DataCamp.com
1969 7 20
# Ceiling dates to nearest time unit
1969 11 19 dttm.dt.ceil('1min')
1971 2 5