DAO Cheatsheet

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Operators and operands Methods of strings Functions Replace Missing w/ Value *( ) empty bracket leaves original price = stocks.

nctions Replace Missing w/ Value *( ) empty bracket leaves original price = stocks.values #convert into array  shows
line_upper = line.upper() # to upper case result = print('What?') unchanged value without index and col name
line_lower = line.lower() # to lower case print(result) # The returned value of the - gdp.subset.fillna(‘ABC’) fill all NaN items w/ ‘ABC’ pd.Series(x).var() #sample variance
line_cap = line.capitalize()# first letter print function is None - output=gdp.subset.fillna(0, inplace=True) np.array(x).var() #population variance
to upper case print(type(result))# The data type of None original data changed. inplace=True is similar to dropna( ) np.array(x).var() * (n/(n-1)) #= sample variance
line_swap = line.swapcase() # Swap upper Specifying row indexes: set_index(‘colname’) np.array(x).std() * (n/(n-1))**0.5 #sample std
and lower case is NoneType
The row indexes can be converted into a new column, while the - Transforming the variances to standard deviations
line_title = line.title() # Capitalize Notes: In Python programming, all positional arguments must default integer row indexes are recovered, by import numpy as np
the 1st letter of each word be specified before keyword arguments, otherwise there will be the reset_index() method. var = [3.5, 6.2, 7.3, 8.5]
The replace() method  for string an error message. output.reset_index(drop=True) #drop the original std = np.array(var) ** 0.5
string_uk = string_us.replace('modeling', pd.DataFrame .columns & .index index std = list(std)
'modelling') .astype()  used in pandas objects/ .dtypes: data type of each col Operations on series Discrete random variables
List methods  empty_list = [] pd.Series  series.values (values of data elements)  Data as series, cannot apply string operations → need convert via from scipy.stats import <>
.append()# item is added to the end of list series.index: RangeIndex(start=0, stop=6, step=1)
.extend([,])# add multiple items to the .str. first! with .str. , function will be applied to all rows
Notes: syntax for variable names: - The indexer iloc[] is for integer-position based indexes, 1) Vectorized len: eg. is_long =
- Only one word list the stop index is exclusive of the selection.
.insert(2, '')# Insert item at index 2 condo[‘name’].str.len( ) > 20
- Only consist of letters, numbers, and underscores - The indexer loc[] is for label based indexes, the stop index is - Output as Boolean  condo.loc[is_long] shows all rows w/
.remove('')# only removes the first
- Cannot begin with a number inclusive of the selection. names > 20 len
occurrence of specific item
- Avoid contradictions with Python keywords or other data_frame.iloc[row:row, col:col] 2) Vectorized count: *good for counting no. of words
.pop(2)# Remove and return the item at index
variable/function names - pandas.DataFrame objects are also mutable eg. project.str.count('A')
2. If the position index is not specified,
Comparison operators e.g. data_frame.loc[2:3, 'educ'] = 9.0 3) Vectorized Indexing eg. if name starts w D
then the pop() method removes the last item
- The loc[] indexer can even by used to create new columns. lower_names = condo[‘name’].str.lower()
of the list. e.g. data_frame.loc[:, 'new col'] = 'value' *convert lowercase for ‘D’
output = [n for n in range(1,101) if n%7 == Boolean indexing is typically used for filtering or segmenting d_start = lower_names.str[0]== ‘d’ 
0 and n%5 > 0] data according to given conditions. condo.loc[d_start]
print(greetings[::-1]), result: dlroW olleH e.g. is_female = data_frame['gender'] != 'F'
# True if the record is a male
4) Data Type Conversion (convert all elements in series to a P(X ≥ 6)=1−P (X ≤5)
Built-in data structures certain type)
Bitwise operators Eg. levels from str to int (’06 to 10’) to 6th floor to 10th floor
"and"  & / "or"  | / "not"  ~ #1 Create New Column #2 Index & Slice with .str. #3 astype(int) 50 multiple-choice questions, given the only correct answer
e.g. cond1 = data_frame['gender']=='F' - eg. condo[‘level_from’] =
among four choices. If John is certain about n = 28 questions and
cond2 = data_frame['married'] condo[‘level’].str[:2].astype(int)
is randomly guessing the remaining questions. Probability that
is_wife = cond1 & cond2 condo[‘level_to] = condo[‘level’].str[-
John can answer at least m=32 questions correctly?
data_frame.loc[is_wife, 'remarks'] = 'Wife' 2:].astype(int) n = 28
new column named "husband" where each element equals to 1 if stocks.pct_change() method to calculate the rate of m = 32
this individual is a male and is married, otherwise the element is change between rows p_right = 1 / 4
Membership operators 0 Numpy p_wrong = 1 - p_right
wage_data.loc[:, 'husband'] = 1 1D: similar to list (1 row)  tuple | 2D: (row, col) (rows are p_answer = binom.cdf(50 - m, 50 - n,
wage_data.loc[(wage_data['female']==1)| vertical down to show no. of rows) p_wrong)
Tuples (wage_data['married']==0), 'husband'] = 0
Item = ‘’, ‘’# Comma-separated items, tuples Avg: data.mean( ) #only numerical, incl. print(p_answer)
Continuous random variables
Boolean (proportion of T)
can go without parentheses unlike lists
Median: data.median( ) | SD: data.std( ) |
Item = ('', ‘’) # Comma-separated items Var: data.var( )
data.max( ) | data.min( )
within parentheses data[‘gender’].value_counts( )# Actual count
Logical operators feel_empty = () #len() = 0 of each category
tuple_one = 'here', # The comma creates a data[‘gender’].value_counts(normalize=True)
#Proportion of each category
tuple type object, comma is neccessary
s2 = range(3) # A tuple with three items
mixed = ['Jack', 32.5, (1, 2)] # Comma-
separated items within parentheses data.corr()
 type(mixed)  <class ‘list’>
 type(mixed[2])  <class ‘tuple’>
Create NumPy arrays  numpy.array()
item_one = ('there') # Only the parentheses
data.cov() e.g. Probability that the amount of the soft drink is between 11.92
do not create a tuple  String e.g. locate specific value ounces to 12.12 ounces
range() function  if only 1 argument: stops before that argu corr_table.loc['educ'].loc['expr']
x, y, z = 'abc' # Unpack a string of from scipy.stats import norm
corr_table.loc['educ'].iloc[2] prob = norm.cdf(12.12, 12, 0.05) -
three characters summary = data.describe( ) # will show norm.cdf(11.92, 12, 0.05)
cage=’bad’ trav=’good’  count, mean, std, min, - Inverse of cdf
cage,trav=trav,cage #swapping variables quartiles(25%,50%,75%),max. median will be
for outcome, prob in zip(outcomes, probs) shown as 50% quartile
#zip function def sum_sq(x):
break vs continue Function arange(startdefault=0, stop, step)
output = sum([item**2 for item in x])
- Break the loop by break Dictionaries  {'key': 'value'} .reshape((rows, cols))
for name in stocks.keys() OR stocks: x = [1, 1, 1]
- Skip the subsequent code by continue np.sin() / np.cos()
a_string = 'abcdef' print(name,stocks[name]) y = [2, 2, 2]
print(sum_sq(y))  result: None due to no Functions and array methods
new_string = '' for price in stocks.values(): print(np.log(3)) # Natural logarithm of 3
for letter in a_string: print(price) #print value only Return in the function
print(np.exp(np.arange(1, 3, 0.5)))
new_string = new_string + letter for item in stocks.items(): Handling Missing Data (NaN: Not a Number for missing
# Natural exponentials of 1, 1.5, 2, 2.5
if letter == 'c': print(item)#iterate both keys and values values) print(np.square(np.arange(3)))
break in parallel Detecting Missing Data: eg. gdp.loc[:,
‘1960’,’1961’].isnull() # Squares of 0, 1, 2
print(new_string) a /n b Create a new dictionary all_dao that print(np.power(2, np.arange(3)))
print(new_string) include all DAO courses. isnull( ): Returns True if NaN, False if not missing
# 2 to the power of 0, 1, 2
abc all_dao = {} notnull( ): Returns True if not missing, False if NaN
if letter == 'c': #skip this for item in courses: -Selects only True (not null): eg. gdp1960=
character to the next iteration if 'DAO' in item: gdp.loc[gdp[‘1960’].notnull()
directly all_dao[item] = courses[item] Dropping Missing *( ) empty bracket default leaves original data
continue #skip the subsequen OR for item, value in courses.items(): unchanged
code, if no code is under continue if 'DAO' in item: - gdpsubset.dropna() returns new data frame w/o missing
then no code will be skipped  abdef all_dao[item] = value value row
*The output of the input() function is always a str type *data type of a list inside dictionaries  Series - output= gdpsubset.dropna(inplace=True) original
object Gives a list of ALL possible name combinations data frame is overwritten, but the dropna() method returns
first = names['first'] nothing (print(output)  None)
last = names['last']
all_names = []
for a in first:
for b in last:
all_names.append(a+' '+b)

Poisson Monte Carlo simulations for decision-making repeat the sampling experiment by 1000 times using the for-loop population proportion, 𝜇, is less than a given constant value 𝜇0,
𝐻𝑎: 𝜇 < 𝜇0  𝜇 is smaller than 𝜇0, or in other words, the test
checks if 𝜇 is on the left-hand-side of 𝜇0, so it is called a left-
tailed test.
population proportion, 𝜇, is greater than a given constant value
Expected Values and Variances 𝜇0, 𝐻𝑎: 𝜇 > 𝜇0  𝜇 is greater than 𝜇0, or in other words, the test
checks if 𝜇 is on the right-hand-side of 𝜇0, so it is called a right-
tailed test.
If 𝑃-value is larger than 𝛼, there is insufficient evidence to
reject the null hypothesis under the given significance level. In
other words, we do not have sufficient evidence to support the
conclusion that the alternative hypothesis is true
If 𝑃-value is lower than 𝛼, we reject the null hypothesis in
Among the 1000 experiments, the chance that the population favor of the alternative hypothesis, which implies that the
mean value 𝜇 falls between the confidence intervals is around alternative hypothesis is true
Alternative 1−𝛼=95%. 1. Hypotheses  prove that the population mean is larger than a
sample = np.random.normal(mean, std, given value, so it is a right tailed test
size=1000) # 1000 records following a normal distribution Null hypothesis: 𝐻0: 𝜇 ≤ 𝜇0 = 1340
sample = np.random.poisson(mean&variance, Alternative hypothesis: 𝐻𝑎: 𝜇 > 𝜇0 = 1340
size=1000) # Poisson distribution 2. Sampling distribution
rolls = np.random.choice(outcomes, p=probs,

size=(smp_size, n))# discrete random variables


Simulations with continuous random variables When the population standard deviation 𝜎 is unknown (t-value)
Var = std ** 2 3. Significance from 𝑃-value
Sampling Distributions
Let X ∧Y be two random variables ,∧a , b∧c Estimate the mean value via testing a sample that contains a
small fraction of the overall population.
are three constants , thenwe have Ensure that each record is selected independently, the keyword
two-tailed test
What is the probability that after one trading day, i.e. at time 𝑡=2, argument replace is specified to be True, meaning that the
Var (c )=0 your investment is worth less than $990? samples are selected with replacement.
sample = bulb['Lifespan'].sample(25,
2
Var (aX+ c)=a Var ( X ) replace=True)
sample.mean()
Var (aX+ bY +c)=a2 Var ( X )+b2 Var (Y )+ 2 abCov(X , Y )  In order to illustrate the sampling distribution of the sample
mean, we repeat the sampling experiments for 1000 times
¿ a special case that X∧Y are uncorrelated , that after five trading day, i.e. at time
What is the probability
𝑡=6, your investment is worth less than $990?
Null hypothesis assumes the population proportion 𝑝=𝑝0, we
choose the z-test model (standard normal distribution)
the covariance between X ∧Y is zero , phat=0.5 ¿ phat that maximises the sd
so the last equation can be written as Sample Proportion = p-hat = m/n

2 2 What is the expected value & variance of the investment after


Var ( aX +bY + c ) =a Var ( X )+ b Varfive( Ytrading
) day, i.e. at time 𝑡=6?
The Central Limit Theorem (CLT): For a relatively large
sample size, the random variable is approximately normally
distributed, regardless of the distribution of the population. The
approximation becomes better with increased sample size.
Confidence intervals
mean_pop = population.mean()
default setting of the var() (or std()) method of NumPy arrays, When the population standard deviation 𝜎 is known (z-value)
the method calculates the population variance (or population sigma = population.values.std()
standard deviation). For the sample variance (or sample standard
deviation)
The sample proportion is centered at the population proportion p probs = np.array([[0.08, 0.13, 0.09, 0.06,
 Specify the keyword argument ddof=1. Here, the argument 0.03], [0.03, 0.08, 0.08, 0.09, 0.07],[0.01,
ddof indicates the delta degrees of freedom. 0.03, 0.06, 0.08, 0.08]])
 Convert the NumPy array into a series, as shown by the code Y = np.arange(5) #Number of purchases
below. The var() (or std()) method calculates the sample variance The variance of the sample proportion is decreased as n increases
mean = (probs.sum(axis=0)*Y).sum()
(or sample standard deviation) by default. E.g. var = (probs.sum(axis=0)*(Y -
pd.Series(q6).var()
mean)**2).sum()
Simulations with discrete random variables The shape of the sampling distribution approaches a normal
Prob of at least 3 purchases  prob_purc[-
If you buy $1000 worth of this stock at time 𝑡=1, what is the distribution as n increases
2:].sum()
probability that after one trading day, i.e. at time 𝑡=2, your
investment is worth less than $990?

P(Q2 ≤ 990)=P(r 2 ≤ log (990/1000))=P (r 2 ≤ log(0.99))

lower = -1.5
The log return Alternative p_in = 0.90 #random variable has a probability of 0.9 to be
If you buy $1000 worth of this stock at time 𝑡=1, what is the
probability that after five trading day, i.e. at time 𝑡=6, your Cut-off value within the interval.
investment is worth less than $990? cut_value = norm.ppf(1-alpha) * se From scipy.stats import norm
cut_value = - norm.ppf(alpha) * se p_temp = 1 - p_in - norm.cdf(lower)
upper = norm.ppf(1 - p_temp)
Hypothesis testing
decide whether a population mean, 𝜇, is different from a given
constant value 𝜇0, 𝐻𝑎: 𝜇 ≠ 𝜇0 𝜇 is either larger or smaller than
Distribution of 𝑋 the constant 𝜇0, so such a test is called a two-tailed test, where 𝜇
could vary from 𝜇0 in two directions.

You might also like