Summary - Data Analytics& Machine Learning

Attributes of big data: 1. Volume (The quantity of data is huge.) 2.
Variety (The array of

available data sources is big - structured, semi-structured, unstructured) 3. Velocity (The
speed at which data is created is high.) If used for credibility and reliability: 4. Veracity
(credibility and reliability of different data sources)
Traditional approach vs. ML: study the problem, write rules vs. train ML algorithm (extract
knowledge from data), evaluate solutions, launch/ analyze errors, if errors study again(loop)
Artificial Intelligence: A program that can sense, reason, act and adapt. Machine Learning:
Algorithms whose performance improves as they are exposed to more data over time.
Machine learning is a subcategory of artificial intelligence. Deep learning: subset of ML in
which multilayered neural networks learn from vast amounts of data.
Common Use of Machine Learning in Finance: fraud prevention, risk management, wealth
management, investment predictions, algorithm trading…For example, orbital insight can
predict how much oil is waiting to hit the market by analyzing massive numbers of satellite
photos of oil tanks with floating lids. Robo-advisor give directly advices for portfolios
according to preferences.
Types of Machine Learning:
1. supervised machine learning: labelled data, direct feedback, predict outcome/future
Supervised learning involves ML algorithms that infer patterns between a set of inputs (the
X’s) and the desired output (Y). The inferred pattern is then used to map a given input set into
a predicted output. Supervised learning requires a labeled data set, one that contains matched
sets of observed inputs and the associated output.
Supervised machine learning 有两类：
Regressions (predict values) (continuous) Classification (predict classes) (discrete)
所属 algorithms:
既可 regression，亦可 classification: K-Nearest Neighbors, Decision Trees, Support Vector

Machines(我们只研究 classification 的情况), Ensemble Methods (Random Forest and
Gradient Boosted Trees), Artificial Neural Networks.
只能 regression: Linear Regression and Penalized Regression
只能 classification: Logistic Regression and Naïve Bayes
2. unsupervised machine learning: no data labels, no feedback, search for hidden patterns
Unsupervised learning is machine learning that does not make use of labelled data.
Unsupervised machine learning 有两类：
Dimension Reduction (the process of reducing the number of features, or variables, in a

dataset while preserving information and overall model performance. It is a common and
powerful way to deal with data sets that have a large number of dimensions.) (continuous)
Clustering (allows us to discover hidden structures in data. The goal of clustering is to find a
natural grouping in data so that items in the same cluster are more similar to each other than
to those from different clusters.) (discrete)
3. semi-supervised machine learning: somewhere between supervised and unsupervised ML
4. reinforcement learning: decision process, reward system, learn series of actions
Reinforcement learning is an approach toward training a machine (an agent) to find the best
course of action through optimal policies that maximize rewards and minimize punishments.
Components of RL:
1. Agent: is the entity that performs actions (e.g., computer). 2 Action: is what an agent can
do in each state. 3 Environment: is the world in which the agent resides. 4 State: describes the
current situation of the agent. 5 Reward: The immediate return sent by the environment to
evaluate the last action by the agent. A reward can be positive (reward) or negative
(punishment).
Application of RL in finance: algorithm trading, derivative hedging, portfolio allocation (find
minimum variance PF)
Commonly-used financial data sources: WRDS, OECD, Bloomberg terminal, Yahoo
Finance, FRED
Essential libraries: Scikit-learn (open sources project for ML), Matplotlib (scientific
plotting), pandas (data wrangling and analysis)
API (Application Programming Interface) is a computing interface that defines interactions
between multiple software intermediaries. Thus, it is a software-to-software interface. It
defines the kinds of calls or requests that can be made, how to make them, the data formats
that should be used, the conventions to follow, etc. Some examples include There is quandl
(Python API for financial and economic datasets), fredapi (Python API for the FRED data
provided by the Federal Reserve Bank of St. Louis), or world bank data (Python API for the
World Bank data), and many others.
Anatomy of Learning Algorithm: Building Blocks of a Learning Algorithm: 1 a loss
function; 2 an optimization criterion based on the loss function (a cost function, for example);
and 3 an optimization routine leveraging training data to find a solution to the optimization
criterion.
Gradient Descent: Gradient descent is an iterative optimization algorithm for finding the
minimum of a function. It can be used to find optimal parameters for linear and logistic
regression, SVM and also neural networks. It is iteratively moving in the direction of steepest
descent as defined by the negative of the gradient. In machine learning, we use gradient
descent to update the parameters of our model. Learning rate (ALPHA)过大: end up
bouncing between two points and may never reach minima 过小: take very long time to
converge 的后果
Variable Types: Categorical: Nominal (hair color), Ordinal (age, 序数，第几); Numerical:
Discrete (goals shot), Continuous (time till dinner)
Encoding Categorical Variables: Most machine learning algorithms work almost
exclusively with numeric data. Therefore, we need to encode categorical features into
numerical features. 就是将 categorical variables 转化成 numerical。有两种转换方法：
Label Encoding: replace categorical value with a numeric values between 0 and number of
classes -1 (缺点：appearance of a relationship which does not exist in reality. 比如 0 小于
1，但实际上它们代表的东西（例如 BS IS）之间不存在这个关系)
One-hot Encoding: for each category of a feature, we create a new column (dummy variable)
with binary encoding to denote whether a particular row belongs to this categoty (category 的
个数与 column 的个数相同) (缺点：dimensionality: category 一多，column 就多，太占空
间了，memory issue.) drop_first: determines whether to get k-1 dummies out of k categorical
levels by removing first level. (the default is false)
Data Transformation:
Scaling (data transformation) is a process of adjusting the range of a feature by shifting and
changing the scale of data. Variables such as age and income can have a diversity of ranges
that result in a heterogeneous training dataset→to the same scale. 共四种 techniques，即方
法:1 Min-max scaling = (x-max)/(max-min)→feature values are in range [0,1]; 2 Standard
scaling = (x-mean)/standard deviation 对 normal distribution 尤其好用，弄完了就成了
standard normal distribution 了, process of centering (subtract mean so the new mean is zero)
and scaling (divide)→removes the mean and scales the data to unit variance; 3 Robust scaling
= (X-median)/(75th quantile-25th quantile) 优点：不受 outliers(极端值)影响, ignore outliers
(data points different from others, e.g., measurement errors)
选哪个？对于 normal distribution 而言，直接选 standard scaling。否则直接上 Min-max
scaling。Both are sensitive to outliers.
Binning (discretization): make linear models more powerful on continuous data by splitting it
up into multiple features (e.g., age groups for different ages); Interactions: x1 and x2 are
values of two features and x1x2 represents interaction between two; Polynomials: a form of
regression analysis in which the relationship between the independent v. x and the dependent
v. y is modelled as an nth degree polynomial in x. Using polynomial features together with a
linear regression model yields the classical model of polynomial regression.
Univariate nonlinear transformations: adding squared or cubed features→linear models for
regression; log (+1, not defined at 0), exp, or sin→nonlinear. Such transformations can
improve inference, but they need to be applied using expert judgement.
Statistics vs. ML: number of observations – samples; data set size – sample size; variables –
features; dependent variable – label; coefficient – weight.
Applying the ML algorithm to a data set to infer the pattern between the inputs and output is
called “training” the algorithm. Once the algorithm has been trained, the inferred pattern can
be used to predict output values based on new inputs.
Training set: On this part of the data, we train the ML model. Test set: On this part of the
data, we test our model’s performance.
Why do we split the data into training and test data? Answer: to prevent overfitting.
Overfitting refers to learning a function that perfectly explains the training data that the
model learned from but does not generalize well to unseen test data. If done improperly, data
leakage can occur through which one can introduce biases into the data.
Problems when basic split: 对于 time-series data 而言: “shuffle=False” to maintain the time
order of data, which is important for time-series data.对于 imbalanced data 而言: “stratify=变
量名字” to ensure that both the training and test sets will have possibly identical distribution
of the specified variable.
Classification vs. Regression: Classification is the task of predicting a discrete class label.
Regression is the task of predicting a continuous quantity. Both use labelled variables to make
predictions.
No free lunch theorem: Each model is a simplification of reality. Simplification is based on
assumptions. Assumptions fail in certain situations. No model works best for all possible
situations!
Use of supervised machine learning in finance: Credit default prediction, derivative
pricing, Robo-advisory, Stock price prediction, asset allocation.
Linear Regression: Linear regression, or ordinary least squares (OLS) is a linear model, e.g.,
a model that assumes a linear relationship between the input variables (x) and the single
output variable (y).
30.000(rows)*25(columns) with 1 column for default situation and is dummy variable. .pop
splits the dependent variable and independent variable. Now 30.000rows * 24 columns for X
and 30.000*1 for Y. Split train dataset and test dataset into 80:20. Now train data
X_train:24000*24, Y_train: 24.000*1 and test data: X_test: 6000*24, Y_test:6000*1. ML
model will learn on X_train and Y_train. Once it has learned, it will apply y=f(X) to test
X_test and will produce y_pred. .fit trains the algorithm with training data to determine the
weights. .predict lets the algorithm do the predictions with the test data. The predictions will
be compared to the actual results, and accuracy could be calculated based on achieved
performance.
Training the data in two steps: S1: Define a loss function: how inaccurate the model’s
predictions are (Residual sum of squares (RSS): the squared sum of the differences between
the actual and predicted values) S2: find parameters that minimize loss: look at the difference
between each real data point and our model’s prediction. Square differences to avoid negative
numbers and penalize large differences and then add them up and take the average.
Strengths: Easy to understand and interpret. Linear regression has no parameters, but it also
has no way to control model complexity. Main weaknesses: Prone to overfitting.
Multicollinearity assumption. Does not work well when there is a non-linear relationship
between predicted and predictor variables. Complexity control. Main parameters: None.
Since Linear Regression faces the following challenges, the regularized regression is used.
Challenges of OLS :1 Interpretability: OLS cannot distinguish variables with little or no
influence. These variables distract from the relevant regressors. 2 Overfitting: OLS works
well when number of observations m is bigger than the number of predictors p, i.e., m is
much bigger than p. If m≈p, overfitting results into low accuracy on unseen observations. If
m≤p, variance of estimates is infinite and OLS fails. As a remedy, one can identify only
relevant variables by feature selection.
The main idea of regularized regression: Fit linear models with least squares but impose
constraints on the coefficients. Regularization means explicitly restricting a model to avoid
overfitting. Simply put, it is a penalty mechanism that applies shrinkage to model parameters.
三种操作方式:
L1 regularization, aka. LASSO (Least Absolute Shrinkage Operator) regression: LASSO

models can shrink some parameters exactly to zero. LASSO performs regularization by
adding a factor of the sum of the absolute value of coefficients to the objective function (RSS)
for linear regression. The consequence of L1 regularization is that when using LASSO, some
coefficients are exactly zero. This can be seen as a form of automatic feature selection.
Having some coefficients be exactly zero often makes a model easier to interpret, and can
reveal the most important features of your model. The larger the value of λ, the more features
are shrunk to zero. Main strengths LASSO: usually results in sparse models, that are easier
to interpret. Main weaknesses LASSO cannot do group selection. If there is a group of
variables among which the pairwise correlations are very high, then the LASSO tends to
arbitrarily select only one variable from the group. Main parameter is λ If we set λ too low,
however, we remove the effect of regularization and end up overfitting, with a result similar
to linear regression. LASSO: variable selection; large amount of features and expect only
a few of them to be important
Min (RSS + λ×(sum of the absolute value of coefficients))
L2 regularization, aka. Ridge regression: Ridge regression can shrink parameters close to
zero. Ridge regression is also a linear model for regression, so the formula it uses to make
predictions is the same one used for ordinary least squares. Ridge regression performs L2
regularization by adding a penalty factor to the cost function used in linear regression. The
penalty term (λ) regularizes the coefficients such that if the coefficients take large values, the
optimization function is penalized. It decreases the complexity of a model but does not
reduce the number of variables. It makes the trade-off between the simplicity of the model
and its performance on the training set. Main strengths Ridge reduce the variance (with an
increasing bias).Works best in situations where the OLS estimates have high variance.
Improve predictive performance. Works in situations where the number of observations is
bigger than the number of predictors p. Involves mathematically simple computations. Main
weaknesses Ridge regression is not able to shrink coefficients to exactly zero→cannot
perform variable selection. With enough training data, regularization becomes less important,
and linear regression catches up with ridge in the end.
Min (RSS + λ×(sum of squared coefficients))
Elastic net: combines the LASSO and ridge regression penalties. It involves tuning two
parameters. They are usually labelled as λ and α. Main strengths Elastic Net reduces the
impact of different features while not eliminating all of the features. Sparse model with good
prediction accuracy, while allowing for a grouping effect. Ability to perform grouped
selection. Appropriate for the p is much larger than m problem Main weaknesses Selection of
two parameters necessary. Main parameters λ and α Elastic net works best while price of
having two parameters
Logistic Regression: (classification, not regression; can also be regularized) model the
probabilities of the output classes given a function that is linear in x, while ensuring that
output probabilities sum up to one and remain between zero and one as we would expect from
probabilities. [0,1] Main strengths Easy to implement, has good interpretability, and
performs very well on linearly separable classes. The output of the model is a probability,
which provides more insight and can be used for ranking. Main weaknesses Overfitting when
provided with large numbers of features. Logistic regression can only learn linear functions
and is less suitable to complex relationships between features and the target variable. Also, it
may not handle irrelevant features well, especially if the features are strongly correlated.
K-Nearest Neighbors: Building the model consists only of storing (memorizing) the training
data set. To make a prediction for a new data point, the algorithm finds the closest data points
in the training dataset — its “nearest neighbors”. K-nearest neighbors (KNN) is considered a
lazy learner, as there is no learning required in the model. KNN does not learn any function
from the training data but memorizes the training data set.
Lazy learning vs. eager learning: Lazy learning (i.e., instance-based learning): The
algorithm simply stores the training data and waits until it is given new data to assess. Eager
learning: Given a set of training data, the algorithm constructs a model before receiving new
data to assess.
Determine which of the K instances in the training dataset are most similar to a new
input: Distance measure is used. Euclidean distance 两条直角边: use if the input variables
are similar in type. Manhattan distance 斜边: use if the input variables are not similar in type.
Prediction: Classification: Take the majority vote of class labels among the k-nearest
neighbors. Regression: Predict the average value of the class value of the k-nearest
neighbors. If ties (whether when finding the closest instances to new data point(regression) or
when voting for the class of the new data point (classification) arises, break arbitrarily.
Main strengths No training is involved→no learning phase. New data can be added
seamlessly without impacting the accuracy of the algorithm. Intuitive and easy to understand.
Main weaknesses When your training set is very large (either in number of features or in
number of samples) prediction can be slow. Feature scaling is required. KNN does
particularly badly with data sets where most features are 0 most of the time (so-called sparse
data sets) Main parameters Number of neighbors; Distance metric: by default, Euclidean
distance is used.
Support Vector Machines (SVMs) (for classification and regression): maximize the
margin (shown as shaded area in the figure on the previous slide), which is defined as the
distance between the separating hyperplane (or decision boundary) and the training samples
that are closest to this hyperplane, the so-called support vectors. During training, the SVM
learns how important each of the training data points is to represent the decision boundary
between the two classes. Typically, only a subset of the training points matters for defining
the decision boundary: the ones that lie on the border between the classes. These are called
support vectors. To make a prediction for a new point, the distance to each of the support
vectors is measured. A classification decision is made based on the distances to the support
vector, and the importance of the support vectors that was learned during training.
If two-class classification dataset in which classes not linearly separable: when linearly
separable (一共两堆，一根线就能分开). However, when they are not, There are two main
steps for nonlinear generalization of SVM. 1 The first step involves the transformation of
the original training (input) data into a higher dimensional data using a nonlinear
mapping (例如 2D 变 3D). 2 The second step involves finding a linear separating
hyperplane in the new space. The maximal marginal hyperplane found in the new space
corresponds to a nonlinear separating hypersurface in the original space. In some cases, it is
not possible to find a hyperplane or a linear decision boundary, and kernels (例如 2: line 变
3D: plane) are used.
If not possible to find hyperplane or linear decision boundary: Kernelized support vector
machines (often just referred to as SVMs): Allows for more complex models that are not
defined simply by hyperplanes in the input space. A kernel is a transformation of the input
data that allows the SVM algorithm to treat/process the data more easily. Using kernels, the
original data is projected into a higher dimension to classify the data better. Main strengths
Fairly robust against overfitting, especially in higher dimensional space. Allows non-linear
generalizations, with many kernels to choose from. No distributional requirement for the data.
Main weaknesses Inefficient to train and memory-intensive to run and tune. It doesn’t
perform well with large datasets. Feature scaling of the data requried. There are also many
hyperparameters, and their meanings are often not intuitive.
Naive Bayes Classifiers: a family of algorithms every pair of features being classified is
independent of each other. The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(B|A). Gaussian Naive Bayes: can be
applied to any continuous data; Multinomial Naive Bayes: can only be applied to count
data(integer feature counts). Suitable for classification with discrete features (e.g., word
counts for text classification); Bernoulli Naive Bayes: can only be applied to binary data.
Suitable for discrete data. Difference occurrence counts vs. binary/boolean features.
Decision Trees: (classification and regression, known as CART: classification and regression
tree models) The top node: root, represents the whole dataset. Each node in the tree either
represents a question or a terminal node (also called a leaf) that contains the answer. The
edges connect the answers to a question with the next question you would ask. 上面的叫 root
下面的叫 leaf. Data does can come in the form of binary yes/no features or can be
represented as continuous features. The questions that are used on continuous data are of the
form “Is feature i larger than value a?” In the machine learning setting, these questions are
called tests (not to be confused with the test set). It is a recursive process. It yields a binary
tree of decisions, with each node containing a test. The recursive partitioning of the data is
repeated until each region in the partition (each leaf in the decision tree) only contains a
single target value (a single class or a single regression value). A leaf of the tree that contains
data points that all share the same target value is called pure.
Controlling complexity of decision trees: 1 Pre-prunning (Scikit 只做这一个): Stopping the
creation of the complex tree earlier, before it perfectly classifies the training set. 2 Post-
prunning: Building a complex tree but then removing or collapsing nodes that contain little
information.
Possible criteria for pre-pruning: Limiting the maximum depth of the tree (max_depth),
Limiting the maximum number of leaves (max_ leaf_nodes), or the minimum number of
samples required to be at a leaf node. A split point at any depth will only be considered if it
leaves at least min samples leaf training samples in each of the left and right branches. (min_
samples_leaf). If we don’t restrict the depth of a decision tree, the tree can become arbitrarily
deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well
to new data.
Feature importance in trees: summarize the workings of the tree. The most commonly used
summary is feature importance, which rates how important each feature is for the decision a
tree makes. It is a number between 0 and 1 for each feature, where 0 means “not used at all”
and 1 means “perfectly predicts the target.” The feature importances always sum to 1.
However, if a feature has a low feature importance, it doesn’t mean that this feature is
uninformative. It only means that the feature was not picked by the tree, likely because
another feature encodes the same information. (shows which variables are important for the
split) Main strengths Resulting model can easily be visualized and understood by nonexperts
(at least for smaller trees); Algorithms are completely invariant to scaling of the data. Main
weaknesses Tend to overfit and provide poor generalization performance. Main parameters
max depth, max leaf nodes, min samples leaf
Ensembles of decision trees (due to weaknesses of normal trees):
Random forest is essentially a collection of decision trees, where each tree is slightly different
from the others. Random forests are one way to address the problem of overfitting commonly
observed the training data common among decision trees. The idea behind random forests is
that each tree might do a relatively good job of predicting, but will likely overfit on part of the
data. If we build many trees, all of which work well and overfit in different ways, we can
reduce the amount of overfitting by averaging their results. This reduction in overfitting,
while retaining the predictive power of the trees, can be shown using rigorous mathematics.
There are two ways in which the trees in a random forest are randomized: by selecting the
data points used to build a tree AND by selecting the features in each split test.
Build a random forest model(decide on the number of trees to build: (the_n estimators
parameter of RandomForestRegressor or RandomForestClassifier). :Take a bootstrap
sample of our data. That is, from our n samples data points, we repeatedly draw an example
randomly with replacement (meaning the same sample can be picked multiple times), n
samples times. (Data selection) Next, a decision tree is built based on this newly created
dataset. However, the algorithm we described for the decision tree is slightly modified.
Instead of looking for the best test for each node, in each node the algorithm randomly selects
a subset of the features, and it looks for the best possible test involving one of these features.
The number of features that are selected is controlled by the max features parameter. This
selection of a subset of features is repeated separately in each node, so that each node can
make a decision using a different subset of the features. The bootstrap sampling leads to
each decision tree in the random forest being built on a slightly different dataset.
Because of the selection of features in each node, each split in each tree operates on a
different subset of features. Together, these two mechanisms ensure that all the trees in the
random forest are different.
If max_features are set to n_features: no randomness in feature selection, randomness due to
bootstrpping remains though
Analyzing random forests: regression: average results to get final prediction; classification:
soft voting: each algorithm makes a soft prediction, providing a probability for each possible
output label. Then average the probabilities by all trees, and the class with the highest
probability is predicted. Random forests overfit less than any of the tree individually and
provides a much more intuitive decision boundary; provides features importances
calculated by aggregating the feature importances over the tress in the forest; give
nonzero importance to many more features than single tree.
Main strengths Most widely used machine learning methods. Often work well without heavy
tuning of the parameters, and don’t require scaling of the data. Main weaknesses Time-
consuming. Don’t perform well on very high dimensional, sparse data, such as text data.
Main parameters The important parameters to adjust are n estimators, max features, and
possibly pre-pruning options like max depth. For n estimators, larger is always better.
Averaging more trees will yield a more robust ensemble by reducing overfitting. However,
there are diminishing returns, and more trees need more memory and more time to train. You
can use the n jobs parameter to adjust the number of cores to use. A common rule of thumb is
to build “as many as you have time/memory for.”
Gradient Boosted Trees (both for regression and classification, a gradient descent
algorithm): Building trees in a serial manner, where each tree tries to correct the mistakes of
the previous one. No randomization but strong pre-pruning is used. Gradient boosted trees
often use very shallow trees, of depth one to five, → smaller in terms of memory and makes
predictions faster. The main idea is that as more and more trees are added, we can iteratively
improve performance.
Build gradient boosting models: Apart from the pre-pruning and the number of trees in the
ensemble, another important parameter of gradient boosting is the learning rate, which
controls how strongly each tree tries to correct the mistakes of the previous trees. A higher
learning rate means each tree can make stronger corrections, allowing for more complex
models. Adding more trees to the ensemble, which can be accomplished by increasing
n_estimators, also increases the model complexity, as the model has more chances to correct
mistakes on the training set.
Main strengths most powerful and widely used models. works well without scaling and on a
mixture of binary and continuous features. Main weaknesses does not work well on high-
dimensional sparse data. Main parameters number of trees, n estimators, and the learning
rate. In contrast to random forests, where a higher n estimators value is always better,
increasing n estimators in gradient boosting leads to a more complex model, potentially
leading to overfitting.
Random & Gradient: both perform well, first try with random forests, which work quite
robustly.
Artificial Neural Networks (ANN): ANN starts with perceptron. Perceptron is one of the
simplest ANN architectures that is capable of having an original idea. ANN is analog to
biological human neural network. Inputs enter from dendrites through cell body and axon to
terminal axon, while inputs go with weights, followed by weighted sum and after activation
function, an output comes out. Similar to the human brain that has neurons called synapse
interconnected to one another, artificial neural networks also have synapses that are
interconnected to one another in various layers of the networks. (Input features and
predictions are nodes and weights are connections between nodes.)
Architecture of ANN: An ANN architecture comprises input layer, hidden layer(s)
(composed of hidden units) and output layer. Multilayer perceptions (MLPs) are known as
feed-forward neural networks. They are represented by composing together different
functions. The overall length of the chain gives the depth of model. For example, we might
have three functions f(1), f(2) , and f(3) connected in a chain, to form f(x) =
f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural
networks. In this case, f(1) is called the first layer of the network, f(2) is called the second
layer, and so on. The overall length of the chain gives the depth of the model.
Hidden layers = 1, shallow; Hidden layers >=2, deep.
Feed-forward network: Feed-forward neural networks are called networks because they are
typically represented by composing together many different functions.
Perceptrons are arranged in layers, with first layer being input layer and the last
producing outputs. The middle layers (hidden layers) have no connection with external
world. Each perceptron in one layer is connected to every perceptron on the next layer
(information is fed forward from one layer to the next). However, there is no connection
among perceptrons in the same layer.
Why the name ‘feed-forward backpropagation NN architecture: Training a NN means
calibrating all of the weights in the ANN. This optimization is an iterative approach involving
forward propagation/ feedforward of information and backpropagation of error steps. FF:
Input values are fed to NN and get an output, which we call predicted value. The process
repeats for all layers until an output value from the last layer is received. BP:After forward
propagation, we get predicted value from ANN. The difference between predicted output and
desired on is converted into loss function.
The goal is the minimize the loss function over the training set. There is a trade-off between
complexity of model and expectation of working better on training data depending on the
number of hidden layers. More complexity is not always better, since the marginal effect of
depth (Test accuracy) decreases.
Use of ANN in finance: option pricing, high frequency trading, portfolio optimization, text
analysis
Activation function: The activation function decides which neurons will be activated—that
is, what information is passed to further layers. In a nutshell, it determines whether the neuron
fires or not. Activation functions (AFs) refer to the functions used over the weighted sum of
inputs in ANNs to get the desired output. Every activation function takes a single number and
performs a certain fixed mathematical operation on it. AFs allow the network to combine the
inputs in more complex ways, and they provide a richer capability in the relationship they can
model and the output they can produce. Typical activation functions: Linear (identity)
function, Sigmoid function, Tanh function, ReLU function
Minimize the cost function: Gradient Descent, Stochastic Gradient Descent, Mini-batch
Gradient Descent
Control complexity of a Neural Network: 1 adjust the number of hidden layers 2 adjust the
number of units in each hidden layer 3 apply regularization
Main strengths Captures the nonlinear relationship between the variables quite well. Given
enough computation time, data, and careful tuning of the parameters, neural networks often
beat other machine learning algorithms (for classification and regression tasks). Main
weaknesses Interpretability of the model. Not good with small data sets and requires a lot of
tweaking and guesswork. Computationally expensive and time-consuming. Main parameters
1 Hidden layers (number of layers and nodes in the ANN architecture) 2 Activation function
(activation function of a hidden layer. Some of the activation functions such as sigmoid, relu,
or tanh, can be used.)
When to use which algorithm: Choose linear models (first to try), when facing large
datasets and high-dimensional data; Nearest neighbors, when facing small datasets, good as
a baseline, and easy to explain; Support vector machines, when facing medium-sized
datasets of features with similar meaning. Require scaling of data and sensitive to parameters;
Naïve Bayes only for classification. Even faster than linear models, good for very large
datasets and high-dimensional data, but often less accurate than linear models; Decision
trees, very fast and do not need scaling of data, can be visualized and easily explained;
Random forests, almost always better than a single tree, very robust and powerful. No
scaling needed, but not good for very high-dimensional sparse data; Gradient boosted
decision trees: slightly more accurate than random forests, slower to train but faster to predict
than random forests and smaller in memory, but need more parameter tuning; Neural
networks: can build complex models especially for large datasets, but sensitive to scaling and
choice of parameters, large models training is time-consuming.
Explainable AI (XAI): Methods and techniques in the application of artificial intelligence
(AI) such that the results of the solution can be understood by humans. Trade-off between
performance and explainability. Simple models are easy to explain. But more complex ones
tend to perform better. Some useful tools: Feature importance graphs, Local Interpretable
Model-Agnostic Explanations (LIME), Shapley Values (SHAP)
Generalization: If a model is able to make accurate predictions on unseen data, we say it is
able to generalize from the training set to the test set. Overfitting: occurs when you fit a
model too closely to the particularities of the training set and obtain a model that works well
on the training set but is not able to generalize to new data. Choosing too simple a model is
called underfitting. The more complex we allow our model to be, the better its performance
on the training data. However, if our model becomes too complex, we start focusing too much
on each individual data points in our training set, and the model will not generalize well to
new data. Trade-off: performance on training data vs. generalization. Sweet spot yields
the best generalization performance.
Bias-variance trade-off: Bias error (在 training data 中的 error): degree to which a model
fits the training data. Algorithms with erroneous assumptions produce high bias with poor
approximation, causing underfitting and high in-sample error. Bias results in underfitting of
the data.Variance error (在 test data 中的 error), or how much the model’s results change in
response to new data from validation and test samples. Unstable models pick up noise and
produce high variance, causing overfitting and high out-of-sample error. High variance gives
rise to overfitting.
How to combat overfitting: 1 Using more training data The more training data we have,
the harder it is to overfit the data by learning too much from any single training example.
Usually, collecting more data points will yield more variety. The larger variety of data points
your dataset contains, the more complex a model you can use without overfitting. so larger
data sets allow building more complex models. 2 Using regularization Adding a penalty in
the loss function for building a model that assigns too much explanatory power to any one
feature, or allows too many features to be taken into account. Optimal regularization can be
achieved by varying the regularization parameter.
Model Evaluation 主要有两个方法: Cross-Validation, Grid Search (balanced vs imbalanced

data, high cost of false positives vs. negatives):
Cross Validation: statistical method of evaluating generalization performance that is more
stable and thorough than using a split into a training and a test set. In cross-validation, the
data is split repeatedly and multiple models are trained. Motivation: random split of
data→1.lucky: all examples that are hard to classify end up in training set and test set only
contain easy examples. Text set accuracy high.2. unlucky →Remedy: cross-validation and
repeat. Main benefits Having multiple splits of the data provides information about how
sensitive our model is to the selection of the training dataset. Compared to using a single split
of the data, we can use our data more effectively. We are able to use e.g., 80-90% of the data
instead of 70-75%. Main drawback computational cost, especially when paired with a grid
search for hyperparameter tuning.
k-fold cross-validation: 1 Data is first partitioned into k parts of (approximately) equal size,
called folds. We end up with k folds. 2 Then we train the model using k-1 folds and evaluate
the performance on the k th fold. 3 We repeat this process k times and average the resulting
scores.
Cross-Validation when working with time-series data: 1 Sliding Window 有 dropped 条越

来越长，training 平移等长 2 Expanding Window 无 dropped 进度条，training 越来越长.
000000000000000111 (one of the sample splits only zero)→Stratified K-fold cross-

validation: split the data such that the proportions between classes are the same in each fold
as they are in the whole dataset. Good to use to evaluate a classifier→more reliable estimates
of generalization performance.
000000000111111111(ensure splits are random)→Shuffling the data: remove the ordering
of the samples by label. We can do that by setting the shuffle parameter of K-Fold to True. It
is good practice to fix the random state to get a reproducible shuffling.
Leave-One-Out Cross Validation: Predict each instance, training on all N-1 instances. I.e.
k-fold cross-validation where each fold is a single sample. For each split, you pick a single
data point to be the test set. Time consuming, particularly for large datasets, but
sometimes provides better estimates on small datasets. Iteration 1/N, 2/N 橙色条一点点移
动到 N/N
Shuffle-split Cross Validation: each split samples train size many points for the training_set
and test_size many (disjoint) points for the test set. This splitting is repeated n_iter times.
Shuffle split works iteratively, KFold just divides the dataset into k folds.
Cross Validation with Groups: Used when groups in the data that are highly
related→ensure the training and test sets contain images of different people.
Example: Logistic regression (c=10), repeat 10 times. Cross-validation=10. Average
accuracy=sum of 10/10
Grid Search(improve the model’s generalization performance by tuning its parameters):
The most commonly used method is grid search, which basically means trying multiple
possible combinations of the parameters.
Parameters vs. Hyperparameters: Parameters: learnt during training; internal to model;
e.g., node weights in a NN Hyperparameters: cannot be learnt but set beforehand; external to
model; e.g., learning rate, hidden layers.
Cross Validation 与 Grid Search 的结合: Grid Search with Cross Validation: while the
method of splitting data into a training, a validation and a test set is workable, it is quite
sensitive to how exactly the data is split. For a better estimate of generalization performance,
we can use cross-validation to evaluate the performance of each parameter combination. Use
GridSearchCV to specify parameters you want to search over using a dictionary.
GridSearchCV will then perform all the necessary model fits. GridSearch CV will cross-
validation in place of the split into a training and validation set that we used before. However,
we still need to split the data into a training and a test set to avoid overfitting the
parameters. 10-fold CV, with 6 gamma and C values each→6x6x10=360 models we need to
train. Heat map can be used to visualize accuracy scores. It is important to make sure that the
ranges for the parameter are large enough.
Supervised Performance Metrics:
Regression: Mean Absolute Error(MAE): measures the average magnitude of the errors in
a set of forecasts, without considering their direction. Linear score, which means that all the
individual differences are weighted equally in the average. It gives an idea of how wrong the
predictions were. The measure gives an idea of the magnitude of the error, but no idea of the
direction (e.g., over- or underpredicting). Mean Squared Error (MSE): represents the
average squared difference between the actual values and the estimated values (residuals).
Root mean squared error (RMSE). R^2: the “goodness of fit” of the predictions to actual
value. In statistics, this measure is also called the coefficient of determination. Adjusted
R^2: how well terms fit a curve or line but adjusts for the number of predictors in a model.
Predictive accuracy→ RMSE is the best choice. Explanatory purposes by indicating how well
the selected independent variable(s) explains the variability in the dependent variable→R 2
and adjusted R 2 are often used for
Classification: here binary classification problems.
Confusion Matrix: True positives (TP) Predicted positive and are actually positive. False
positives (FP) = Type I error Predicted positive and are actually negative. True negatives
(TN) Predicted negative and are actually negative. False negatives (FN) = Type II error
Predicted negative and are actually positive. Accuracy (A) = (TP + TN)/(TP + FP + TN + FN)
or Accuracy (A) = (TP + TN)/Total. Precision (P) = TP/(TP + FP). Recall (R) = TP/(TP +
FN) Accuracy is the number of correct predictions made as a ratio of all predictions made.
This is the most common evaluation metric for classification problems and is also the most
misused. It is most suitable when there are an equal number of observations in each class
(which is rarely the case) and when all predictions and the related prediction errors are
equally important, which is often not the case. In case of imbalanced data (data sets where
one of two classes is much more frequent than the other one), 99% non-fraudulent
transactions, 1% fraudulent. →99% accuracy, but imbalanced data is the norm and it is rare
that the events of interest have equal or even similar frequency in the data. Precision:
percentage of positive instances out of the total predicted positive instances. Precision is also
known as positive predictive value (PPV). Precision is a good measure to determine when the
cost of false positives is high (e.g., email spam detection). Recall is the percentage of positive
instances out of the total actual positive instances. Recall is also known as sensitivity, hit rate,
or true positive rate (TPR). Recall is a good measure when there is a high cost associated with
false negatives (e.g., fraud detection, disease diagnostics). F1 score: harmonic mean of
precision and recall. F1 score = (2 * P * R)/(P + R). F1 score is more appropriate than
accuracy when unequal class distribution is in the dataset and it is necessary to measure the
equilibrium of precision and recall. High scores on both of these metrics suggest good model
performance.
Area under ROC curve (AUC): evaluation metric for binary classification problems.
Receiver Operating Characteristic (ROC): probability curve, and AUC represents degree
or measure of separability. It tells how much the model is capable of distinguishing between
classes.
Model Selection Criteria: Simplicity, Training time, Presence of non-linearity in the data,
Robustness to overfitting, Size of the dataset, Number of features, Model interpretation
Computational linguistics, also known as natural language processing (NLP): the
subfield of computer science concerned with using computational techniques to learn,
understand, and produce human language content. Natural language processing (NLP) is a
branch of AI that deals with the problems of making a machine understand the structure and
the meaning of natural language as used by humans. Several techniques of machine learning
and deep learning are used within NLP.
Goals of NLP: 1 aiding human-human communication, (e.g., in machine translation (MT)); 2
aiding human-machine communication (e.g., with conversational agents); or 3 benefiting both
humans and machines by analyzing and learning from the enormous quantity of human
language content that is now available online.
NLP has many applications in the finance sectors in areas such as: 1 sentiment analysis 2
chatbots 3 document processing 4 risk management (liquidity risk management, credit default
modelling, etc.)
Automation: Automation using NLP is well-suited in the context of finance. It reduces the
strain that repetitive, low-value tasks put on human employees. It tackles the routine,
everyday processes, freeing up teams to finish their high-value work. In doing so, it drives
enormous time and cost savings.
Sources of the text: A lot of information, such as sell side reports, earnings calls, and
newspaper headlines, is communicated via text message, making NLP very useful in the
financial domain. (Refinitiv Transcripts database that covers historical archives in the mkt)
Uses of the text: Parsing documents It is unfortunately quite common for companies to
obscure machine-readable disclosure by inserting tables into the documents in the ”picture
format” (.jpg, .png, etc) Sources of text data could be sell side reports, earning calls and
newspaper headlines. Using text data helps to parse documents – obscure machine-readable
disclosure by inserting tables into the documents in picture format. A potential problem can
be the fact that the tables in documents cannot be easily read, but the pictures can be parsed
using optical character recognition (OCR).
Terminology: data set is often called corpus. Each data point, represented as a single text, is
called a document. Token is equivalent to a word.
Types of data represented as strings: Not every string is text data, because there are four
kings of string data: Categorical data, free strings that can be semantically mapped to
categories, structured string data and text data. Categorical data can be easily mapped into a
variable, e.g., balance sheet, cash-flow statement, income statement; Free strings are the
sources of data with subcategories the same as above, but with some shortcuts and
misspellings like P&L statement, CF statement; Structured string data are data items with a
certain underlying structure like address and phone number; Text data are phrases, sentences
that do not belong to any of the above groups.
NLP processing pipeline: consists of preprocessing (tokenization, stemming, lemmatization,
PoS tagging, named entity recognition, stop words removal), feature representation (BoW,
Co-occurrence matrix, TF-IDF; Word2vec, GloVe) and Inference (Supervised, unsupervised
and reinforcement).
Data preprocessing:3 python packages for data preprocessing – NLTK, TexBlob and spaCy.
Tokenization : splitting a text into meaningful segments – tokens, which can be words,
punctuation, numbers or other special characters that are the building blocks of a sentence.
Stop words removal :helps remove extremely common words that offer little value in
modeling. In finance, we do not always drop all stop words, because some of them can have
an important tole in e.g., differentiating sentiment.
Both lemmatization and stemming are methods for normalization that try to extract some
normal form of a word. These words would likely result in overfitting and poor generalization
performance, e.g., vocabulary contains singular and plural versions of some words, different
verb forms and a noun relating to the verb. Stemming: each word can be represented using its
stem. It is done by using a rule-based heuristic like dropping common suffixes. (E.g.,
connection, connections, connective, connected, connecting→connect) The process is referred
to lemmatization when a dictionary of known word forms is used and the role of a word in the
sentence is important. Lemmatization is the process of converting inflected forms of a word
into its morphological root. For example, the lemma of analyzed and analyzing is analyze.
Lemmatization is computationally more expensive and advanced. The difference between the
two processes is that stemming can often create nonexistent words, whereas lemmas are
actual words.
Part-of-Speech(PoS) tagger: uses language structure and dictionaries to tag every token in
the text with a corresponding part of speech. Some common POS tags are noun, verb, adj…
Named entity recognition (NER): an optional step to locate and classify named entities in
text into predefined categories, e.g., names of persons, locations, …
Additional preprocessing methods : lowercasing (removes distinctions among the same
words due to upper and lower class), removal of non-alphanumeric characters (include any
characters that are not letters or digits such as >. Filter() can be used to remove all non-
alphanumeric characters from a string), dependency parsing (extracting dependency parse of
a sentence to represent its grammatical structure & defines the dependency relationship
between headwords and their dependents), and coreference resolution (connecting tokens
that represent the same entity. E.g., Tom has a headache. He did not sleep well. →Tim=He)
Feature representation: Word embeddings convert textual data into numerical data. The
process of converting NLP text into numbers is called vectorization. Two main methods for
computing word embeddings are frequency-based/ count-based (count vectorization, TF-
IDF vectorization and Co-occurrence vectorization) and prediction- based/ learning- based
(pretrained models, e.g., Word2vec and GloVe and customized deep learning-based feature
representation).
In count-based models: the semantic similarity between words is determined by counting the
co-occurrence frequency. Bag of Words (BoW): documents are described by word
occurrences while ignoring the relative position information of the words, which means any
information about the structure of the sentence is lost. Although the resulting matrix can be
very large in memory, the amount of data can be reduced by using sparse matrices. Problem
of BoW is discarded word order. A solution could be n-grams. N-grams : considers the
counts of pairs or triplets (or more) of tokens that appear next to each other. N-grams are
representations of word or token sequences. They can offer invaluable contextual information
that can complement and enrich unigram. Co-occurrence matrix: how often things co-occur
in some environment. An alternative to BoW is TF-IDF (Term Frequency-Inverse
Document Frequency): calculated word frequencies. A word frequency score tries to
highlight words that are more interesting. To get a complete representation of the value of
each word, TF at the sentence level*IDF of a word across the entire dataset. TF-IDF values
can be useful in measuring the key terms across a compilation of documents and can serve as
word feature values for training an ML model. (TF-IDF=TF x IDF). Higher values indicate
words that appear more frequently within a smaller number of documents, which signifies
relatively more unique terms that are important. Lower values indicates terms that appear in
many documents.
Prediction-based word embedding techniques: in predictive models, the word vectors are
learnt by trying to improve on the predictive ability (minimizing the loss between the target
word and the context word). Word2Vec: king-man+woman=queen. GloVe: the distance
between king→queen is roughly the same as the one between man→woman, or
brother→sister.
Inference: Supervised NLP: Naïve Bayes is one of the most frequently used as it can
produce reasonable accuracy using simple assumptions. Unsupervised NLP: Latent Dirichlet
Allocation (LDA) has been used for topic modelling – NLP practitioners build probabilistic
generative models to reveal likely topic attributions for words, as an unsupervised NLP
method.
Topic modelling (unsupervised ML): provides methods for automatically organizing,
understanding, searching and summarizing large electronic archives by discovering the
hidden themes in the collection, annotating the documents according to these themes and
using annotations to organize, summarize, search and form predictions. LDA tries to find
groups of words (the topics) that appear together frequently, e.g., wordcloud.
Sentiment analysis: Robo-readers are automated programs used to analyze large quantities
of text like news articles and social media. In this way, robo-readers are being used by
investors to examine how views expressed in text relate to future company performance.
Robo-readers often look to analyze sentiment polarity- how positive, negative or neutral a
particular phrase or statement is regarding a target. Sentiment provides invaluable predictive
power, both along and when coupled with structured financial data, for predicting stock price
movements for individual firms and for portfolios of companies.
Text curation: uses database Financial Phrase Bank and presents data (cross-sectional data,
not time-series data) in a text document format. The sentiment of each sentence has been
labeled with positive, negative or neutral. The sentiment classes are provided from an
investor’s perspective and maybe useful for predicting whether a sentence may have a
corresponding positive, negative or neutral influence on the respective company’s stock price.
A supervised ML model is trained, validated and tested using these data. The final ML model
can be used to predict the sentiment classes of sentences present in similar financial news
statements.
Text cleansing involves removing punctuations, numbers and spaces that may not be
necessary for model training or incorporating appropriate substitutions for, potentially
extraneous information present in the text.
Stop words: not removed because some of them (e.g., not, more, very and few) carry
significant meaning in the financial texts that is useful for sentiment prediction. Some words
like a, an, the can be removed. However, overall to avoid confusion no words are removed.
Document term matrix (DTM): The last step of text preprocessing is using the final BoW
after normalizing to build a document term matrix (DTM). It is a matrix that is similar to a
data table for structured data and is widely used for text data. Each row belongs to a
document and each column represents a token. The number of rows of DTM is equal to the
number of documents in a sample dataset.
Unsupervised ML is used to draw inferences from data sets consisting of input data without
labeled responses. There are two types of unsupervised ML: dimensionality reduction and
clustering. Main challenge: Algorithm performance evaluation, i.e., whether the algorithm
learned something useful.
Dimensionality reduction: process of reducing the number of features or variables while
preserving data information and overall model performance. The most frequently-used
techniques for dimensionality reduction include Principal component analysis (PCA), Kernel
principal component analysis (KPCA) and t-distributed stochastic neighbor embedding (t-
SNE). PCA is a kind of linear algorithms that forces the new variables to be linear
combinations of the original features. Nonlinear algorithms such as KPCA and t-SNE can
capture more complex structures in data. Dimensionality reduction can be used in portfolio
management, yield curve construction and interest rate modelling as well as speed and
accuracy enhancement of a trading strategy.
Clustering algorithms (focus on minimizing dissimilarity, i.e., distance between data
points) allows us to discover hidden structures in data. The goal of clustering is to find a
natural grouping in data so that items in the same cluster are more similar to each other than
to those from different clusters, seek to learn, from the properties of the data, an optimal
division or discrete labeling of groups of points based on similarity. Three typical techniques
are k-means clustering, hierarchical and affinity propagation clustering. Clustering can be
used in portfolio construction, investor classification and risk management. For example,
Pairs trading is a non-directional, relative value investment strategy that seeks to identify two
companies or funds with similar characteristics such as Audi and Mercedes-Benz whose
equity securities are currently trading at a price relationship that is out of historical trading
range. This will entail buying the undervalued security while short-selling the overvalued
security, all while maintaining market neutrality. Another example is investor classification to
determine the investor’s ability and willingness to take risk.
Under k-means clustering, we need to tell model how many groups we want
(hyperparameter). It is a centroid-based algorithm/ distance-based, which tries to find cluster
centers that are representative of certain regions of data. The algorithm assigns each data
point to the closest cluster center and then sets each center as the mean of data points that are
assigned to it. The algorithm is finished when the assignment of instances to clusters do not
change. It divides a set of N samples X into K disjoint clusters S, each described by the mean
of samples in the cluster. The means are commonly called the cluster centroids and they are
not in general points from X, although they live in the same place. The k-means algorithm
aims to choose centroids (center of cluster calculated as arithmetic mean) that minimize the
inertia (a measure of how internally coherent clusters are), also known as the within-cluster
sum-of-squares criterion. Main strengths are simplicity, wide range of applicability, fast
convergence, linear scalability to large data while producing clusters of an even size and is
most useful when the exact number of clusters k is known beforehand. Main weaknesses:
have to tune the hyperparameter – number of clusters, lack of a guarantee to find a global
optimum and its sensitivity to outliers, and it can only capture relatively simple shapes. The
optimal number of clusters in data can be found by ‘knee finding’ and ‘elbow finding’.
If k increases, average distortion will decrease, each cluster will have fewer constituent
instances, and the instances will be closer to their respective centroids. However, the
improvement in average distortion will decline as k increases. The elbow method plots the
value of the cost function produced by different values of k. Median (here: medoid) is
preferred over the mean in the presence of outliers. K-medoids is an approach to overcome
extreme values in dataset that can disrupt a clustering solution significantly. K-medoids uses
median/medoids as the center point of a cluster which means that the center of a cluster must
be one of the observations in that cluster.
Hierarchical clustering involves creating clusters that have a predominant ordering, i.e., a
hierarchy so that we do not need to specify the number of clusters. Two types of hierarchical
clustering: agglomerative and divisive hierarchical clustering. Agglomerative hierarchical
clustering is a bottom-up approach, where each observation starts in its own cluster and pairs
of clusters are merged as one moves up the hierarchy. Divisive one is a top-down approach
where all observations start in one cluster and splits are performed recursively as one moves
down the hierarchy. Hierarchical clustering can be visualized by dendrogram. Main
strengths of agglomerative hierarchical clustering are: no need to pre-specify the number of
clusters; results can be visualized. Main weaknesses: it fails at separating complex shapes in
data structure; the choice of both distance metric and linkage criteria is often arbitrary. Main
strengths of divisive hierarchical clustering: bottom-up methods make clustering decisions
based on local patterns so that they do not need to take account the global distribution. Top-
down clustering benefits from complete information about the the global distribution when
making top-level partitioning decisions. Main weakness is the sensitivity to initialization due
to the possible divisions of data into two clusters at the first steps.
Affinity propagation clustering also does not need to select number of groups and creates
clusters by sending messages between pairs of samples until convergence. Then a data set is
described using a small number of exemplars, which are identified as those most
representative of other samples. The messages sent between pairs represent the suitability for
one sample to be the exemplar of the other, which is updated in response to the values from
other pairs. This updating happens iteratively until convergence, at which point the final
exemplars are chosen and hence the final clustering is given. In comparison with k-means, it
describes a dataset using a small number of exemplars. These are members of the input set
that are representative of clusters. The centroid in k-means clustering does not have to be one
of the data points, while the exemplar in affinity propagation clustering is one of the data
points. Main strengths: choose the number of clusters based on data provided; algorithm is
fast. Main weaknesses: time- and memory-complexity; only appropriate for small and
medium sized data sets; converge only to suboptimal options and can fail to converge.
Presentation:
1. How to speak when AI is listening: Pi-audio analysis → advantage of easily
implementing the tool myself
2. ChatGPT is trained at too low price and have the bias removed in the training data
removed by Kenian workers paid with low wage; wont end up with the Amazon one
3. Jocky vs. Horses: show and disscuss then showed that we should go for the
business→horses, rather than the charasmantic of CEO
4. Log run McDonald’s dictionary is easily accessible and it’s updated
5. Argentine example: Hedge fund managers sometimes go to surprising lengths and
measures and try to seize assets of countries such as vessels and planes. When
country defaults, sometimes investors go to extra lengths and try to ….

Summary - Data Analytics& Machine Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary - Data Analytics& Machine Learning

Uploaded by

Copyright:

Available Formats

Attributes of big data: 1. Volume (The quantity of data is huge.) 2.

Variety (The array of

Regressions (predict values) (continuous) Classification (predict classes) (discrete)

既可 regression，亦可 classification: K-Nearest Neighbors, Decision Trees, Support Vector

只能 classification: Logistic Regression and Naïve Bayes

Unsupervised machine learning 有两类：

Dimension Reduction (the process of reducing the number of features, or variables, in a

L1 regularization, aka. LASSO (Least Absolute Shrinkage Operator) regression: LASSO

Model Evaluation 主要有两个方法: Cross-Validation, Grid Search (balanced vs imbalanced

Cross-Validation when working with time-series data: 1 Sliding Window 有 dropped 条越

000000000000000111 (one of the sample splits only zero)→Stratified K-fold cross-

You might also like

Summary - Data Analytics&amp; Machine Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary - Data Analytics&amp; Machine Learning

Uploaded by

Copyright:

Available Formats

Attributes of big data: 1. Volume (The quantity of data is huge.) 2.

Variety (The array of

Regressions (predict values) (continuous) Classification (predict classes) (discrete)

既可 regression，亦可 classification: K-Nearest Neighbors, Decision Trees, Support Vector

只能 classification: Logistic Regression and Naïve Bayes

Unsupervised machine learning 有两类：

Dimension Reduction (the process of reducing the number of features, or variables, in a

L1 regularization, aka. LASSO (Least Absolute Shrinkage Operator) regression: LASSO

Model Evaluation 主要有两个方法: Cross-Validation, Grid Search (balanced vs imbalanced

Cross-Validation when working with time-series data: 1 Sliding Window 有 dropped 条越

000000000000000111 (one of the sample splits only zero)→Stratified K-fold cross-

You might also like

Summary - Data Analytics& Machine Learning

Summary - Data Analytics& Machine Learning