Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

®

Cheat Sheet: Machine Learning with KNIME Analytics Platform

SUPERVISED LEARNING UNSUPERVISED LEARNING


Supervised Learning: A set of machine learning algorithms to predict the value of a target class or variable. They produce a mapping function (model) from the input features to the target class/variable. To estimate the model parameters during the training phase, labeled example data are needed in the training set. Unsupervised Learning: A set of machine
Generalization to unseen data is evaluated on the test set data via scoring metrics. learning algorithms to discover patterns in the
data. A labeled dataset is not required, since data
are ultimately organized and/or transformed
NUMERIC PREDICTION based on similarity or statistical measures.
CLASSIFICATION NUMERIC PREDICTION
Classification: A type of supervised learning where the target is a class. The model learns to produce & CLASSIFICATION Numeric Prediction: A type of supervised learning for numeric target variables. The model learns to associate one or more numbers with the vector of
a class score and to assign each vector of input features to the class with the highest score. A cost input features. Note that numeric prediction models can also be trained to predict class scores and therefore can be used for classification problems CLUSTERING
can be introduced to penalize one of the classes during class assignment. too.
Artificial Neural Networks (ANN,
NN): Inspired by biological nervous Clustering: A branch of unsupervised learning
Decision Tree: Follows the C4.5 decision tree Linear/Polynomial Regression: Linear Regression algorithms that groups data together based on
algorithm. These algorithms generate a Naive Bayes: Based on Bayes' systems, Artificial Neural Networks
tree-like structure, creating data subsets, aka theorem and assuming statistical are based on architectures of is a statistical algorithm to model a multivariate
linear relationship between the numeric target
TIME SERIES ANALYSIS similarity measures, without the help of labels,
classes, or categories.
tree nodes. At each node, the data are split independence between input interconnected units called artificial
features (thus "naive"), this neurons. Artificial neurons' variable and the input features. Polynomial
based on one of the input features, generating Regression extends this concept to fitting a Time Series Analysis: A set of numeric prediction methods to analyze/predict time series
two or more branches as output. Further splits algorithm estimates the conditional parameters and connections are k-Means: The n data points in the dataset are
trained via dedicated algorithms, polynomial function of a pre-defined degree. data. Time series are time ordered sequences of numeric values. In particular, time series clustered into k clusters based on the shortest
are made in subsequent nodes until a node is probability of each output class forecasting aims at predicting future values based on previously observed values.
generated where all or almost all of the data given the vector of input features. the most popular being the distance from the cluster prototypes.
belong to the same class. The class with the highest Back-Propagation algorithm. The cluster prototype is taken as the average
conditional probability is assigned Auto-Regressive Integrated Moving Average (ARIMA): A linear Auto-Regres- data point in the cluster.
to the input data. sive (AR) model is constructed on a specified number p of past values; data
Deep Learning: Deep learning are prepared by a degree of differencing d to correct non-stationarity; while
Support Vector Machine (SVM): extends the family of ANNs with a linear combination - named Moving Average (MA) - models the q past
A supervised algorithm construct- deeper architectures and additional residual errors. All ARIMA model parameters are estimated concurrently by
ing a set of discriminative paradigms, e.g. Recurrent Neural various algorithms, mostly following the Box–Jenkins approach.
hyperplanes in high-dimensional Networks (RNN). The training of
Logistic Regression: A statistical algorithm that space. In addition to linear such networks, has been enabled
by recent advances in hardware Regression Tree: Builds a decision tree to predict
models the relationship between the input classification, SVMs can perform numeric values through a recursive, top-down, ML-based TSA: A numeric prediction model trained on
features and the categorical output classes by non-linear classification by performance as well as parallel
execution. greedy approach known as recursive binary vectors of past values can predict the current numeric value
maximizing a likelihood function. Originally implicitly mapping their inputs into splitting. At each step, the algorithm splits the of the time series. Hierarchical Clustering: Builds a hierarchy of
developed for binary problems, it has been high-dimensional feature spaces, subsets represented by each node into two or more clusters by either collecting the most similar
extended to problems with more than two where the two classes are linearly new branches using a greedy search for the best (agglomerative approach) or separating the
classes (multinomial logistic regression). separable. Generalized Linear Model (GLM): split. The average value of the points in a leaf most dissimilar (divisive approach) data points
A statistics-based flexible generaliza- produces the numerical prediction. and clusters, according to a selected distance
k-Nearest Neighbor (kNN): tion of ordinary linear regression, Long Short Term Memory (LSTM) Units: LSTM units measure. The result is a dendrogram clustering
A non-parametric method that valid also for non-normal distribu- produce a hidden state by processing m x n tensors of input the data together bottom-up (agglomerative) or
assigns the class of the k most tions of the target variable. GLM values, where m is the size of the input vector at any time separating the data in different clusters
similar points in the training data, uses the linear combination of the and n the number of past vectors. The hidden state can then top-down (divisive).
based on a pre-defined distance input features to model an arbitrary be transformed into the current vector of numeric values.
measure. Class attribution can be function of the target variable (the LSTM units are suited for time series prediction as values
weighted by the distance to the k-th link function) rather than the target from past vectors can be remembered or forgotten through
point and/or by the class probability. variable itself. a series of logic gates.

ENSEMBLE LEARNING DBSCAN: A density-based non-parametric


clustering algorithm. Data points are classified
Ensemble Learning: A combination of multiple models from supervised learning algorithms to obtain
a more stable and accurate overall model. Most commonly used ensemble techniques are Bagging
and Boosting.
DEPLOYMENT as core, density-reachable, and outlier points.
Core and density-reachable points in high
density regions are clustered together, while
points with no close neighbors in low-density
Read Data Transform Learner Write Model Read Model regions are labeled as outliers.

BAGGING
Bagging: A method for training multiple
classification/regression models on different
randomly drawn subsets of the training data.
The final prediction is based on the
predictions provided by all the models, thus
reducing the chance of overfitting. Self-Organizing Tree Algorithm
Predictor Scorer Data Input Predictor Data Output (SOTA): A special Self-Organiz-
ing Map (SOM) neural network.
Tree Ensemble of Decision/Regression Trees: Its cell structure is grown using
Ensemble model of multiple decision/regres- a binary tree topology.
sion trees trained on different subsets of data.
Data subsets with less or equal rows and less Fuzzy c-Means: One of the
or equal columns are bootstrapped from the most widely used fuzzy
original training set. Final prediction is based on clustering algorithms. It works
a hard vote (majority rule) or soft-vote similarly to the k-Means

TRAINING
(averaging all probabilities or numeric algorithm, but it allows for data
predictions) on all involved trees. points to belong to more than
Random Forest of Decision/Regression Trees: BOOSTING one cluster, with different
degrees of membership.
Ensemble model of multiple decision/regres-
sion trees trained on different subsets of data. Boosting: A method for training a set of
Data subsets with the same number of rows classification/regression models iteratively. Resources
are bootstrapped from the original training set.
At each node, the split is performed on a
At each step, a new model is trained on the
• E-Books: Learn even more with the KNIME books. RECOMMENDATION ENGINES
prediction errors and added to the
subset of sqrt(x) features from the original x ensemble to improve the results from the EVALUATION From basic concepts in “KNIME Beginner’s Luck”,
input features. Final prediction is based on a previous model state, leading to higher to advanced concepts in “KNIME Advanced Luck”, Recommendation Engines: A set of
hard vote (majority rule) or soft-vote (averaging accuracy after each iteration. through to examples of real-world case studies in algorithms that use known information about
all probabilities or numerical predictions) on all Evaluation: Various scoring metrics for assessing model quality - in particular, a model’s predictive ability or propensity to error.
“Practicing Data Science”. Available for purchase user preferences to predict items of interest.
involved trees.
Gradient Boosted Regression Trees: at knime.com/knimepress
Ensemble model combining multiple Confusion Matrix: A representation of a classification Numeric Error Measures: Evaluation metrics Association Rules: The node
sequential simple regression trees into a task’s success through the count of matches and for numeric prediction models quantifying the • KNIME Blog: Engaging topics, challenges, reveals regularities in co-occur-
stronger model. The algorithm builds the mismatches between the actual and predicted classes, error size and direction. Common metrics industry news, and knowledge nuggets at rences of multiple products in
Custom Ensemble Model: model stagewise. At each iteration, a simple aka true positives, false negatives, false positives, and include RMSE, MAE, or R2. Most of these knime.com/blog large-scale transaction data
Combining different supervised regression tree is fitted to predict the true negatives. One class is arbitrarily selected as the metrics depend on the range of the target recorded at points-of-sale. Based
models to form a custom residuals of the current model, following the positive class. variable. on the a-priori algorithm, the
• KNIME Hub: Search, share, and collaborate on most frequent itemsets in the
ensemble model. The final gradient of the loss function. This leads to Accuracy Measures: Evaluation metrics for a classification
prediction can be based on an increasingly accurate and complex overall KNIME workflows, nodes, and components with dataset are used to generate
model calculated from the values in the confusion matrix, ROC Curve: A graphical representa-
majority vote as well as on the model. The same regression trees can also tion of the performance of a binary the entire KNIME community at hub.knime.com recommendation rules.
such as sensitivity and specificity, precision and recall, or
average or other functions of the be used for classification. overall accuracy. classification model with false
output results. positive rates on the x-axis and true • KNIME Forum: Join our global community and Collaborative Filtering: Based on
Cross-Validation: A model validation technique for positive rates on the y axis. Multiple engage in conversations at forum.knime.com the Alternating Least Squares
assessing how the results of a machine learning model will points for the curve are obtained for (ALS) technique, it produces
XGBoost: An optimized distributed library for machine learning models in the generalize to an independent dataset. A model is trained and
gradient boosting framework, designed to be highly efficient, flexible, and portable. different classification thresholds. • KNIME Server: The enterprise software for recommendations (filtering)
validated N times on different pairs of training set and test The area under the curve is the about the interests of a user by
It features regularization parameters to penalize complex models, effective set, both extracted from the original dataset. Some basic team-based collaboration, automation, manage-
handling of sparse data for better performance, parallel computation, and more metric value. comparing their current
statistics on the resulting N error or accuracy measures ment, and deployment of data science workflows preferences with those of
efficient memory usage. gives insights on overfitting and generalization. as analytical applications and services. Visit multiple users (collaborating).
knime.com/server for more information.

You might also like