Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Kernels and Model Selection and Feature Selection

Kernels
Kernel is a way of computing the dot product of two vectors x and y in some (possibly very high
dimensional) feature space, which is why kernel functions are sometimes called "generalized dot
product".

Suppose we have a mapping φ:Rn→Rm that brings our vectors in Rn to some feature space Rm.
Then the dot product of x and y in this space is φ(x)Tφ(y). A kernel is a function k that
corresponds to this dot product, i.e. k(x,y)=φ(x)Tφ(y).

Why is this useful? Kernels give a way to compute dot products in some feature space without
even knowing what this space is and what is φ. Kernel is effectively a similarity measure.

Model Selection
Model selection is the task of selecting a statistical model from a set of candidate models, given
data. In the simplest cases, a pre-existing set of data is considered. However, the task can also
involve the design of experiments such that the data collected is well-suited to the problem of
model selection. Given candidate models of similar predictive or explanatory power, the simplest
model is most likely to be the best choice (Occam's razor).

Model selection is the process of choosing between different machine learning approaches - e.g.
SVM, logistic regression, etc - or choosing between different hyperparameters or sets of features
for the same machine learning approach - e.g. deciding between the polynomial
degrees/complexities for linear regression.

Model selection may also refer to the problem of selecting a few representative models from a
large set of computational models for the purpose of decision making or optimization under
uncertainty.

The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less
important than you'd think - there may be a "best" algorithm for a particular problem, but often
its performance is not much better than other well-performing approaches for that problem.

There may be certain qualities you look for in an model:

 Interpretable - can we see or understand why the model is making the decisions it makes?
 Simple - easy to explain and understand
 Accurate
 Fast (to train and test)
 Scalable (it can be applied to a large dataset)
Though there are generally trade-offs amongst these qualities.

In its most basic forms, model selection is one of the fundamental tasks of scientific inquiry.
Determining the principle that explains a series of observations is often linked directly to a
mathematical model predicting those observations. Of the countless number of possible
mechanisms and processes that could have produced the data, how can one even begin to choose
the best model? The mathematical approach commonly taken decides among a set of candidate
models; this set must be chosen by the researcher. Often simple models such as polynomials are
used, at least initially. Burnham & Anderson (2002) emphasize throughout their book the
importance of choosing models based on sound scientific principles, such as understanding of the
phenomenological processes or mechanisms (e.g., chemical reactions) underlying the data.

Once the set of candidate models has been chosen, the statistical analysis allows us to select the
best of these models. What is meant by best is controversial. A good model selection technique
will balance goodness of fit with simplicity. More complex models will be better able to adapt
their shape to fit the data (for example, a fifth-order polynomial can exactly fit six points), but
the additional parameters may not represent anything useful. (Perhaps those six points are really
just randomly distributed about a straight line.) Goodness of fit is generally determined using a
likelihood ratio approach, or an approximation of this, leading to a chi-squared test. The
complexity is generally measured by counting the number of parameters in the model.

Model selection techniques can be considered as estimators of some physical quantity, such as
the probability of the model producing the given data. The bias and variance are both important
measures of the quality of this estimator; efficiency is also often considered.

A standard example of model selection is that of curve fitting, where, given a set of points and
other background knowledge (e.g. points are a result of i.i.d. samples), we must select a curve
that describes the function that generated the points.

Feature Selection
In machine learning and statistics, feature selection, also known as variable selection, attribute
selection or variable subset selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction. Feature selection techniques are used for
four reasons:

1. simplification of models to make them easier to interpret by researchers/users,


2. shorter training times,
3. to avoid the curse of dimensionality,
4. enhanced generalization by reducing overfitting (formally, reduction of variance)

The central premise when using a feature selection technique is that the data contains some
features that are either redundant or irrelevant, and can thus be removed without incurring much
loss of information. Redundant and irrelevant are two distinct notions, since one relevant feature
may be redundant in the presence of another relevant feature with which it is strongly correlated.

Feature selection techniques should be distinguished from feature extraction. Feature extraction
creates new features from functions of the original features, whereas feature selection returns a
subset of the features. Feature selection techniques are often used in domains where there are
many features and comparatively few samples (or data points). Archetypal cases for the
application of feature selection include the analysis of written texts and DNA microarray data,
where there are many thousands of features, and a few tens to hundreds of samples.

Feature Selection Algorithms

There are three general classes of feature selection algorithms: filter methods, wrapper methods
and embedded methods.

 Filter Methods

Filter feature selection methods apply a statistical measure to assign a scoring to each
feature. The features are ranked by the score and either selected to be kept or removed
from the dataset. The methods are often univariate and consider the feature
independently, or with regard to the dependent variable. Some examples of some filter
methods include the Chi squared test, information gain and correlation coefficient scores.

 Wrapper Methods

Wrapper methods consider the selection of a set of features as a search problem, where
different combinations are prepared, evaluated and compared to other combinations. A
predictive model is used to evaluate a combination of features and assign a score based
on model accuracy.

The search process may be methodical such as a best-first search, it may stochastic such
as a random hill-climbing algorithm, or it may use heuristics, like forward and backward
passes to add and remove features.
An example of a wrapper method is the recursive feature elimination algorithm.

 Embedded Methods

Embedded methods learn which features best contribute to the accuracy of the model
while the model is being created. The most common type of embedded feature selection
methods are regularization methods.

Regularization methods are also called penalization methods that introduce additional
constraints into the optimization of a predictive algorithm (such as a regression
algorithm) that bias the model toward lower complexity (fewer coefficients).

Examples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression.

Feature selection is another key part of the applied machine learning process, like model
selection. You cannot fire and forget.
BIBLIOGRAPHY
The sources used in creating this report are -:

1. Kernel method - Wikipedia


2. How to intuitively explain what a kernel is? - Cross Validated
3. Model selection - Wikipedia
4. Model Selection
5. Feature selection - Wikipedia
6. An Introduction to Feature Selection

You might also like