Regularized Linear Regression

You might also like

Download as pdf
Download as pdf
You are on page 1of 9
12152, 713 PM Regularized Linear Regression ipynd - Colaboratory + What do regression coefficients tell us? « Recall that for a normal linear regression model of Y= jo + Bi Xi +... +BpXp We would estimate its coefficients using the least squares criterion, which minimizes the residual sum of squares (RSS). Or graphically, we're fitting the blue line to our data (the black points) that minimizes the sum of the distances between the points and the blue line (sum of the red lines) as shown below. Mathematically, this can be denoted as: + isthe tota numberof observations (data) ‘+ yj Is the actual output value of the observation (data). + pis the ota numberof fetus + 5, leamodetscoeticient +5 ithe iy observation ja fat io + SF By 8 the pre value. output of each observation In regression multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase when that independent vai holding all the other independent variables constant. ble increases by one, + Regularization Before discussing about regularization, welll do a quick recap on the notion of overfitting and the bias-varlance tradeoff. Overfitting: So what is overfitting? Well, to put it in more simple terms it's when we built a model that is too ‘complex that it matches the training data "too closely’ or we can say that the model has started to learn not only the signal, but also the noise in the data. The result of this is that our mode! will do well on the training data, but won't generalize to out-of-sample data, data that we have not seen before, Bias-Variance tradeoff: When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to "bias" and error due to “variance”. Understanding these two types of hitpsscolab research google.comidrvaTWADBz20092CiC-sqXIG9PIRJZVOF z4cH?usp=sharingtprin!Mede=tue 19 ‘21322, 713 PM RRegularized Linear Regression pynb -Colaboratory error can help us diagnose model results and avoid the mistake of over/under fitting, A typical graph of discussing this is shown below: + Bias: The red line, measures how far off in general our models’ predictions are from the correct value. Thus as our model gets more and more complex we will become more and more accurate about our predictions (Error steadily decreases). + Variance: The cyan line, measures how different can our model be from one to another, as we're looking at different possible data sets. Ifthe estimated model will vary dramatically from one data set to the other, then we will have very erratic predictions, because our prediction will be extremely sensitive to what data set we obtain. As the complexity of our model rises, variance becomes our primary concern When creating a model, our goal is to locate the optimum model complexity. If our model complexity exceeds this sweet spot, we are in effect overfitting our model; while if our complexity falls short of the sweet spot, we are nderfitting the model. With all of that in mind, the notion of regularization is simply a useful technique to use when we think our model is too complex (models that have low bias, but high variance). It is a method for “constraining” or “regularizing" the size of the coefficients (‘shrinking’ them towards zero). The specific regularization techniques well be discussing are Ridge Regression and Lasso Regression Variance Optimal Mode! Complexity Error Model Complexity ~ Ridge and Lasso Regression Regularized linear regression models are very similar to least squares, except that the coefficients are estimated by minimizing a slightly different objective function. we minimize the sum of RSS and a “penalty term” that penalizes coefficient size, Ridge regression (or "L2 regularization’) minimizes: Rss+ aS 6? Lasso regression (or'L1 regularization’) minimizes: ass AS hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus 219 12182, 713 PM RRegularized Linear Regression pynb -Colaboratory Where is a tuning parameter that seeks to balance between the fit of the model to the data and the magnitude of the model's coefficients: + Atny A imposes no penalty on the coefficient size, and is equivalent to a normal linear regression. + Increasing penalizes the coefficients and thus shrinks them towards zero. Lasso stands for least absolute shrinkage and selection operator Thus you can think of it as, we're balancing two things to measure the models total quality. The RSS, measures how well the model is going to fit the data, and then the magnitude of the coefficients, which can be problematic if they become too big ~ Visualizing Regularization contours of RS a6 iinove sua om B, ‘Te peray term budget) shown asa contrat eon Lasso RIDGE REGRESSION + Lasso regression shrinks coefficients all the way to zero, thus removing them from the model. + Ridge regression shrinks coefficients toward zero, but they rarely reach zero. To get a sense of why this is happening, the visualization below depicts what happens when we apply the two different regularization. The general idea is that we ate restricting the allowed values of our coefficients to a certain ‘region’ and within that region, we wish to find the coefficients that result in the best model. In this diagram, we are iting a linear regression model with two features, x1 and 2p. + 6 represents the set of two coefficients, 8; and 2, which minimize the RSS for the unregularized model. «The ellipses that are centered around f represent regions of constant RSS. n other words all ofthe points on 2a given ellipse share a common value of the RSS, despite the fact that they may have different values for 8 ‘and §p. As the ellipses expand away from the least squares coefficient estimates, the RSS increases + Regularization restricts the allowed positions of to the blue constraint region. In this case, 8 is not within the blue constraint region. Thus, we need to move 8 until it intersects the blue region, while increasing the RSS as little as possible. hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus a9 12182, 713 PM RRegularized Linear Regression pynb -Colaboratory ‘not generally occur on an axis, and so the coefficient estimates will be typically be non-zero. For lasso, this region is a diamond because it constrains the absolute value of the coefficients. Because the constraint has comers at each of the axes, and so the ellipse will often intersect the constraint region at an axis, When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In the figure above, the intersection occurs at 1 = 0, and so the resulting model will only include + The size of the blue constraint region is determined by X, with a smaller A resulting in a larger region: + When A is zero, the blue region is infinitely large, and thus the coefficient sizes are not constrained, + When 2 increases, the blue region gets smaller and smaller + Letus do some Python implementation of these two regularized linear regressions Let us load the Boston Housing Dataset, which contains some dataset about the housing values in suburbs of + Boston, Welll choose the first few features, train a ridge and lasso regression separately at look at the estimated coefficients’ weight for different A parameter. Note that we're choosing the first few features ingore nupy 25 9 fngort pandas as pa ‘ron sklearn.datasets ingort load_poston fron sklearn,Tsnear_vodel amport Ridge, Adee boston « ead bostent) boston. data, shape (96, 2) 4 extract input and response variables (housing prices), boston = lead. bostent) X= boston.data(:, :feature_ sun] y = boston. target features = boston festure_noves[ feature nur) pasostateane(x, columns = features) head) + 002701 09 707 09 0469 6421 709 + CRIM per capita crmerate by town + ZN- proportion of residential land zoned for lots over 25.000 sa + INDUS- proportion of non-retal business acres per town, ‘+ CHAS- Charles River dummy variable (1 iftract bounds river 0 otherwise) hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus For ridge, this region is a circle because it constrains the square of the coefficients. Thus the intersection wil 49 ‘21322, 713 PM Regularized Linear Regression pynb -Colaboratory + NOX-nitvie oxides concentration (parts per 10 millon) + RM average numberof rooms per deling + AGE- proportion of owner-occupied units bul prior to 1940, ves) array(L24. , 21.6, 34.7, 33.4, 36.23) 44 split into training and testing sets and standardize thom Atrain, Xtest, ytrain, y.cest = traintere_splie(Xy y, randon_state = 1) dscaler() train std = sté.fit_transform(X tran) ACkest sta + ats transforn(h test) 1 Loop hough different penalty score (alpha) ard obtain the estinated coefficient (weights) Blghas = 16 *+ np.arange(a, 5) prine(* afferent alpha valves:", slohas) 4 stores the weights of each feature Fdge-wesahe = (] for sina in alphas ldge = nidge(aiphe = alpha, #it_intercept = True) Plage. Fle (4 train std, trata) ldge_wegnt.sopend(edge.coet_) 4tforent alpha values: { 2@ 108 1888 10880] slohos array({_ 20, 108, 1680, 26000)) [rray((-a.eozsseae, 0,37088036, -0,70836751, 2. ‘a.aavsiza , -8.80450099]), arvay({-1.30485068, 8.5170331 , “9,95951603, 0,96894376, -0.81006887, 5ee96735, -0.74796038)), ‘@-z0s63082, -9.12574202))1 tght_versus_alpha_plot(woight, alphas, features) ass in the estinated weight, the alph value and the nanes for the features and plot the radel's estinotea coefficient weight for different alpha values fig = ple.figure(tigsize = (8, 6)) # ensure that the weight 4s an array weight. = ap-array(weiahe) or cot in rarge(weight.shape(t]) plt.plot(alphas, weignt{:, col], label = features{cot]) pltcanhiine(@, color = "black", Linestyle = *--', linewidth = 3) manually specity the coordinate of the Legend ple legerd(bbox t0_anchor = (2.3, 8.9)) pis.title*coe/fseient weight a= Alpna Grows") pltiylabel ('CoorFicient weight") ple xiabe1(“sIpha") 1 change default figure and font size plt-reParans{ Figure. figsize") = 8, 6 pltcreparans{ ‘fort.size") = 12 ridge tig = weight versus_slota_plot(ridge weight, alphas, features) hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus 12182, 713 PM Regularized Linear Regression pynb -Colaboratory Coefficient Weight as Alpha Grows We Sa i — j lasso weete = () Iasi. (tec tain st roi) Cocticient Weights alpha Grows pra ying Regularization > Signs and causes of overfitting in the original regression model + Linear models can overfitf you include irrelevant features, meaning features that are unrelated to the response. Orif you include highly correlated features, meaning two or more predictor variables are closely related to one another. Because it will learn a coefficient for every feature you Include in the model, regardless of whether that feature has the signal or the noise. This is especially a problem when p (number of features) is close to n (number of observations) + Linear models that have large estimated coefficients is a sign that the model may be overfitting the data. The larger the absolute value of the coefficient, the more power it has to change the predicted response, resulting ina higher variance. hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus 9 12182, 713 PM Regularized Linear Regression pynb -Colaboratory ‘Should features be standardized? Yes, because L1 and L2 regularizers of linear models assume that all features are centered around 0 and have variance in the same order. Ifa feature has a variance that is orders of magnitude larger that others, features would be penalized simply because of their scale and make the model unable to learn from other features correctly as expected. Also, standardizing avoids penalizing the intercept, which wouldn't make intuitive sense. How should we choose between Lasso regression (L1) and Ridge regression (L2)? + If model performance is our primary concern or that we are not concerned with explicit feature selection, itis best to try both and see which one works better. Usually L2 regularization can be expected to give superior performance over L1 + Note that there's also a ElasticNet regression, which is a combination of Lasso regression and Ridge regression, + Lasso regression is preferred if we want a sparse model, meaning that we believe many features are irrelevant to the output. + When the dataset includes collinear features, Lasso regression is unstable in a similar way as unregularized linear models are, meaning that the coefficients (and thus feature ranks) can vary significantly even on small data changes. When using L2-norm, since the coefficients are squared in the penalty expression, it has a different effect from L1- norm, namely it forces the coefficient values to be spread out more equally. When the dataset at hand contains correlated features, it means that these features should get similar coefficients. Using an example of a linear model Y = X1 + X2, with strongly correlated feature of X1 and X2, then for L1-norm, the penalty is the same + X14 1+ X2or¥ = 2X1 40+ X2.Inboth cases the penalty is 2* a, For L2-norm, however, the first model's penalty is 1? + 12 = 2a, while for the second model is penalized with 2? + 0? = da. whether the learned model is ¥ ‘The effect of this is that models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularized or L1 models). So while L2 regularization does not perform feature selection the same way as L1 does, itis more useful for feature interpretation due to its stability and the fact that useful features still tend to have non-zero coefficients. But again, please do remove collinear features to prevent a bunch of downstream headaches. det generate random data( size, se04) restate = ap-randon,Randoestate(sted) Kissed + retate.normal(e, 1, size) XE = Xseed « ratatecnornal(@, dy size) Xz = Xseed + rstate,norwal(®, 2, size) X07 XLeeed 4 retate.noreal(@, 3, size) yexe xd «x3 + estaresnormal(@, 1 76) X= npaarray(Oe1, 2, 13D) et pretty_print_Linear(estinator, nanes = Nore, sort = False) “m nelger nethed for pretty-printing Linear wodels' coefficients” Snes fe None! nes = [0085! Xx for x in range(a, lea(eoe#) + 3)1 nfo = sipcoef, ranes) © info = sorteatinto, kay = Lambda x2 -mp-abs(xl0)) output = [°() * )' foonat(round(coet, 3), mane) for coef, mane An info] hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus 79 12182, 713 PM RRegularized Linear Regression pynb -Colaboratory output = * + * sotn(output) | We run the two wethod 10 tines with oifferent random seeds for seed in range(s0) print("Randon seed", seed) XK, y = generate random cata(size, seed) s50.44°0%, ¥) printt"Lass0 todel:", pretty_print_Iinear(ass0)) Plage = Rege(atphe = 30) riage-f20%, y) prine("Ridge todel:*, pretty print_Linear(ntdee)) prin, anaom seed: 2 fonaom seeds 4 Ridge model: 0.964 * x1 + 8.982 * X24 1.098 * 33 Fsdge novel: 6.758 "x4 ¢ 1.081 "1a © 239 onaon seed ioge model: Fade novel: 2.038 * 24 ¢ 1.039 + 42 4 9.991 + 3 onaon seeds 8 To sum It up, overfitting is when we build a predictive model that fits the data ‘too closely’, so that it captures the random noise in the data rather than true patterns. As a result, the model predictions will be wrong when applied to new data. Give that our data is sufficiently large and clean, regularization is one good way to prevent overfiting from occurring, hntps:ifeolab research google.comidrva/1 WADBz2Do9zC1C-sqXI59PIRUZVOF 24H 7usp=sharingfprintMede=trus 12182, 713 PM Regularized Linear Regression pynd - Colaboratory hitpsscolab research google.comidrveTWADBz2D092CiC-sqXIS9PIRJZVOF 2taH?usp=sharingtpiniMod 99

You might also like