Machine Learning Foundation
Section 2, Part d: Regularization and Gradient Descent
Introduction
We will begin with a short tutorial on regression, polynomial features, and regularization based on a very simple, sparse data set that contains a column of x
data and associated y
noisy data. The data file is called X_Y_Sinusoid_Data.csv
.
Question 1
Import the data.
Also generate approximately 100 equally spaced x data points over the range of 0 to 1. Using these points, calculate the y-data which represents the "ground truth" (the real function) from the equation: $y = sin(2\pi x)$
Plot the sparse data (
x
vsy
) and the calculated ("real") data.
Question 2
- Using the
PolynomialFeatures
class from Scikit-learn's preprocessing library, create 20th order polynomial features. - Fit this data using linear regression.
- Plot the resulting predicted value compared to the calculated data.
Note that PolynomialFeatures
requires either a dataframe (with one column, not a Series) or a 2D array of dimension (X
, 1), where X
is the length.
Question 3
- Perform the regression on using the data with polynomial features using ridge regression ($\alpha$=0.001) and lasso regression ($\alpha$=0.0001).
- Plot the results, as was done in Question 1.
- Also plot the magnitude of the coefficients obtained from these regressions, and compare them to those obtained from linear regression in the previous question. The linear regression coefficients will likely need a separate plot (or their own y-axis) due to their large magnitude.
What does the comparatively large magnitude of the data tell you about the role of regularization?
Question 4
For the remaining questions, we will be working with the data set from last lesson, which is based on housing prices in Ames, Iowa. There are an extensive number of features--see the exercises from week three for a discussion of these features.
To begin:
- Import the data with Pandas, remove any null values, and one hot encode categoricals. Either Scikit-learn's feature encoders or Pandas
get_dummies
method can be used. - Split the data into train and test sets.
- Log transform skewed features.
- Scaling can be attempted, although it can be interesting to see how well regularization works without scaling features.
Create a list of categorial data and one-hot encode. Pandas one-hot encoder (get_dummies
) works well with data that is defined as a categorical.
Next, split the data in train and test data sets.
There are a number of columns that have skewed features--a log transformation can be applied to them. Note that this includes the SalePrice
, our predictor. However, let's keep that one as is.
Transform all the columns where the skew is greater than 0.75, excluding "SalePrice".
Separate features from predictor.
Question 5
- Write a function
rmse
that takes in truth and prediction values and returns the root-mean-squared error. Use sklearn'smean_squared_error
.
- Fit a basic linear regression model
- print the root-mean-squared error for this model
- plot the predicted vs actual sale price based on the model.
Question 6
Ridge regression uses L2 normalization to reduce the magnitude of the coefficients. This can be helpful in situations where there is high variance. The regularization functions in Scikit-learn each contain versions that have cross-validation built in.
- Fit a regular (non-cross validated) Ridge model to a range of $\alpha$ values and plot the RMSE using the cross validated error function you created above.
- Use $$[0.005, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 80]$$ as the range of alphas.
- Then repeat the fitting of the Ridge models using the range of $\alpha$ values from the prior section. Compare the results.
Now for the RidgeCV
method. It's not possible to get the alpha values for the models that weren't selected, unfortunately. The resulting error values and $\alpha$ values are very similar to those obtained above.
Question 7
Much like the RidgeCV
function, there is also a LassoCV
function that uses an L1 regularization function and cross-validation. L1 regularization will selectively shrink some coefficients, effectively performing feature elimination.
The LassoCV
function does not allow the scoring function to be set. However, the custom error function (rmse
) created above can be used to evaluate the error on the final model.
Similarly, there is also an elastic net function with cross validation, ElasticNetCV
, which is a combination of L2 and L1 regularization.
- Fit a Lasso model using cross validation and determine the optimum value for $\alpha$ and the RMSE using the function created above. Note that the magnitude of $\alpha$ may be different from the Ridge model.
- Repeat this with the Elastic net model.
- Compare the results via table and/or plot.
Use the following alphas:[1e-5, 5e-5, 0.0001, 0.0005]
We can determine how many of these features remain non-zero.
Now try the elastic net, with the same alphas as in Lasso, and l1_ratios between 0.1 and 0.9
Comparing the RMSE calculation from all models is easiest in a table.
We can also make a plot of actual vs predicted housing prices as before.
Question 8
Let's explore Stochastic gradient descent in this exercise.
Recall that Linear models in general are sensitive to scaling. However, SGD is very sensitive to scaling.
Moreover, a high value of learning rate can cause the algorithm to diverge, whereas a too low value may take too long to converge.
- Fit a stochastic gradient descent model without a regularization penalty (the relevant parameter is
penalty
). - Now fit stochastic gradient descent models with each of the three penalties (L2, L1, Elastic Net) using the parameter values determined by cross validation above.
- Do not scale the data before fitting the model.
- Compare the results to those obtained without using stochastic gradient descent.
Notice how high the error values are! The algorithm is diverging. This can be due to scaling and/or learning rate being too high. Let's adjust the learning rate and see what happens.
- Pass in
eta0=1e-7
when creating the instance ofSGDClassifier
. - Re-compute the errors for all the penalties and compare.
Now let's scale our training data and try again.
- Fit a
MinMaxScaler
toX_train
create a variableX_train_scaled
. - Using the scaler, transform
X_test
and create a variableX_test_scaled
. - Apply the same versions of SGD to them and compare the results. Don't pass in a eta0 this time.
Comments
Post a Comment