1. Environment
Python
2. Requirement
I had to implement regression against the data set that had been originally distributed and constructed from 900lines × 22columns having y and x1 ~ x21 as columns. I substituted the order of index for randomly when I installed the first data set.
2. Choose one regression
I had to decide which types of regression is most appropriate for solving the problem and chose Elastic Net regression. I came up with four options for regressing at the beginning.
①normal linear regression without doing regularization
②Ridge regression with L2 regularization
③Lasso regression with L1 regularization
④Elastic Net whose regularization penalty is a convex combination of the lasso and ridge penalty
In this case, linear regression without doing regularization should not be applied as an appropriate regression because the dimensions of regression equation (21 dimensions) is too high compared to the number of data (900 lines).
Therefore, regressing without regularization was likely to cause overfitting condition easily on this problem. Based on this idea, I removed linear regression without regularization from the options for regressing. Then, three options are left: Ridge, Lasso and Elastic Net.
However, when I looked at the objective function on Elastic Net (that is shown below), I could find that it was possible to examine both of Ridge and Lasso by adjusting parameter α. For example if I substitute 1 for α on Elastic Net, itʼs going to be the same one as Ridge.
min RMSE+λ(α⋅Ridge Penalty+(1−α)⋅LASSO Penalty) for α∈[0,1]
Therefore, I applied Elastic Net regression to this regression problem based on what has been discussed so far.
3. GridSearch
Then, I had to decide the optimized number of parameter λ, α on Elastic Net regression. I implemented the parameter tuning that is called Grid Search to find the optimized numbers based on RMSE score. Firstly, I tried to apply the package "GridSearchCV", but it's not allowed us to calculate the accuracy on RMSE score (maybe). Therefore, I made the gridsearch algorithm as belows.
①set the range for each parameter: λ∈{0.1, 0.5, 1.0, 10}, α∈{0.1, 0.2, 0.3, ..., 0.9, 1.0} and tried all combinations with "for" loops on Python.
②applied 10-fold cross validation on each λ, α combination and calculated each RMSE score, which means that I could get 10 kinds of RMSE scores on each parameter combination.
③calculated the mean score of 10 RMSE scores and regarded the score as RMSE score of each λ, α combination. Based on this flow, it was expected to get 40 (4 kinds of λ x 10 kinds of α) RMSE scores at this time.
④regarded one parameter combination having the lowest RMSE scores as the most optimized parameter combinations. However, because I set shuffle=True on cross validation, each loop was likely to lead the different combination as the optimized one. Therefore, I needed to repeat this process many times and chose the best parameter based on how many times that combination was led as the most optimized.
⑤repeated many times and regarded one combination having the best score as the best. Therefore, I fixed these numbers as the parameters on Elastic Net.
Here's the algorithm of mine. However, must be improved.
substitute the number of loops on Step⑤ for gridsearch(loop)
def gridsearch(loop):
param_opt = []
alpha_opt = []
l1_ratio_opt = []
for x in range(loop):
n_splits = 10 #the number of groups
k_fold = KFold(n_splits=n_splits, shuffle=True) #make dataset random
rmse_10 = []
intercept = []
alpha = [0.1, 0.5, 1.0, 10]
l1_ratio = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
best_score = 100
for i in alpha:
for j in l1_ratio:
clf = linear_model.ElasticNet (alpha = i, l1_ratio = j)
for train_index, test_index in k_fold.split(X): #check the overfitting
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
fit = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
rmse_10.append(mean_squared_error(y_test, y_pred)**0.5)
rmse_mean = np.mean(rmse_10)
rmse_10 = []
if rmse_mean < best_score:
best_score = rmse_mean
best_alpha = i
best_l1_ratio = j
param_opt.append([best_alpha, best_l1_ratio])
alpha_opt.append(best_alpha)
l1_ratio_opt.append(best_l1_ratio)
df_alpha = pd.DataFrame({'alpha' : alpha_opt})
df_l1_ratio = pd.DataFrame({'l1' : l1_ratio_opt})
df_param = pd.concat([df_alpha, df_l1_ratio],axis=1)
df_param = df_param.groupby(['alpha','l1']).size().sort_values(ascending=False).reset_index(name='size')
alpha = df_param.iloc[0, 0]
l1_ratio = df_param.iloc[0, 1]
return alpha, l1_ratio, df_param