機械学習の目的は、高い予測精度のモデルを作成することです。
高い予測精度のモデルの作成方法はいくつかありますが、一般的な方法を説明していきます。
複数の機械学習モデルを比較して、最良もモデルを選択する
次に、最良のモデルのハイパーパラメータを選択する。
このような手順で最良のモデルを決定します。
今回は、複数の機械学習モデルの比較を、コードとともに確認していきましょう。


それでは、やっていきましょう。
必要なライブラリをインポート
import numpy as np import pandas as pd from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import warnings warnings.filterwarnings("ignore")
データセットを整える
bostonデータセットは、住宅価格を目的変数(ターゲット)として、これを予測するためのデータセットです。
目的変数:住宅価格(の中央値)
説明変数:目的変数に関係しそうな、さまざまなデータ(説明変数の詳細は、こちらに載せています)
boston = load_boston() df = pd.DataFrame(data = boston.data, columns = boston.feature_names) df['PRICE'] = boston.target df
出力:
上のコードはデータフレームにするためのコードです。
bostonのデータセットで実践する場合は上のコードを実行して下さい。
自身のデータを使用する場合は、以下のコードから実行して下さい。
機械学習がしやすいデータの形に整える
X = df.iloc[:,:-1].values # 説明変数(最後の列以外) y = df.iloc[:,-1].values # 目的変数(最後の列のみ)
訓練用のデータとテスト用のデータに分割する
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state = 1)
データ数を確認してみましょう。
print("Shape of X_train: ",X_train.shape) print("Shape of X_test: ", X_test.shape) print("Shape of y_train: ",y_train.shape) print("Shape of y_test",y_test.shape)
出力:
Shape of X_train: (379, 13)
Shape of X_test: (127, 13)
Shape of y_train: (379,)
Shape of y_test (127,)
複数のモデルを作成し、精度を評価する
1. 線形回帰モデル
モデルの作成し、データを学習する
from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline steps = [ ('scalar', StandardScaler()), ('model', LinearRegression()) ] linear = Pipeline(steps) linear.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_linear = cross_val_score(estimator=linear,X=X_train,y=y_train,cv=10) y_pred_linear_train = linear.predict(X_train) r2_score_linear_train = r2_score(y_train, y_pred_linear_train) y_pred_linear_test = linear.predict(X_test) r2_score_linear_test = r2_score(y_test, y_pred_linear_test) rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_test))) print("CV: ", cv_linear.mean()) print('R2_score (train): ', r2_score_linear_train) print('R2_score (test): ', r2_score_linear_test) print("RMSE: ", rmse_linear)
出力:
CV: 0.6764932429312001
R2_score (train): 0.7168057552393374
R2_score (test): 0.7789410172622858
RMSE: 4.679504823808764
2. リッジ回帰
モデルを作成し、データを学習する
from sklearn.linear_model import Ridge from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline steps = [ ('scalar', StandardScaler()), ('model', Ridge(alpha=10,random_state=0)) ] ridge = Pipeline(steps) ridge.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_ridge = cross_val_score(estimator=ridge,X=X_train,y=y_train,cv=10) y_pred_ridge_train = ridge.predict(X_train) r2_score_ridge_train = r2_score(y_train, y_pred_ridge_train) y_pred_ridge_test = ridge.predict(X_test) r2_score_ridge_test = r2_score(y_test, y_pred_ridge_test) rmse_ridge = (np.sqrt(mean_squared_error(y_test, y_pred_ridge_test))) print('CV: ', cv_ridge.mean()) print('R2_score (train): ', r2_score_ridge_train) print('R2_score (test): ', r2_score_ridge_test) print("RMSE: ", rmse_ridge)
出力:
CV: 0.6769476230353946
R2_score (train): 0.7151170322655596
R2_score (test): 0.7775108393295396
RMSE: 4.694617837373151
3. ラッソ回帰
モデルを作成し、データを学習する
from sklearn.linear_model import Lasso from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline steps = [ ('scalar', StandardScaler()), ('model', Lasso(alpha=0.01, random_state=0)) ] lasso = Pipeline(steps) lasso.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_lasso = cross_val_score(estimator=lasso,X=X_train,y=y_train,cv=10) y_pred_lasso_train = lasso.predict(X_train) r2_score_lasso_train = r2_score(y_train, y_pred_lasso_train) y_pred_lasso_test = lasso.predict(X_test) r2_score_lasso_test = r2_score(y_test, y_pred_lasso_test) rmse_lasso = (np.sqrt(mean_squared_error(y_test, y_pred_lasso_test))) print('CV: ', cv_lasso.mean()) print('R2_score (train): ', r2_score_lasso_train) print('R2_score (test): ', r2_score_lasso_test) print("RMSE: ", rmse_lasso)</pre?
出力:
CV: 0.6767952938403962
R2_score (train): 0.7167038845044521
R2_score (test): 0.7787621490259894
RMSE: 4.6813976343080315
4. サポートベクタマシン
モデルを作成し、データを学習する
from sklearn.svm import SVR from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline steps = [ ('scalar', StandardScaler()), ('model', SVR(kernel='rbf', gamma=0.1, C=10)) ] svr = Pipeline(steps) svr.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_svr = cross_val_score(estimator=svr,X=X_train,y=y_train,cv=10) y_pred_svr_train = svr.predict(X_train) r2_score_svr_train = r2_score(y_train, y_pred_svr_train) y_pred_svr_test = svr.predict(X_test) r2_score_svr_test = r2_score(y_test, y_pred_svr_test) rmse_svr = (np.sqrt(mean_squared_error(y_test, y_pred_svr_test))) print('CV: ', cv_svr.mean()) print('R2_score (train): ', r2_score_svr_train) print('R2_score (test): ', r2_score_svr_test) print("RMSE: ", rmse_svr)
出力:
CV: 0.7758265296789186
R2_score (train): 0.8715030332798441
R2_score (test): 0.8796605925578973
RMSE: 3.4526276964020046
5. 決定木
モデルを作成し、データを学習する
from sklearn.tree import DecisionTreeRegressor dt = DecisionTreeRegressor(max_depth=5,random_state = 0) dt.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_dt = cross_val_score(estimator = dt, X = X_train, y = y_train, cv = 10) y_pred_dt_train = dt.predict(X_train) r2_score_dt_train = r2_score(y_train, y_pred_dt_train) y_pred_dt_test = dt.predict(X_test) r2_score_dt_test = r2_score(y_test, y_pred_dt_test) rmse_dt = (np.sqrt(mean_squared_error(y_test, y_pred_dt_test))) print('CV: ', cv_dt.mean()) print('R2_score (train): ', r2_score_dt_train) print('R2_score (test): ', r2_score_dt_test) print("RMSE: ", rmse_dt)
出力:
CV: 0.7732200171327743
R2_score (train): 0.9204825770764915
R2_score (test): 0.8763987309111113
RMSE: 3.4991074641466478
6. ランダムフォレスト
モデルを作成し、データを学習する
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators = 500, max_depth=5, random_state = 0) rf.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_rf = cross_val_score(estimator=rf,X=X_train,y=y_train,cv=10) y_pred_rf_train = rf.predict(X_train) r2_score_rf_train = r2_score(y_train, y_pred_rf_train) y_pred_rf_test = rf.predict(X_test) r2_score_rf_test = r2_score(y_test, y_pred_rf_test) rmse_rf = (np.sqrt(mean_squared_error(y_test, y_pred_rf_test))) print('CV: ', cv_rf.mean()) print('R2_score (train): ', r2_score_rf_train) print('R2_score (test): ', r2_score_rf_test) print("RMSE: ", rmse_rf)
出力:
CV: 0.8481686807822297
R2_score (train): 0.9417895056865595
R2_score (test): 0.897569453187331
RMSE: 3.1853749564009592
7. XGBoost
モデルを作成し、データを学習する
import xgboost as xgb xgb = xgb.XGBRegressor(n_estimators = 500, max_depth=5, random_state=0) xgb.fit(X_train, y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_xgb = cross_val_score(estimator = xgb, X = X_train, y = y_train, cv = 10) y_pred_xgb_train = xgb.predict(X_train) r2_score_xgb_train = r2_score(y_train, y_pred_xgb_train) y_pred_xgb_test = xgb.predict(X_test) r2_score_xgb_test = r2_score(y_test, y_pred_xgb_test) rmse_xgb = (np.sqrt(mean_squared_error(y_test, y_pred_xgb_test))) print('CV: ', cv_rf.mean()) print('R2_score (train): ', r2_score_xgb_train) print('R2_score (test): ', r2_score_xgb_test) print("RMSE: ", rmse_xgb)
出力:
CV: 0.8481686807822297
R2_score (train): 0.999999977374972
R2_score (test): 0.9156707707731353
RMSE: 2.8902464814349504
8. LightGBM
モデルを作成し、データを学習する
import lightgbm as lgb lgbm = lgb.LGBMRegressor(n_estimators=500, random_state=0) lgbm.fit(X_train,y_train)
モデルを評価する
from sklearn.metrics import r2_score cv_lgbm = cross_val_score(estimator = lgbm, X = X_train, y = y_train, cv = 10) y_pred_lgbm_train = lgbm.predict(X_train) r2_score_lgbm_train = r2_score(y_train, y_pred_lgbm_train) y_pred_lgbm_test = lgbm.predict(X_test) r2_score_lgbm_test = r2_score(y_test, y_pred_lgbm_test) rmse_lgbm = (np.sqrt(mean_squared_error(y_test, y_pred_lgbm_test))) print('CV: ', cv_lgbm.mean()) print('R2_score (train): ', r2_score_lgbm_train) print('R2_score (test): ', r2_score_lgbm_test) print("RMSE: ", rmse_lgbm)
出力:
CV: 0.8508454179739436
R2_score (train): 0.9986977634003343
R2_score (test): 0.9054846107197084
RMSE: 3.059828457568549
とりあえず、機械学習モデルは以上です。
評価結果が分かりにくいので、まとめて表示します。
複数のモデルを評価する
models = [('Linear', rmse_linear, r2_score_linear_train, r2_score_linear_test, cv_linear.mean()), ('Ridge', rmse_ridge, r2_score_ridge_train, r2_score_ridge_test, cv_ridge.mean()), ('Lasso', rmse_lasso, r2_score_lasso_train, r2_score_lasso_test, cv_lasso.mean()), ('Support Vector', rmse_svr, r2_score_svr_train, r2_score_svr_test, cv_svr.mean()), ('Decision Tree', rmse_dt, r2_score_dt_train, r2_score_dt_test, cv_dt.mean()), ('Random Forest', rmse_rf, r2_score_rf_train, r2_score_rf_test, cv_rf.mean()), ('XGBoost', rmse_xgb, r2_score_xgb_train, r2_score_xgb_test, cv_xgb.mean()), ('LightGBM', rmse_lgbm, r2_score_lgbm_train, r2_score_lgbm_test, cv_lgbm.mean()), ]
評価結果をデータフレームに変換する。
predict = pd.DataFrame(data = models, columns=['Model', 'RMSE', 'R2_Score(training)', 'R2_Score(test)', 'Cross-Validation']) predict
評価結果を可視化する
交差検証
predict.sort_values(by=['Cross-Validation'], ascending=False, inplace=True) f, axe = plt.subplots(1,1, figsize=(18,6),dpi=200) sns.barplot(x='Model', y='Cross-Validation', data=predict, ax = axe) axe.set_xlabel('Model', size=16) axe.set_ylabel('Cross-Validation', size=16) axe.set_ylim(0,1.0) plt.show()
決定係数(訓練データ)
predict.sort_values(by=['R2_Score(training)'], ascending=False, inplace=True) f, axe = plt.subplots(1,1, figsize=(18,6),dpi=200) sns.barplot(x='Model', y='R2_Score(training)', data=predict, ax = axe) axe.set_xlabel('Model', size=16) axe.set_ylabel('R2_Score(training)', size=16) axe.set_ylim(0,1.0) plt.show()
決定係数(テストデータ)
predict.sort_values(by=['R2_Score(test)'], ascending=False, inplace=True) f, axe = plt.subplots(1,1, figsize=(18,6),dpi=200) sns.barplot(x='Model', y='R2_Score(test)', data=predict, ax = axe) axe.set_xlabel('Model', size=16) axe.set_ylabel('R2_Score(test)', size=16) axe.set_ylim(0,1.0) plt.show()
RMSE(二乗平均平方根誤差)
predict.sort_values(by=['RMSE'], ascending=True, inplace=True) f, axe = plt.subplots(1,1, figsize=(18,6),dpi=200) sns.barplot(x='Model', y='RMSE', data=predict, ax = axe) axe.set_xlabel('Model', size=16) axe.set_ylabel('RMSE', size=16) plt.show()
評価結果の可視化を一度にする
f, axes = plt.subplots(4,1, figsize=(15,13),dpi=200) size=7 # Cross-Validation predict.sort_values(by=['Cross-Validation'], ascending=False, inplace=True) sns.barplot(x='Model', y='Cross-Validation', data=predict, ax = axes[0]) axes[0].set_xlabel('Model', size=size) axes[0].set_ylabel('Cross-Validation', size=size) axes[0].set_ylim(0,1.0) # R2_Score(training) predict.sort_values(by=['R2_Score(training)'], ascending=False, inplace=True) sns.barplot(x='Model', y='R2_Score(training)', data=predict, ax = axes[1]) axes[1].set_xlabel('Model', size=size) axes[1].set_ylabel('R2_Score(training)', size=size) axes[1].set_ylim(0,1.0) # R2_Score(test) predict.sort_values(by=['R2_Score(test)'], ascending=False, inplace=True) sns.barplot(x='Model', y='R2_Score(test)', data=predict, ax = axes[2]) axes[2].set_xlabel('Model', size=size) axes[2].set_ylabel('R2_Score(test)', size=size) axes[2].set_ylim(0,1.0) # RMSE predict.sort_values(by=['RMSE'], ascending=True, inplace=True) sns.barplot(x='Model', y='RMSE', data=predict, ax = axes[3]) axes[3].set_xlabel('Model', size=size) axes[3].set_ylabel('RMSE', size=size) plt.show()
出力:
こんな感じで、複数のモデルを評価し、最良のモデルをみつけます。
今回は、XGBoostが最良のモデルになりましたね。
その後はハイパーパラメータの調整を行い、さらなる精度の向上を目指す流れになります。
今回は以上です。
コメント