ayethuzar commited on
Commit
cfce332
·
unverified ·
1 Parent(s): edeee62

Update Milestone4Documentation.md

Browse files
Files changed (1) hide show
  1. Milestone4Documentation.md +113 -7
Milestone4Documentation.md CHANGED
@@ -87,21 +87,127 @@ This summary plot visualises all of the SHAP values. On the y-axis, the values a
87
 
88
  This summary plot gives additional insight through visualizing the relationship between features and their SHAP interaction values. As we can see, certain features tend to have a more significiant impact on the prediction, and the distributions of the plots tell us which interactions are more significant than others. For example, Overall Quality, Above Ground Living Area, Total Basement Square Foot, and Neighborhood.
89
 
90
- ## Tuning XGBoostWIthOptuna
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ## Optimized XGBoost
93
 
94
- ## SHAP for Optimized XGBoost
 
 
 
 
95
 
96
- ## XGBoost Model (baseline)
97
 
98
- ## SHAP for XGBoost baseline
99
 
100
- ## Tuning XGBoostWIthOptuna
 
 
101
 
102
- ## Optimized XGBoost
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
- ## SHAP for Optimized XGBoost
105
 
106
  ## Pickled the models for streamlit app
107
 
 
87
 
88
  This summary plot gives additional insight through visualizing the relationship between features and their SHAP interaction values. As we can see, certain features tend to have a more significiant impact on the prediction, and the distributions of the plots tell us which interactions are more significant than others. For example, Overall Quality, Above Ground Living Area, Total Basement Square Foot, and Neighborhood.
89
 
90
+ ## XGBoostWithOptuna
91
+
92
+ ```py
93
+ def objective(trial):
94
+ param = {
95
+ 'max_depth': trial.suggest_int('max_depth', 1, 10),
96
+ 'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
97
+ 'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
98
+ 'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
99
+ 'gamma': trial.suggest_float('gamma', 0.01, 1.0),
100
+ 'subsample': trial.suggest_float('subsample', 0.01, 1.0),
101
+ 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.01, 1.0),
102
+ 'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
103
+ 'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
104
+ 'random_state': trial.suggest_int('random_state', 1, 1000)
105
+ }
106
+ model = xgb.XGBRegressor(**param)
107
+ model.fit(X_train, y_train)
108
+ y_pred = model.predict(X_test)
109
+ return mean_squared_error(y_test, y_pred)
110
+ ```
111
+
112
+ Optuna is used to perform hyperparameter tuning for the XGBoost model. The objective function is defined, which takes a set of hyperparameters as input and returns the MSE as the evaluation metric to minimize. Optuna then searches the hyperparameter space to find the best combination of hyperparameters that result in the lowest MSE.
113
 
114
  ## Optimized XGBoost
115
 
116
+ ```py
117
+ xgb_optimized = xgb.XGBRegressor(**study.best_params)
118
+ xgb_optimized.fit(X_train, y_train)
119
+ y_pred = xgb_optimized.predict(X_test)
120
+ ```
121
 
122
+ After hyperparameter tuning, the best set of hyperparameters found by Optuna is used to create an optimized XGBoost model (xgb_optimized). This model is expected to perform better than the initial XGBoost model due to the fine-tuned hyperparameters.
123
 
124
+ ## SHAP for Tuned Optimized XGBoost
125
 
126
+ <p align="center">
127
+ <img src="/img/XGBoostOptimized_SHAP_summary.png">
128
+ </p>
129
 
130
+ The optimized summary plot gives a very similar results to the baseline XGBoost. However, the feature value order changed with GrLivArea now taking the top spot. YearBuilt and BsmtExposure also climbed one spot each. The density of each plot also seems a bit more spreaded out with distinct disconnect between density areas.
131
+
132
+ <p align="center">
133
+ <img src="/img/XGBoostOptimized_SHAP_summary_interaction.png">
134
+ </p>
135
+
136
+ This optimized interaction plot also has different feature value order as above. It also has less outliers and range is expanded. However we also see the density of SHAP values being grouped at certain CHAP values.
137
+
138
+ ## LGBM
139
+
140
+ ```py
141
+ reg_lgbm_baseline = lgbm.LGBMRegressor() # default - 'regression'
142
+ reg_lgbm_baseline.fit(X_train, y_train)
143
+ lgbm_predict = reg_lgbm_baseline.predict(X_test)
144
+ ```
145
+
146
+ LGBM is known for its fast processing and performance, making it an excellent candidate for comparison with XGBoost. The baseline LGBM model (reg_lgbm_baseline) is trained on the training data (X_train and y_train), and its performance is evaluated using MAE, MSE, and RMSE scores. I did XGBoost for milestone-2 and switch to LGBMRegressor for milestone-3 and the baseline model is already better than the XGBoost, with RMSE = 26233.
147
+
148
+ ## SHAP for LGBM
149
+
150
+ <p align="center">
151
+ <img src="/img/LGBM_SHAP_summary.png">
152
+ </p>
153
+
154
+ The LGBM baseline plot's feature order is the same as XGBoost baseline. However, the outliers in the plot are no longer present. The SHAP value range is now more compact. There is also a change in the density shapes of the plots which can be accounted for the more compact SHAP range.
155
+
156
+ <p align="center">
157
+ <img src="/img/LGBM_SHAP_summary_interaction.png">
158
+ </p>
159
+
160
+ The LGBM baseline interaction plot reverts to the baseline feature order while the density is expanded. It also features more distinct areas of density of SHAP values.
161
+
162
+ ## LGBM with Optuna
163
+
164
+ ```py
165
+ def objective(trial, data=X,target=y):
166
+
167
+ params = {
168
+ 'metric': 'rmse',
169
+ 'random_state': 22,
170
+ 'n_estimators': 20000,
171
+ 'boosting_type': trial.suggest_categorical("boosting_type", ["gbdt", "goss"]),
172
+ 'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
173
+ 'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
174
+ 'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]),
175
+ 'subsample': trial.suggest_categorical('subsample', [0.6, 0.7, 0.85, 1.0]),
176
+ 'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.01, 0.02, 0.03, 0.05, 0.1]),
177
+ 'max_depth': trial.suggest_int('max_depth', 2, 12, step=1),
178
+ 'num_leaves' : trial.suggest_int('num_leaves', 13, 148, step=5),
179
+ 'min_child_samples': trial.suggest_int('min_child_samples', 1, 96, step=5),
180
+ }
181
+ reg = lgbm.LGBMRegressor(**params)
182
+ reg.fit(X_train ,y_train,
183
+ eval_set=[(X_test, y_test)],
184
+ #categorical_feature=cat_indices,
185
+ callbacks=[log_evaluation(period=1000),
186
+ early_stopping(stopping_rounds=50)
187
+ ],
188
+ )
189
+
190
+ y_pred = reg.predict(X_test)
191
+ rmse = mean_squared_error(y_test, y_pred, squared=False)
192
+
193
+ return rmse
194
+ ```
195
+
196
+ Similar to XGBoost, Optuna is utilized to perform hyperparameter tuning for the LGBM model. Optuna searches for the best combination of hyperparameters that minimize the RMSE on the validation data. The tuned LGBM model is expected to improve the baseline performance.
197
+
198
+ ## SHAP for LGBM tuned with Optuna
199
+
200
+ <p align="center">
201
+ <img src="/img/LGBMTuned_SHAP_summary.png">
202
+ </p>
203
+
204
+ The SHAP summary plot for tuned LGBM introduces Neighborhood into the top 10 features, while dropping BsmtExposure. It also reverse the positions of MasVnrArea and SaleCondition. The range of SHAP values increased and outliers are present again. The density shapes are different from the XGBoost models.
205
+
206
+ <p align="center">
207
+ <img src="/img/LGBMTuned_SHAP_summary_interaction.png">
208
+ </p>
209
 
210
+ Tuned LGBM SHAP summary interaction also reintroduces outliers while maintaining an expanded range. The density of SHAP values also differ from the XGBoost models.
211
 
212
  ## Pickled the models for streamlit app
213