python - Can't reproduce Xgb.cv cross-validation results -
i using python 3.5 , python implementation of xgboost, version 0.6
i built forward feature selection routine in python, iteratively builds optimal set of features (leading best score, here metric binary classification error).
on data set, using xgb.cv routine, can down error rate of around 0.21 increasing max_depth (of trees) 40...
but if custom cross-validation, using same xg boost parameters, same folds, same metric , same data set, reach best score being 0.70 max_depth of 4 ... if use optimal max_depth obtained xgb.cv routine, score drops 0.65 ... don't understand happening ...
my best guess xgb.cv using different folds (i.e. shuffles data before partitioning), think submit folds input xgb.cv (with option shuffle=false) ... so, might different ...
here code of forward_feature_selection (using xgb.cv):
def forward_feature_selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5): k_fold = kfold(n_splits=13) selected_features = [] gain = threshold + 1 previous_best_score = initial_score train = train.drop(train.columns[to_exclude], axis=1) # df.columns zero-based pd.index features = train.columns.values selected = np.zeros(len(features)) scores = np.zeros(len(features)) while (gain > threshold): # start add-a-feature loop in range(0,len(features)): if (selected[i]==0): # take features not yet selected selected_features.append(features[i]) new_train = train.iloc[:][selected_features] selected_features.remove(features[i]) dtrain = xgb.dmatrix(new_train, y_train, missing = none) # dtrain = xgb.dmatrix(pd.dataframe(new_train), y_train, missing = none) if (i % 10 == 0): print("launching xgboost feature "+ str(i)) xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=false) if params['objective'] == 'binary:logistic': scores[i] = xgb_cv.tail(1)["test-error-mean"] #classification else: scores[i] = xgb_cv.tail(1)["test-rmse-mean"] #regression else: scores[i] = initial_score # discard selected variables candidates best = np.argmin(scores) gain = previous_best_score - scores[best] if (gain > 0): previous_best_score = scores[best] selected_features.append(features[best]) selected[best] = 1 print("adding feature: " + features[best] + " increases score " + str(gain) + ". final score now: " + str(previous_best_score)) return (selected_features, previous_best_score) and here "custom" cross validation:
mean_error_rate = 0 train, test in k_fold.split(ds): dtrain = xgb.dmatrix(pd.dataframe(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = none) gbm = xgb.train(params, dtrain, 30) dtest = xgb.dmatrix(pd.dataframe(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = none) res.ix[test,"pred"] = gbm.predict(dtest) cv_reg = reg.fit(pd.dataframe(ds.iloc[train]), dc.iloc[train]["bin_spread"]) res.ix[test,"lasso"] = cv_reg.predict(pd.dataframe(ds.iloc[test])) res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5 res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"]) print (str(100*np.sum(res.loc[test, "xgb_right"])/(n/13))) mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(n/13)) print("mean_error_rate : " + str(mean_error_rate/13)) using following parameters:
params = {"objective": "binary:logistic", "booster":"gbtree", "max_depth":4, "eval_metric" : "error", "eta" : 0.15} res = pd.dataframe(dc["bin_spread"]) k_fold = kfold(n_splits=13) n = dc.shape[0] num_trees = 30 and call forward feature selection:
selfeat = forward_feature_selection(dc, dc["bin_spread"], params, num_round = num_trees, threshold = 0, initial_score=999, to_exclude = [0,1,5,30,31], nfold = 13) any understand happening appreciated ! in advance tip !
this normal. have experienced same. firstly, kfold splitting differently each time. have specified folds in xgboost kfold not splitting consistently, normal. next, initial state of model different each time. there inner random states withing xgboost can cause too, try changing eval metric see if variance reduces. if particular metric suits needs, try average best parameters , use optimal parameters.
Comments
Post a Comment