python - Can't reproduce Xgb.cv cross-validation results -


i using python 3.5 , python implementation of xgboost, version 0.6

i built forward feature selection routine in python, iteratively builds optimal set of features (leading best score, here metric binary classification error).

on data set, using xgb.cv routine, can down error rate of around 0.21 increasing max_depth (of trees) 40...

but if custom cross-validation, using same xg boost parameters, same folds, same metric , same data set, reach best score being 0.70 max_depth of 4 ... if use optimal max_depth obtained xgb.cv routine, score drops 0.65 ... don't understand happening ...

my best guess xgb.cv using different folds (i.e. shuffles data before partitioning), think submit folds input xgb.cv (with option shuffle=false) ... so, might different ...

here code of forward_feature_selection (using xgb.cv):

def forward_feature_selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):      k_fold = kfold(n_splits=13)     selected_features = []     gain = threshold + 1     previous_best_score = initial_score     train = train.drop(train.columns[to_exclude], axis=1)  # df.columns zero-based pd.index      features = train.columns.values     selected = np.zeros(len(features))     scores = np.zeros(len(features))     while (gain > threshold):    # start add-a-feature loop         in range(0,len(features)):             if (selected[i]==0):   # take features not yet selected                 selected_features.append(features[i])                 new_train = train.iloc[:][selected_features]                 selected_features.remove(features[i])                 dtrain = xgb.dmatrix(new_train, y_train, missing = none)             #    dtrain = xgb.dmatrix(pd.dataframe(new_train), y_train, missing = none)                 if (i % 10 == 0):                     print("launching xgboost feature "+ str(i))                 xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=false)                  if params['objective'] == 'binary:logistic':                     scores[i] = xgb_cv.tail(1)["test-error-mean"]   #classification                 else:                     scores[i] = xgb_cv.tail(1)["test-rmse-mean"]    #regression             else:                 scores[i] = initial_score    # discard selected variables candidates         best = np.argmin(scores)         gain = previous_best_score - scores[best]         if (gain > 0):                     previous_best_score = scores[best]               selected_features.append(features[best])             selected[best] = 1          print("adding feature: " + features[best] + " increases score " + str(gain) + ". final score now: " + str(previous_best_score))      return (selected_features, previous_best_score) 

and here "custom" cross validation:

mean_error_rate = 0 train, test in k_fold.split(ds):     dtrain =  xgb.dmatrix(pd.dataframe(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = none)     gbm = xgb.train(params, dtrain, 30)     dtest =  xgb.dmatrix(pd.dataframe(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = none)     res.ix[test,"pred"] = gbm.predict(dtest)      cv_reg = reg.fit(pd.dataframe(ds.iloc[train]), dc.iloc[train]["bin_spread"])     res.ix[test,"lasso"] = cv_reg.predict(pd.dataframe(ds.iloc[test]))      res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5     res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"])      print (str(100*np.sum(res.loc[test, "xgb_right"])/(n/13)))     mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(n/13)) print("mean_error_rate : " + str(mean_error_rate/13))   

using following parameters:

params = {"objective": "binary:logistic",            "booster":"gbtree",           "max_depth":4,            "eval_metric" : "error",           "eta" : 0.15} res = pd.dataframe(dc["bin_spread"])  k_fold = kfold(n_splits=13) n = dc.shape[0] num_trees = 30 

and call forward feature selection:

selfeat = forward_feature_selection(dc,                                      dc["bin_spread"],                                      params,                                      num_round = num_trees,                                     threshold = 0,                                     initial_score=999,                                     to_exclude = [0,1,5,30,31],                                     nfold = 13) 

any understand happening appreciated ! in advance tip !

this normal. have experienced same. firstly, kfold splitting differently each time. have specified folds in xgboost kfold not splitting consistently, normal. next, initial state of model different each time. there inner random states withing xgboost can cause too, try changing eval metric see if variance reduces. if particular metric suits needs, try average best parameters , use optimal parameters.


Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -