xgboost + hyperopt 调参套路

最新推荐文章于 2023-12-02 23:45:16 发布

原创最新推荐文章于 2023-12-02 23:45:16 发布 · 1.9k 阅读

15 ·

本内容遵循CC 4.0 BY-SA版权协议

python 专栏收录该内容

33 篇文章

订阅专栏

本文介绍了两种常见的机器学习调参方法：手动设置超参数和使用Hyperopt工具包自动调参。手动调参通常将数据分为70%训练、10%验证和20%测试，而使用Hyperopt则可在控制调参时间的同时，通过交叉验证避免过拟合，找到最优模型。

1. 手动设置超参数：

有时候为了节省时间快速验证不用工具进行调参，手动设定超参数，此时一般设置成70%的train, 10%的validation, 20%的test。

由于fit的过程是不断增加tree，所以在train上fit的时候，设置validation set，当在validation上loss不再下降的时候，就停止fit，这样可以避免在train上过拟合。

2. 利用工具包调参

相比gridsearch，hyperopt调参可以设置调参次数max_evals_num，控制调参时间。不至于调起来没完没了让人等得心焦。

一个基本的套路总结如下：

1. 拆分训练集和测试集，例如80%的训练，20%的测试集。

2. 利用hyperopt + 80%的训练集，针对决策树的相关超参数进行调参。

调参时要提供一个评价函数：hyperopt_eval_func()。

例如用train做五折交叉验证：cross_val_score()代表目前这批超参数的效果，避免调参时在tarin上过拟合。

所以这种情况下其实没必要再设置外部的validation集了。

利用hyperopt调参的代码如下：

def hyperopt_eval_func(params, X, y):
    '''利用params里定义的模型和超参数，对X进行fit，并返回cv socre。
    Args:
        @params: 模型和超参数
        @X:输入参数
        @y:真值
    Return:
        @score: 交叉验证的损失值
    '''   
    
    int_feat = ['n_estimators', 'max_depth', 'min_child_weight']
    for p in int_feat:
        params[p] = int(params[p])    
    
    clf = XGBClassifier(**params)        

    #用cv结果来作为评价函数
    from sklearn.model_selection import KFold
    shuffle = KFold(n_splits=5, shuffle=True)
    score = -1 * cross_val_score(clf, X, y, scoring='f1', cv=shuffle).mean()
    
    return score

def hyperopt_binary_model(params):
    '''hyperopt评价函数，在hyperopt_eval_func外面包围了一层，增加一些信息输出
    Args:
        @params:用hyperopt调参优化得到的超参数
    Return:
        @loss_status: loss and status
        
    '''    
    global best_loss, count, binary_X, binary_y   
    count += 1 
    
    clf_type = params['type']   
    del params['type']
    loss = hyperopt_eval_func(params, binary_X, binary_y)
    print(count, loss)
    if loss < best_loss:
        ss = 'count:%d  new best loss: %4.3f , using %s'%(count, loss, clf_type)        
        print(ss)         
        best_loss = loss

    loss_status = {'loss': loss, 'status': STATUS_OK}
    return loss_status

def get_best_model(best):
    '''根据hyperopt搜索的参数，返回对应最优score的模型
    Args:
        @best:最优超参数
    Return:
        @clf: xgb model
    '''     
    int_feat = ['n_estimators', 'max_depth', 'min_child_weight']
    for p in int_feat:
        best[p] = int(best[p])
        
    #fix the random state
    best['seed'] = 2018    
    clf = XGBClassifier(**best)
    
    return clf


def get_best_model(X_train, y_train, predictors, max_evals_num=10):
    '''利用hyperopt得到最优的xgb model
    Args:
        @X_train: 训练样本X 数据
        @y_train: 训练样本y target
        @predictors: 用于预测的特征
        @max_evals_num: hyperopt调参时的次数，次数越多，模型越优，但是也越耗费时间
    Return:
        @clf: 最优model
    '''
    space = {     
        'type': 'xgb',
        'n_estimators': hp.quniform('n_estimators', 50,400,50),
        'max_depth': hp.quniform('max_depth', 2, 8, 1),            
        'learning_rate': hp.uniform('learning_rate', 0.01, 0.1),
        'min_child_weight': hp.quniform('min_child_weight', 2, 8, 1),
        'gamma': hp.uniform('gamma', 0, 0.2),
        'subsample': hp.uniform('subsample', 0.7, 1.0),
        'colsample_bytree': hp.uniform('colsample_bytree', 0.7, 1.0) 
    }   
    
    #hyperopt train
    global count, best_loss, binary_X, binary_y
    count = 0
    best_loss = 1000000
    binary_X = X_train
    binary_y = y_train
    trials = Trials()
    best = fmin(hyperopt_binary_model, space, algo=tpe.suggest, max_evals=max_evals_num, trials=trials)
    print( 'best param:{}'.format(best))
    print('best trans cv mse on train:{}'.format(best_loss)) 
    
    
    clf = get_best_model(best)
    
    return clf