sklearn中的逻辑回归

最新推荐文章于 2025-01-26 15:21:22 发布

原创最新推荐文章于 2025-01-26 15:21:22 发布 · 2.4k 阅读

3 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#sklearn #逻辑回归 #参数选择 #正则化

ml algorithm 专栏收录该内容

44 篇文章

订阅专栏

本文详细介绍了scikit-learn库中与逻辑回归相关的类，包括LogisticRegression、LogisticRegressionCV和logistic_regression_path。LogisticRegression是用于二分类的模型，支持多种优化算法和正则化选项。LogisticRegressionCV则通过交叉验证自动选择最佳的正则化参数C。logistic_regression_path是一个即将废弃的函数，用于计算一系列正则化参数下的逻辑回归路径。此外，文章还强调了不同求解器对正则化的支持情况以及多分类问题的处理方式。

1.sklearn中与逻辑回归有关的三个类

sklearn中，lr相关的代码在linear_model模块中，查看linear_model的__init__文件，内容如下

__all__ = ['ARDRegression',
           'BayesianRidge',
           'ElasticNet',
           'ElasticNetCV',
           'Hinge',
           'Huber',
           'HuberRegressor',
           'Lars',
           'LarsCV',
           'Lasso',
           'LassoCV',
           'LassoLars',
           'LassoLarsCV',
           'LassoLarsIC',
           'LinearRegression',
           'Log',
           'LogisticRegression',
           'LogisticRegressionCV',
           'ModifiedHuber',
           'MultiTaskElasticNet',
           'MultiTaskElasticNetCV',
           'MultiTaskLasso',
           'MultiTaskLassoCV',
           'OrthogonalMatchingPursuit',
           'OrthogonalMatchingPursuitCV',
           'PassiveAggressiveClassifier',
           'PassiveAggressiveRegressor',
           'Perceptron',
           'Ridge',
           'RidgeCV',
           'RidgeClassifier',
           'RidgeClassifierCV',
           'SGDClassifier',
           'SGDRegressor',
           'SquaredLoss',
           'TheilSenRegressor',
           'enet_path',
           'lars_path',
           'lars_path_gram',
           'lasso_path',
           'logistic_regression_path',
           'orthogonal_mp',
           'orthogonal_mp_gram',
           'ridge_regression',
           'RANSACRegressor']

可以看出来，与lr有关的部分一共有三个类：

LogisticRegression
LogisticRegressionCV
logistic_regression_path

2.logistic_regression_path

@deprecated('logistic_regression_path was deprecated in version 0.21 and '
            'will be removed in version 0.23.0')
def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
                             max_iter=100, tol=1e-4, verbose=0,
                             solver='lbfgs', coef=None,
                             class_weight=None, dual=False, penalty='l2',
                             intercept_scaling=1., multi_class='auto',
                             random_state=None, check_input=True,
                             max_squared_sum=None, sample_weight=None,
                             l1_ratio=None):
    """Compute a Logistic Regression model for a list of regularization
    parameters.

    This is an implementation that uses the result of the previous model
    to speed up computations along the set of solutions, making it faster
    than sequentially calling LogisticRegression for the different parameters.
    Note that there will be no speedup with liblinear solver, since it does
    not handle warm-starting.

    .. deprecated:: 0.21
        ``logistic_regression_path`` was deprecated in version 0.21 and will
        be removed in 0.23.

    Read more in the :ref:`User Guide <logistic_regression>`.

首先可以看到，logistic_regression_path将会在0.23.0版本移出。
由注释说明不难看出，logistic_regression_path主要是基于之前的训练结果用于训练加速。同时特别强调，如果用的是liblinear求解器，logistic_regression_path不能加快速度，因为liblinear不能处理warm-starting场景。

3.LogisticRegression

class LogisticRegression(BaseEstimator, LinearClassifierMixin,
                         SparseCoefMixin):
    """
    Logistic Regression (aka logit, MaxEnt) classifier.

    In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
    scheme if the 'multi_class' option is set to 'ovr', and uses the
    cross-entropy loss if the 'multi_class' option is set to 'multinomial'.
    (Currently the 'multinomial' option is supported only by the 'lbfgs',
    'sag', 'saga' and 'newton-cg' solvers.)

    This class implements regularized logistic regression using the
    'liblinear' library, 'newton-cg', 'sag', 'saga' and 'lbfgs' solvers. **Note
    that regularization is applied by default**. It can handle both dense
    and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit
    floats for optimal performance; any other input format will be converted
    (and copied).

    The 'newton-cg', 'sag', and 'lbfgs' solvers support only L2 regularization
    with primal formulation, or no regularization. The 'liblinear' solver
    supports both L1 and L2 regularization, with a dual formulation only for
    the L2 penalty. The Elastic-Net regularization is only supported by the
    'saga' solver.

    Read more in the :ref:`User Guide <logistic_regression>`.
...
    def __init__(self, penalty='l2', dual=False, tol=1e-4, C=1.0,
                 fit_intercept=True, intercept_scaling=1, class_weight=None,
                 random_state=None, solver='lbfgs', max_iter=100,
                 multi_class='auto', verbose=0, warm_start=False, n_jobs=None,
                 l1_ratio=None):

        self.penalty = penalty
        self.dual = dual
        self.tol = tol
        self.C = C
        self.fit_intercept = fit_intercept
        self.intercept_scaling = intercept_scaling
        self.class_weight = class_weight
        self.random_state = random_state
        self.solver = solver
        self.max_iter = max_iter
        self.multi_class = multi_class
        self.verbose = verbose
        self.warm_start = warm_start
        self.n_jobs = n_jobs
        self.l1_ratio = l1_ratio

看看注释里头提到的几个关键点：
1.首先对于多分类，如果multi_class设置的是ovr，那么采用的是one-vs-rest的方式。如果设置的是multinomial，会使用cross-entropy loss。
2.具体的优化求解方法包括’liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’。‘newton-cg’, ‘sag’, and ‘lbfgs’支持L2正则，liblinear支持L1与L2，Elastic-Net正则只能使用’saga’。

关心一下初始化方法中的相关参数：
1.dual：选择目标函数是原始形式还是对偶形式，默认false。
2.tol：停止迭代的标准，默认1e-4。
3.C：正则化系数，默认1.0
4.solver：最优化求解方法，默认lbfgs
5.max_iter：最大迭代次数，默认100。

4.LogisticRegressionCV

class LogisticRegressionCV(LogisticRegression, BaseEstimator,
                           LinearClassifierMixin):
    """Logistic Regression CV (aka logit, MaxEnt) classifier.

    See glossary entry for :term:`cross-validation estimator`.

    This class implements logistic regression using liblinear, newton-cg, sag
    of lbfgs optimizer. The newton-cg, sag and lbfgs solvers support only L2
    regularization with primal formulation. The liblinear solver supports both
    L1 and L2 regularization, with a dual formulation only for the L2 penalty.
    Elastic-Net penalty is only supported by the saga solver.

    For the grid of `Cs` values and `l1_ratios` values, the best hyperparameter
    is selected by the cross-validator
    :class:`~sklearn.model_selection.StratifiedKFold`, but it can be changed
    using the :term:`cv` parameter. The 'newton-cg', 'sag', 'saga' and 'lbfgs'
    solvers can warm-start the coefficients (see :term:`Glossary<warm_start>`).

LogisticRegressionCV的用法与LogisticRegression基本相当，不一样在于LogisticRegressionCV通过cross validation，来选择正则化参数C，同时通过l1_ratios来选择l1正则与l2正则的组合。