GLM(广义线性模型) 与 LR(逻辑回归) 详解

最新推荐文章于 2026-04-13 08:04:52 发布

原创

最新推荐文章于 2026-04-13 08:04:52 发布 · 置顶 · 6.9w 阅读

308

标签

#机器学习

本文介绍了广义线性模型(GLM)的概念及其组成部分，包括随机成分、系统成分和连接函数。通过对比线性回归，详细阐述了逻辑回归作为GLM的应用案例，并推导了其损失函数。

GLM 广义线性模型

George Box said: “All models are wrong, some are useful”

1. 始于 Linear Model

作为 GLM 的基础，本节 review 经典的 Linear Regression，并阐述一些基础 term。
我们线性回归的基本如下述公式，本质上是想通过观察 $x$ ，然后以一个简单的线性函数 $h(x)$ 来预测 $y$ ：

y = h (x) = w T x

$y=h(x)=w^Tx$

1.1 dependent variable $y$

这是我们的预测目标，也称 response variable。这里有一个容易混淆的点，实际上 $y$ 可以表达三种含义（建模用的是分布，观察到的是采样，预测的是期望）：

distribution；抽象地讨论 response variable 时，我们实际上关注对于给定数据和参数时， $y|\ x,w$ 服从的分布。Linear Regression 的 $y$ 服从高斯分布，具体取值是实数，但这里我们关注的是分布。

observed outcome；我们的 label，有时用 $t$ 区分表示；这是真正观察到的结果，只是一个值。
expected outcome； $y=\mathbf{E}[y|x]=h(x)$ 表示模型的预测；注意 $y$ 实际上服从一个分布，但预测结果是整个分布的均值 $\mu$ ，只是一个值。

1.2 independent variable $x$

这是我们的特征，可以包含很多维度，一个特征也称为一个 predictor。

1.3 hypothesis $h(x)$

线性模型的假设非常简单，即 $h(x) = w^Tx$ inner product of weight vector and feature vector，被称为 linear predictor。这是就是线性模型，GLM 也是基与此的推广。
深入来看，各个维度特征（predictor） $x_j$ 通过系数 $w_j$ 线性加和，这一过程将信息进行了整合；而不同的 weight（coefficient）反映了相关特征不同的贡献程度。

2. 推广到 Generalized Linear Model

2.1 Motive & Definition

线性模型有着非常强的局限，即 response variable $y$ 必须服从高斯分布；主要的局限是拟合目标 $y$ 的 scale 是一个实数 $(-\infty,+\infty)$ 。具体来说有俩个问题：

$y$ 的取值范围和一些常见问题不匹配。例如 count（游客人数统计恒为正）以及 binary（某个二分类问题）

$y$ 的方差是常数 constant。有些问题上方差可能依赖 $y$ 的均值，例如我预测目标值越大方也越大（预测越不精确）

所以这时我们使用 Generalized Linear Model 来克服这俩个问题。
一句话定义 GLM 即（from wiki）：
In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
详细来说，我们可以把 GLM 分解为 Random Component、System Component 和 Link Function 三个部分。

2.2 Random Component

An exponential family model for the response

这里是指 response variable 必须服从某一 exponential family distribution 指数族分布，即 $y|x,w \thicksim ExponentialFamily(\eta)$ ， $\eta$ 指 exponential family 的 natural parameter 自然参数。
例如 linear regression 服从 Gaussian 高斯分布，logistic regression 服从 Bernoulli 伯努利分布。指数族还有很多分布如多项分布、拉普拉斯分布、泊松分布等等。
另外，这也可以被称为 Error Structure : error distribution model for the response。对于 Gaussian 的 residual 残差 $\epsilon=y-h(x)$ 服从高斯分布 $N(0,\sigma)$ 是很直观可；但是，例如 Bernoulli 则没有直接的 error term。可以构造其他的 residual 服从 Binomial，但是较抽象，本文不对此角度展开。
2.3 Systematic Component

linear predictor

广义线性模型 GLM 本质上还是线性模型，我们推广的只是 response variable $y$ 的分布，模型最终学习的目标还是 linear predictor $w^Tx$ 中的 weight vector。
注意，GLM 的一个较强的假设是 $\eta=w^Tx$ ，即 $y$ 相关的 exponential family 的 natural parameter $\eta$ 等于 linear predictor。这个假设倾向于是一种 design choice（from Andrew），不过这种 choice 是有合理性的，至少 $\eta$ 和 linear predictor 的 scale 是一致的。

2.4 Link Function

A link function connects the mean of the response to the linear predictor

通过上述的 Random Component 和 Systematic Component，我们已经把 $y$ 和 $w^Tx$ 统一到了 exponential family distribution 中，最终的一步就是通过 link function 建立俩者联系。对任意 exponential family distribution，都存在 link function $g(\mu)=\eta$ ， $\mu$ 是分布的均值而 $\eta$ 是 natural parameter；例如 Gaussian 的 link function 是 identity（ $g(\mu)=\eta$ ），Bernoulli 的 link function 是 logit（ $g(\mu)=ln\displaystyle\frac{\mu}{1-\mu}=\eta$ ）。
link function 建立了response variable 分布均值（实际就是我们的预测目标）和 linear predictor 的关系（准确来说，这只在 $T(y)=y$ 条件下成立，但大多数情况都条件都成立，这里不展开说明）。实际上 link function 把原始 $y$ 的 scale 转换统一到了 linear predictor 的 scale 上。另外，不同分布的 link function 可以通过原始分布公式变换到指数组分布形式来直接推出，之后本文第4章会有详细讲解。
最后要强调的是，link function 的反函数 $g^{-1}(\eta)=\mu$ 称为响应函数 response function。响应函数把 linear predictor 直接映射到了预测目标 $y$ ，较常用的响应函数例如 logistic/sigmoid、softmax（都是 logit 的反函数）。

2.5 Contrast between LM & GLM

linear predictor $\eta=w^Tx$

Linear Regression

response variable $y \thicksim N(\eta,\sigma_e^2)$

link function $\eta=g(\mu)=\mu$ , called identity

prediction $h(x) = \mathbf{E}[y|x,w]=\mu=g^{-1}(\eta)=\mu$

Generalized Linear Model

response variable $y \thicksim exponential\ family$

link function $g(\mu)$ , eg. logit for Bernoulli

prediction $h(x) = \mathbf{E}[y|x,w]=\mu=g^{-1}(\eta)$ , eg.logistic for Bernoulli

这里再次强调了他们 linear predictor 的部分是一致的；不过对 response variable 服从分布的假设不一致。Gaussian 的 response function 是 $g^{-1}(\eta)=\mu$ ；而 exponential family 根据具体假设的分布，使用相应的 response function （例如 Bernoulli 是 sigmoid）。

额外强调一个点，无论是 LM 还是 GLM，我们对不同数据 $x$ 得到的其实是不同的 response variable 的分布，所以不同分布的 $\mu$ 值不同，进而我们预测的结果不同。虽然每一条数据只是预测了一个值，但其实对应的是一个分布。并且数据相似的话对应的分布也相似，那么预测结果也相似。

3. 实例：Linear Regression 到 Logistic Regression

GLM 中我们最常用到的是 Logistic Regression；即假设 $y$ 服从 Bernoulli 伯努利分布，这里详细展开阐述。

3.1 以 GLM 来看 Logistic Regression

以下直接列出相关的 GLM 概念对 LR 的解释：

An exponential family (random component) $y_i\thicksim Bern(\mu_i)$

linearl predictor (system component)： $\eta_i=\sum_j^J w_jx_{ij}$ ，这个部分 GLM 都有且都一致

link function： $\eta=g(\mu)=ln\displaystyle\frac{\mu}{1-\mu}$ ，这个函数是 log-odds，又称 logit 函数

response function： $\mu=g^{-1}(\eta)=\displaystyle\frac{1}{1+e^{-\eta}}$ ，称为 logistic 或 sigmoid

prediction： $h(x_i)=\mathbf{E}[y_i|w,x_i]=sigmoid(\eta_i)$

loss function : $E=-ylnh(x)-(1-y)ln(1-h(x))$ , 和 Linear Model 一样由 MLE 推导得到，3.3 中会给出详细推导过程

Scale Insight 可以给我们更多 link function 的理解：

binary outcome：0 or 1，对应 observed outcome 即 label

probability： $[0,1]$ ，对应 expected outcome，即 $y$ 的分布的 $\mu$

odds： $(0,\infty)$ ，概率和几率可相互转换（ $o_a=\displaystyle\frac{p_a}{1-p_a},\ p_a=\displaystyle\frac{o_a}{1+o_a}$ ）；即是发生概率与不发生概率的比，赌博中常见这一概念，和赔率相关

log-odds/logit： $(-\infty,+\infty)$ ，即 $log\displaystyle\frac{p_a}{1-p_a}$

所以 log-odds 在 scale 上和 linear predictor 匹配。对于 Logistic Regression，我们通过 link function logit 建立了与简单线性模型的关联。link function 就是把任意 exponential family distribution 的均值 $\mu$ 映射到线性尺度上：

$η = g (μ) = l n μ 1 - μ η = l n E [ y i | w , x i ] 1 - E [ y i | w , x i ] = w T x i = \sum j J$