Open Notes: Machine Learning 机器学习基础笔记（5：(Ada)Boosting, Multi-class, Loss, K-Mean|Mediods）

最新推荐文章于 2024-12-04 11:09:37 发布

原创最新推荐文章于 2024-12-04 11:09:37 发布 · 329 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

Notes 同时被 2 个专栏收录

143 篇文章

订阅专栏

Machine_Learning

55 篇文章

订阅专栏

本文介绍了机器学习中提升方法的基本思想，通过结合多个弱学习器来形成强学习器，并探讨了袋装法、随机森林及AdaBoost等算法。此外，还讨论了多类别分类策略及K-means和K-medoids等聚类算法的应用。

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."

SL, LC: Boosting (12)

Basic Idea:

yet another method to introduce non-linearity into linear models;

==> combine (potentially infinitely) many weak learners to yield a strong one;

Theoretical Motivation

why it may work:

The predictors make different types of mistakes;

Combine them may make a stronger predictor.

Outline

Example

Bagging (Bootstrap Aggregating, Breiman 1994)

Idea

generating multiple versions of a predictor and using these to get an aggregated predictor.

the idea of "Bootstrapping" essentially means to generate sub-groups of data by the only orginal dataset.

e.g

Advantage

If perturbing the training set can cause significant changes in the learned classifier then bagging can improve accuracy (lower bias)

Application

Bagged Decision Trees

Random Forests (Bagged Trees++)

===> Random Forest generate weak learner by sampling both the data and the feature sets;

===> the other prominent method is gradient boosting machine, like the e.g. AdaBoost below; they create weak learners by assigning and adjusting the weights of sample points according to the mistakes made by the previous weak learners.

AdaBoost

Construct D_t

let

how:

why:

how to choose alpha_t:

===> so that the more correct the current distribution D_t is, the stronger the suppression of the correct ones and promotion of the errors.

===> alternatively if the current distribution is very error-prone ===> strengthen the correct examples, and suppress the incorrect ones for next weak learner.

==> check the math at both sides of epsilon = 0.5 and see clearly for yourself.

Finding the weak hypothesis with the least weighted error

Easy for weak learners, e.g.:

This efficiency of finding weak classifier is an attractive quality of boosting.

The Final Hypothesis

//the garbled char is alpha_t

Recall that each weak classifier takes an example x and produces a -1 or a +1:

The Full Alg.

Analysis of the Training Error

https://svivek.com/teaching/machine-learning/fall2017/readings/boosting.pdf

The training error of the combined classifier decreases exponentially fast (with T, numbers of weak trainers) if the errors of the weak classifiers (the 𝜖_t) are strictly better than chance

Test error will increase after the H_final becomes too “complex” ====>• Overfitting

Summary: Generalized (Ada)boost Alg.

SL: Multi-Class Classification (13)

Overview:

Strategies

comparison

Reduction Strategy: One Against All

Idea:

Algorithm

Inference/Prediction

based on the Ideal case: only the correct label will have a positive score

Analysis

Reduction Strategy: One against One/ All against All

Idea:

Algorithm

//C(K,2) ===> choose 2 from K = k! / 2(k-2)! = k(k-1)/2

Inference ==> all are post learning

Analysis

Assumption:

It is possible to separate all k classes with the O(k^2) classifiers

Comparison:

Problem with Reduciton Strategies

Singular Strategy (Multi-class Perceptron)

recall for 1v.1 strategy we infer by:

using a Perceptron style algorithm:

Interlude: Learning as Loss Minimization (14)

Idea

Learning = minimize empirical loss on the training set ==> but we have to control overfitting ==> We need something that biases the learner towards simpler hypotheses

Loss Function

Loss functions should penalize mistakes
We are minimizing average loss over the training data

Some Loss Function in Learning Models

The Idea Loss Function for SL: 0-1 Loss

Ideally, we would like to minimize 0-1 loss, But we can’t for computational reasons (NP-Hard).

To avoid the computational problem we can minimize it's upper bound or approximate the loss by one from the:

LF Zoo

Remember that we DO NOT care about the magnitude when making predictions, but we DO care about magnitude when making corrections/updates

Choice of R(w)/R(h)

Neural networks use other strategies such as dropout

UL: Clustering

Basic Idea:

Goal

Given a collection of data points, the goal is to find structure in the data: organize that data into sensible groups

Clustering as an Optimization Problem

Given a set of points and a pairwise distance, devise an algorithm 𝑓 that splits the data so that it optimizes some conditions.

===> we assume pairwise distances are given:

K-mean Clustering (14)

Idea:

==> we have to measure "closeness" ==> by the macroscopic distortedness of the presentation of the group

Objective

K-means algorithm a.k.a Llyod’s algorithm

e.g.

Remarks

More important properties:

Application: Vector Quantization

Choosing K

==> the vertical axis describes the overall distortion of the clusters

==> pick the "elbow"(yellow dot)

==> we need high enough K to inlcude all important clusters of the data (low distortion)

==> we need low enough K to avoid overfitting. (avoid marginal gains that leads to overfitting)