Open Notes: Machine Learning 机器学习基础笔记(5:(Ada)Boosting, Multi-class, Loss, K-Mean|Mediods)

本文介绍了机器学习中提升方法的基本思想,通过结合多个弱学习器来形成强学习器,并探讨了袋装法、随机森林及AdaBoost等算法。此外,还讨论了多类别分类策略及K-means和K-medoids等聚类算法的应用。

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."
 

SL, LC: Boosting (12)

Basic Idea:

yet another method to introduce non-linearity into linear models;

==> combine (potentially infinitely) many weak learners to yield a strong one;

 

Theoretical Motivation

why it may work: 

The predictors make different types of mistakes;

Combine them may make a stronger predictor.

 

Outline

 

Example

 

Bagging (Bootstrap Aggregating, Breiman 1994)

Idea

generating multiple versions of a predictor and using these to get an aggregated predictor.

the idea of "Bootstrapping" essentially means to generate sub-groups of data by the only orginal dataset.

e.g

Advantage

If perturbing the training set can cause significant changes in the learned classifier then bagging can improve accuracy (lower bias)

 

Application

Bagged Decision Trees

Random Forests (Bagged Trees++)

===> Random Forest generate weak learner by sampling both the data and the feature sets;

===> the other prominent method is gradient boosting machine, like the e.g. AdaBoost below; they create weak learners by assigning and adjusting the weights of sample points according to the mistakes made by the previous weak learners.

 

AdaBoost

Construct D_t

let

how:

why:

how to choose alpha_t:

===> so that the more correct the current distribution D_t is, the stronger the suppression of the correct ones and promotion of the errors.

===> alternatively if the current distribution is very error-prone ===> strengthen the correct examples, and suppress the incorrect ones for next weak learner.

==> check the math at both sides of epsilon = 0.5 and see clearly for yourself.

 

Finding the weak hypothesis with the least weighted error

Easy for weak learners, e.g.:

This efficiency of finding weak classifier is an attractive quality of boosting.

 

The Final Hypothesis

//the garbled char is alpha_t

Recall that each weak classifier takes an example x and produces a -1 or a +1:

 

The Full Alg.

 

Analysis of the Training Error

https://svivek.com/teaching/machine-learning/fall2017/readings/boosting.pdf

The training error of the combined classifier decreases exponentially fast (with T, numbers of weak trainers) if the errors of the weak classifiers (the 𝜖_t) are strictly better than chance

Test error will increase after the H_final becomes too “complex” ====>• Overfitting

 

Summary: Generalized (Ada)boost Alg.

 

SL: Multi-Class Classification (13)

Overview:

 

Strategies

comparison

 

Reduction Strategy: One Against All

Idea:

Algorithm

Inference/Prediction

based on the Ideal case: only the correct label will have a positive score

 

Analysis

 

Reduction Strategy: One against One/ All against All

Idea:

 

Algorithm

//C(K,2) ===> choose 2 from K = k! / 2(k-2)! = k(k-1)/2

Inference ==> all are post learning

 

Analysis

Assumption:

It is possible to separate all k classes with the O(k^2) classifiers

Comparison:

 

Problem with Reduciton Strategies

 

Singular Strategy (Multi-class Perceptron)

recall for 1v.1 strategy we infer by:

using a Perceptron style algorithm:

 

Interlude: Learning as Loss Minimization (14)

Idea

Learning = minimize empirical loss on the training set  ==> but we have to control overfitting ==> We need something that biases the learner towards simpler hypotheses

 

Loss Function

Loss functions should penalize mistakes
We are minimizing average loss over the training data

Some Loss Function in Learning Models

 

The Idea Loss Function for SL: 0-1 Loss

Ideally, we would like to minimize 0-1 loss, But we can’t for computational reasons (NP-Hard).

To avoid the computational problem we can minimize it's upper bound or approximate the loss by one from the:

LF Zoo

Remember that we DO NOT care about the magnitude when making predictions, but we DO care about magnitude when making corrections/updates

 

Choice of R(w)/R(h)

Neural networks use other strategies such as dropout

 

UL: Clustering

Basic Idea:

Goal

Given a collection of data points, the goal is to find structure in the data: organize that data into sensible groups

 

Clustering as an Optimization Problem

Given a set of points and a pairwise distance, devise an algorithm 𝑓 that splits the data so that it optimizes some conditions.

===> we assume pairwise distances are given:

 

K-mean Clustering (14)

Idea:

==> we have to measure "closeness" ==> by the macroscopic distortedness of the presentation of the group

Objective

 

K-means algorithm a.k.a Llyod’s algorithm

e.g.

Remarks

More important properties:

 

Application: Vector Quantization

 

Choosing K

==> the vertical axis describes the overall distortion of the clusters

==> pick the "elbow"(yellow dot)

==> we need high enough K to inlcude all important clusters of the data (low distortion)

==> we need low enough K to avoid overfitting. (avoid marginal gains that leads to overfitting)

 

K-Medoids

Idea:

==> hence K-Medoids

 

K-Mediods Algorithm

 

Compare to kNN

kNN is a SL algorithm ==> it has to perdict a label for the test data based on the "opinions" of k nearest data points.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值