by Max Z. C. Li (843995168@qq.com)
based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.
original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."
SL, LC: Boosting (12)
Basic Idea:
yet another method to introduce non-linearity into linear models;
==> combine (potentially infinitely) many weak learners to yield a strong one;

Theoretical Motivation

why it may work:
The predictors make different types of mistakes;
Combine them may make a stronger predictor.
Outline

Example

Bagging (Bootstrap Aggregating, Breiman 1994)
Idea
generating multiple versions of a predictor and using these to get an aggregated predictor.
the idea of "Bootstrapping" essentially means to generate sub-groups of data by the only orginal dataset.
e.g


Advantage
If perturbing the training set can cause significant changes in the learned classifier then bagging can improve accuracy (lower bias)
Application
Bagged Decision Trees

Random Forests (Bagged Trees++)

===> Random Forest generate weak learner by sampling both the data and the feature sets;
===> the other prominent method is gradient boosting machine, like the e.g. AdaBoost below; they create weak learners by assigning and adjusting the weights of sample points according to the mistakes made by the previous weak learners.
AdaBoost
Construct D_t
let

how:

![]()
why:

how to choose alpha_t:

===> so that the more correct the current distribution D_t is, the stronger the suppression of the correct ones and promotion of the errors.
===> alternatively if the current distribution is very error-prone ===> strengthen the correct examples, and suppress the incorrect ones for next weak learner.
==> check the math at both sides of epsilon = 0.5 and see clearly for yourself.
Finding the weak hypothesis with the least weighted error
Easy for weak learners, e.g.:

This efficiency of finding weak classifier is an attractive quality of boosting.
The Final Hypothesis

//the garbled char is alpha_t
Recall that each weak classifier takes an example x and produces a -1 or a +1:

The Full Alg.

Analysis of the Training Error
https://svivek.com/teaching/machine-learning/fall2017/readings/boosting.pdf

The training error of the combined classifier decreases exponentially fast (with T, numbers of weak trainers) if the errors of the weak classifiers (the 𝜖_t) are strictly better than chance
Test error will increase after the H_final becomes too “complex” ====>• Overfitting

Summary: Generalized (Ada)boost Alg.

SL: Multi-Class Classification (13)
Overview:

Strategies

comparison

Reduction Strategy: One Against All
Idea:

Algorithm

Inference/Prediction
based on the Ideal case: only the correct label will have a positive score

Analysis

Reduction Strategy: One against One/ All against All
Idea:

Algorithm

//C(K,2) ===> choose 2 from K = k! / 2(k-2)! = k(k-1)/2
Inference ==> all are post learning

Analysis
Assumption:
It is possible to separate all k classes with the O(k^2) classifiers
Comparison:

Problem with Reduciton Strategies

Singular Strategy (Multi-class Perceptron)
recall for 1v.1 strategy we infer by:


using a Perceptron style algorithm:


Interlude: Learning as Loss Minimization (14)
Idea

Learning = minimize empirical loss on the training set ==> but we have to control overfitting ==> We need something that biases the learner towards simpler hypotheses


Loss Function
Loss functions should penalize mistakes
We are minimizing average loss over the training data
Some Loss Function in Learning Models

The Idea Loss Function for SL: 0-1 Loss

Ideally, we would like to minimize 0-1 loss, But we can’t for computational reasons (NP-Hard).
To avoid the computational problem we can minimize it's upper bound or approximate the loss by one from the:
LF Zoo


Remember that we DO NOT care about the magnitude when making predictions, but we DO care about magnitude when making corrections/updates
Choice of R(w)/R(h)

Neural networks use other strategies such as dropout
UL: Clustering
Basic Idea:
Goal
Given a collection of data points, the goal is to find structure in the data: organize that data into sensible groups
Clustering as an Optimization Problem

Given a set of points and a pairwise distance, devise an algorithm 𝑓 that splits the data so that it optimizes some conditions.
===> we assume pairwise distances are given:

K-mean Clustering (14)
Idea:

==> we have to measure "closeness" ==> by the macroscopic distortedness of the presentation of the group

Objective

K-means algorithm a.k.a Llyod’s algorithm


e.g.

Remarks

More important properties:

Application: Vector Quantization

Choosing K

==> the vertical axis describes the overall distortion of the clusters
==> pick the "elbow"(yellow dot)
==> we need high enough K to inlcude all important clusters of the data (low distortion)
==> we need low enough K to avoid overfitting. (avoid marginal gains that leads to overfitting)
K-Medoids
Idea:

==> hence K-Medoids
K-Mediods Algorithm

Compare to kNN
kNN is a SL algorithm ==> it has to perdict a label for the test data based on the "opinions" of k nearest data points.

本文介绍了机器学习中提升方法的基本思想,通过结合多个弱学习器来形成强学习器,并探讨了袋装法、随机森林及AdaBoost等算法。此外,还讨论了多类别分类策略及K-means和K-medoids等聚类算法的应用。
&spm=1001.2101.3001.5002&articleId=114921806&d=1&t=3&u=1a199f0c9ff148dd9aa605ec415e1e9c)
3394

被折叠的 条评论
为什么被折叠?



