分类算法简介
分类-简介 (Classification - Introduction)
分类简介 (Introduction to Classification)
Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”.
分类可以定义为根据观测值或给定数据点预测类别或类别的过程。 分类的输出可以采用“黑色”或“白色”或“垃圾邮件”或“无垃圾邮件”的形式。
Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set.
在数学上,分类是从输入变量(X)到输出变量(Y)近似映射函数(f)的任务。 它基本上属于有监督的机器学习,在该机器学习中,还提供了目标以及输入数据集。
An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification.
分类问题的一个示例可以是电子邮件中的垃圾邮件检测。 只能有两类输出,“垃圾邮件”和“无垃圾邮件”; 因此,这是一个二进制类型分类。
To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email.
为了实现这种分类,我们首先需要训练分类器。 在此示例中,“垃圾邮件”和“无垃圾邮件”电子邮件将用作培训数据。 成功训练分类器后,可以将其用于检测未知电子邮件。
分类中学习者的类型 (Types of Learners in Classification)
We have two types of learners in respective to classification problems −
对于分类问题,我们有两种类型的学习者-
懒惰的学习者 (Lazy Learners)
As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.
顾名思义,这类学习者在存储训练数据后等待测试数据出现。 仅在获取测试数据后才进行分类。 他们花在培训上的时间更少,但花在预测上的时间却更多。 懒惰学习者的例子有K近邻和基于案例的推理。
渴望学习者 (Eager Learners)
As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN).
与懒惰的学习者相反,热心的学习者在存储训练数据后无需等待测试数据出现就构造分类模型。 他们花更多的时间在训练上,而花更少的时间在预测上。 渴望学习的人的例子有决策树,朴素贝叶斯和人工神经网络(ANN)。
在Python中构建分类器 (Building a Classifier in Python)
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −
Scikit-learn是用于机器学习的Python库,可用于在Python中构建分类器。 在Python中构建分类器的步骤如下-
步骤1:导入必要的python包 (Step1: Importing necessary python package)
For building a classifier using scikit-learn, we need to import it. We can import it by using following script −
为了使用scikit-learn构建分类器,我们需要将其导入。 我们可以使用以下脚本导入它-
import sklearn
步骤2:导入数据集 (Step2: Importing dataset)
After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −
导入必要的程序包后,我们需要一个数据集来建立分类预测模型。 我们可以从sklearn数据集中导入它,也可以根据需要使用其他一个。 我们将使用sklearn的乳腺癌威斯康星州诊断数据库。 我们可以在以下脚本的帮助下导入它-
from sklearn.datasets import load_breast_cancer
The following script will load the dataset;
以下脚本将加载数据集;
data = load_breast_cancer()
We also need to organize the data and it can be done with the help of following scripts −
我们还需要组织数据,可以在以下脚本的帮助下完成数据-
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database.
在我们的数据库中,以下命令将打印标签的名称,“ 恶性 ”和“ 良性 ”。
print(label_names)
The output of the above command is the names of the labels −
上面命令的输出是标签的名称-
['malignant' 'benign']
These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1.
这些标记映射到二进制值0和1。 恶性肿瘤用0表示, 良性癌症用1表示。
The feature names and feature values of these labels can be seen with the help of following commands −
这些标签的特征名称和特征值可通过以下命令查看-
print(feature_names[0])
The output of the above command is the names of the features for label 0 i.e. Malignant cancer −
上面命令的输出是标签0的特征名称,即恶性肿瘤 -
mean radius
Similarly, names of the features for label can be produced as follows −
类似地,标签的特征名称可以如下产生:
print(feature_names[1])
The output of the above command is the names of the features for label 1 i.e. Benign cancer −
上面命令的输出是标签1的功能名称,即良性癌症-
mean texture
We can print the features for these labels with the help of following command −
我们可以在以下命令的帮助下打印这些标签的功能-
print(features[0])
This will give the following output −
这将给出以下输出-
[
1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01
]
We can print the features for these labels with the help of following command −
我们可以在以下命令的帮助下打印这些标签的功能-
print(features[1])
This will give the following output −
这将给出以下输出-
[
2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02
]
第三步:将数据整理到培训和测试集中 (Step3: Organizing data into training & testing sets)
As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −
由于我们需要在看不见的数据上测试模型,因此将数据集分为两部分:训练集和测试集。 我们可以使用sklearn python包的train_test_split()函数将数据拆分为集合。 以下命令将导入功能-
from sklearn.model_selection import train_test_split
Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −
现在,下一条命令会将数据分为训练和测试数据。 在此示例中,我们将40%的数据用于测试目的,将60%的数据用于培训目的-
train, test, train_labels, test_labels = train_test_split(
features,labels,test_size = 0.40, random_state = 42
)
步骤4:模型评估 (Step4: Model evaluation)
After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module −
将数据划分为训练和测试后,我们需要构建模型。 为此,我们将使用朴素贝叶斯算法。 以下命令将导入GaussianNB模块-
from sklearn.naive_bayes import GaussianNB
Now, initialize the model as follows −
现在,按如下所示初始化模型-
gnb = GaussianNB()
Next, with the help of following command we can train the model −
接下来,在以下命令的帮助下,我们可以训练模型-
model = gnb.fit(train, train_labels)
Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows −
现在,出于评估目的,我们需要进行预测。 可以通过如下方式使用predict()函数来完成:
preds = gnb.predict(test)
print(preds)
This will give the following output −
这将给出以下输出-
[
1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1
0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1
1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0
0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0
1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0
1
]
The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.
以上输出的0和1系列是恶性和良性肿瘤类别的预测值。
步骤5:寻找准确性 (Step5: Finding accuracy)
We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.
通过比较两个数组test_labels和preds,我们可以找到上一步中模型构建的准确性。 我们将使用precision_score()函数确定准确性。
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))
0.951754385965
The above output shows that NaïveBayes classifier is 95.17% accurate .
上面的输出显示NaïveBayes分类器的准确度是95.17%。
分类评估指标 (Classification Evaluation Metrics)
The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.
即使您已经完成了机器学习应用程序或模型的实现,该工作也不会完成。 我们必须找出我们的模型有多有效? 可以有不同的评估指标,但是我们必须谨慎选择它,因为指标的选择会影响如何衡量和比较机器学习算法的性能。
The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −
以下是一些重要的分类评估指标,您可以根据数据集和问题类型进行选择-
混淆矩阵 (Confusion Matrix)
It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −
这是衡量分类问题性能的最简单方法,其中输出可以是两种或多种类型的类。 混淆矩阵不过是具有二维的表。 “实际”和“预测”,此外,这两个维度均具有“真阳性(TP)”,“真阴性(TN)”,“假阳性(FP)”,“假阴性(FN)”,如下所示-
The explanation of the terms associated with confusion matrix are as follows −
与混淆矩阵相关的术语的解释如下-
True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
真实正值(TP) -数据点的实际类别和预测类别均为1时会出现这种情况。
True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.
真负数(TN) -数据点的实际类别和预测类别都为0的情况。
False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of data point is 1.
误报(FP) -数据点的实际类别为0且数据点的预测类别为1的情况。
False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of data point is 0.
假阴性(FN) -数据点的实际类别为1而数据点的预测类别为0的情况。
We can find the confusion matrix with the help of confusion_matrix() function of sklearn. With the help of the following script, we can find the confusion matrix of above built binary classifier −
我们可以借助sklearn的confusion_matrix()函数找到混淆矩阵。 借助以下脚本,我们可以找到上面构建的二进制分类器的混淆矩阵-
from sklearn.metrics import confusion_matrix
输出量 (Output)
[
[ 73 7]
[ 4 144]
]
准确性 (Accuracy)
It may be defined as the number of correct predictions made by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
$$𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=\frac{𝑇𝑃+𝑇𝑁}{𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁}$$它可以定义为我们的ML模型做出的正确预测的数量。 我们可以借助以下公式轻松地通过混淆矩阵来计算它-
$$𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦= \ frac {𝑇𝑃+𝑇𝑁} {𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁} $$For above built binary classifier, TP + TN = 73+144 = 217 and TP+FP+FN+TN = 73+7+4+144=228.
对于以上构建的二进制分类器,TP + TN = 73 + 144 = 217和TP + FP + FN + TN = 73 + 7 + 4 + 144 = 228。
Hence, Accuracy = 217/228 = 0.951754385965 which is same as we have calculated after creating our binary classifier.
因此,精度= 217/228 = 0.951754385965,与我们在创建二进制分类器之后计算出的精度相同。
精确 (Precision)
Precision, used in document retrievals, may be defined as the number of correct documents returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
$$𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=\frac{𝑇𝑃}{𝑇𝑃+FP}$$文档检索中使用的精度可以定义为我们的ML模型返回的正确文档数。 我们可以借助以下公式轻松地通过混淆矩阵来计算它-
$$𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= \ frac {𝑇𝑃} {𝑇𝑃+ FP} $$For the above built binary classifier, TP = 73 and TP+FP = 73+7 = 80.
对于上述内置的二进制分类器,TP = 73,TP + FP = 73 + 7 = 80。
Hence, Precision = 73/80 = 0.915
因此,精度= 73/80 = 0.915
召回或敏感性 (Recall or Sensitivity)
Recall may be defined as the number of positives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
$$𝑅𝑒𝑐𝑎𝑙𝑙=\frac{𝑇𝑃}{𝑇𝑃+FN}$$召回率可以定义为我们的ML模型返回的肯定数。 我们可以借助以下公式轻松地通过混淆矩阵来计算它-
$$𝑅𝑒𝑐𝑎𝑙𝑙= \ frac {𝑇𝑃} {𝑇𝑃+ FN} $$For above built binary classifier, TP = 73 and TP+FN = 73+4 = 77.
对于以上构建的二进制分类器,TP = 73,TP + FN = 73 + 4 = 77。
Hence, Precision = 73/77 = 0.94805
因此,精度= 73/77 = 0.94805
特异性 (Specificity)
Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
$$𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=\frac{𝑇N}{𝑇N+FP}$$与召回相反,特异性可以定义为我们的ML模型返回的阴性数。 我们可以借助以下公式轻松地通过混淆矩阵来计算它-
$$𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦= \ frac {𝑇N} {𝑇N+ FP} $$For the above built binary classifier, TN = 144 and TN+FP = 144+7 = 151.
对于上面构建的二进制分类器,TN = 144和TN + FP = 144 + 7 = 151。
Hence, Precision = 144/151 = 0.95364
因此,精度= 144/151 = 0.95364
各种ML分类算法 (Various ML Classification Algorithms)
The followings are some important ML classification algorithms −
以下是一些重要的ML分类算法-
Logistic Regression
逻辑回归
Support Vector Machine (SVM)
支持向量机(SVM)
Decision Tree
决策树
Naïve Bayes
朴素贝叶斯
Random Forest
随机森林
We will be discussing all these classification algorithms in detail in further chapters.
我们将在后续章节中详细讨论所有这些分类算法。
应用领域 (Applications)
Some of the most important applications of classification algorithms are as follows −
分类算法最重要的一些应用如下-
Speech Recognition
语音识别
Handwriting Recognition
手写识别
Biometric Identification
生物识别
Document Classification
文件分类
翻译自: https://www.tutorialspoint.com/machine_learning_with_python/classification_introduction.htm
分类算法简介
分类是预测给定数据点类别的过程,常用于邮件垃圾检测等场景。分类学习者分为懒惰学习者(如K-最近邻)和渴望学习者(如决策树)。在Python中,可以使用scikit-learn库构建分类器,涉及数据导入、训练集和测试集划分、模型训练和评估。常用的分类评估指标包括准确率、精确率、召回率和特异性,而常见的分类算法有逻辑回归、SVM、决策树、朴素贝叶斯和随机森林等。

9681

被折叠的 条评论
为什么被折叠?



