KNN多分类解释和Python代码

$⊠$ KNN也可以用于多分类。而且由于它是空间函数，所以处理10个左右的y情况，非常擅长，比softmax函数好。 KNN的方法，最直观的理解就是，假设设定一个点周围最近的 $n$ 个点，那么这 $n$ 个点中，频数最高的某种label、y的情况，就作为 $\hat{y}$ 。

Supervised Learning with scikit-learn

相当于复习了。

4 hours
17 Videos
54 Exercises

这个哥们教的，Andreas Müller | DataCamp 。

Andy is a lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python”, describing a practical approach to machine learning with python and scikit-learn.

按照这种套路，去买他的书看最值。因为出视频和书的，一般PPT和论文都写得来，没什么短板，看视频的效率高，不用到处查。搞xgboost的时候，被这哥们坑惨了，Sergey Fogelson | DataCamp 。最后说一句，有问题大家一起探讨，一起进步，卧槽，这差太多了，扶不动啊。

Supervised learning | Python

Reinforcement learning 有种贝叶斯的感觉。

Software agents interact with an environment

Learn how to optimize their behavior
Given a system of rewards and punishments
Draws inspiration from behavioral psychology

Applications

Economics
Genetics
Game playing

AlphaGo: First computer to defeat the world champion in Go

看第一个视频，就觉得他教的不错！

Which of these is a classification problem? | Python

可以了解下这里的数据结构和最后形成的图。

因为本地没有安装好。设置anaconda的路径吧。暂时不知道，可以先在jupyter上尝试，之后熟悉路径这种繁琐的东西后再说。

Installing scikit-learn — scikit-learn 0.19.1 documentation

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()

In [7]: type(iris)
Out[7]: sklearn.datasets.base.Bunch

这种数据结构还是没见过。 iris是字典格式。

In [8]: print(iris.keys())
dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])

In [9]: type(iris.data), type(iris.target)
Out[9]: (numpy.ndarray, numpy.ndarray)
In [10]: iris.data.shape
Out[10]: (150, 4)

因此可以知道iris.data的矩阵结构为(150, 4)。

In [11]: iris.target_names
Out[11]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

X = iris.data
y = iris.target

这里两者都是向量，画图的时候不需要连接，因为有隐含的index。

df = pd.DataFrame(X, columns=iris.feature_names)

这样的话，矩阵就重塑pd.DataFrame了。

_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8, 8],s=150, marker='D')
plt.show()

marker='D'对应的是diamond，可以查到的， markers — Matplotlib 2.1.1 documentation 。

s是shape，c是color。见plt.scatter?。

Marker size is scaled by s and marker color is mapped to c.

看图，明显感觉到第三行最后一个分类非常好，因此有idea了。

另外，deploy失败的问题，常常是网络，因此使用手机是最稳定的。

Visual EDA | Python

但是也可以看到之前的图，都是针对连续变量的，散点图和histogram图。因此对于binary就可能出现countplot图了。 palette='RdBu'表示Red和Blue。 plt.figure()开启新图，类似于plt.clf()的功能，否则图像重叠。

plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

The classification challenge | Python

KNN，k—Nearest Neighbors和KMeans等的方法还不太一样。看最近的几个点，用众数投票。

不能有缺失值，但是对于sklearn包来说，这个不是问题，具体可以看 python中变量批量处理集成方案 - A Hugo website 。

但是KNN也还是作为监督学习的。

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])

你看，还是有y存在的。

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')

另外，

iris['data'].shape为(150,4)， iris['target'].shape为(150,)。

k-Nearest Neighbors: Fit | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

导入包。

y = df['party'].values
X = df.drop('party', axis=1).values

这个地方加.values是为了保持y和X是向量格式。

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

这里n_neighbors = 6限定了，周围6个最近的大家少数服从多数投票。

k-Nearest Neighbors: Predict | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df.party.values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

Measuring model performance | Python

knn.score(X_test, y_test)可以看 $A c c$ 。 stratify=y在train_test_split中表示。

stratify : array-like or None (default is None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

train_test_split(X, y, stratify=y） - CSDN博客，将stratify=X就是按照X中的比例分配，将stratify=y就是按照y中的比例分配，一般都是=y。

但是如果数据量很大的话，random应该很稳定吧，哦，对了，如果存在少类情况，要注意这个bug。

$L a r g e r k \to s m o o t h e r d e c i s i o n b o u n d a r y \to l e s s c o m p l e x m o d e l$

$S m a l l e r k \to m o r e c o m p l e x m o d e l \to c a n l e a d t o o v e r f i t t i n g$

很好理解，不均匀，说骚操作太多了。有点和常识不一样啊，其实没有， $n ↑$ 其实不是加参数，而是减参数。

The digits recognition dataset | Python

.load_digits()原来sklearn的样本数据就是这样调用的。

# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()

In [4]: print(digits.keys())
dict_keys(['DESCR', 'data', 'images', 'target', 'target_names'])

digits作为字典格式，一共有这些key。

In [5]: print(digits.DESCR)
Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

更像是数据的介绍。

In [6]: print(digits.images.shape)
(1797, 8, 8)

这个我有点看不懂，为什么是三个elements。

In [7]: type(digits.images)
Out[7]: numpy.ndarray

numpy格式。

In [10]: digits.images[0:2]
Out[10]: 
array([[[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
        [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
        [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
        [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
        [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
        [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
        [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
        [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]],

       [[  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,   9.,   0.,   0.],
        [  0.,   0.,   3.,  15.,  16.,   6.,   0.,   0.],
        [  0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   3.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.]]])

看前面两个样本的结果就是，一个行向量中每个元素都是一个矩阵，所以是三个维度。

In [11]: digits.images[0:2,0]
Out[11]: 
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.]])

这样是取用前面两个样本中，每个元素矩阵中的第一排。样本解释完毕。

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

原来是搞文本识别的。

Train/Test Split + Fit/Predict/Accuracy | Python

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

计算开始计算 $A c c$ 了，先录入包。

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

分好测试组和训练组。

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

<script.py> output:
    0.983333333333

预测能力太强了。

Overfitting and underfitting | Python

for i, k in enumerate(neighbors):中， k代表n_neighbors， i用于记录 $A c c$ 。

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors = k)

    # Fit the classifier to the training data
    knn.fit(X_train,y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train,y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()