\(\boxtimes\) KNN也可以用于多分类。 而且由于它是空间函数,所以处理10个左右的y情况,非常擅长,比softmax函数好。 KNN的方法,最直观的理解就是,假设设定一个点周围最近\(n\)个点,那么这\(n\)个点中,频数最高的某种label、y的情况,就作为\(\hat y\)

Supervised Learning with scikit-learn


这个哥们教的,Andreas Müller | DataCamp

Andy is a lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python”, describing a practical approach to machine learning with python and scikit-learn.

按照这种套路,去买他的书看最值。 因为出视频和书的,一般PPT和论文都写得来,没什么短板,看视频的效率高,不用到处查。

Supervised learning | Python

Reinforcement learning 有种贝叶斯的感觉。

Software agents interact with an environment

  • Learn how to optimize their behavior
  • Given a system of rewards and punishments
  • Draws inspiration from behavioral psychology


  • Economics
  • Genetics
  • Game playing

AlphaGo: First computer to defeat the world champion in Go


Which of these is a classification problem? | Python


因为本地没有安装好。 设置anaconda的路径吧。 暂时不知道,可以先在jupyter上尝试, 之后熟悉路径这种繁琐的东西后再说。

Installing scikit-learn — scikit-learn 0.19.1 documentation

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iris = datasets.load_iris()
In [7]: type(iris)
Out[7]: sklearn.datasets.base.Bunch

这种数据结构还是没见过。 iris是字典格式。

In [8]: print(iris.keys())
dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])
In [9]: type(iris.data), type(iris.target)
Out[9]: (numpy.ndarray, numpy.ndarray)
In [10]: iris.data.shape
Out[10]: (150, 4)

因此可以知道iris.data的矩阵结构为(150, 4)

In [11]: iris.target_names
Out[11]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
X = iris.data
y = iris.target


df = pd.DataFrame(X, columns=iris.feature_names)


_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8, 8],s=150, marker='D')

marker='D'对应的是diamond,可以查到的, markers — Matplotlib 2.1.1 documentation


Marker size is scaled by s and marker color is mapped to c.



Visual EDA | Python

但是也可以看到之前的图,都是针对连续变量的,散点图和histogram图。 因此对于binary就可能出现countplot图了。 palette='RdBu'表示Red和Blue。 plt.figure()开启新图,类似于plt.clf()的功能,否则图像重叠。

sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])

The classification challenge | Python

KNN,k—Nearest Neighbors和KMeans等的方法还不太一样。 看最近的几个点,用众数投票。

不能有缺失值,但是对于sklearn包来说,这个不是问题,具体可以看 python中变量批量处理集成方案 - A Hugo website


from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,



k-Nearest Neighbors: Fit | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier


y = df['party'].values
X = df.drop('party', axis=1).values


# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data

这里n_neighbors = 6限定了,周围6个最近的大家少数服从多数投票。

k-Nearest Neighbors: Predict | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df.party.values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

Measuring model performance | Python

knn.score(X_test, y_test)可以看\(Acc\)stratify=ytrain_test_split中表示。

stratify : array-like or None (default is None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

train_test_split(X, y, stratify=y) - CSDN博客, 将stratify=X就是按照X中的比例分配, 将stratify=y就是按照y中的比例分配, 一般都是=y


\[Larger\space k\space \to\space smoother\space decision\space boundary\space \to\space less\space complex\space model\]

\[Smaller\space k\space \to\space more\space complex\space model\space \to\space can\space lead\space to\space overfitting\]

很好理解,不均匀,说骚操作太多了。 有点和常识不一样啊,其实没有,\(n \uparrow\)其实不是加参数,而是减参数。

The digits recognition dataset | Python

# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()
In [4]: print(digits.keys())
dict_keys(['DESCR', 'data', 'images', 'target', 'target_names'])


In [6]: print(digits.images.shape)
(1797, 8, 8)


In [7]: type(digits.images)
Out[7]: numpy.ndarray


In [10]: digits.images[0:2]
array([[[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
        [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
        [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
        [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
        [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
        [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
        [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
        [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]],

       [[  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,   9.,   0.,   0.],
        [  0.,   0.,   3.,  15.,  16.,   6.,   0.,   0.],
        [  0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   3.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.]]])


In [11]: digits.images[0:2,0]
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.]])

这样是取用前面两个样本中,每个元素矩阵中的第一排。 样本解释完毕。

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')


Train/Test Split + Fit/Predict/Accuracy | Python

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split


# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)


# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))
<script.py> output:


Overfitting and underfitting | Python

for i, k in enumerate(neighbors):中, k代表n_neighborsi用于记录\(Acc\)

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors = k)

    # Fit the classifier to the training data
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train,y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.xlabel('Number of Neighbors')