"学习笔记：Supervised Learning with scikit-learn 学习笔记

$\boxtimes$

KNN算法同样适用于多分类问题。由于其基于空间距离的特性，在处理大约10个类别的目标变量时表现优异，相比softmax函数具有一定优势。

KNN算法的核心思想是：对于给定样本，找出其周围最近的$n$个邻居，然后根据这些邻居中频数最高的类别作为该样本的预测值$\hat y$。

Supervised Learning with scikit-learn

本课程主要涵盖监督学习相关内容，包括数据预处理、k-NN算法、回归分析、决策树、逻辑回归和支持向量机等，其中重点讲解k-NN算法。

课程概况： - 课程时长：4小时 - 视频数量：17个 - 练习数量：54个

讲师介绍： 本课程由Andreas Müller | DataCamp主讲。

Andy is a lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python”, describing a practical approach to machine learning with python and scikit-learn.

建议购买作者的书籍进行深入学习。通常视频课程与配套书籍相结合的学习效率更高，能够避免查阅大量资料。在学习xgboost相关内容时，另一位讲师Sergey Fogelson | DataCamp的讲解存在一些不足。

Supervised learning | Python

Reinforcement learning 有种贝叶斯的感觉。

Software agents interact with an environment

Learn how to optimize their behavior
Given a system of rewards and punishments
Draws inspiration from behavioral psychology

Applications

Economics
Genetics
Game playing

AlphaGo: First computer to defeat the world champion in Go

看第一个视频，就觉得他教的不错！

Which of these is a classification problem? | Python

通过本节可以了解数据集的基本结构及相应的可视化结果。

由于本地环境尚未完全配置好，需要设置Anaconda的路径。目前可以先在Jupyter Notebook中进行尝试，待熟悉相关路径配置后再进行完整环境搭建。

Installing scikit-learn — scikit-learn 0.19.1 documentation

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()

In [7]: type(iris)
Out[7]: sklearn.datasets.base.Bunch

这种数据结构在初见时可能不太熟悉。iris对象采用字典格式进行数据存储。

In [8]: print(iris.keys())
dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])

In [9]: type(iris.data), type(iris.target)
Out[9]: (numpy.ndarray, numpy.ndarray)
In [10]: iris.data.shape
Out[10]: (150, 4)

由此可知，iris.data的数据结构为(150, 4)的numpy数组，表示150个样本，每个样本包含4个特征。

In [11]: iris.target_names
Out[11]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

X = iris.data
y = iris.target

这两个变量都是向量形式，在绘制散点图时无需显式连接，因为它们通过隐含的索引建立对应关系。

df = pd.DataFrame(X, columns=iris.feature_names)

通过上述操作，将numpy数组转换为pandas DataFrame格式。

_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8, 8],s=150, marker='D')
plt.show()

marker='D'对应的是diamond，可以查到的， markers — Matplotlib 2.1.1 documentation 。

s是shape，c是color。见plt.scatter?。

Marker size is scaled by s and marker color is mapped to c.

看图，明显感觉到第三行最后一个分类非常好，因此有idea了。

此外，部署失败通常由网络问题引起，使用手机热点连接往往更加稳定。

Visual EDA | Python

从之前的图表可以看出，散点图和直方图主要适用于连续变量。因此对于二元变量，计数图(countplot)更为适用。

参数说明： - palette='RdBu'：表示使用红色和蓝色的配色方案 - plt.figure()：创建新的图形窗口，功能类似于plt.clf()，用于避免图形重叠

plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

The classification challenge | Python

KNN（K近邻算法）与KMeans等聚类算法有所不同。KNN通过查找最近的k个样本，采用多数投票法进行预测。

算法要求： - 数据集不能存在缺失值，但scikit-learn提供了完善的数据预处理功能，详细处理方案可参考：python中变量批量处理集成方案 - A Hugo website

KNN属于监督学习算法范畴，需要标注数据进行训练。

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])

你看，还是有y存在的。

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')

另外，

iris['data'].shape为(150,4)， iris['target'].shape为(150,)。

k-Nearest Neighbors: Fit | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

导入包。

y = df['party'].values
X = df.drop('party', axis=1).values

这个地方加.values是为了保持y和X是向量格式。

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

这里n_neighbors = 6限定了，周围6个最近的大家少数服从多数投票。

k-Nearest Neighbors: Predict | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df.party.values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

Measuring model performance | Python

knn.score(X_test, y_test)可以看$Acc$。 stratify=y在train_test_split中表示。

stratify : array-like or None (default is None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

train_test_split(X, y, stratify=y） - CSDN博客，将stratify=X就是按照X中的比例分配，将stratify=y就是按照y中的比例分配，一般都是=y。

但是如果数据量很大的话，random应该很稳定吧，哦，对了，如果存在少类情况，要注意这个bug。

$$Larger\space k\space \to\space smoother\space decision\space boundary\space \to\space less\space complex\space model$$

$$Smaller\space k\space \to\space more\space complex\space model\space \to\space can\space lead\space to\space overfitting$$

很好理解，不均匀，说骚操作太多了。有点和常识不一样啊，其实没有，$n \uparrow$其实不是加参数，而是减参数。

The digits recognition dataset | Python

$\boxtimes$ KNN也可以用于多分类。而且由于它是空间函数，所以处理10个左右的y情况，非常擅长，比softmax函数好。 KNN的方法，最直观的理解就是，假设设定一个点周围最近的$n$个点，那么这$n$个点中，频数最高的某种label、y的情况，就作为$\hat y$。

.load_digits()原来sklearn的样本数据就是这样调用的。

# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()

In [4]: print(digits.keys())
dict_keys(['DESCR', 'data', 'images', 'target', 'target_names'])

digits作为字典格式，一共有这些key。

In [5]: print(digits.DESCR)
Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

更像是数据的介绍。

In [6]: print(digits.images.shape)
(1797, 8, 8)

这个我有点看不懂，为什么是三个elements。

In [7]: type(digits.images)
Out[7]: numpy.ndarray

numpy格式。

In [10]: digits.images[0:2]
Out[10]: 
array([[[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
        [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
        [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
        [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
        [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
        [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
        [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
        [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]],

       [[  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,   9.,   0.,   0.],
        [  0.,   0.,   3.,  15.,  16.,   6.,   0.,   0.],
        [  0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   3.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.]]])

看前面两个样本的结果就是，一个行向量中每个元素都是一个矩阵，所以是三个维度。

In [11]: digits.images[0:2,0]
Out[11]: 
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.]])

这样是取用前面两个样本中，每个元素矩阵中的第一排。样本解释完毕。

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

原来是搞文本识别的。

Train/Test Split + Fit/Predict/Accuracy | Python

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

计算开始计算$Acc$了，先录入包。

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

分好测试组和训练组。

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

<script.py> output:
    0.983333333333

预测能力太强了。

Overfitting and underfitting | Python

for i, k in enumerate(neighbors):中， k代表n_neighbors， i用于记录$Acc$。

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors = k)

    # Fit the classifier to the training data
    knn.fit(X_train,y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train,y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

休息会，看会电视。

Introduction to regression | Python

.reshape: Returns an array containing the same data with a new shape. Refer to numpy.reshape for full documentation.

import numpy
numpy.reshape?

python reshape用法 - a3335581的博客 - CSDN博客

看这个就懂了。

y.reshape(-1, 1)表示行向量，-1行，element里面也是一行。这样的话，

y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)

新的y和X_rooms就是两个数列了。

最后通过np.linspace虚拟出$\hat y = X \hat \beta$这条线出来。

import numpy as np
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
import matplotlib.pyplot as plt
plt.scatter(X_rooms, y, color='blue')
plt.scatter(prediction_space, reg.predict(prediction_space), color='black', linewidth=3)
plt.show()

Importing data for supervised learning | Python

# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

{}".format()这种表达方法很直觉啊，比R好。

<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)

显然后面这种格式好啊。

Exploring the Gapminder data | Python

seaborn.heatmap, sns.heatmap(df.corr(), square=True, cmap='RdYlGn') $\to$ .info(), .describe(), .head().

又是EDA。

In [1]: sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
... 
... 
Out[1]: <matplotlib.axes._subplots.AxesSubplot at 0x7fd8205810f0>

In [2]: plt.show()

The basics of linear regression | Python

我突然发现啊，这个集成方案后，不太可能在回归方程上搞什么幺蛾子了，比如加个interaction啊。

Fit & predict for regression | Python

# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

np.linspace(min(X_fertility),max(X_fertility)).reshape(-1,1)常见手法画$\hat y = X \hat \beta$。所以在OLS中，reg.score(X_fertility, y)是$R^2$。

Train/test split for regression | Python

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

<script.py> output:
    R^2: 0.838046873142936
    Root Mean Squared Error: 3.2476010800377213

Cross-validation | Python

我记得这里xgboost也解释过，只所以要cross validation，是因为$R_{test}^2$是受到split严重影响的。

一看这个图就理解了。

并且这里cross_val_score是默认了$R_{CV}^2$， $ = CV + score$。

5-fold cross-validation | Python

from sklearn.model_selection import cross_val_score说明明显cross validation是属于selection的范畴的。

# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv = 5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

<script.py> output:
    [ 0.81720569  0.82917058  0.90214134  0.80633989  0.94495637]
    Average 5-Fold CV Score: 0.8599627722793232

哈哈。

Regularized regression | Python

$$Loss \space function \space \space = \space OLS \space loss \space function \space + \space \alpha \sum_{i=1}^na_{i}^2$$

其中， $\alpha太大 \to underfitting$， $\alpha太小 \to overfitting$。所以这不是一个非常客观的参数，人为因素太大。

normalize是啥意思？ help(Ridge)，就是给变量标准化的，变成单位向量，因为这里可能跟变量单位相关，影响正则化。

标准化与归一化的区别

简单来说，标准化是依照特征矩阵的列处理数据，其通过求z-score的方法，将样本的特征值转换到同一量纲下。归一化是依照特征矩阵的行处理数据，其目的在于样本向量在点乘运算或其他核函数计算相似性时，拥有统一的标准，也就是说都转化为"单位向量" ¹。规则为l2的归一化公式如下：

$$\tilde x = \frac{x}{|x|}$$

$$|x| = \sum_{i=1}^mx_i$$

其中$x$表示一个行向量，即一个用户的数据。 $m$表示有m个特征。

$$Loss \space function \space \space = \space OLS \space loss \space function \space + \space \alpha \sum_{i=1}^n|a_{i}|$$

Regularization I: Lasso | Python

# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha = 0.4, normalize=True)

# Fit the regressor to the data
lasso.fit(X,y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

Regularization II: Ridge | Python

$L1 \to$ Lasso $\gets + \alpha \sum_{i=1}^n|a_{i}|$
$L2 \to$ Ridge $\gets + \alpha \sum_{i=1}^na_{i}^2$

<!-- -->

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)
    Return numbers spaced evenly on a log scale.
    
    In linear space, the sequence starts at ``base ** start``
    (`base` to the power of `start`) and ends with ``base ** stop``
    (see `endpoint` below).
    
    Parameters
    ----------
    start : float
        ``base ** start`` is the starting value of the sequence.
    stop : float
        ``base ** stop`` is the final value of the sequence, unless `endpoint`
        is False.  In that case, ``num + 1`` values are spaced over the
        interval in log-space, of which all but the last (a sequence of
        length `num`) are returned.
    num : integer, optional
        Number of samples to generate.  Default is 50.

In [3]: np.logspace(-4, 0, 50)
Out[3]: 
array([  1.00000000e-04,   1.20679264e-04,   1.45634848e-04,
         1.75751062e-04,   2.12095089e-04,   2.55954792e-04,
         3.08884360e-04,   3.72759372e-04,   4.49843267e-04,
         5.42867544e-04,   6.55128557e-04,   7.90604321e-04,
         9.54095476e-04,   1.15139540e-03,   1.38949549e-03,
         1.67683294e-03,   2.02358965e-03,   2.44205309e-03,
         2.94705170e-03,   3.55648031e-03,   4.29193426e-03,
         5.17947468e-03,   6.25055193e-03,   7.54312006e-03,
         9.10298178e-03,   1.09854114e-02,   1.32571137e-02,
         1.59985872e-02,   1.93069773e-02,   2.32995181e-02,
         2.81176870e-02,   3.39322177e-02,   4.09491506e-02,
         4.94171336e-02,   5.96362332e-02,   7.19685673e-02,
         8.68511374e-02,   1.04811313e-01,   1.26485522e-01,
         1.52641797e-01,   1.84206997e-01,   2.22299648e-01,
         2.68269580e-01,   3.23745754e-01,   3.90693994e-01,
         4.71486636e-01,   5.68986603e-01,   6.86648845e-01,
         8.28642773e-01,   1.00000000e+00])

所以np.logspace(-4, 0, 50)产生连续的数列，50个，base在$\log_{10}$，从小到大排序。

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

衡量cross validation出来的十个结果均值和标准差的关系。

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X,y, cv = 10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)

$\Box$add plot best_alpha

这个时候返回来看函数就懂了。

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

<!-- 这个定义，虽然是看$R^2 \gets \alpha$，但是感觉写的不对，先放着这里。 -->

ax.plot(alpha_space, cv_scores)这里说明了，计算$R^2 \gets \alpha$，其中为什么要算标准差，是因为要画阴影部分，搞置信区间。为什么要$/10$，这是因为$\sigma = \frac{\hat \sigma}{n^{\frac{1}{2}}}$。

How good is your model? | Python

重点考虑了F1 score。

复习一下啊。

$$Recall = \frac{TP}{TP+FN} \to how \space actually$$

$$Precision = \frac{TP}{TP+FP} \to how \space confidently$$

$$F1 \space score = \frac{1}{\frac{1}{Recall} + \frac{1}{Precision}}$$

In [7]: print(confusion_matrix(y_test, y_pred))
[[52  7]
 [ 3 112]]

In [8]: print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support
          0       0.95      0.88      0.91        59
          1       0.94      0.97      0.96       115
avg / total       0.94      0.94      0.94       174

根据定义的T和N不同，当然$Recall$和$Precisiom$不同了。

Metrics for classification | Python

You may have noticed in the video that the classification report consisted of three rows, and an additional support column.

The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed.

The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

# Import necessary modules
from sklearn.metrics import classification_report, confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

<script.py> output:
    [[176  30]
     [ 52  50]]
                 precision    recall  f1-score   support
    
              0       0.77      0.85      0.81       206
              1       0.62      0.49      0.55       102
    
    avg / total       0.72      0.73      0.72       308

Logistic regression and the ROC curve | Python

from sklearn.metrics import roc_curve引入函数。 y_pred_prob = logreg.predict_proba(X_test)[:,1]设置好$\hat y$。 fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)算好fpr, tpr, thresholds。 plt.plot([0, 1], [0, 1], 'k--')画对角线。 plt.plot(fpr, tpr, label='Logistic Regression')画ROC曲线。

Building a logistic regression model | Python

前面都是老套路。

# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

不解释了。

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

<script.py> output:
    [[176  30]
     [ 35  67]]
                 precision    recall  f1-score   support
    
              0       0.83      0.85      0.84       206
              1       0.69      0.66      0.67       102
    
    avg / total       0.79      0.79      0.79       308

Plotting an ROC curve | Python

In [7]: pd.DataFrame(logreg.predict_proba(X_test)).head()
Out[7]: 
          0         1
0  0.604098  0.395902
1  0.760424  0.239576
2  0.796702  0.203298
3  0.772360  0.227640
4  0.571949  0.428051

因此我们选择第二列，index = 1。

# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

这里集成方案好，不需要自己调试，只需要知道idea就好。理解的技巧x轴-fpr，y轴-tpr。

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Precision-recall Curve | Python

$\Box$ add precision-recall-curve

Area under the ROC curve | Python

from sklearn.metrics import roc_auc_score引入。 roc_auc_score(y_test, y_pred_prob)使用，记得引入$\hat y$是概率。

from sklearn.model_selection import cross_val_score通过CV也可以引入。

In [8]: cv_scores = cross_val_score(logreg, X, y, cv=5,
...:                             scoring='roc_auc')

这样也是可以的。

AUC computation | Python

Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

也就是说靠猜，$AUC = 0.5$，但是实际上只要$AUC > 0.5$就是有利可图的。

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

导入包。

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

得到$\hat y$。

In [5]: # Compute and print AUC score
        print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
AUC: 0.8254806777079764

# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv = 5, scoring = 'roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))

<script.py> output:
    AUC: 0.8254806777079764
    AUC scores computed using 5-fold cross-validation: [ 0.80148148  0.8062963   0.81481481  0.86245283  0.8554717 ]

Hyperparameter tuning | Python

hyperparameter的定义不是$\beta$这种，可以从model中学习出来的，而是需要从迭代中学习出来。

$\Box$ grid-search

Hyperparameter tuning with GridSearchCV | Python

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

设定好GridSearchCV的一些参数。

In [12]: # Fit it to the data
         logreg_cv.fit(X,y)
Out[12]: 
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': array([  1.00000e-05,   8.48343e-05,   7.19686e-04,   6.10540e-03,
         5.17947e-02,   4.39397e-01,   3.72759e+00,   3.16228e+01,
         2.68270e+02,   2.27585e+03,   1.93070e+04,   1.63789e+05,
         1.38950e+06,   1.17877e+07,   1.00000e+08])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [13]: # Print the tuned parameters and score
         print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
         print("Best score is {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameters: {'C': 3.7275937203149381}
Best score is 0.7708333333333334

Hyperparameter tuning with RandomizedSearchCV | Python

# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X,y)

In [2]: # Print the tuned parameters and score
        print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
        print("Best score is {}".format(tree_cv.best_score_))
Tuned Decision Tree Parameters: {'max_depth': 3, 'criterion': 'entropy', 'max_features': 3, 'min_samples_leaf': 8}
Best score is 0.7369791666666666

Hold-out set for final evaluation | Python

Hold-out set中， Hold-out表示out of sample， set表示dataset。

Hold-out set reasoning | Python

Hold-out set in practice I: Classification | Python

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

In [4]: # Print the optimal parameters and best score
        print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
        print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameter: {'C': 0.43939705607607948, 'penalty': 'l1'}
Tuned Logistic Regression Accuracy: 0.7652173913043478

Hold-out set in practice II: Regression | Python

$$elastic \space net = a \times L1 + b \times L2$$

# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

Preprocessing data | Python

处理分类变量

scikit-learn: OneHotEncoder()
pandas: get_dummies() + .drop(..., axis = 1)

Exploring categorical features | Python

If You Feel My Love (Chaow Mix) - Blaxy Girls - 单曲 - 网易云音乐今天这种听，感觉就想干活了！

转咖啡真开心。

# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot
plt.show()

这里注意一下， df.boxplotbase在pd.DataFrame上的。

Creating dummy variables | Python

对df所有分类变量全部OneHotEncoder()。drop_first=True`是集成方案。

# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region
print(df_region.columns)

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region
print(df_region.columns)

Regression with categorical features | Python

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5,  normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv = 5)

# Print the cross-validated scores
print(ridge_cv)

Handling missing data | Python

.replace(0,np.nan,inplace=True)和 .dropna()剔除缺失值。

from sklearn.preprocessing import Imputer导入包。 imp = Imputer(missing_values='NaN', strategy='mean', axis=0)然后指定规则。 imp.fit(X)应用于变量X。 X = imp.transform(X)完成转化。

当然任何的sklearn的集成方案里面，都少不了pipeline，因此兼容的。

Dropping missing data | Python

# Convert '?' to NaN
df[df == '?'] = np.nan

这种写法可以穿透行与列。

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

<script.py> output:
    party                  0
    infants               12
    water                 48
    budget                11
    physician             11
    salvador              15
    religious             11
    satellite             14
    aid                   15
    missile               22
    immigration            7
    synfuels              21
    education             31
    superfund             25
    crime                 17
    duty_free_exports     28
    eaa_rsa              104
    dtype: int64
    Shape of Original DataFrame: (435, 17)
    Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)

Imputing missing data in a ML Pipeline I | Python

In [3]: Imputer?
axis : integer, optional (default=0)
    The axis along which to impute.

    - If `axis=0`, then impute along columns.

# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

新颖，strategy='most_frequent'。

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

Imputing missing data in a ML Pipeline II | Python

# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

In [5]: y_pred = pipeline.predict(X_test)

In [6]: print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support

   democrat       0.99      0.96      0.98        85
 republican       0.94      0.98      0.96        46

avg / total       0.97      0.97      0.97       131

Centering and scaling | Python

$\Box$可以总结下normalize，搞单位向量，这个我还真不懂。

k-NN主要用距离的，那么scaling可以帮助一些变量不至于过分影响模型。

Standardization $\frac{x-\mu}{\sigma}\to \hat \mu = 0, \hat \sigma = 1$
MinMax $\to \min = 0, \max = 1$
Normalize $\to \min = -1, \max = +1$

那么就检验下是否对k-NN模型有改善。

In [6]: from sklearn.preprocessing import StandardScaler
In [7]: steps = [('scaler', StandardScaler()),
   ...:          ('knn', KNeighborsClassifier())]
In [8]: pipeline = Pipeline(steps)
In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,
   ...: test_size=0.2, random_state=21)
In [10]: knn_scaled = pipeline.fit(X_train, y_train)
In [11]: y_pred = pipeline.predict(X_test)
In [12]: accuracy_score(y_test, y_pred)
Out[12]: 0.956
In [13]: knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
In [14]: knn_unscaled.score(X_test, y_test)
Out[14]: 0.928

$$Acc: 0.956 \to 0.928$$

Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.

但是变量是dummy的时候，没必要了。

Centering and scaling your data | Python

# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

<script.py> output:
    Mean of Unscaled Features: 18.432687072460002
    Standard Deviation of Unscaled Features: 41.54494764094571
    Mean of Scaled Features: 2.7314972981668206e-15
    Standard Deviation of Scaled Features: 0.9999999999999999

Centering and scaling in a pipeline | Python

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

说的很好啊。

# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))

<script.py> output:
    Accuracy with Scaling: 0.7700680272108843
    Accuracy without Scaling: 0.6979591836734694

Bringing it all together I: Pipeline for classification | Python

You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are $C$ and $gamma$. $C$ controls the regularization strength. It is analogous to the $C$ you tuned for logistic regression in Chapter 3, while $gamma$ controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.

# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

<script.py> output:
    Accuracy: 0.7795918367346939
                 precision    recall  f1-score   support
    
          False       0.83      0.85      0.84       662
           True       0.67      0.63      0.65       318
    
    avg / total       0.78      0.78      0.78       980
    
    Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.1}

Bringing it all together II: Pipeline for regression | Python

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.4, random_state = 42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))

<script.py> output:
    Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}
    Tuned ElasticNet R squared: 0.8862016570888217

Final thoughts | Python

搞完了，可以去看那本书了！

Statement of Accomplishment

证书

github的图片加载实在是太慢了，坑死了，国内啊。。。真是坑。

如果$x^2+y^2+z^2=1$，则向量$[x,y,z]$称为单位向量。只要模为1的向量，就称为单位向量，单位向量有无穷多个，在任何一个方向上都有一个单位向量。 ↩︎

"学习笔记：Supervised Learning with scikit-learn 学习笔记

"学习笔记 系列导航

"学习笔记 系列导航

"学习笔记系列导航

"学习笔记系列导航