23 min read

Supervised Learning with scikit-learn 学习笔记

\(\boxtimes\) KNN也可以用于多分类。 而且由于它是空间函数,所以处理10个左右的y情况,非常擅长,比softmax函数好。 KNN的方法,最直观的理解就是,假设设定一个点周围最近\(n\)个点,那么这\(n\)个点中,频数最高的某种label、y的情况,就作为\(\hat y\)

Supervised Learning with scikit-learn

相当于复习了。主要是监督学习,变量预处理, k-NN、回归、决策树、逻辑回归、SVM都有涉及,主要是k-NN。

  • 4 hours
  • 17 Videos
  • 54 Exercises

这个哥们教的,Andreas Müller | DataCamp

Andy is a lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python”, describing a practical approach to machine learning with python and scikit-learn.

按照这种套路,去买他的书看最值。 因为出视频和书的,一般PPT和论文都写得来,没什么短板,看视频的效率高,不用到处查。 搞xgboost的时候,被这哥们坑惨了,Sergey Fogelson | DataCamp 。 最后说一句,有问题大家一起探讨,一起进步,卧槽,这差太多了,扶不动啊。

Supervised learning | Python

Reinforcement learning 有种贝叶斯的感觉。

Software agents interact with an environment

  • Learn how to optimize their behavior
  • Given a system of rewards and punishments
  • Draws inspiration from behavioral psychology

Applications

  • Economics
  • Genetics
  • Game playing

AlphaGo: First computer to defeat the world champion in Go

看第一个视频,就觉得他教的不错!

Which of these is a classification problem? | Python

可以了解下这里的数据结构和最后形成的图。

因为本地没有安装好。 设置anaconda的路径吧。 暂时不知道,可以先在jupyter上尝试, 之后熟悉路径这种繁琐的东西后再说。

Installing scikit-learn — scikit-learn 0.19.1 documentation

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()
In [7]: type(iris)
Out[7]: sklearn.datasets.base.Bunch

这种数据结构还是没见过。 iris是字典格式。

In [8]: print(iris.keys())
dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])
In [9]: type(iris.data), type(iris.target)
Out[9]: (numpy.ndarray, numpy.ndarray)
In [10]: iris.data.shape
Out[10]: (150, 4)

因此可以知道iris.data的矩阵结构为(150, 4)

In [11]: iris.target_names
Out[11]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
X = iris.data
y = iris.target

这里两者都是向量,画图的时候不需要连接,因为有隐含的index。

df = pd.DataFrame(X, columns=iris.feature_names)

这样的话,矩阵就重塑pd.DataFrame了。

_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8, 8],s=150, marker='D')
plt.show()

marker='D'对应的是diamond,可以查到的, markers — Matplotlib 2.1.1 documentation

s是shape,c是color。见plt.scatter?

Marker size is scaled by s and marker color is mapped to c.

看图,明显感觉到第三行最后一个分类非常好,因此有idea了。

另外,deploy失败的问题,常常是网络,因此使用手机是最稳定的。

Visual EDA | Python

但是也可以看到之前的图,都是针对连续变量的,散点图和histogram图。 因此对于binary就可能出现countplot图了。 palette='RdBu'表示Red和Blue。 plt.figure()开启新图,类似于plt.clf()的功能,否则图像重叠。

plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

The classification challenge | Python

KNN,k—Nearest Neighbors和KMeans等的方法还不太一样。 看最近的几个点,用众数投票。

不能有缺失值,但是对于sklearn包来说,这个不是问题,具体可以看 python中变量批量处理集成方案 - A Hugo website

但是KNN也还是作为监督学习的。

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])

你看,还是有y存在的。

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')

另外,

iris['data'].shape(150,4)iris['target'].shape(150,)

k-Nearest Neighbors: Fit | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

导入包。

y = df['party'].values
X = df.drop('party', axis=1).values

这个地方加.values是为了保持yX是向量格式。

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

这里n_neighbors = 6限定了,周围6个最近的大家少数服从多数投票。

k-Nearest Neighbors: Predict | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df.party.values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

Measuring model performance | Python

knn.score(X_test, y_test)可以看\(Acc\)stratify=ytrain_test_split中表示。

stratify : array-like or None (default is None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

train_test_split(X, y, stratify=y) - CSDN博客, 将stratify=X就是按照X中的比例分配, 将stratify=y就是按照y中的比例分配, 一般都是=y

但是如果数据量很大的话,random应该很稳定吧,哦,对了,如果存在少类情况,要注意这个bug。

\[Larger\space k\space \to\space smoother\space decision\space boundary\space \to\space less\space complex\space model\]

\[Smaller\space k\space \to\space more\space complex\space model\space \to\space can\space lead\space to\space overfitting\]

很好理解,不均匀,说骚操作太多了。 有点和常识不一样啊,其实没有,\(n \uparrow\)其实不是加参数,而是减参数。

The digits recognition dataset | Python

\(\boxtimes\) KNN也可以用于多分类。 而且由于它是空间函数,所以处理10个左右的y情况,非常擅长,比softmax函数好。 KNN的方法,最直观的理解就是,假设设定一个点周围最近\(n\)个点,那么这\(n\)个点中,频数最高的某种label、y的情况,就作为\(\hat y\)

.load_digits()原来sklearn的样本数据就是这样调用的。

# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()
In [4]: print(digits.keys())
dict_keys(['DESCR', 'data', 'images', 'target', 'target_names'])

digits作为字典格式,一共有这些key。

In [5]: print(digits.DESCR)
Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

更像是数据的介绍。

In [6]: print(digits.images.shape)
(1797, 8, 8)

这个我有点看不懂,为什么是三个elements。

In [7]: type(digits.images)
Out[7]: numpy.ndarray

numpy格式。

In [10]: digits.images[0:2]
Out[10]: 
array([[[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
        [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
        [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
        [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
        [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
        [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
        [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
        [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]],

       [[  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,   9.,   0.,   0.],
        [  0.,   0.,   3.,  15.,  16.,   6.,   0.,   0.],
        [  0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   3.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.]]])

看前面两个样本的结果就是,一个行向量中每个元素都是一个矩阵,所以是三个维度。

In [11]: digits.images[0:2,0]
Out[11]: 
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.]])

这样是取用前面两个样本中,每个元素矩阵中的第一排。 样本解释完毕。

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

原来是搞文本识别的。

Train/Test Split + Fit/Predict/Accuracy | Python

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

计算开始计算\(Acc\)了,先录入包。

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

分好测试组和训练组。

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))
<script.py> output:
    0.983333333333

预测能力太强了。

Overfitting and underfitting | Python

for i, k in enumerate(neighbors):中, k代表n_neighborsi用于记录\(Acc\)

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors = k)

    # Fit the classifier to the training data
    knn.fit(X_train,y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train,y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

休息会,看会电视。

Introduction to regression | Python

.reshape: Returns an array containing the same data with a new shape. Refer to numpy.reshape for full documentation.

import numpy
numpy.reshape?

python reshape用法 - a3335581的博客 - CSDN博客

看这个就懂了。

y.reshape(-1, 1)表示行向量,-1行,element里面也是一行。 这样的话,

y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)

新的yX_rooms就是两个数列了。

最后通过np.linspace虚拟出\(\hat y = X \hat \beta\)这条线出来。

import numpy as np
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
import matplotlib.pyplot as plt
plt.scatter(X_rooms, y, color='blue')
plt.scatter(prediction_space, reg.predict(prediction_space), color='black', linewidth=3)
plt.show()

Importing data for supervised learning | Python

# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

{}".format()这种表达方法很直觉啊,比R好。

<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)

显然后面这种格式好啊。

Exploring the Gapminder data | Python

seaborn.heatmap, sns.heatmap(df.corr(), square=True, cmap='RdYlGn') \(\to\) .info(), .describe(), .head().

又是EDA。

In [1]: sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
... 
... 
Out[1]: <matplotlib.axes._subplots.AxesSubplot at 0x7fd8205810f0>

In [2]: plt.show()

The basics of linear regression | Python

我突然发现啊,这个集成方案后,不太可能在回归方程上搞什么幺蛾子了,比如加个interaction啊。

Fit & predict for regression | Python

# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

np.linspace(min(X_fertility),max(X_fertility)).reshape(-1,1)常见手法画\(\hat y = X \hat \beta\)。 所以在OLS中,reg.score(X_fertility, y)\(R^2\)

Train/test split for regression | Python

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
<script.py> output:
    R^2: 0.838046873142936
    Root Mean Squared Error: 3.2476010800377213

Cross-validation | Python

我记得这里xgboost也解释过,只所以要cross validation,是因为\(R_{test}^2\)是受到split严重影响的。

一看这个图就理解了。

并且这里cross_val_score是默认了\(R_{CV}^2\), $ = CV + score$。

5-fold cross-validation | Python

from sklearn.model_selection import cross_val_score说明 明显cross validation是属于selection的范畴的。

# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv = 5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))
<script.py> output:
    [ 0.81720569  0.82917058  0.90214134  0.80633989  0.94495637]
    Average 5-Fold CV Score: 0.8599627722793232

哈哈。

K-Fold CV comparison | Python

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg,X,y,cv = 3)
print(np.mean(cvscores_3))

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg,X,y,cv = 10)
print(np.mean(cvscores_10))
<script.py> output:
    0.871871278262
    0.843612862013

显然,\(cv \uoparrow\),效果越好。

Regularized regression | Python

\[Loss \space function \space \space = \space OLS \space loss \space function \space + \space \alpha \sum_{i=1}^na_{i}^2\]

其中, \(\alpha太大 \to underfitting\)\(\alpha太小 \to overfitting\)。 所以这不是一个非常客观的参数,人为因素太大。

normalize是啥意思? help(Ridge), 就是给变量标准化的,变成单位向量,因为这里可能跟变量单位相关,影响正则化。

标准化与归一化的区别

简单来说,标准化是依照特征矩阵的列处理数据,其通过求z-score的方法,将样本的特征值转换到同一量纲下。归一化是依照特征矩阵的行处理数据,其目的在于样本向量在点乘运算或其他核函数计算相似性时,拥有统一的标准,也就是说都转化为“单位向量” 1。规则为l2的归一化公式如下:

\[\tilde x = \frac{x}{|x|}\]

\[|x| = \sum_{i=1}^mx_i\]

其中\(x\)表示一个行向量,即一个用户的数据。 \(m\)表示有m个特征。

\[Loss \space function \space \space = \space OLS \space loss \space function \space + \space \alpha \sum_{i=1}^n|a_{i}|\]

Regularization I: Lasso | Python

# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha = 0.4, normalize=True)

# Fit the regressor to the data
lasso.fit(X,y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

Regularization II: Ridge | Python

  • \(L1 \to\) Lasso \(\gets + \alpha \sum_{i=1}^n|a_{i}|\)
  • \(L2 \to\) Ridge \(\gets + \alpha \sum_{i=1}^na_{i}^2\)
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []
logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)
    Return numbers spaced evenly on a log scale.
    
    In linear space, the sequence starts at ``base ** start``
    (`base` to the power of `start`) and ends with ``base ** stop``
    (see `endpoint` below).
    
    Parameters
    ----------
    start : float
        ``base ** start`` is the starting value of the sequence.
    stop : float
        ``base ** stop`` is the final value of the sequence, unless `endpoint`
        is False.  In that case, ``num + 1`` values are spaced over the
        interval in log-space, of which all but the last (a sequence of
        length `num`) are returned.
    num : integer, optional
        Number of samples to generate.  Default is 50.
In [3]: np.logspace(-4, 0, 50)
Out[3]: 
array([  1.00000000e-04,   1.20679264e-04,   1.45634848e-04,
         1.75751062e-04,   2.12095089e-04,   2.55954792e-04,
         3.08884360e-04,   3.72759372e-04,   4.49843267e-04,
         5.42867544e-04,   6.55128557e-04,   7.90604321e-04,
         9.54095476e-04,   1.15139540e-03,   1.38949549e-03,
         1.67683294e-03,   2.02358965e-03,   2.44205309e-03,
         2.94705170e-03,   3.55648031e-03,   4.29193426e-03,
         5.17947468e-03,   6.25055193e-03,   7.54312006e-03,
         9.10298178e-03,   1.09854114e-02,   1.32571137e-02,
         1.59985872e-02,   1.93069773e-02,   2.32995181e-02,
         2.81176870e-02,   3.39322177e-02,   4.09491506e-02,
         4.94171336e-02,   5.96362332e-02,   7.19685673e-02,
         8.68511374e-02,   1.04811313e-01,   1.26485522e-01,
         1.52641797e-01,   1.84206997e-01,   2.22299648e-01,
         2.68269580e-01,   3.23745754e-01,   3.90693994e-01,
         4.71486636e-01,   5.68986603e-01,   6.86648845e-01,
         8.28642773e-01,   1.00000000e+00])

所以np.logspace(-4, 0, 50)产生连续的数列,50个,base在\(\log_{10}\),从小到大排序。

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

衡量cross validation出来的十个结果均值和标准差的关系。

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X,y, cv = 10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)

\(\Box\)add plot best_alpha

这个时候返回来看函数就懂了。

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

ax.plot(alpha_space, cv_scores)这里说明了,计算\(R^2 \gets \alpha\),其中为什么要算标准差,是因为要画阴影部分,搞置信区间。 为什么要\(/10\),这是因为\(\sigma = \frac{\hat \sigma}{n^{\frac{1}{2}}}\)

How good is your model? | Python

重点考虑了F1 score。

复习一下啊。

\[Recall = \frac{TP}{TP+FN} \to how \space actually\]

\[Precision = \frac{TP}{TP+FP} \to how \space confidently\]

\[F1 \space score = \frac{1}{\frac{1}{Recall} + \frac{1}{Precision}}\]

In [7]: print(confusion_matrix(y_test, y_pred))
[[52  7]
 [ 3 112]]
In [8]: print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support
          0       0.95      0.88      0.91        59
          1       0.94      0.97      0.96       115
avg / total       0.94      0.94      0.94       174

根据定义的TN不同,当然\(Recall\)\(Precisiom\)不同了。

Metrics for classification | Python

You may have noticed in the video that the classification report consisted of three rows, and an additional support column.

The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed.

The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

# Import necessary modules
from sklearn.metrics import classification_report, confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
<script.py> output:
    [[176  30]
     [ 52  50]]
                 precision    recall  f1-score   support
    
              0       0.77      0.85      0.81       206
              1       0.62      0.49      0.55       102
    
    avg / total       0.72      0.73      0.72       308

Logistic regression and the ROC curve | Python

from sklearn.metrics import roc_curve引入函数。 y_pred_prob = logreg.predict_proba(X_test)[:,1]设置好\(\hat y\)fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)算好fpr, tpr, thresholdsplt.plot([0, 1], [0, 1], 'k--')画对角线。 plt.plot(fpr, tpr, label='Logistic Regression')画ROC曲线。

Building a logistic regression model | Python

前面都是老套路。

# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

不解释了。

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
<script.py> output:
    [[176  30]
     [ 35  67]]
                 precision    recall  f1-score   support
    
              0       0.83      0.85      0.84       206
              1       0.69      0.66      0.67       102
    
    avg / total       0.79      0.79      0.79       308

Plotting an ROC curve | Python

In [7]: pd.DataFrame(logreg.predict_proba(X_test)).head()
Out[7]: 
          0         1
0  0.604098  0.395902
1  0.760424  0.239576
2  0.796702  0.203298
3  0.772360  0.227640
4  0.571949  0.428051

因此我们选择第二列,index = 1

# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

这里集成方案好,不需要自己调试,只需要知道idea就好。 理解的技巧x轴-fpr,y轴-tpr。

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Precision-recall Curve | Python

\(\Box\) add precision-recall-curve

Area under the ROC curve | Python

from sklearn.metrics import roc_auc_score引入。 roc_auc_score(y_test, y_pred_prob)使用,记得引入\(\hat y\)是概率。

from sklearn.model_selection import cross_val_score通过CV也可以引入。

In [8]: cv_scores = cross_val_score(logreg, X, y, cv=5,
...:                             scoring='roc_auc')

这样也是可以的。

AUC computation | Python

Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

也就是说靠猜,\(AUC = 0.5\),但是实际上只要\(AUC > 0.5\)就是有利可图的。

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

导入包。

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

得到\(\hat y\)

In [5]: # Compute and print AUC score
        print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
AUC: 0.8254806777079764
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv = 5, scoring = 'roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))
<script.py> output:
    AUC: 0.8254806777079764
    AUC scores computed using 5-fold cross-validation: [ 0.80148148  0.8062963   0.81481481  0.86245283  0.8554717 ]

Hyperparameter tuning | Python

hyperparameter的定义 不是\(\beta\)这种,可以从model中学习出来的, 而是需要从迭代中学习出来。

\(\Box\) grid-search

Hyperparameter tuning with GridSearchCV | Python

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

设定好GridSearchCV的一些参数。

In [12]: # Fit it to the data
         logreg_cv.fit(X,y)
Out[12]: 
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': array([  1.00000e-05,   8.48343e-05,   7.19686e-04,   6.10540e-03,
         5.17947e-02,   4.39397e-01,   3.72759e+00,   3.16228e+01,
         2.68270e+02,   2.27585e+03,   1.93070e+04,   1.63789e+05,
         1.38950e+06,   1.17877e+07,   1.00000e+08])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [13]: # Print the tuned parameters and score
         print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
         print("Best score is {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameters: {'C': 3.7275937203149381}
Best score is 0.7708333333333334

Hyperparameter tuning with RandomizedSearchCV | Python

# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}
# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
# Fit it to the data
tree_cv.fit(X,y)
In [2]: # Print the tuned parameters and score
        print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
        print("Best score is {}".format(tree_cv.best_score_))
Tuned Decision Tree Parameters: {'max_depth': 3, 'criterion': 'entropy', 'max_features': 3, 'min_samples_leaf': 8}
Best score is 0.7369791666666666

Hold-out set for final evaluation | Python

Hold-out set中, Hold-out表示out of sample, set表示dataset。

Hold-out set in practice I: Classification | Python

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
# Fit it to the training data
logreg_cv.fit(X_train, y_train)
In [4]: # Print the optimal parameters and best score
        print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
        print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameter: {'C': 0.43939705607607948, 'penalty': 'l1'}
Tuned Logistic Regression Accuracy: 0.7652173913043478

Hold-out set in practice II: Regression | Python

\[elastic \space net = a \times L1 + b \times L2\]

# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

Preprocessing data | Python

处理分类变量

  • scikit-learn: OneHotEncoder()
  • pandas: get_dummies() + .drop(..., axis = 1)

Exploring categorical features | Python

If You Feel My Love (Chaow Mix) - Blaxy Girls - 单曲 - 网易云音乐 今天这种听,感觉就想干活了!

转咖啡真开心。

# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot
plt.show()

这里注意一下, df.boxplotbase在pd.DataFrame上的。

Creating dummy variables | Python

df所有分类变量全部OneHotEncoder()drop_first=True`是集成方案。

# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region
print(df_region.columns)

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region
print(df_region.columns)

Regression with categorical features | Python

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5,  normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv = 5)

# Print the cross-validated scores
print(ridge_cv)

Handling missing data | Python

.replace(0,np.nan,inplace=True).dropna()剔除缺失值。

from sklearn.preprocessing import Imputer导入包。 imp = Imputer(missing_values='NaN', strategy='mean', axis=0)然后指定规则。 imp.fit(X)应用于变量XX = imp.transform(X)完成转化。

当然任何的sklearn的集成方案里面,都少不了pipeline,因此兼容的。

Dropping missing data | Python

# Convert '?' to NaN
df[df == '?'] = np.nan

这种写法可以穿透行与列。

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))
<script.py> output:
    party                  0
    infants               12
    water                 48
    budget                11
    physician             11
    salvador              15
    religious             11
    satellite             14
    aid                   15
    missile               22
    immigration            7
    synfuels              21
    education             31
    superfund             25
    crime                 17
    duty_free_exports     28
    eaa_rsa              104
    dtype: int64
    Shape of Original DataFrame: (435, 17)
    Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)

Imputing missing data in a ML Pipeline I | Python

In [3]: Imputer?
axis : integer, optional (default=0)
    The axis along which to impute.

    - If `axis=0`, then impute along columns.
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

新颖,strategy='most_frequent'

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

Imputing missing data in a ML Pipeline II | Python

# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
In [5]: y_pred = pipeline.predict(X_test)

In [6]: print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support

   democrat       0.99      0.96      0.98        85
 republican       0.94      0.98      0.96        46

avg / total       0.97      0.97      0.97       131

Centering and scaling | Python

\(\Box\)可以总结下normalize,搞单位向量,这个我还真不懂。

k-NN主要用距离的,那么scaling可以帮助一些变量不至于过分影响模型。

  • Standardization \(\frac{x-\mu}{\sigma}\to \hat \mu = 0, \hat \sigma = 1\)
  • MinMax \(\to \min = 0, \max = 1\)
  • Normalize \(\to \min = -1, \max = +1\)

那么就检验下是否对k-NN模型有改善。

In [6]: from sklearn.preprocessing import StandardScaler
In [7]: steps = [('scaler', StandardScaler()),
   ...:          ('knn', KNeighborsClassifier())]
In [8]: pipeline = Pipeline(steps)
In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,
   ...: test_size=0.2, random_state=21)
In [10]: knn_scaled = pipeline.fit(X_train, y_train)
In [11]: y_pred = pipeline.predict(X_test)
In [12]: accuracy_score(y_test, y_pred)
Out[12]: 0.956
In [13]: knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
In [14]: knn_unscaled.score(X_test, y_test)
Out[14]: 0.928

\[Acc: 0.956 \to 0.928\]

Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.

但是变量是dummy的时候,没必要了。

Centering and scaling your data | Python

# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))
<script.py> output:
    Mean of Unscaled Features: 18.432687072460002
    Standard Deviation of Unscaled Features: 41.54494764094571
    Mean of Scaled Features: 2.7314972981668206e-15
    Standard Deviation of Scaled Features: 0.9999999999999999

Centering and scaling in a pipeline | Python

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

说的很好啊。

# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
<script.py> output:
    Accuracy with Scaling: 0.7700680272108843
    Accuracy without Scaling: 0.6979591836734694

Bringing it all together I: Pipeline for classification | Python

You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are \(C\) and \(gamma\). \(C\) controls the regularization strength. It is analogous to the \(C\) you tuned for logistic regression in Chapter 3, while \(gamma\) controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.

# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))
<script.py> output:
    Accuracy: 0.7795918367346939
                 precision    recall  f1-score   support
    
          False       0.83      0.85      0.84       662
           True       0.67      0.63      0.65       318
    
    avg / total       0.78      0.78      0.78       980
    
    Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.1}

Bringing it all together II: Pipeline for regression | Python

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.4, random_state = 42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
<script.py> output:
    Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}
    Tuned ElasticNet R squared: 0.8862016570888217

Final thoughts | Python

搞完了,可以去看那本书了!

Statement of Accomplishment

证书

github的图片加载实在是太慢了,坑死了,国内啊。。。真是坑。


  1. 如果\(x^2+y^2+z^2=1\),则向量\([x,y,z]\)称为单位向量。 只要模为1的向量,就称为单位向量,单位向量有无穷多个,在任何一个方向上都有一个单位向量。