Supervised Learning with scikit-learn 学习笔记

KNN也可以用于多分类。 而且由于它是空间函数,所以处理10个左右的y情况,非常擅长,比softmax函数好。 KNN的方法,最直观的理解就是,假设设定一个点周围最近n个点,那么这n个点中,频数最高的某种label、y的情况,就作为\hat y

Supervised Learning with scikit-learn

相当于复习了。主要是监督学习,变量预处理, k-NN、回归、决策树、逻辑回归、SVM都有涉及,主要是k-NN。

Supervised learning | Python

Reinforcement learning 有种贝叶斯的感觉。

Software agents interact with an environment

  • Learn how to optimize their behavior
  • Given a system of rewards and punishments
  • Draws inspiration from behavioral psychology


  • Economics
  • Genetics
  • Game playing

AlphaGo: First computer to defeat the world champion in Go


Which of these is a classification problem? | Python


因为本地没有安装好。 设置anaconda的路径吧。 暂时不知道,可以先在jupyter上尝试, 之后熟悉路径这种繁琐的东西后再说。

Installing scikit-learn — scikit-learn 0.19.1 documentation

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iris = datasets.load_iris()
In [7]: type(iris)
Out[7]: sklearn.datasets.base.Bunch

这种数据结构还是没见过。 iris是字典格式。

In [8]: print(iris.keys())
dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])
In [9]: type(iris.data), type(iris.target)
Out[9]: (numpy.ndarray, numpy.ndarray)
In [10]: iris.data.shape
Out[10]: (150, 4)

因此可以知道iris.data的矩阵结构为(150, 4)

In [11]: iris.target_names
Out[11]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
X = iris.data
y = iris.target


df = pd.DataFrame(X, columns=iris.feature_names)


_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8, 8],s=150, marker='D')

marker='D'对应的是diamond,可以查到的, markers — Matplotlib 2.1.1 documentation


Marker size is scaled by s and marker color is mapped to c.



Visual EDA | Python

但是也可以看到之前的图,都是针对连续变量的,散点图和histogram图。 因此对于binary就可能出现countplot图了。 palette='RdBu'表示Red和Blue。 plt.figure()开启新图,类似于plt.clf()的功能,否则图像重叠。

sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])

The classification challenge | Python

KNN,k—Nearest Neighbors和KMeans等的方法还不太一样。 看最近的几个点,用众数投票。

不能有缺失值,但是对于sklearn包来说,这个不是问题,具体可以看 python中变量批量处理集成方案 - A Hugo website


from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,



k-Nearest Neighbors: Fit | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier


y = df['party'].values
X = df.drop('party', axis=1).values


# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data

这里n_neighbors = 6限定了,周围6个最近的大家少数服从多数投票。

k-Nearest Neighbors: Predict | Python

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df.party.values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

Measuring model performance | Python

knn.score(X_test, y_test)可以看Accstratify=ytrain_test_split中表示。

stratify : array-like or None (default is None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

train_test_split(X, y, stratify=y) - CSDN博客, 将stratify=X就是按照X中的比例分配, 将stratify=y就是按照y中的比例分配, 一般都是=y


Larger\space k\space \to\space smoother\space decision\space boundary\space \to\space less\space complex\space model

Smaller\space k\space \to\space more\space complex\space model\space \to\space can\space lead\space to\space overfitting

很好理解,不均匀,说骚操作太多了。 有点和常识不一样啊,其实没有,n \uparrow其实不是加参数,而是减参数。

The digits recognition dataset | Python

\boxtimes KNN也可以用于多分类。 而且由于它是空间函数,所以处理10个左右的y情况,非常擅长,比softmax函数好。 KNN的方法,最直观的理解就是,假设设定一个点周围最近n个点,那么这n个点中,频数最高的某种label、y的情况,就作为\hat y


# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()
In [4]: print(digits.keys())
dict_keys(['DESCR', 'data', 'images', 'target', 'target_names'])


In [5]: print(digits.DESCR)
Optical Recognition of Handwritten Digits Data Set

Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,

  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.


In [6]: print(digits.images.shape)
(1797, 8, 8)


In [7]: type(digits.images)
Out[7]: numpy.ndarray


In [10]: digits.images[0:2]
array([[[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
        [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
        [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
        [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
        [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
        [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
        [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
        [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]],

       [[  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,   9.,   0.,   0.],
        [  0.,   0.,   3.,  15.,  16.,   6.,   0.,   0.],
        [  0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   3.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   1.,  16.,  16.,   6.,   0.,   0.],
        [  0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.]]])


In [11]: digits.images[0:2,0]
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.]])

这样是取用前面两个样本中,每个元素矩阵中的第一排。 样本解释完毕。

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')


Train/Test Split + Fit/Predict/Accuracy | Python

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split


# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)


# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))
<script.py> output:


Overfitting and underfitting | Python

for i, k in enumerate(neighbors):中, k代表n_neighborsi用于记录Acc

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors = k)

    # Fit the classifier to the training data
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train,y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.xlabel('Number of Neighbors')


Introduction to regression | Python

.reshape: Returns an array containing the same data with a new shape. Refer to numpy.reshape for full documentation.

import numpy

python reshape用法 - a3335581的博客 - CSDN博客


y.reshape(-1, 1)表示行向量,-1行,element里面也是一行。 这样的话,

y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)


最后通过np.linspace虚拟出\hat y = X \hat \beta这条线出来。

import numpy as np
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
import matplotlib.pyplot as plt
plt.scatter(X_rooms, y, color='blue')
plt.scatter(prediction_space, reg.predict(prediction_space), color='black', linewidth=3)

Importing data for supervised learning | Python

# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))


<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)


Exploring the Gapminder data | Python

seaborn.heatmap, sns.heatmap(df.corr(), square=True, cmap='RdYlGn') \to .info(), .describe(), .head().


In [1]: sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
Out[1]: <matplotlib.axes._subplots.AxesSubplot at 0x7fd8205810f0>

In [2]: plt.show()

The basics of linear regression | Python


Fit & predict for regression | Python

# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)

np.linspace(min(X_fertility),max(X_fertility)).reshape(-1,1)常见手法画\hat y = X \hat \beta。 所以在OLS中,reg.score(X_fertility, y)R^2

Train/test split for regression | Python

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
<script.py> output:
    R^2: 0.838046873142936
    Root Mean Squared Error: 3.2476010800377213

Cross-validation | Python

我记得这里xgboost也解释过,只所以要cross validation,是因为R_{test}^2是受到split严重影响的。


并且这里cross_val_score是默认了R_{CV}^2, $ = CV + score$。

5-fold cross-validation | Python

from sklearn.model_selection import cross_val_score说明 明显cross validation是属于selection的范畴的。

# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv = 5)

# Print the 5-fold cross-validation scores

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))
<script.py> output:
    [ 0.81720569  0.82917058  0.90214134  0.80633989  0.94495637]
    Average 5-Fold CV Score: 0.8599627722793232


K-Fold CV comparison | Python

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg,X,y,cv = 3)

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg,X,y,cv = 10)
<script.py> output:

显然,cv \uoparrow,效果越好。

Regularized regression | Python

Loss \space function \space \space = \space OLS \space loss \space function \space + \space \alpha \sum_{i=1}^na_{i}^2

其中, \alpha太大 \to underfitting\alpha太小 \to overfitting。 所以这不是一个非常客观的参数,人为因素太大。

normalize是啥意思? help(Ridge), 就是给变量标准化的,变成单位向量,因为这里可能跟变量单位相关,影响正则化。


简单来说,标准化是依照特征矩阵的列处理数据,其通过求z-score的方法,将样本的特征值转换到同一量纲下。归一化是依照特征矩阵的行处理数据,其目的在于样本向量在点乘运算或其他核函数计算相似性时,拥有统一的标准,也就是说都转化为“单位向量” 1。规则为l2的归一化公式如下:

\tilde x = \frac{x}{|x|}

|x| = \sum_{i=1}^mx_i

其中x表示一个行向量,即一个用户的数据。 m表示有m个特征。

Loss \space function \space \space = \space OLS \space loss \space function \space + \space \alpha \sum_{i=1}^n|a_{i}|

Regularization I: Lasso | Python

# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha = 0.4, normalize=True)

# Fit the regressor to the data

# Compute and print the coefficients
lasso_coef = lasso.coef_

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)

Regularization II: Ridge | Python

  • L1 \to Lasso \gets + \alpha \sum_{i=1}^n|a_{i}|
  • L2 \to Ridge \gets + \alpha \sum_{i=1}^na_{i}^2
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []
logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)
    Return numbers spaced evenly on a log scale.
    In linear space, the sequence starts at ``base ** start``
    (`base` to the power of `start`) and ends with ``base ** stop``
    (see `endpoint` below).
    start : float
        ``base ** start`` is the starting value of the sequence.
    stop : float
        ``base ** stop`` is the final value of the sequence, unless `endpoint`
        is False.  In that case, ``num + 1`` values are spaced over the
        interval in log-space, of which all but the last (a sequence of
        length `num`) are returned.
    num : integer, optional
        Number of samples to generate.  Default is 50.
In [3]: np.logspace(-4, 0, 50)
array([  1.00000000e-04,   1.20679264e-04,   1.45634848e-04,
         1.75751062e-04,   2.12095089e-04,   2.55954792e-04,
         3.08884360e-04,   3.72759372e-04,   4.49843267e-04,
         5.42867544e-04,   6.55128557e-04,   7.90604321e-04,
         9.54095476e-04,   1.15139540e-03,   1.38949549e-03,
         1.67683294e-03,   2.02358965e-03,   2.44205309e-03,
         2.94705170e-03,   3.55648031e-03,   4.29193426e-03,
         5.17947468e-03,   6.25055193e-03,   7.54312006e-03,
         9.10298178e-03,   1.09854114e-02,   1.32571137e-02,
         1.59985872e-02,   1.93069773e-02,   2.32995181e-02,
         2.81176870e-02,   3.39322177e-02,   4.09491506e-02,
         4.94171336e-02,   5.96362332e-02,   7.19685673e-02,
         8.68511374e-02,   1.04811313e-01,   1.26485522e-01,
         1.52641797e-01,   1.84206997e-01,   2.22299648e-01,
         2.68269580e-01,   3.23745754e-01,   3.90693994e-01,
         4.71486636e-01,   5.68986603e-01,   6.86648845e-01,
         8.28642773e-01,   1.00000000e+00])

所以np.logspace(-4, 0, 50)产生连续的数列,50个,base在\log_{10},从小到大排序。

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

衡量cross validation出来的十个结果均值和标准差的关系。

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X,y, cv = 10)
    # Append the mean of ridge_cv_scores to ridge_scores
    # Append the std of ridge_cv_scores to ridge_scores_std

# Display the plot
display_plot(ridge_scores, ridge_scores_std)

\Boxadd plot best_alpha


def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])

ax.plot(alpha_space, cv_scores)这里说明了,计算R^2 \gets \alpha,其中为什么要算标准差,是因为要画阴影部分,搞置信区间。 为什么要/10,这是因为\sigma = \frac{\hat \sigma}{n^{\frac{1}{2}}}

How good is your model? | Python

重点考虑了F1 score。


Recall = \frac{TP}{TP+FN} \to how \space actually

Precision = \frac{TP}{TP+FP} \to how \space confidently

F1 \space score = \frac{1}{\frac{1}{Recall} + \frac{1}{Precision}}

In [7]: print(confusion_matrix(y_test, y_pred))
[[52  7]
 [ 3 112]]
In [8]: print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support
          0       0.95      0.88      0.91        59
          1       0.94      0.97      0.96       115
avg / total       0.94      0.94      0.94       174


Metrics for classification | Python

You may have noticed in the video that the classification report consisted of three rows, and an additional support column.

The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed.

The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

# Import necessary modules
from sklearn.metrics import classification_report, confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
<script.py> output:
    [[176  30]
     [ 52  50]]
                 precision    recall  f1-score   support
              0       0.77      0.85      0.81       206
              1       0.62      0.49      0.55       102
    avg / total       0.72      0.73      0.72       308

Logistic regression and the ROC curve | Python

from sklearn.metrics import roc_curve引入函数。 y_pred_prob = logreg.predict_proba(X_test)[:,1]设置好\hat yfpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)算好fpr, tpr, thresholdsplt.plot([0, 1], [0, 1], 'k--')画对角线。 plt.plot(fpr, tpr, label='Logistic Regression')画ROC曲线。

Building a logistic regression model | Python


# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)


# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
<script.py> output:
    [[176  30]
     [ 35  67]]
                 precision    recall  f1-score   support
              0       0.83      0.85      0.84       206
              1       0.69      0.66      0.67       102
    avg / total       0.79      0.79      0.79       308

Plotting an ROC curve | Python

In [7]: pd.DataFrame(logreg.predict_proba(X_test)).head()
          0         1
0  0.604098  0.395902
1  0.760424  0.239576
2  0.796702  0.203298
3  0.772360  0.227640
4  0.571949  0.428051

因此我们选择第二列,index = 1

# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

这里集成方案好,不需要自己调试,只需要知道idea就好。 理解的技巧x轴-fpr,y轴-tpr。

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')

Precision-recall Curve | Python

\Box add precision-recall-curve

Area under the ROC curve | Python

from sklearn.metrics import roc_auc_score引入。 roc_auc_score(y_test, y_pred_prob)使用,记得引入\hat y是概率。

from sklearn.model_selection import cross_val_score通过CV也可以引入。

In [8]: cv_scores = cross_val_score(logreg, X, y, cv=5,
...:                             scoring='roc_auc')


AUC computation | Python

Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

也就是说靠猜,AUC = 0.5,但是实际上只要AUC > 0.5就是有利可图的。

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score


# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

得到\hat y

In [5]: # Compute and print AUC score
        print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
AUC: 0.8254806777079764
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv = 5, scoring = 'roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))
<script.py> output:
    AUC: 0.8254806777079764
    AUC scores computed using 5-fold cross-validation: [ 0.80148148  0.8062963   0.81481481  0.86245283  0.8554717 ]

Hyperparameter tuning | Python

hyperparameter的定义 不是\beta这种,可以从model中学习出来的, 而是需要从迭代中学习出来。

\Box grid-search

Hyperparameter tuning with GridSearchCV | Python

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)


In [12]: # Fit it to the data
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': array([  1.00000e-05,   8.48343e-05,   7.19686e-04,   6.10540e-03,
         5.17947e-02,   4.39397e-01,   3.72759e+00,   3.16228e+01,
         2.68270e+02,   2.27585e+03,   1.93070e+04,   1.63789e+05,
         1.38950e+06,   1.17877e+07,   1.00000e+08])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [13]: # Print the tuned parameters and score
         print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
         print("Best score is {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameters: {'C': 3.7275937203149381}
Best score is 0.7708333333333334

Hyperparameter tuning with RandomizedSearchCV | Python

# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}
# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
# Fit it to the data
In [2]: # Print the tuned parameters and score
        print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
        print("Best score is {}".format(tree_cv.best_score_))
Tuned Decision Tree Parameters: {'max_depth': 3, 'criterion': 'entropy', 'max_features': 3, 'min_samples_leaf': 8}
Best score is 0.7369791666666666

Hold-out set for final evaluation | Python

Hold-out set中, Hold-out表示out of sample, set表示dataset。

Hold-out set in practice I: Classification | Python

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
# Fit it to the training data
logreg_cv.fit(X_train, y_train)
In [4]: # Print the optimal parameters and best score
        print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
        print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameter: {'C': 0.43939705607607948, 'penalty': 'l1'}
Tuned Logistic Regression Accuracy: 0.7652173913043478

Hold-out set in practice II: Regression | Python

elastic \space net = a \times L1 + b \times L2

# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

Preprocessing data | Python


  • scikit-learn: OneHotEncoder()
  • pandas: get_dummies() + .drop(..., axis = 1)

Exploring categorical features | Python

# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot

这里注意一下, df.boxplotbase在pd.DataFrame上的。

Creating dummy variables | Python


# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region

Regression with categorical features | Python

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5,  normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv = 5)

# Print the cross-validated scores

Handling missing data | Python


from sklearn.preprocessing import Imputer导入包。 imp = Imputer(missing_values='NaN', strategy='mean', axis=0)然后指定规则。 imp.fit(X)应用于变量XX = imp.transform(X)完成转化。


Dropping missing data | Python

# Convert '?' to NaN
df[df == '?'] = np.nan


# Print the number of NaNs

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))
<script.py> output:
    party                  0
    infants               12
    water                 48
    budget                11
    physician             11
    salvador              15
    religious             11
    satellite             14
    aid                   15
    missile               22
    immigration            7
    synfuels              21
    education             31
    superfund             25
    crime                 17
    duty_free_exports     28
    eaa_rsa              104
    dtype: int64
    Shape of Original DataFrame: (435, 17)
    Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)

Imputing missing data in a ML Pipeline I | Python

In [3]: Imputer?
axis : integer, optional (default=0)
    The axis along which to impute.

    - If `axis=0`, then impute along columns.
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)


# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

Imputing missing data in a ML Pipeline II | Python

# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
In [5]: y_pred = pipeline.predict(X_test)

In [6]: print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support

   democrat       0.99      0.96      0.98        85
 republican       0.94      0.98      0.96        46

avg / total       0.97      0.97      0.97       131

Centering and scaling | Python



  • Standardization \frac{x-\mu}{\sigma}\to \hat \mu = 0, \hat \sigma = 1
  • MinMax \to \min = 0, \max = 1
  • Normalize \to \min = -1, \max = +1


In [6]: from sklearn.preprocessing import StandardScaler
In [7]: steps = [('scaler', StandardScaler()),
   ...:          ('knn', KNeighborsClassifier())]
In [8]: pipeline = Pipeline(steps)
In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,
   ...: test_size=0.2, random_state=21)
In [10]: knn_scaled = pipeline.fit(X_train, y_train)
In [11]: y_pred = pipeline.predict(X_test)
In [12]: accuracy_score(y_test, y_pred)
Out[12]: 0.956
In [13]: knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
In [14]: knn_unscaled.score(X_test, y_test)
Out[14]: 0.928

Acc: 0.956 \to 0.928

Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.


Centering and scaling your data | Python

# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))
<script.py> output:
    Mean of Unscaled Features: 18.432687072460002
    Standard Deviation of Unscaled Features: 41.54494764094571
    Mean of Scaled Features: 2.7314972981668206e-15
    Standard Deviation of Scaled Features: 0.9999999999999999

Centering and scaling in a pipeline | Python

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.


# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
<script.py> output:
    Accuracy with Scaling: 0.7700680272108843
    Accuracy without Scaling: 0.6979591836734694

Bringing it all together I: Pipeline for classification | Python

You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are C and gamma. C controls the regularization strength. It is analogous to the C you tuned for logistic regression in Chapter 3, while gamma controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.

# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))
<script.py> output:
    Accuracy: 0.7795918367346939
                 precision    recall  f1-score   support
          False       0.83      0.85      0.84       662
           True       0.67      0.63      0.65       318
    avg / total       0.78      0.78      0.78       980
    Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.1}

Bringing it all together II: Pipeline for regression | Python

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.4, random_state = 42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
<script.py> output:
    Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}
    Tuned ElasticNet R squared: 0.8862016570888217

Final thoughts | Python


  1. 如果x^2+y^2+z^2=1,则向量[x,y,z]称为单位向量。 只要模为1的向量,就称为单位向量,单位向量有无穷多个,在任何一个方向上都有一个单位向量。