Machine Learning with the Experts: School Budgets 学习笔记

DrivenData: Machine Learning Course

Peter Bull | DataCamp , 这个哥们教的,应该很强。

Introducing the challenge | Python


What category of problem is this? | Python

Reinforcement Learning, because the model is learning from the data through a system of rewards and punishments.

Exploring the data | Python


Looking at the datatypes | Python

pandas里面,category会以numeric的方式记录string变量。 使用.astype('category')来实现。 但是.astype只对series有用。

get_dummy(df, ...)一直都是要加df的。 prefix_sep = '_'表达了新变量的形式。

\(\lambda\) function, one line code. 加上apply(...,axis = 0)就可以批量操作变量了。

Exploring datatypes in pandas | Python

In [1]: df.dtypes
Function                   object
Use                        object
Sharing                    object
Reporting                  object
dtype: object
In [2]: df.dtypes.value_counts()
object     23
float64     2
dtype: int64


Encode the labels as categorical variables | Python

.dtypesdf.info() 更加精准。

# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis = 0)

# Print the converted dtypes


In [1]: LABELS
In [4]: print(df[LABELS].dtypes)
Function            category
Use                 category
Sharing             category
Reporting           category
Student_Type        category
Position_Type       category
Object_Type         category
Pre_K               category
Operating_Status    category
dtype: object

Counting unique labels | Python

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique, axis = 0)

# Plot number of unique values for each label
num_unique_labels.plot(kind = 'bar')

# Label the axes
plt.ylabel('Number of unique values')

# Display the plot


How do we measure success? | Python

详解sklearn中logloss的计算过程 - ybdesire的专栏 - CSDN博客


\[logloss = -\frac{1}{N}\sum_{i=1}^Ny_i \log p_i + (1-y_i)\log(1-p_i)\]



\[logloss = -\frac{1}{N}\log e^{\sum_{i=1}^Ny_i \log p_i + (1-y_i)\log(1-p_i)}\]

\[\begin{alignat}{2} logloss & = -\frac{1}{N}\log e^{\sum_{i=1}^Ny_i \log p_i + (1-y_i)\log(1-p_i)} \\ & = -\frac{1}{N}\log \prod_{i=1}^N e^{y_i \log p_i} + e^{(1-y_i)\log(1-p_i)} \\ & = -\frac{1}{N}\log \prod_{i=1}^N p_i^{y_i} + (1-p_i)^{1-y_i} \\ \end{alignat}\]


假设\(y_i = 1, \hat y_i = 0, p_o = 0.5\)

\[logloss_i = -(y_i \log p_i + (1-y_i)\log(1-p_i))\]

\[\begin{alignat}{2} logloss_i & = -(y_i \log p_i + (1-y_i)\log(1-p_i)) \\ & = -(1 \times \log 0.5 + (1-1) \times \log(1-0.5)) \\ & = -log(0.5) \\ & = 0.69 \end{alignat}\]

因此\(p_i = 0.5 \to \hat y_i = 1 \to 不自信且错了\),因此惩罚不高。


\[\begin{alignat}{2} logloss_i & = -(y_i \log p_i + (1-y_i)\log(1-p_i)) \\ & = -(0 \times \log 0.9 + (1-0) \times \log(1-0.9)) \\ & = -log(0.1) \\ & = 2.30 \end{alignat}\]

因此\(p_i = 0.9 \to \hat y_i = 1 \to 自信且错了\),因此惩罚高。

log loss provides a steep penalty for predictions that are both wrong and confident, i.e., a high probability is assigned to the incorrect class.


Computing log loss with NumPy | Python

import numpy as np
def compute_log_loss(predicted, actual, eps=1e-14):
     """ Computes the logarithmic loss between predicted and
     actual when these are 1D arrays.
     :param predicted: The predicted probabilities as floats between 0-1
     :param actual: The actual binary labels. Either 0 or 1.
     :param eps (optional): log(0) is inf, so we need to offset our
     predicted values slightly by eps from 0 or 1.
     predicted = np.clip(predicted, eps, 1 - eps)
     loss = -1 * np.mean(actual * np.log(predicted)
     + (1 - actual)
     * np.log(1 - predicted))
     return loss

Using the compute_log_loss() function, compute the log loss for the following predicted values (in each case, the actual values are contained in actual_labels):

  • correct_confident. 这里是\(\hat y\),不是\(X\)
  • correct_not_confident.
  • wrong_not_confident.
  • wrong_confident.
  • actual_labels.
# Compute and print log loss for 1st case
correct_confident = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident)) 

# Compute log loss for 2nd case
correct_not_confident = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident)) 

# Compute and print log loss for 3rd case
wrong_not_confident = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident)) 

# Compute and print log loss for 4th case
wrong_confident = compute_log_loss(wrong_confident,actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident)) 

# Compute and print log loss for actual labels
actual_labels = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels)) 
<script.py> output:
    Log loss, correct and confident: 0.05129329438755058
    Log loss, correct and not confident: 0.4307829160924542
    Log loss, wrong and not confident: 1.049822124498678
    Log loss, wrong and confident: 2.9957322735539904
    Log loss, actual labels: 9.99200722162646e-15

correct and not confident是第二小的,因此啊谦虚就是好。



Signature: np.clip(a, a_min, a_max, out=None)
Clip (limit) the values in an array.

Given an interval, values outside the interval are clipped to
the interval edges.  For example, if an interval of ``[0, 1]``
is specified, values smaller than 0 become 0, and values larger
than 1 become 1.


It’s time to build a model | Python

Keep going!

分层抽样是为了防止某些\(y\)是没有学习到的。 multilabel_train_test_split()


  • For each classifier, the class is fitted against all the other classes.
  • by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.

sklearn.multiclass.OneVsRestClassifier — scikit-learn 0.19.1 documentation


# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Create the DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,

# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test, y_test)))
<script.py> output:
    Accuracy: 0.0


Making predictions | Python


Use your model to predict values on holdout data | Python

holdout data 之前我就是栽在这上头。

In [3]: pd.read_csv('HoldoutData.csv').head()
   Unnamed: 0              Object_Description         Program_Description  \
0         237    Personal Services - Teachers       Instruction - Regular   
1         466    Extra Duty/Signing Bonus Pay  Basic Educational Services   
2         784  OTHER PERSONAL SERVICES                                NaN   
3        1786  TERMINAL LEAVE VACATION                                NaN   
4        2643    Extra Duty/Signing Bonus Pay               Undistributed   
In [4]: pd.read_csv('HoldoutData.csv', index_col=0).head()
                  Object_Description         Program_Description  \
237     Personal Services - Teachers       Instruction - Regular   
466     Extra Duty/Signing Bonus Pay  Basic Educational Services   
784   OTHER PERSONAL SERVICES                                NaN   
1786  TERMINAL LEAVE VACATION                                NaN   
2643    Extra Duty/Signing Bonus Pay               Undistributed   

index_col=0选第一列作为index。 clf.fit(X_train, y_train)用train的样本训练模型。

# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit it to the training data
clf.fit(X_train, y_train)

# Load the holdout data: holdout
holdout = pd.read_csv('HoldoutData.csv', index_col=0)

# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))


In [2]: type(y_train)
Out[2]: pandas.core.frame.DataFrame

In [1]: y_train.columns
Index(['Function_Aides Compensation', 'Function_Career & Academic Counseling',
       'Function_Communications', 'Function_Curriculum Development',
       'Function_Data Processing & Information Services',
       'Function_Development & Fundraising', 'Function_Enrichment',
       'Function_Extended Time & Tutoring',
       'Function_Facilities & Maintenance', 'Function_Facilities Planning',
       'Object_Type_Rent/Utilities', 'Object_Type_Substitute Compensation',
       'Object_Type_Supplies/Materials', 'Object_Type_Travel & Conferences',
       'Pre_K_NO_LABEL', 'Pre_K_Non PreK', 'Pre_K_PreK',
       'Operating_Status_Operating, Not PreK-12',
       'Operating_Status_PreK-12 Operating'],
      dtype='object', length=104)

所以理论上,One vs. Rest, 所以,也有104列\(\hat y\)

In [3]: holdout.columns
Index(['Object_Description', 'Program_Description', 'SubFund_Description',
       'Job_Title_Description', 'Facility_or_Department',
       'Sub_Object_Description', 'Location_Description', 'FTE',
       'Function_Description', 'Position_Extra', 'Text_4', 'Total', 'Text_2',
       'Text_3', 'Fund_Description', 'Text_1'],

也就是holdout作为要预测的样本中间的\(X\),就这些,要预测104组\(\hat y\)

Out[2]: ['FTE', 'Total']


Writing out your results to a csv for submission | Python

# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))
In [4]: predictions.shape
Out[4]: (2000, 104)

哈哈2000行,而且104组\(\hat y\)。真开心。

# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,

这里很好解释, pd.get_dummies(df[LABELS]).columns这里的名字,当然是用那个104组的名字了。 index=holdout.index2000行啊。 data=predictions2000行啊,104列定义好了的。

# Save prediction_df to csv
In [7]: # Submit the predictions for scoring: score
        score = score_submission(pred_path = 'predictions.csv')
        # Print score
        print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))
Your model, trained with numeric data only, yields logloss score: 1.9067227623381413


Even though your basic model scored 0.0 accuracy, it nevertheless performs better than the benchmark score of 2.0455. You’ve now got the basics down and have made a first pass at this complicated supervised learning problem. It’s time to step up your game and incorporate the text data.


A very brief introduction to NLP | Python


确实啊, bag of words,只考虑了频率,没有考虑词序。 “Red,not blue” = “Blue,not Red”。


Creating a bag-of-words in scikit-learn | Python



  • 分词
  • 建立字典
  • 算频数
