DrivenData: Machine Learning Course
。 Peter Bull | DataCamp , 这个哥们教的,应该很强。
Introducing the challenge | Python
所以可以去定义oversupply还是undersupply。
What category of problem is this? | Python
Reinforcement Learning, because the model is learning from the data through a system of rewards and punishments.
Exploring the data | Python
.describe
是针对连续变量的。
Looking at the datatypes | Python
在pandas
里面,category会以numeric的方式记录string变量。
使用.astype('category')
来实现。
但是.astype
只对series有用。
get_dummy(df, ...)
一直都是要加df
的。
prefix_sep = '_'
表达了新变量的形式。
\(\lambda\) function, one line code.
加上apply(...,axis = 0)
就可以批量操作变量了。
Exploring datatypes in pandas | Python
In [1]: df.dtypes
Out[1]:
Function object
Use object
Sharing object
Reporting object
...
dtype: object
In [2]: df.dtypes.value_counts()
Out[2]:
object 23
float64 2
dtype: int64
真机智!
Encode the labels as categorical variables | Python
.dtypes
比df.info()
更加精准。
Encode the labels as categorical variables | Python
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis = 0)
# Print the converted dtypes
print(df[LABELS].dtypes)
这里敢这么用,是因为LABELS
是list
。
In [1]: LABELS
Out[1]:
['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
In [4]: print(df[LABELS].dtypes)
Function category
Use category
Sharing category
Reporting category
Student_Type category
Position_Type category
Object_Type category
Pre_K category
Operating_Status category
dtype: object
Counting unique labels | Python
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique, axis = 0)
# Plot number of unique values for each label
num_unique_labels.plot(kind = 'bar')
# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')
# Display the plot
plt.show()
pd.Series.nunique
是集成方案。
How do we measure success? | Python
详解sklearn中logloss的计算过程 - ybdesire的专栏 - CSDN博客
所以这个是和惩罚自信的损失函数,来中和\(Acc\)。
\[logloss = -\frac{1}{N}\sum_{i=1}^Ny_i \log p_i + (1-y_i)\log(1-p_i)\]
看到这个可以借鉴softmax函数,做多分类问题。
理解它,
\[logloss = -\frac{1}{N}\log e^{\sum_{i=1}^Ny_i \log p_i + (1-y_i)\log(1-p_i)}\]
\[\begin{alignat}{2} logloss & = -\frac{1}{N}\log e^{\sum_{i=1}^Ny_i \log p_i + (1-y_i)\log(1-p_i)} \\ & = -\frac{1}{N}\log \prod_{i=1}^N e^{y_i \log p_i} + e^{(1-y_i)\log(1-p_i)} \\ & = -\frac{1}{N}\log \prod_{i=1}^N p_i^{y_i} + (1-p_i)^{1-y_i} \\ \end{alignat}\]
举例子证明,自信惩罚多。
假设\(y_i = 1, \hat y_i = 0, p_o = 0.5\)
\[logloss_i = -(y_i \log p_i + (1-y_i)\log(1-p_i))\]
\[\begin{alignat}{2} logloss_i & = -(y_i \log p_i + (1-y_i)\log(1-p_i)) \\ & = -(1 \times \log 0.5 + (1-1) \times \log(1-0.5)) \\ & = -log(0.5) \\ & = 0.69 \end{alignat}\]
因此\(p_i = 0.5 \to \hat y_i = 1 \to 不自信且错了\),因此惩罚不高。
但是,
\[\begin{alignat}{2} logloss_i & = -(y_i \log p_i + (1-y_i)\log(1-p_i)) \\ & = -(0 \times \log 0.9 + (1-0) \times \log(1-0.9)) \\ & = -log(0.1) \\ & = 2.30 \end{alignat}\]
因此\(p_i = 0.9 \to \hat y_i = 1 \to 自信且错了\),因此惩罚高。
log loss provides a steep penalty for predictions that are both wrong and confident, i.e., a high probability is assigned to the incorrect class.
因此对于\(FP\)和\(FN\)有很强的惩罚性。
Computing log loss with NumPy | Python
import numpy as np
def compute_log_loss(predicted, actual, eps=1e-14):
""" Computes the logarithmic loss between predicted and
actual when these are 1D arrays.
:param predicted: The predicted probabilities as floats between 0-1
:param actual: The actual binary labels. Either 0 or 1.
:param eps (optional): log(0) is inf, so we need to offset our
predicted values slightly by eps from 0 or 1.
"""
predicted = np.clip(predicted, eps, 1 - eps)
loss = -1 * np.mean(actual * np.log(predicted)
+ (1 - actual)
* np.log(1 - predicted))
return loss
Using the
compute_log_loss
() function, compute the log loss for the following predicted values (in each case, the actual values are contained inactual_labels
):
correct_confident
. 这里是\(\hat y\),不是\(X\)correct_not_confident.
wrong_not_confident.
wrong_confident.
actual_labels.
# Compute and print log loss for 1st case
correct_confident = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident))
# Compute log loss for 2nd case
correct_not_confident = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident))
# Compute and print log loss for 3rd case
wrong_not_confident = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident))
# Compute and print log loss for 4th case
wrong_confident = compute_log_loss(wrong_confident,actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident))
# Compute and print log loss for actual labels
actual_labels = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels))
<script.py> output:
Log loss, correct and confident: 0.05129329438755058
Log loss, correct and not confident: 0.4307829160924542
Log loss, wrong and not confident: 1.049822124498678
Log loss, wrong and confident: 2.9957322735539904
Log loss, actual labels: 9.99200722162646e-15
correct and not confident
是第二小的,因此啊谦虚就是好。
还是不太懂。
理解这个函数。
Signature: np.clip(a, a_min, a_max, out=None)
Docstring:
Clip (limit) the values in an array.
Given an interval, values outside the interval are clipped to
the interval edges. For example, if an interval of ``[0, 1]``
is specified, values smaller than 0 become 0, and values larger
than 1 become 1.
因此eps=1e-14
是0,1-eps=1-1e-14
是1,理解了啊,就是给个上限和下限。
It’s time to build a model | Python
Keep going!
分层抽样是为了防止某些\(y\)是没有学习到的。
multilabel_train_test_split()
OneVsRestClassifer
的意思是?
- For each classifier, the class is fitted against all the other classes.
- by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.
sklearn.multiclass.OneVsRestClassifier — scikit-learn 0.19.1 documentation
sklearn.multiclass
有什么?
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Create the DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)
# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])
# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,
label_dummies,
size=0.2,
seed=123)
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test, y_test)))
<script.py> output:
Accuracy: 0.0
惊呆了。
Making predictions | Python
这里开始上传.csv
文件来打比赛了。
Use your model to predict values on holdout data | Python
holdout data 之前我就是栽在这上头。
In [3]: pd.read_csv('HoldoutData.csv').head()
Out[3]:
Unnamed: 0 Object_Description Program_Description \
0 237 Personal Services - Teachers Instruction - Regular
1 466 Extra Duty/Signing Bonus Pay Basic Educational Services
2 784 OTHER PERSONAL SERVICES NaN
3 1786 TERMINAL LEAVE VACATION NaN
4 2643 Extra Duty/Signing Bonus Pay Undistributed
In [4]: pd.read_csv('HoldoutData.csv', index_col=0).head()
Out[4]:
Object_Description Program_Description \
237 Personal Services - Teachers Instruction - Regular
466 Extra Duty/Signing Bonus Pay Basic Educational Services
784 OTHER PERSONAL SERVICES NaN
1786 TERMINAL LEAVE VACATION NaN
2643 Extra Duty/Signing Bonus Pay Undistributed
index_col=0
选第一列作为index。
clf.fit(X_train, y_train)
用train的样本训练模型。
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())
# Fit it to the training data
clf.fit(X_train, y_train)
# Load the holdout data: holdout
holdout = pd.read_csv('HoldoutData.csv', index_col=0)
# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))
我没注意,有那么多的\(y\)。
In [2]: type(y_train)
Out[2]: pandas.core.frame.DataFrame
In [1]: y_train.columns
Out[1]:
Index(['Function_Aides Compensation', 'Function_Career & Academic Counseling',
'Function_Communications', 'Function_Curriculum Development',
'Function_Data Processing & Information Services',
'Function_Development & Fundraising', 'Function_Enrichment',
'Function_Extended Time & Tutoring',
'Function_Facilities & Maintenance', 'Function_Facilities Planning',
...
'Object_Type_Rent/Utilities', 'Object_Type_Substitute Compensation',
'Object_Type_Supplies/Materials', 'Object_Type_Travel & Conferences',
'Pre_K_NO_LABEL', 'Pre_K_Non PreK', 'Pre_K_PreK',
'Operating_Status_Non-Operating',
'Operating_Status_Operating, Not PreK-12',
'Operating_Status_PreK-12 Operating'],
dtype='object', length=104)
所以理论上,One vs. Rest, 所以,也有104列\(\hat y\)。
In [3]: holdout.columns
Out[3]:
Index(['Object_Description', 'Program_Description', 'SubFund_Description',
'Job_Title_Description', 'Facility_or_Department',
'Sub_Object_Description', 'Location_Description', 'FTE',
'Function_Description', 'Position_Extra', 'Text_4', 'Total', 'Text_2',
'Text_3', 'Fund_Description', 'Text_1'],
dtype='object')
也就是holdout作为要预测的样本中间的\(X\),就这些,要预测104组\(\hat y\)。
In [2]: NUMERIC_COLUMNS
Out[2]: ['FTE', 'Total']
还只能用两个变量\(X\),这好坑啊。
Writing out your results to a csv for submission | Python
# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))
In [4]: predictions.shape
Out[4]: (2000, 104)
哈哈2000行,而且104组\(\hat y\)。真开心。
# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
index=holdout.index,
data=predictions)
这里很好解释,
pd.get_dummies(df[LABELS]).columns
这里的名字,当然是用那个104组的名字了。
index=holdout.index
2000行啊。
data=predictions
2000行啊,104列定义好了的。
# Save prediction_df to csv
prediction_df.to_csv('predictions.csv')
In [7]: # Submit the predictions for scoring: score
score = score_submission(pred_path = 'predictions.csv')
# Print score
print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))
Your model, trained with numeric data only, yields logloss score: 1.9067227623381413
还可以啊!
Even though your basic model scored 0.0 accuracy, it nevertheless performs better than the benchmark score of 2.0455. You’ve now got the basics down and have made a first pass at this complicated supervised learning problem. It’s time to step up your game and incorporate the text data.
这也是搞笑了啊,哈哈。
A very brief introduction to NLP | Python
这个应用场景不合适啊。
确实啊, bag of words,只考虑了频率,没有考虑词序。 “Red,not blue” = “Blue,not Red”。
NLP没什么意思,不是我的方向。
Creating a bag-of-words in scikit-learn | Python
NLP不是方向,不想搞。
CountVectorizer()
- 分词
- 建立字典
- 算频数
蛮无聊的,就当是复习了。