6 min read

Kaggle Python Tutorial on Machine Learning 学习笔记

Get the Data with Pandas | Python

# Import the Pandas library
import  pandas as pd
# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

#Print the `head` of the train and test dataframes
print(train.head())
print(test.head())

Understanding your data | Python

对于pd.DataFrame的数据结构,注意三个函数。函数别忘了()

  • .describe()
  • .shape
  • .info()

Rose vs Jack, or Female vs Male | Python

类似于R的, prop.tabletable函数。

# absolute numbers
train["Survived"].value_counts()

# percentages
train["Survived"].value_counts(normalize = True)

也可以这样train.Survived.value_counts()normalize = True得到比例。

train["Survived"][train["Sex"] == 'male'].value_counts()
train["Survived"][train["Sex"] == 'female'].value_counts()

没有pd.pivot_table()好用。


# Passengers that survived vs passengers that passed away
print(train.Survived.value_counts())

# As proportions
print(train["Survived"].value_counts(normalize = True))

# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))

# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))
<script.py> output:
    0    549
    1    342
    Name: Survived, dtype: int64
    0    0.616162
    1    0.383838
    Name: Survived, dtype: float64
    0    468
    1    109
    Name: Survived, dtype: int64
    1    233
    0     81
    Name: Survived, dtype: int64
    0    0.811092
    1    0.188908
    Name: Survived, dtype: float64
    1    0.742038
    0    0.257962
    Name: Survived, dtype: float64

It looks like it makes sense to predict that all females will survive, and all men will die.

Does age play a role? | Python

# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train["Child"])


# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))
0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      NaN
6      0.0
7      1.0
8      0.0
9      1.0
10     1.0
11     0.0
12     0.0
13     0.0
14     1.0
15     0.0
16     1.0
17     NaN
18     0.0
19     NaN
20     0.0
21     0.0
22     1.0
23     0.0
24     1.0
25     0.0
26     NaN
27     0.0
28     NaN
29     NaN
      ... 
861    0.0
862    0.0
863    NaN
864    0.0
865    0.0
866    0.0
867    0.0
868    NaN
869    1.0
870    0.0
871    0.0
872    0.0
873    0.0
874    0.0
875    1.0
876    0.0
877    0.0
878    NaN
879    0.0
880    0.0
881    0.0
882    0.0
883    0.0
884    0.0
885    0.0
886    0.0
887    0.0
888    NaN
889    0.0
890    0.0
Name: Child, dtype: float64
1    0.539823
0    0.460177
Name: Survived, dtype: float64
0    0.618968
1    0.381032
Name: Survived, dtype: float64

First Prediction | Python

# Create a copy of test: test_one
test_one = test.copy()

# Initialize a Survived column to 0
test_one["Survived"] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one["Survived"][test_one["Sex"] == "female"] = 1
print(test_one.Survived)

test_one.Survived还可以这样玩。

Intro to decision trees | Python

# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree

Cleaning and Formatting your Data | Python

开始覆盖一些缺失值。 利用.fillna()

# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

#Print the Sex and Embarked columns
print(train.Sex)
print(train.Embarked)

这里对Embarked进行了特征工程。

Creating your first decision tree | Python

The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have.

也就是说DecisionTreeClassifier进行时, 需要将输入input转化成numpy,最简单的方法就是 .values

In [1]: type(train["Survived"].values)
Out[1]: numpy.ndarray
target = train["Survived"].values

features = train[["Sex", "Age"]].values

my_tree = tree.DecisionTreeClassifier()

my_tree = my_tree.fit(features, target)

.feature_importances_可以查重要性排序。 .score(X,y)看效果。

# Print the train data to see the available features
print(train)

# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

0.977553310887还不错。

In [5]: train[["Pclass", "Sex", "Age", "Fare"]].columns
Out[5]: Index(['Pclass', 'Sex', 'Age', 'Fare'], dtype='object')

In [6]: my_tree_one.feature_importances_
Out[6]: array([ 0.1269655 ,  0.31274009,  0.23147703,  0.32881738])

这点就没有R好了,麻烦,看个排序还要那么搞来搞去。

Predict and submit to Kaggle | Python

.astype(int)更改变量性质。 .to_csv("my_solution_one.csv", index_label = ["PassengerId"])中, .to_csv指定了index_label,就是说在.csv中也要写index。

# Impute the missing value with the median
test.Fare[152] = test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[['Pclass', 'Sex', 'Age', 'Fare']].values

# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

Overfitting and how to control it | Python

depth树的深度包括, max_depth最大深度,min_samples_split节点最小样本量,只能大于等于,不能小于。 random_state就是个set.seed(),给定随机状态。

# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.score(features_two, target)) # 0.905723905724

这里进行了一定的调整参数, 增加了max_depth = 10, min_samples_split = 5tree.DecisionTreeClassifier中。

Feature-engineering for our Titanic data set | Python

构建一个family_size感觉没啥意思。

# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two.SibSp + train_two.Parch + 1

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target)) #0.979797979798

score还提高了卧槽,这。 "SibSp", "Parch", "family_size"三个都在模型中,因为树是非线性的,因此这样可以,抓住sense。

Your submission scored 0.75598, which is not an improvement of your best score. Keep trying!

A Random Forest analysis in Python | Python

n_estimators needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.

n_estimators回答了需要种植多少个树。


# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
<script.py> output:
    0.939393939394
    418

Remember, .score() measure should be high but not extreme because that would be a sign of overfitting.

.score()本身就是过拟合的体现。

#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)

#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_two, target))
<script.py> output:
    [ 0.14130255  0.17906027  0.41616727  0.17938711  0.05039699  0.01923751
      0.0144483 ]
    [ 0.10384741  0.20139027  0.31989322  0.24602858  0.05272693  0.04159232
      0.03452128]
    0.905723905724
    0.939393939394

"Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"

The most important feature was “Sex”, but it was more significant for “my_tree_two”