6 min read

学习笔记:Kaggle Python Tutorial on Machine Learning 学习笔记

学习笔记 系列导航

1 Hive/Impala 学习笔记 2017-12-04
2 KS 学习笔记 2017-12-06
3 datacamp pandas DataFrames 学习笔记 2017-12-11
4 datacamp pandas Merging DataFrames 学习笔记 2017-12-13
5 Databases in Python 学习笔记 2017-12-14
6 Data Visualization with Python 学习笔记 2017-12-16
7 Exploratory data analysis in Python 学习笔记 2017-12-17
8 Statistical Thinking in Python (Part-2) 学习笔记 2017-12-18
9 list comprehensions in Python 学习笔记 2017-12-19
10 无监督学习:Unsupervised Learning in Python 学习笔记 2017-12-20
11 学习笔记:Deep Learning in Python 学习笔记 2017-12-22
12 学习笔记:Python 学习的流水笔记 2017-12-25
13 学习笔记:Network Analysis in Python Part 1 学习笔记 2017-12-27
14 学习笔记:XGBoost using Python 学习笔记 2017-12-28
15 学习笔记:Supervised Learning with scikit-learn 学习笔记 2017-12-30
16 学习笔记:Boosting理论部分 学习笔记 2018-01-02
17 学习笔记:Machine Learning with the Experts School Budgets 学习笔记 2018-01-02
18 学习笔记:犯罪心理解析 2018-01-02
19 学习笔记:决策树理论部分 学习笔记 2018-01-03
20 学习笔记:Shell 学习笔记 2018-01-04
21 学习笔记:客户价值定价 学习笔记 2018-01-04
22 学习笔记:Introduction to Git for Data Science 学习笔记 2018-01-06
23 学习笔记:线性代数 整理笔记 2018-01-08
24 学习笔记:退火算法 学习笔记 2018-01-09
25 学习笔记:Fahrenheit 911 视频笔记 2018-01-18
26 学习笔记:pandas debugging 学习笔记 2018-01-19
27 学习笔记:brilliant.org概率论导论 学习笔记 2018-01-22
28 学习笔记:Machine Learning with Tree-Based Models in R 学习笔记 2018-01-22
29 学习笔记:Building Web Applications in R with Shiny 学习笔记 2018-01-25
30 学习笔记:Inference for Numerical Data 学习笔记 2018-01-26
31 学习笔记:Support Vector Machines SVM 学习笔记 2018-01-26
32 学习笔记:Introduction to DataCamp Projects 学习笔记 2018-01-28
33 学习笔记:Working with Web Data in R 学习笔记 2018-01-28
34 学习笔记:三种平均数使用的方式 学习笔记 2018-01-29
35 学习笔记:戒律的复活 每周六更新 2018-01-29
36 学习笔记:Communicating with Data in the Tidyverse 学习笔记 2018-01-31
37 学习笔记:Kaggle R Tutorial on Machine Learning 学习笔记 2018-02-01
38 技术:ggridges 山峦图 学习笔记 2018-02-02
39 技术:XGBoost 学习笔记 2018-02-02
40 学习笔记:圆桌派 第三季 视频笔记 2018-02-05
41 学习笔记:基础与技巧整理 2018-02-25
42 集成学习R SuperLearner包学习笔记 2018-03-04
43 学习笔记:英语学习积累:词汇、表达与语法整理 2018-04-09
44 技术:原理与应用学习笔记 2018-04-29
45 学习笔记:魏剑峰英语学习笔记:表达与语法整理 2018-05-02
46 技术:方法与实践学习笔记 2018-05-12
47 无监督学习:主成分分析(PCA)原理与实现学习笔记 2018-05-17
48 技术:特征筛选学习笔记 2018-05-29
49 学习笔记:Planet Money播客学习笔记:经济学话题解析 2018-06-05
50 基础算法系列梯度下降算法详解:原理与优化学习笔记 2018-07-11
51 技术:统计建模学习笔记 2018-07-24
52 技术:指标设计学习笔记 2018-09-20
53 技术:安装与使用基础学习笔记 2018-11-07
54 📈 ggplot 设计思路 学习笔记 2019-12-26
55 🧩 Python 函数编写学习笔记 2019-12-31
56 技术:特征工程之目标编码学习笔记 2020-01-20
57 📚 词向量 学习笔记 2020-07-04
58 主题模型:学习笔记 2020-07-04
59 学习笔记:WSJ 学习笔记 2020-10-19
60 健身:学习笔记 2025-08-19

Get the Data with Pandas | Python

# Import the Pandas library
import  pandas as pd
# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

#Print the `head` of the train and test dataframes
print(train.head())
print(test.head())

Understanding your data | Python

对于pd.DataFrame的数据结构,注意三个函数。函数别忘了()

  • .describe()
  • .shape
  • .info()

Rose vs Jack, or Female vs Male | Python

类似于R的, prop.tabletable函数。

# absolute numbers
train["Survived"].value_counts()

# percentages
train["Survived"].value_counts(normalize = True)

也可以这样train.Survived.value_counts()normalize = True得到比例。

train["Survived"][train["Sex"] == 'male'].value_counts()
train["Survived"][train["Sex"] == 'female'].value_counts()

没有pd.pivot_table()好用。


# Passengers that survived vs passengers that passed away
print(train.Survived.value_counts())

# As proportions
print(train["Survived"].value_counts(normalize = True))

# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))

# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))
<script.py> output:
    0    549
    1    342
    Name: Survived, dtype: int64
    0    0.616162
    1    0.383838
    Name: Survived, dtype: float64
    0    468
    1    109
    Name: Survived, dtype: int64
    1    233
    0     81
    Name: Survived, dtype: int64
    0    0.811092
    1    0.188908
    Name: Survived, dtype: float64
    1    0.742038
    0    0.257962
    Name: Survived, dtype: float64

It looks like it makes sense to predict that all females will survive, and all men will die.

Does age play a role? | Python

# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train["Child"])


# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))
0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      NaN
6      0.0
7      1.0
8      0.0
9      1.0
10     1.0
11     0.0
12     0.0
13     0.0
14     1.0
15     0.0
16     1.0
17     NaN
18     0.0
19     NaN
20     0.0
21     0.0
22     1.0
23     0.0
24     1.0
25     0.0
26     NaN
27     0.0
28     NaN
29     NaN
      ... 
861    0.0
862    0.0
863    NaN
864    0.0
865    0.0
866    0.0
867    0.0
868    NaN
869    1.0
870    0.0
871    0.0
872    0.0
873    0.0
874    0.0
875    1.0
876    0.0
877    0.0
878    NaN
879    0.0
880    0.0
881    0.0
882    0.0
883    0.0
884    0.0
885    0.0
886    0.0
887    0.0
888    NaN
889    0.0
890    0.0
Name: Child, dtype: float64
1    0.539823
0    0.460177
Name: Survived, dtype: float64
0    0.618968
1    0.381032
Name: Survived, dtype: float64

First Prediction | Python

# Create a copy of test: test_one
test_one = test.copy()

# Initialize a Survived column to 0
test_one["Survived"] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one["Survived"][test_one["Sex"] == "female"] = 1
print(test_one.Survived)

test_one.Survived还可以这样玩。

Intro to decision trees | Python

# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree

Cleaning and Formatting your Data | Python

开始覆盖一些缺失值。 利用.fillna()

# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

#Print the Sex and Embarked columns
print(train.Sex)
print(train.Embarked)

这里对Embarked进行了特征工程。

Creating your first decision tree | Python

The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have.

也就是说DecisionTreeClassifier进行时, 需要将输入input转化成numpy,最简单的方法就是 .values

In [1]: type(train["Survived"].values)
Out[1]: numpy.ndarray
target = train["Survived"].values

features = train[["Sex", "Age"]].values

my_tree = tree.DecisionTreeClassifier()

my_tree = my_tree.fit(features, target)

.feature_importances_可以查重要性排序。 .score(X,y)看效果。

# Print the train data to see the available features
print(train)

# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

0.977553310887还不错。

In [5]: train[["Pclass", "Sex", "Age", "Fare"]].columns
Out[5]: Index(['Pclass', 'Sex', 'Age', 'Fare'], dtype='object')

In [6]: my_tree_one.feature_importances_
Out[6]: array([ 0.1269655 ,  0.31274009,  0.23147703,  0.32881738])

这点就没有R好了,麻烦,看个排序还要那么搞来搞去。

Predict and submit to Kaggle | Python

.astype(int)更改变量性质。 .to_csv("my_solution_one.csv", index_label = ["PassengerId"])中, .to_csv指定了index_label,就是说在.csv中也要写index。

# Impute the missing value with the median
test.Fare[152] = test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[['Pclass', 'Sex', 'Age', 'Fare']].values

# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

Overfitting and how to control it | Python

depth树的深度包括, max_depth最大深度,min_samples_split节点最小样本量,只能大于等于,不能小于。 random_state就是个set.seed(),给定随机状态。

# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.score(features_two, target)) # 0.905723905724

这里进行了一定的调整参数, 增加了max_depth = 10, min_samples_split = 5tree.DecisionTreeClassifier中。

Feature-engineering for our Titanic data set | Python

构建一个family_size感觉没啥意思。

# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two.SibSp + train_two.Parch + 1

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target)) #0.979797979798

score还提高了卧槽,这。 "SibSp", "Parch", "family_size"三个都在模型中,因为树是非线性的,因此这样可以,抓住sense。

Your submission scored 0.75598, which is not an improvement of your best score. Keep trying!

A Random Forest analysis in Python | Python

n_estimators needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.

n_estimators回答了需要种植多少个树。


# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
<script.py> output:
    0.939393939394
    418

Remember, .score() measure should be high but not extreme because that would be a sign of overfitting.

.score()本身就是过拟合的体现。

#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)

#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_two, target))
<script.py> output:
    [ 0.14130255  0.17906027  0.41616727  0.17938711  0.05039699  0.01923751
      0.0144483 ]
    [ 0.10384741  0.20139027  0.31989322  0.24602858  0.05272693  0.04159232
      0.03452128]
    0.905723905724
    0.939393939394

"Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"

The most important feature was “Sex”, but it was more significant for “my_tree_two”

学习笔记 系列导航

1 Hive/Impala 学习笔记 2017-12-04
2 KS 学习笔记 2017-12-06
3 datacamp pandas DataFrames 学习笔记 2017-12-11
4 datacamp pandas Merging DataFrames 学习笔记 2017-12-13
5 Databases in Python 学习笔记 2017-12-14
6 Data Visualization with Python 学习笔记 2017-12-16
7 Exploratory data analysis in Python 学习笔记 2017-12-17
8 Statistical Thinking in Python (Part-2) 学习笔记 2017-12-18
9 list comprehensions in Python 学习笔记 2017-12-19
10 无监督学习:Unsupervised Learning in Python 学习笔记 2017-12-20
11 学习笔记:Deep Learning in Python 学习笔记 2017-12-22
12 学习笔记:Python 学习的流水笔记 2017-12-25
13 学习笔记:Network Analysis in Python Part 1 学习笔记 2017-12-27
14 学习笔记:XGBoost using Python 学习笔记 2017-12-28
15 学习笔记:Supervised Learning with scikit-learn 学习笔记 2017-12-30
16 学习笔记:Boosting理论部分 学习笔记 2018-01-02
17 学习笔记:Machine Learning with the Experts School Budgets 学习笔记 2018-01-02
18 学习笔记:犯罪心理解析 2018-01-02
19 学习笔记:决策树理论部分 学习笔记 2018-01-03
20 学习笔记:Shell 学习笔记 2018-01-04
21 学习笔记:客户价值定价 学习笔记 2018-01-04
22 学习笔记:Introduction to Git for Data Science 学习笔记 2018-01-06
23 学习笔记:线性代数 整理笔记 2018-01-08
24 学习笔记:退火算法 学习笔记 2018-01-09
25 学习笔记:Fahrenheit 911 视频笔记 2018-01-18
26 学习笔记:pandas debugging 学习笔记 2018-01-19
27 学习笔记:brilliant.org概率论导论 学习笔记 2018-01-22
28 学习笔记:Machine Learning with Tree-Based Models in R 学习笔记 2018-01-22
29 学习笔记:Building Web Applications in R with Shiny 学习笔记 2018-01-25
30 学习笔记:Inference for Numerical Data 学习笔记 2018-01-26
31 学习笔记:Support Vector Machines SVM 学习笔记 2018-01-26
32 学习笔记:Introduction to DataCamp Projects 学习笔记 2018-01-28
33 学习笔记:Working with Web Data in R 学习笔记 2018-01-28
34 学习笔记:三种平均数使用的方式 学习笔记 2018-01-29
35 学习笔记:戒律的复活 每周六更新 2018-01-29
36 学习笔记:Communicating with Data in the Tidyverse 学习笔记 2018-01-31
37 学习笔记:Kaggle R Tutorial on Machine Learning 学习笔记 2018-02-01
38 技术:ggridges 山峦图 学习笔记 2018-02-02
39 技术:XGBoost 学习笔记 2018-02-02
40 学习笔记:圆桌派 第三季 视频笔记 2018-02-05
41 学习笔记:基础与技巧整理 2018-02-25
42 集成学习R SuperLearner包学习笔记 2018-03-04
43 学习笔记:英语学习积累:词汇、表达与语法整理 2018-04-09
44 技术:原理与应用学习笔记 2018-04-29
45 学习笔记:魏剑峰英语学习笔记:表达与语法整理 2018-05-02
46 技术:方法与实践学习笔记 2018-05-12
47 无监督学习:主成分分析(PCA)原理与实现学习笔记 2018-05-17
48 技术:特征筛选学习笔记 2018-05-29
49 学习笔记:Planet Money播客学习笔记:经济学话题解析 2018-06-05
50 基础算法系列梯度下降算法详解:原理与优化学习笔记 2018-07-11
51 技术:统计建模学习笔记 2018-07-24
52 技术:指标设计学习笔记 2018-09-20
53 技术:安装与使用基础学习笔记 2018-11-07
54 📈 ggplot 设计思路 学习笔记 2019-12-26
55 🧩 Python 函数编写学习笔记 2019-12-31
56 技术:特征工程之目标编码学习笔记 2020-01-20
57 📚 词向量 学习笔记 2020-07-04
58 主题模型:学习笔记 2020-07-04
59 学习笔记:WSJ 学习笔记 2020-10-19
60 健身:学习笔记 2025-08-19