11 min read

Kaggle R Tutorial on Machine Learning 学习笔记

做到0.8的话对这个数据集基本上算是比较好的了,再往上提高也没有太大的意义,Titanic暂时告一段落。

Set Sail | R

library(tidyverse)
library(knitr)
# Import the training set: train
train_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train <- read_csv(train_url)
  
# Import the testing set: test
test_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test <- read_csv(test_url)
  
# Print train and test to the console
# train
# test

Understanding your data | R

library(skimr)

skim(train)
skim(test)

Rose vs Jack, or Female vs Male | R

这里的sense是什么?一个变量的组间比较。

# Your train and test set are still loaded
# str(train)
# str(test)

# Survival rates in absolute numbers
table(train$Survived)

# As proportions
prop.table(table(train$Survived))
  
# Two-way comparison: Sex and Survived
table(train$Sex, train$Survived)

# Two-way comparison: row-wise proportions
prop.table(table(train$Sex, train$Survived), 1)
  • prop.table + margin = 1 \(\to\) 行求和,求权重
  • prop.table + margin = 2 \(\to\) 列求和,求权重

这里Survived=1生,Survived=0死。 因此女的存活多,男的存活低,因此Sex是重要变量。 就这点sense。

Does age play a role? | R

感觉在 简单地 分析变量啊。

# Your train and test set are still loaded in
# str(train)
# str(test)

# Create the column child, and indicate whether child or no child
# train$Child <- NA
# train$Child[train$Age < 18] <- 1
# train$Child[train$Age  >= 18] <- 0

train <- 
train %>% 
  mutate(
    Child = case_when(
      Age < 18 ~ 1,
      Age >= 18 ~ 0
    )
  )

# Two-way comparison
prop.table(table(train$Child,train$Survived),margin = 1)

注意这里不能直接用 Age有空值。

summary(train$Age)

Making your first predictions | R

# Your train and test set are still loaded in
# str(train)
# str(test)

# Copy of test
test_one <- test

# Initialize a Survived column to 0
# test_one$Survived[test_one$Sex == "male"] <- 0

# Set Survived to 1 if Sex equals "female"
# test_one$Survived[test_one$Sex == "female"] <- 1

test_one <- 
  test_one %>% 
  mutate(
    Survived = case_when(
      Sex == "male" ~ 0,
      Sex == "female" ~ 1
    )
  )

这个时候可以output,然后去打个分了。

# read_csv("~/Downloads/gender_submission.csv")
# read_csv("~/Downloads/my_solution.csv")

Your submission scored 0.76555, which is not an improvement of your best score. Keep trying!

卧槽,性别这么重要!!!

这些变态为什么是100%预测对。

Creating your first decision tree | R

# Your train and test set are still loaded in
# str(train)
# str(test)

# Build the decision tree
library(rpart)
my_tree_two <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, method = "class")

# Visualize the decision tree using plot() and text()
# plot(my_tree_two)
# text(my_tree_two)

# Load in the packages to build a fancy plot
# library(rattle) # 用fancyRpartPlot
library(rpart.plot)
# library(RColorBrewer)
# Time to plot your fancy tree
# fancyRpartPlot(my_tree_two)
rpart.plot(x = my_tree_two)

这里安装rattle,需要RGtk2,但是只能用老版本。 install.packages("RGtk2", repos="http://www.ggobi.org/r")。 还是不行,bug太多,fancy但是useless。

Predict and submit to Kaggle | R

# my_tree_two and test are available in the workspace

# Make predictions on the test set
my_prediction <- predict(my_tree_two, newdata = test, type = "class")

# Finish the data.frame() call
my_solution <- data.frame(PassengerId = test$PassengerId, 
                          Survived = my_prediction)

# Use nrow() on my_solution
nrow(my_solution)

# Finish the write.csv() call
write_csv(my_solution, "my_solution_dt.csv")

You advanced 3,239 places on the leaderboard! Your submission scored 0.78468, which is an improvement of your previous score of 0.76555. Great job!

Overfitting, the iceberg of decision trees | R

# Your train and test set are still loaded in

# Change this command
# super_model <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
#                      data = train, method = "class", control = rpart.control(minsplit = 2, cp = 0))

my_tree_three <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
                     data = train, method = "class", control = rpart.control(minsplit = 50, cp = 0))

# Visualize my_tree_three
rpart.plot(my_tree_three)

太恐怖了!

Passenger Title and survival rate | R

# train_new and test_new are available in the workspace

# Finish the command
my_tree_five <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title,
                      data = train_new, method = "class")

# Visualize my_tree_five
fancyRpartPlot(my_tree_five)

# Make prediction
my_prediction <- predict(my_tree_five, newdata = test_new, type = "class")

# Make results ready for submission
my_solution <- data.frame(PassengerId = test_new$PassengerId, Survived = my_prediction)
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

这里加入了一个新变量Title原数据是没有的。

You advanced 2,894 places on the leaderboard! Your submission scored 0.80382, which is an improvement of your previous score of 0.78468. Great job!

What is a Random Forest | R

For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority’s vote count as the actual classification decision, avoids overfitting.

因为大多数投票,就避免了过拟合,因为噪音是随机的,不会导致major voting的。

Before starting with the actual analysis, you first need to meet one big condition of Random Forests: no missing values in your data frame. Let’s get to work.

随机森林不能有空值。这里介绍了三种方法,对于空值的处理,这里可以建议对空值进行填写。

# All data, both training and test set
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/all_data.RData","all_data.RData")
load("all_data.RData")

# Passenger on row 62 and 830 do not have a value for embarkment.
# Since many passengers embarked at Southampton, we give them the value S.
all_data$Embarked[c(62, 830)] <- "S"

# Factorize embarkment codes.
all_data$Embarked <- factor(all_data$Embarked)

# Passenger on row 1044 has an NA Fare value. Let's replace it with the median fare value.
all_data$Fare[1044] <- median(all_data$Fare, na.rm = TRUE)

# How to fill in missing Age values?
# We make a prediction of a passengers Age using the other variables and a decision tree model.
# This time you give method = "anova" since you are predicting a continuous variable.
library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
                       data = all_data[!is.na(all_data$Age),], method = "anova")
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])

# Split the data back into a train set and a test set
train <- all_data[1:891,]
test <- all_data[892:1309,]

One more important element in Random Forest is randomization to avoid the creation of the same tree from the training set. You randomize in two ways: by taking a randomized sample of the rows in your training set and by only working with a limited and changing number of the available variables for every node of the tree.

A Random Forest analysis in R | R

# train and test are available in the workspace
# str(train)
# str(test)

# Load in the package
library(randomForest)

# Train set and test set
# str(train)
# str(test)

# Set seed for reproducibility
set.seed(111)

# Apply the Random Forest Algorithm
my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title,
                          data = train, 
                          importance = TRUE, 
                          ntree = 1000)

# Make your prediction using the test set
my_prediction <- predict(my_forest, newdata = test)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)

# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

Your submission scored 0.79425, which is not an improvement of your best score. Keep trying!

Important variables | R

varImpPlot(my_forest)

所以说title很重要,没有的话怎么run? 因此这个额外信息太耍赖了。

尝试下GBM

set.seed(1)
library(gbm)
my_gbm <- gbm(formula = as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title, 
                    distribution = "bernoulli", 
                    data = train %>% mutate_if(is.character,as.factor),
                    n.trees = 10000)

# Make your prediction using the test set
my_prediction <- predict(my_forest, newdata = test)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)

# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

Your submission scored 0.79904, which is not an improvement of your best score. Keep trying!

分数刷不上去了,就是随机森林效果最好了 = =。 那有有空弄下xgboost。

try XGBoost

继续打榜,这次换xgboost。 根据这篇文章进行调参。

One-Hot-Encoding

all_data1 <- 
all_data %>% 
  select(-PassengerId,-Name,-Ticket,-Cabin)
skim(all_data1)

因此需要对变量EmbarkedSexTitle进行one-hot。 这里使用caret::dummyVars函数。

library(caret)
dummies_model <- dummyVars(~ Embarked +  Sex + Title, data = all_data1)

dummies_model也是类似于模型,类似地写入一个方程。 接着one-hot是我们的结果,我们将他们predict出来。

df_all_ohe <- as_tibble(predict(dummies_model, newdata = all_data1))
head(df_all_ohe)

注意这里存在共线性。 通过阅读help(dummyVars),我们发现可以通过fullRank = TRUE来完成这一效应,因此加入。

dummies_model <- dummyVars(~ Embarked +  Sex + Title, data = all_data1, fullRank = TRUE)
df_all_ohe <- as_tibble(predict(dummies_model, newdata = all_data1))
head(df_all_ohe)

看到只有Sex.male,我们知道共线性剔除了。

最后我们需要将原来的变量替换成one-hot的结果。 这是一种方法,应该学习。

ohe_feats <- c('Embarked', 'Sex', 'Title')
df_all_combined <- 
bind_cols(
all_data1[,-c(which(colnames(all_data1) %in% ohe_feats))]
,
df_all_ohe
)
df_all_combined %>% head()

好了。分traintest

train_and_validation <- df_all_combined[1:891,]
test <- df_all_combined[892:1309,] # 只用于predict
set.seed(123)
train_index <- sample(1:nrow(train_and_validation),round(2/3*nrow(train_and_validation)),replace = F)
train <- train_and_validation[train_index,]
validation <- train_and_validation[-train_index,]
names(train)
names(validation)

train[-1]表示踢出第一行,因此我们会拿到所有的\(X\),不要\(y\) 注意这里\(y\)不可以train[1]而是train$Survived,因为要保证其是一个series,而非table。

library(xgboost)
dtrain <- xgb.DMatrix(data = as.matrix(train[-1]), label=train$Survived)
dvalidation <- xgb.DMatrix(data = as.matrix(validation[-1]), label=validation$Survived)
watchlist <- list(train=dtrain, validation=dvalidation)

Error in check.custom.obj() : Setting objectives in 'params' and 'obj' at the same time is not allowed 这里不能使用watchlist dtest设计的有问题。它的Survived都是空的,这是比赛,怎么会告诉你,脑子进水了? 更可能是watchlistxgb.train的参数而非xgboost的,别搞混了。

library(xgboost)
xgb <- xgb.train(
 data=dtrain,
 eta = 0.1,
 max_depth = 15,
 nround=16,
 subsample = 0.5,
 colsample_bytree = 0.5,
 seed = 1,
 watchlist=watchlist,
 eval.metric = "auc",
 # eval.metric = "error",
 # eval.metric = "logloss",
 objective = "binary:logistic",
 # num_class = 12, # set the number of classes. To use only with multiclass objectives.
 nfold = 100,
 nthread = 3
)

nfold = 5直接加上好了,就是CV。

xgb$evaluation_log %>% 
  as_tibble() %>% 
  # filter(validation_auc == max(validation_auc))
  gather(key,value,train_auc:validation_auc) %>% 
  ggplot(aes(x = iter, y = value, col = key)) +
    geom_line()

找到最好的迭代次数是16次。

# my_tree_two and test are available in the workspace

# Make predictions on the test set
my_prediction <- predict(xgb, newdata = as.matrix(test[-1]), type = "class")
my_prediction <- ifelse(my_prediction > 0.5, 1,0)
# Finish the data.frame() call
my_solution <- data.frame(PassengerId = (all_data %>% filter(is.na(Survived)))$PassengerId, 
                          Survived = my_prediction)

# Use nrow() on my_solution
nrow(my_solution)

# Finish the write.csv() call
write_csv(my_solution, "my_solution_dt.csv")

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!

XGBoost也还是不行! 后面果然只能靠经验了。

这里总结下主要的调整参数。

Booster Parameters

The tree specific parameters

  • eta : The default value is set to 0.3. You need to specify step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means model is more robust to overfitting. \(\eta\)是来控制过拟合的,理论上当把\(\eta\)设置很小的时候,迭代次数nround设置得大一点,两个是互斥的。
  • gamma : The default value is set to 0. You need to specify minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to \(\infty\). Larger the gamma more conservative the algorithm is. 正则化参数,越大对过拟合的惩罚越大。
  • max_depth : The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to \(\infty\). 树的深度,用的很多,每把都有。
  • min_child_weight : The default value is set to 1. You need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to \(\infty\). 正则化参数,限制了树nodes的样本大小,太小就不分了,决策树范畴的东西。
  • max_delta_step : The default value is set to 0. Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.The range is 0 to \(\infty\). 不是特别懂,但是对于不平衡的样本似乎有效,加入进来。
  • subsample : The default value is set to 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. The range is 0 to 1.
  • colsample_bytree : The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.

Linear Booster Specific Parameters

  • lambda and alpha : These are regularization term on weights. Lambda default value assumed is 1 and alpha is 0.
  • lambda_bias : L2 regularization term on bias and has a default value of 0.

派生特征 + XGBoost

Name进行挖掘, 增加派生特征。 一家人同时幸存或遇难的可能性较高。

对于Name变量,上文从中派生出了Title变量。由于以下原因,可推测乘客的姓氏可能具有一定的预测作用

  • 部分西方国家中人名的重复度较高,而姓氏重复度较低,姓氏具有一定辨识度
  • 部分国家的姓氏具有一定的身份识别作用
  • 姓氏相同的乘客,可能是一家人(这一点也基于西方国家姓氏重复度较低这一特点), 而一家人同时幸存或遇难的可能性较高

考虑到只出现一次的姓氏不可能同时出现在训练集和测试集中,不具辨识度和预测作用,因此将只出现一次的姓氏均命名为’Small’

all_data$Surname <- sapply(all_data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
all_data$FamilyID <- paste(as.character(all_data$FamilySize), all_data$Surname, sep="")
all_data$FamilyID[all_data$FamilySize <= 2] <- 'Small'
# Delete erroneous family IDs
famIDs <- data.frame(table(all_data$FamilyID))
famIDs <- famIDs[famIDs$Freq <= 2,]
all_data$FamilyID[all_data$FamilyID %in% famIDs$Var1] <- 'Small'
# Convert to a factor
all_data$FamilyID <- factor(all_data$FamilyID)
all_data2<- 
all_data %>% 
  select(-PassengerId,-Name,-Ticket,-Cabin)
skim(all_data2)
library(caret)
dummies_model <- dummyVars(~ Embarked +  Sex + Title + Surname + FamilyID, data = all_data2, fullRank = TRUE)
df_all_ohe <- as_tibble(predict(dummies_model, newdata = all_data2))
head(df_all_ohe)
ohe_feats <- c('Embarked', 'Sex', 'Title','Surname','FamilyID')
df_all_combined <- 
bind_cols(
all_data2[,-c(which(colnames(all_data2) %in% ohe_feats))]
,
df_all_ohe
)
df_all_combined %>% head()

好了。分traintest

train_and_validation <- df_all_combined[1:891,]
test <- df_all_combined[892:1309,] # 只用于predict
set.seed(123)
train_index <- sample(1:nrow(train_and_validation),round(2/3*nrow(train_and_validation)),replace = F)
train <- train_and_validation[train_index,]
validation <- train_and_validation[-train_index,]
# names(train)
# names(validation)
library(xgboost)
dtrain <- xgb.DMatrix(data = as.matrix(train[-1]), label=train$Survived)
dvalidation <- xgb.DMatrix(data = as.matrix(validation[-1]), label=validation$Survived)
watchlist <- list(train=dtrain, validation=dvalidation)
library(xgboost)
xgb <- xgb.train(
 data=dtrain,
 eta = 0.1,
 max_depth = 10,
 nround=10000,
 subsample = 0.5,
 colsample_bytree = 0.5,
 seed = 1,
 watchlist=watchlist,
 eval.metric = "auc",
 # eval.metric = "error",
 # eval.metric = "logloss",
 objective = "binary:logistic",
 # num_class = 12, # set the number of classes. To use only with multiclass objectives.
 nfold = 50,
 nthread = 8,
 gamma = 10,
 max_delta_step = 10
)

nfold = 5直接加上好了,就是CV。

xgb$evaluation_log %>% 
  as_tibble() %>% 
  # filter(validation_auc > max(validation_auc) - 0.005) %>% 
  # arrange(abs(validation_auc-train_auc))
  gather(key,value,train_auc:validation_auc) %>% 
  ggplot(aes(x = iter, y = value, col = key)) +
    geom_line()

找到最好的迭代次数是25次。

# my_tree_two and test are available in the workspace

# Make predictions on the test set
my_prediction <- predict(xgb, newdata = as.matrix(test[-1]), type = "class")
my_prediction <- ifelse(my_prediction > 0.5, 1,0)
# Finish the data.frame() call
my_solution <- data.frame(PassengerId = (all_data %>% filter(is.na(Survived)))$PassengerId, 
                          Survived = my_prediction)

# Use nrow() on my_solution
nrow(my_solution)

# Finish the write.csv() call
write_csv(my_solution, "my_solution_dt.csv")

Your submission scored 0.77990, which is not an improvement of your best score. Keep trying!

我的感觉是这个数据不太适合用XGBoost做模型。

不应该完全去搞模型,而是要了解用WOE去跑模型时,各个变量的处理方式。

对比这篇文章 后, 我发现他的AUC很低的。

## [1] "Recall: 0.947976878612717"
## [1] "Precision: 0.841025641025641"
## [1] "Accuracy: 0.850187265917603"
## [1] "AUC: 0.809094822285082"

但是排名靠前,那说明了一定问题的,说明我过拟合了?