19 min read

Machine Learning with Tree-Based Models in R 学习笔记

作为学习笔记,不准备更新了。

knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval=F)

Machine Learning with Tree-Based Models in R

五种tree都有练习,卧槽! 这里的代码主要是集中于R,因此在Python使用上不熟悉的情况下, 用R进行debugging会非常快, 因此用R学习树,也是非常有价值的。 算了下4个小时可以全部掌握, 因此不需要太多顾虑。

She likes to mentor and share her knowledge through mentorship programs, tutorials and talks.

因此对这个人Gabriela de Queiroz | DataCamp,非常有好感了, 她是搞教学的,因此理解起来快。 而且是NGO R-Ladies的,因此不会像业界那么迷之自信。

Let’s get it.

讲的不错,回去把bug都de了哈哈。


Welcome to the course! | R

其实regression问题就是真对numeric变量而说的。

Tree-based models

  • Interpretability + Ease-of-Use + Accuracy
  • Make Decisions + Numeric Predictions

What you’ll learn:

  • Interpret and explain decisions
  • Explore different use cases
  • Build and evaluate classification and regression models
  • Tune model parameters for optimal performance

We will cover:

  • Classification & Regression Trees
  • Bagged Trees
  • Random Forests
  • Boosted Trees (GBM)

Build a classification tree | R

rpart.plot进行可视化。 creditsub导入德国信贷数据。 str()类似于pandas的.info

url="https://assets.datacamp.com/production/course_3022/datasets/credit.csv"
library(tidyverse)
creditsub <- read_csv(url)
# Look at the data
str(creditsub)

# Create the model
library(rpart)
credit_model <- rpart(formula = default ~ ., 
                      data = creditsub, 
                      method = "class")

# Display the results
library(rpart.plot)
rpart.plot(x = credit_model, yesno = 2, type = 0, extra = 0)

Train/test split | R

nrow类似于python的.shapelensample(x, size, replace = FALSE, prob = NULL)注意replace的默认值为FALSEsize的位置是要抽样样本大小。 x表示要被抽样的向量。

# Total number of rows in the credit data frame
credit <- creditsub
n <- nrow(credit)

# Number of rows for the training set (80% of the dataset)
n_train <- round(0.8 * n) 

# Create a vector of indices which is an 80% random sample
set.seed(123)
train_indices <- sample(1:n, n_train)

# Subset the credit data frame to training indices only
credit_train <- credit[train_indices, ]  
  
# Exclude the training indices to create the test set
credit_test <- credit[-train_indices, ]  

Train a classification tree model | R

# Train the model (to predict 'default')
credit_model <- rpart(formula = default ~ ., 
                      data = credit_train, 
                      method = "class")

# Look at the model output                      
print(credit_model)

Compute confusion matrix | R

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,  
                        newdata = credit_test,   
                        type = "class")  
                            
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,       
                reference = credit_test$default)  

confusionMatrix将所有指标都写出来了。

Compare models with a different splitting criterion | R

ce(): Calculates the classification error.

library(rpart)
# Train a gini-based model
credit_model1 <- rpart(formula = default ~ ., 
                       data = credit_train, 
                       method = "class",
                       parms = list(split = "gini"))

# Train an information-based model
credit_model2 <- rpart(formula = default ~ ., 
                       data = credit_train, 
                       method = "class",
                       parms = list(split = "information"))

# Generate predictions on the validation set using the gini model
pred1 <- predict(object = credit_model1, 
             newdata = credit_test,
             type = "class")    

# Generate predictions on the validation set using the information model
pred2 <- predict(object = credit_model2, 
             newdata = credit_test,
             type = "class")

# Compare classification error
library(ModelMetrics)
sum(credit_test$default != pred1)/nrow(credit_test)
sum(credit_test$default != pred2)/nrow(credit_test)

Introduction to regression trees | R

同质性(Homogeneity)的测量中,

  • 分类变量是用gini,entropy…
  • 连续变了是用标准差、绝对差…

rpart中, "class"anova用来确定对\(y\)变量的对待,前者是分类变量,后者是连续变量。

Train/Validation/Test Split

  • training set
  • validation set: for hyperparam in training.
  • test set: only use once.

Split the data | R

grade <- read_csv(
  "https://assets.datacamp.com/production/course_3022/datasets/grade.csv"
)

prob: A vector of probability weights for obtaining the elements of the vector being sampled.

因此非常容易,终于搞清楚R的门道了。 类似于python的,train_test_split一样直观。

# Look/explore the data
str(grade)

# Randomly assign rows to ids (1/2/3 represents train/valid/test)
# This will generate a vector of ids of length equal to the number of rows
# The train/valid/test split will be approximately 70% / 15% / 15% 
set.seed(1)
assignment <- sample(1:3, size = nrow(grade), prob = c(0.7,0.15,0.15), replace = TRUE)

# Create a train, validation and tests from the original data frame 
grade_train <- grade[assignment == 1, ]    # subset the grade data frame to training indices only
grade_valid <- grade[assignment == 2, ]  # subset the grade data frame to validation indices only
grade_test <- grade[assignment == 3, ]   # subset the grade data frame to test indices only

Train a regression tree model | R

# Train the model
grade_model <- rpart(formula = final_grade ~ ., 
                     data = grade_train, 
                     method = "anova")

# Look at the model output                      
print(grade_model)

# Plot the tree model
rpart.plot(x = grade_model, yesno = 2, type = 0, extra = 0)

\(\Box\)yesno什么意思。

Performance metrics for regression | R

Metrics包可以直接使用, rmse函数。

Evaluate a regression tree model | R

# Generate predictions on a test set
pred <- predict(object = grade_model,   # model object 
                newdata = grade_test)  # test dataset

# Compute the RMSE
rmse(actual = grade_test$final_grade, 
     predicted = pred)

What are the hyperparameters for a decision tree | R

以前我很少关注这些参数,但是现在,感觉非常亲切,这也许就是知识的沉潜。

  • minsplit: minimum number of data points required to attempt a split, minsplit = 2 (smallest leaf possible)
  • cp: complexity parameter 1, \(cp\downarrow\to complexity \uparrow\)
  • maxdepth: depth of a decision tree
> print(model$cptable)
          CP nsplit rel error    xerror       xstd
1 0.06839852      0 1.0000000 1.0080595 0.09215642
2 0.06726713      1 0.9316015 1.0920667 0.09543723
3 0.03462630      2 0.8643344 0.9969520 0.08632297
4 0.02508343      3 0.8297080 0.9291298 0.08571411
5 0.01995676      4 0.8046246 0.9357838 0.08560120
6 0.01817661      5 0.7846679 0.9337462 0.08087153
7 0.01203879      6 0.7664912 0.9092646 0.07982862
8 0.01000000      7 0.7544525 0.9407895 0.08399125
# Prune the model (to optimized cp value)
# Returns the optimal model

model_opt <- prune(tree = model, cp = cp_opt)

明显是 4 0.02508343 3 0.8297080 0.9291298 0.08571411 最优。

Tuning the model | R

which.min(x)反馈index。

# Plot the "CP Table"
plotcp(grade_model)

# Print the "CP Table"
print(grade_model$cptable)

# Retreive optimal cp value based on cross-validated error
opt_index <- which.min(grade_model$cptable[, "xerror"])
cp_opt <- grade_model$cptable[opt_index, "CP"]

# Prune the model (to optimized cp value)
grade_model_opt <- prune(tree = grade_model, 
                         cp = cp_opt)
                          
# Plot the optimized model
rpart.plot(x = grade_model_opt, yesno = 2, type = 0, extra = 0)

Grid search for model selection | R

hyper_grid <- expand.grid(minsplit = minsplit, 
                            maxdepth = maxdepth)
> hyper_grid[1:10,]
   minsplit maxdepth
1         1        5
2         6        5
3        11        5
4        16        5
5        21        5
6        26        5
7         1       15
8         6       15
9        11       15
10       16       15

我认为这个expand.grid可用于模型迭代。

# create an empty list to store models

models <- list()
 # execute the grid search

> for (i in 1:nrow(hyper_grid)) {

    # get minsplit, maxdepth values at row i
    minsplit <- hyper_grid$minsplit[i]
    maxdepth <- hyper_grid$maxdepth[i]

    # train a model and store in the list
    models[[i]] <- rpart(formula = response ~ ., 
                         data = train, 
                         method = "anova",
                         minsplit = minsplit)
}

我觉得我开窍了,实际上,直接for loop,比搞purrr要简单许多。 models[[i]]就是sense。

# create an empty vector to store RMSE values
rmse_values <- c()
 # compute validation RMSE fr 

for (i in 1:length(models)) {

    # retreive the i^th model from the list
    model <- models[[i]]

    # generate predictions on grade_valid 
    pred <- predict(object = model,
                    newdata = valid)

    # compute validation RMSE and add to the 
    rmse_values[i] <- rmse(actual = valid$response, 
                           predicted = pred)
}

完全的自动化思路,很好!

Generate a grid of hyperparameter values | R

# Establish a list of possible values for minsplit and maxdepth
minsplit <- seq(1, 4, 1)
maxdepth <- seq(1, 6, 1)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)

# Check out the grid
head(hyper_grid)

# Print the number of grid combinations
nrow(hyper_grid)
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)

记住这个函数,用的很常见,不要老是help,耽误时间。

Generate a grid of models | R

# Number of potential models in the grid
num_models <- nrow(hyper_grid)

# Create an empty list to store models
grade_models <- list()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {

    # Get minsplit, maxdepth values at row i
    minsplit <- hyper_grid$minsplit[i]
    maxdepth <- hyper_grid$maxdepth[i]

    # Train a model and store in the list
    grade_models[[i]] <- rpart(formula = final_grade ~ ., 
                               data = grade_train, 
                               method = "anova",
                               minsplit = minsplit,
                               maxdepth = maxdepth)
}

我大概理解了,这里hyper_grid的确方便,合唱一个table或者矩阵, 但是还是要拆分成

minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]

还是麻烦,因此还是purrr好。

Evaluate the grid | R

typeof(grade_models)

一共包含了24个模型。

# Number of potential models in the grid
num_models <- length(grade_models)

# Create an empty vector to store RMSE values
rmse_values <- c()

# Write a loop over the models to compute validation RMSE
for (i in 1:num_models) {

    # Retreive the i^th model from the list
    model <- grade_models[[i]]
    
    # Generate predictions on grade_valid 
    pred <- predict(object = model,
                    newdata = grade_valid)
    
    # Compute validation RMSE and add to the 
    rmse_values[i] <- rmse(actual = grade_valid$final_grade, 
                           predicted = pred)
}

# Identify the model with smallest validation set RMSE
best_model <- grade_models[[which.min(rmse_values)]]

# Print the model paramters of the best model
best_model$control

# Compute test set RMSE on best_model
pred <- predict(object = best_model,
                newdata = grade_test)
rmse(actual = grade_test$final_grade, 
     predicted = pred)

Introduction to bagged trees | R

bootstrap,这是bagging的想法。

> library(ipred)
> bagging(formula = response ~ ., data = dat)

ipred包!哈哈。

Train a bagged tree model | R

nbagg限制了包的个数。

If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the coob parameter to TRUE

The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model

废物利用。

library(ipred)
# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)

# Train a bagged model
credit_model <- bagging(formula = as.factor(default) ~ ., 
                        data = credit_train,
                        coob = TRUE)

# Print the model
print(credit_model)

as.factor(default)这里必须用factor。

Evaluating the bagged tree performance | R

可以回忆### ROC切线理解。 最大点是最优解。

> library(Metrics)
> auc(actual, predicted)
[1] .76765

Prediction and confusion matrix | R

就当练习下矩阵latex。

\[\begin{matrix} & & Pred &\\ & & 1 & 0 \\ Actual & 1 & TP & FN \\ & 0 & FP & TN \\ \end{matrix}\]

confusionMatrix()来自caret

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,    
                            newdata = credit_test,  
                            type = "class")  # return classification labels

# Print the predicted classes
print(class_prediction)

# Calculate the confusion matrix for the test set
library(caret)
confusionMatrix(data = class_prediction,         
                reference = credit_test$default)  
  • data:
    a factor of predicted classes (for the default method) or an object of class table.
  • reference:
    a factor of classes to be used as the true results

Predict on a test set and compute AUC | R

type = "prob"区别于type = "class",已经非常熟悉了。

# Generate predictions on the test set
pred <- predict(object = credit_model,
                newdata = credit_test,
                type = "prob")

# `pred` is a matrix
class(pred)
                
# Look at the pred format
head(pred)
                
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
library(Metrics)
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"])                    

注意predict给的是两列,yesno概率相加为1。 这里预测的时候,我们定义yes为positive,而非原来假设的no。 这里需要在auc中进行说明的。

Using caret for cross-validating models | R

# Specify the training configuration
ctrl <- trainControl(method = "cv",                      # Cross-validation
                     number = 5,                         # 5 folds
                     classProbs = TRUE,                  # For AUC
                     summaryFunction = twoClassSummary)  # For AUC

这里的classProbs = TRUEsummaryFunction = twoClassSummary, 不懂啊,看help。

  • classProbs: a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?
  • summaryFunction: a function to compute performance metrics across resamples. The arguments to the function should be the same as those in defaultSummary.

是一些细节的东西,不要太在意。

之前都是rpart搭配method = "class"或者method = "anova"2。 现在是train搭配 method = "treebag"。 都是些很模糊的东西。

set.seed(1)  #for reproducibility
credit_model <- train(default ~ .,           
                      data = credit_train, 
                      method = "treebag",
                      metric = "ROC",
                      trControl = ctrl)

的确感觉bagging和CV很像啊。

Use caret::train() with the "treebag" method to train a model and evaluate the model using cross-validated AUC.

Cross-validate a bagged tree model in caret | R

# Specify the training configuration
ctrl <- trainControl(method = "cv",     # Cross-validation
                     number = 5,      # 5 folds
                     classProbs = TRUE,                  # For AUC
                     summaryFunction = twoClassSummary)  # For AUC

# Cross validate the credit model using "treebag" method; 
# Track AUC (Area under the ROC curve)
set.seed(1)  # for reproducibility
credit_caret_model <- train(default ~ .,
                            data = credit_train, 
                            method = "treebag",
                            metric = "ROC",
                            trControl = ctrl)

# Look at the model object
print(credit_caret_model)

# Inspect the contents of the model list 
names(credit_caret_model)

# Print the CV AUC
credit_caret_model$results[,"ROC"]

Generate predictions from the caret model | R

# Generate predictions on the test set
pred <- predict(object = credit_caret_model, 
                newdata = credit_test,
                type = "prob")

# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
                    predicted = pred[,"yes"])

Compare test set performance to CV performance | R

  • The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the ipred::bagging() function.
  • The credit_caret_model_test_auc object stores the test set AUC from the model trained using the caret::train() function with method = "treebag".

这是可以留心之处,学习的乐趣,感受到了。

Introduction to Random Forest | R

开始随机森林了,都是炒冷饭,bagging也是。

sample subset of the features这个才是随机森林的关键,为什么这么做? feature bagging or random sub-feature.

虽然牺牲了performance(变量减少了), 但是减少了bagging树之间的cor。 这个就是理由。

library(randomForest)

# Train a default RF model (500 trees)

model <- randomForest(formula = response ~ ., data = train)

这个脑残包默认五百个树。

Train a Random Forest model | R

相当于复习了饿!

Train a Random Forest model | R

library(randomForest)
library(tidyverse)
# Train a Random Forest
set.seed(1)  # for reproducibility
credit_model <- randomForest(formula = as.factor(default) ~ ., 
                             data = credit_train %>% mutate_if(is.character,as.factor),
                             type = "classification"
                             )
                             
# Print the model output                             
print(credit_model)
# head(credit_train)

type: one of regression, classification, or unsupervised. 随机森林还可以无监督?

randomForest需要对\(y\)因子化。

randomForest没有办法自动将 char格式转换为factor格式 可以用str(train) 查看下 哪些字段是char ,再用as.factor转化下。 因此,data %>% mutate_if(is.character,as.factor)拯救世界。

Understanding Random Forest model output | R

# Print the credit_model output

> print(credit_model)


Call:
 randomForest(formula = default ~ ., data = credit_train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 24.12%
Confusion matrix:
     no yes class.error
no  516  46  0.08185053
yes 147  91  0.61764706

No. of variables tried at each split: 4就是 mtry:
number of predictors sampled for spliting at each node. 一般是\(\sqrt{n}\)\(n\)是特征向量数量。

OOB estimate of error rate: 24.12% 中, OOB\(\to\)Out-of-bag,可以用于pred,因为没有train。

# Grab OOB error matrix & take a look

> err <- credit_model$err.rate

> head(err)
           OOB        no       yes
[1,] 0.3414634 0.2657005 0.5375000
[2,] 0.3311966 0.2462908 0.5496183
[3,] 0.3232831 0.2476636 0.5147929
[4,] 0.3164933 0.2180294 0.5561224
[5,] 0.3197756 0.2095808 0.5801887
[6,] 0.3176944 0.2115385 0.5619469

可以用来决定用多少tree。

Evaluate out-of-bag error | R

# Grab OOB error matrix & take a look
err <- credit_model$err.rate
head(err)

# Look at final OOB error rate (last row in err matrix)
oob_err <- err[nrow(err), "OOB"]
print(oob_err)

# Plot the model trained in the previous exercise
plot(credit_model)

# Add a legend since it doesn't have one by default
legend(x = "right", 
       legend = colnames(err),
       fill = 1:ncol(err))

\(\Box\)说实话,OOB这里不是很懂。

Evaluate model performance on a test set | R

oob_err就是\(1-Acc\)

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,   # model object 
                            newdata = credit_test %>% 
                            mutate_if(is.character,as.factor),  # test dataset
                            type = "class") # return classification labels
                            
# Calculate the confusion matrix for the test set
cm <- confusionMatrix(data = class_prediction,       # predicted classes
                      reference = credit_test$default)  # actual classes
print(cm)

# Compare test set accuracy to OOB accuracy
paste0("Test Accuracy: ", cm$overall[1])
paste0("OOB Accuracy: ", 1 - oob_err)

Test AccuracyOOB Accuracy是不一样的。

OOB error vs. test set error | R

Advantages & Disadvantages of OOB estimates

  • Can evaluate your model without a separate test set 不需要分test样本。
  • Computed automatically by the randomForest() function

But …

  • OOB Error only estimates error (not AUC, log-loss, etc.)
  • Can’t compare Random Forest performace to other types of models

在我看来,都不是特别重要。

# Generate predictions on the test set
pred <- predict(object = credit_model,
            newdata = credit_test %>% 
              mutate_if(is.character,as.factor),
            type = "prob")

# `pred` is a matrix
class(pred)
                
# Look at the pred format
head(pred)
                
# Compute the AUC (`actual` must be a binary 1/0 numeric vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"])                    

r - How does predict.randomForest estimate class probabilities? - Cross Validated 解释很清楚。 response, prob. or votes

Tuning a Random Forest model | R

  • ntree: number of trees
  • mtry: number of variables randomly sampled as candidates at each split 每个树有多少变量
  • sampsize: number of samples to train on
  • nodesize: minimum size (number of samples) of the terminal nodes 每个终点的样本最小值
  • maxnodes: maximum number of terminal nodes 可以有多少个终点

这里重点讲mtry

# Execute the tuning process

set.seed(1)              
res <- tuneRF(x = train_predictor_df,
              y = train_response_vector,
              ntreeTry = 500)
 # Look at results
print(res)
      mtry OOBError
2.OOB    2   0.2475
4.OOB    4   0.2475
8.OOB    8   0.2425

tuneRF这是一个很好的包。

Tuning a Random Forest via mtry | R

# Execute the tuning process
set.seed(1)              
res <- tuneRF(x = subset(credit_train, select = -default) %>% mutate_if(is.character,as.factor),
              y = credit_train$default %>% as.factor(),
              ntreeTry = 500)
               
# Look at results
print(res)

# Find the mtry value that minimizes OOB Error
mtry_opt <- res[,"mtry"][which.min(res[,"OOBError"])]
print(mtry_opt)

# If you just want to return the best RF model (rather than results)
# you can set `doBest = TRUE` in `tuneRF()` to return the best RF model
# instead of a set performance matrix.

Tuning a Random Forest via tree depth | R

ncol(credit_train)衡量了模型有多少变量。 nodesize <- seq(3, 8, 2)允许终点伤最少多少个样本。 sampsize <- nrow(credit_train) * c(0.7, 0.8)衡量了用多少样本来train。 model$err.rate[nrow(model$err.rate), "OOB"]提取OOB-error。

# Establish a list of possible values for mtry, nodesize and sampsize
mtry <- seq(4, ncol(credit_train) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(credit_train) * c(0.7, 0.8)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)

# Create an empty vector to store OOB error values
oob_err <- c()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {

    # Train a Random Forest model
    model <- randomForest(formula = default ~ ., 
                          data = credit_train %>% mutate_if(is.character,as.factor),
                          mtry = hyper_grid$mtry[i],
                          nodesize = hyper_grid$nodesize[i],
                          sampsize = hyper_grid$sampsize[i]
                          )
    # Store OOB error for the model                      
    oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

Introduction to boosting | R

这就是区别。 Boosted trees improve the model fit by considering past fits and bagged trees do not.

Train a GBM model | R

  • Adaboost
  • Gradient Boosting Machine (“GBM”)

Adaboost

  • Train decision tree where with equal weight
  • Increase/Lower the weights of the observations
  • Second tree is grown on weighted data
  • Repeat this process for a specified number of iterations

Gradient Boosting = Gradient Descent + Boosting

  • Fit an additive model (ensemble) in a forward, stage-wise manner.
  • In each stage, introduce a “weak learner” (e.g. decision tree) to compensate the shortcomings of existing weak learners.
  • In Adaboost, “shortcomings” are identified by high-weight data points.
  • In Gradient Boosting, the “shortcomings” are identified by gradients.

为什么GBM好/坏?

  • Often performs better than any other algorithm
  • Directly optmizes cost function
  • Overfits (need to find a proper stopping point)
  • Sensitive to extreme values and noises
# Train a 5000-tree GBM model

> model <- gbm(formula = response ~ ., 
               distribution = "bernoulli",
               data = train,
               n.trees = 5000)

distribution = "bernoulli"针对于\(y\)n.trees = 5000就是iteration开关。

Train a GBM model | R

For binary classification, gbm() requires the response to be encoded as 0/1 (numeric), so we will have to convert from a “no/yes” factor to a 0/1 numeric response column. 可以用ifelse()函数。

# Convert "yes" to 1, "no" to 0
credit_train$default <- ifelse(credit_train$default == "yes", 1, 0)

# Train a 10000-tree GBM model
set.seed(1)
library(gbm)
credit_model <- gbm(formula = default ~ ., 
                    distribution = "bernoulli", 
                    data = credit_train %>% mutate_if(is.character,as.factor),
                    n.trees = 10000)
                    
# Print the model object                    
print(credit_model)

# summary() prints variable importance
# summary(credit_model)
# str(credit_model)
> print(credit_model)
gbm(formula = default ~ ., distribution = "bernoulli", data = credit_train, 
    n.trees = 10000)
A gradient boosted model with bernoulli loss function.
10000 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
> 
> # summary() prints variable importance
> summary(credit_model)
                                      var     rel.inf
checking_balance         checking_balance 33.49502510
amount                             amount 11.62938098
months_loan_duration months_loan_duration 11.17113439
credit_history             credit_history 11.15698321
savings_balance           savings_balance  6.44293358
employment_duration   employment_duration  6.06266137
age                                   age  5.73175696
percent_of_income       percent_of_income  3.74219743
other_credit                 other_credit  3.56695375
purpose                           purpose  3.38820798
housing                           housing  1.55169398
years_at_residence     years_at_residence  1.35255308
job                                   job  0.47631930
phone                               phone  0.09142691
existing_loans_count existing_loans_count  0.08924265
dependents                     dependents  0.05152933

这里告诉你了, 用了多少迭代, 有多少噪音变量, 并且好变量的具体情况。 这里可以得到一点sense。

陈天奇给R的xgboost代码了。

Prediction using a GBM model | R

predict.gbm需要给n.trees = 10000这给没有default值。 type = "response"会反馈\(p\)给Bernoulli分布和\(E(n)\)给Poisson分布。 否则只给\(0\)\(1\)

# Since we converted the training response col, let's also convert the test response col
credit_test$default <- ifelse(credit_test$default == "yes", 1, 0)

# Generate predictions on the test set
preds1 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = 10000)

# Generate predictions on the test set (scale to response)
preds2 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = 10000,
                  type = "response")

# Compare the range of the two sets of predictions
range(preds1)
range(preds2)
> range(preds1)
[1] -3.210354  2.088293
> range(preds2)
[1] 0.03877796 0.88976007

Evaluate test set AUC | R

# Generate the test set AUCs using the two sets of preditions & compare
auc(actual = credit_test$default, predicted = preds1)  #default
auc(actual = credit_test$default, predicted = preds2)  #rescaled
> auc(actual = credit_test$default, predicted = preds1)  #default
[1] 0.7875175
> auc(actual = credit_test$default, predicted = preds2)  #rescaled
[1] 0.7875175

GBM hyperparameters | R

GBM Hyperparameters

  • n.trees: number of trees
  • bag.fraction: proportion of observations to be sampled in each tree
  • n.minobsinnode: minimum number of observations in the trees terminal nodes
  • interaction.depth: maximum nodes per tree
  • shrinkage: learning rate

这里重点看shrinkage

Early stopping in GBMs | R

early stopping 就是选择最优迭代次数。 用gbm.perf()。 可以选择两种方法, method = "OOB"method = "CV"

# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = credit_model, 
                          method = "OOB", 
                          oobag.curve = TRUE)

# Train a CV GBM model
set.seed(1)
credit_model_cv <- gbm(formula = default ~ ., 
                       distribution = "bernoulli", 
                       data = credit_train,
                       n.trees = 10000,
                       cv.folds = 2)

# Optimal ntree estimate based on CV
ntree_opt_cv <- gbm.perf(object = credit_model_cv, 
                         method = "cv")
 
# Compare the estimates                         
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))                         
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv.folds>0 when calling gbm usually results in improved predictive performance.Error in plot.window(...) : need finite 'ylim' values

报错了,来不及看。

> print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
[1] "Optimal n.trees (OOB Estimate): 3233"
> print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
[1] "Optimal n.trees (CV Estimate): 7889"

\(\Box\)这个要好好研究下。

OOB vs CV-based early stopping | R

n.trees = ntree_opt_oob这里的最优迭代次数,直接用train的了?

# Generate predictions on the test set using ntree_opt_oob number of trees
preds1 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = ntree_opt_oob)
                  
# Generate predictions on the test set using ntree_opt_cv number of trees
preds2 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = ntree_opt_cv)   

# Generate the test set AUCs using the two sets of preditions & compare
auc1 <- auc(actual = credit_test$default, predicted = preds1)  #OOB
auc2 <- auc(actual = credit_test$default, predicted = preds2)  #CV 

# Compare AUC 
print(paste0("Test set AUC (OOB): ", auc1))                         
print(paste0("Test set AUC (CV): ", auc2))

Compare all models based on AUC | R

In this final exercise, we will perform a model comparison across all types of models that we’ve learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM).

神仙大家环节。

Loaded in your workspace are four numeric vectors:

  • dt_preds
  • bag_preds
  • rf_preds
  • gbm_preds

sprintf很好用啊!

# Generate the test set AUCs using the two sets of predictions & compare
actual <- credit_test$default
dt_auc <- auc(actual = actual, predicted = dt_preds)
bag_auc <- auc(actual = actual, predicted = bag_preds)
rf_auc <- auc(actual = actual, predicted = rf_preds)
gbm_auc <- auc(actual = actual, predicted = gbm_preds)

# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
sprintf("Random Forest Test AUC: %.3f", rf_auc)
sprintf("GBM Test AUC: %.3f", gbm_auc)
> sprintf("Decision Tree Test AUC: %.3f", dt_auc)
[1] "Decision Tree Test AUC: 0.627"
> sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
[1] "Bagged Trees Test AUC: 0.781"
> sprintf("Random Forest Test AUC: %.3f", rf_auc)
[1] "Random Forest Test AUC: 0.804"
> sprintf("GBM Test AUC: %.3f", gbm_auc)
[1] "GBM Test AUC: 0.786"

Plot & compare ROC curves | R

ROCR包画曲线,这个老师sense真的很好啊。

# List of predictions
preds_list <- list(dt_preds, bag_preds, rf_preds, gbm_preds)

# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(credit_test$default), m)

# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")
plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright", 
       legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
       fill = 1:m)

证书


  1. determines when the splitting up of the decision tree stops. cp = 0 (no stopping of splits)

    • 连续性method="anova",
    • 离散型method="class",
    • 计数型method="poisson", 泊松分布
    • 生存分析型method="exp" 指数分布