11 min read

【学习笔记·技术】🌲 Machine Learning with Tree-Based Models in R 学习笔记

本文为DataCamp课程”Machine Learning with Tree-Based Models in R”的学习笔记,系统记录了树模型在R语言中的实现方法与技术要点。

Machine Learning with Tree-Based Models in R

课程概述

本课程系统介绍了五种核心树模型算法,提供了丰富的实践练习案例。课程采用R语言作为主要实现工具,为习惯使用Python环境的学习者提供了高效的算法调试与验证途径。课程设计紧凑,约需4小时学习时间,适合快速掌握树模型的核心概念与实现方法。

讲师介绍

课程讲师为Gabriela de Queiroz | DataCamp,具备以下专业背景:

She likes to mentor and share her knowledge through mentorship programs, tutorials and talks.

  • 专注于数据科学教学领域,擅长复杂技术概念的通俗化讲解
  • 作为R-Ladies NGO组织成员,采用亲和力较强的教学风格,避免了过度技术化表达

课程特点

课程内容结构清晰,理论与实践并重,能够帮助学习者建立系统的树模型知识框架,掌握实际编程技能。


课程学习建议

建议结合视频教程进行学习,以获得最佳的学习效果。

课程概述

Machine Learning with Tree-Based Models in R

本课程系统介绍了五种核心树模型算法,提供了丰富的实践练习案例。课程采用R语言作为主要实现工具,为习惯使用Python环境的学习者提供了高效的算法调试与验证途径。课程设计紧凑,约需4小时学习时间,适合快速掌握树模型的核心概念与实现方法。

核心概念与优势

树模型核心优势:

  • 可解释性强 + 易于使用 + 预测精度高
  • 支持决策制定 + 数值预测能力

课程学习目标:

  • 学习解释和阐述模型决策过程
  • 探索不同应用场景
  • 构建和评估分类与回归模型
  • 调优模型参数以获得最佳性能

课程内容涵盖:

  • 分类与回归树(CART)
  • 袋装树(Bagged Trees)
  • 随机森林(Random Forests)
  • 梯度提升树(GBM)

图 1:决策树基本结构示意图

第一章:分类决策树

1.1 基础概念

Welcome to the course! | R

回归问题主要针对数值型目标变量进行预测,而分类问题主要针对分类型目标变量。

1.2 构建第一个分类树

Build a classification tree | R

核心函数介绍: - rpart.plot:用于决策树可视化 - creditsub:加载德国信贷数据集 - str():类似于pandas中的.info()方法,用于查看数据结构

Train/test split | R

nrow:类似于Python中的.shape[0]len(),用于获取行数

sample(x, size, replace = FALSE, prob = NULL)函数说明:

  • replace:默认值为FALSE,表示无放回抽样
  • size:指定抽样样本数量
  • x:被抽样的向量对象

Compute confusion matrix | R

confusionMatrix函数能够自动计算并展示完整的分类性能指标集合,包括准确率、精确率、召回率、F1分数等关键评估指标。

Compare models with a different splitting criterion | R

ce(): Calculates the classification error.

Introduction to regression trees | R

同质性(Homogeneity)度量方法:

  • 分类变量:使用基尼系数(Gini)、信息熵(Entropy)等指标
  • 连续变量:使用标准差、绝对偏差等指标

rpart函数参数:

  • "class":处理分类目标变量
  • "anova":处理连续目标变量

数据集划分策略:

  • 训练集(Training set):用于模型训练
  • 验证集(Validation set):用于超参数调优
  • 测试集(Test set):仅用于最终模型性能评估,使用一次

Split the data | R

函数参数说明:

  • prob:指定抽样概率权重的向量参数

实现特点: 该方法实现了直观的数据分割逻辑,功能上等同于Python环境中的train_test_split函数。R语言的实现方式使数据处理流程更加规范和易于理解。

Train a regression tree model | R

参数解析:**

  • yesno=2:控制决策树节点中”是/否”标签的显示格式
  • type=0:设置决策树的显示类型
  • extra=0:控制额外信息的显示选项

Performance metrics for regression | R

回归模型评估工具:

  • Metrics包:提供完整的模型评估指标集
  • rmse函数:用于计算均方根误差(Root Mean Square Error)

What are the hyperparameters for a decision tree | R

图 2:rpart.control主要参数详解

通过对这些超参数的系统学习,能够深入理解决策树模型的核心控制机制,为后续的模型调优奠定理论基础。

关键超参数解析:

  • minsplit:节点分裂所需的最小样本数,默认minsplit = 2(可能的最小叶节点)
  • cp:复杂度参数 [^cp],\(cp\downarrow\to complexity \uparrow\)
  • maxdepth:决策树的最大深度

[^cp]:决定决策树停止分裂的阈值参数。cp = 0表示不限制分裂(直至满足其他停止条件)

> print(model$cptable)
          CP nsplit rel error    xerror       xstd
1 0.06839852      0 1.0000000 1.0080595 0.09215642
2 0.06726713      1 0.9316015 1.0920667 0.09543723
3 0.03462630      2 0.8643344 0.9969520 0.08632297
4 0.02508343      3 0.8297080 0.9291298 0.08571411
5 0.01995676      4 0.8046246 0.9357838 0.08560120
6 0.01817661      5 0.7846679 0.9337462 0.08087153
7 0.01203879      6 0.7664912 0.9092646 0.07982862
8 0.01000000      7 0.7544525 0.9407895 0.08399125
# Prune the model (to optimized cp value)
# Returns the optimal model

model_opt <- prune(tree = model, cp = cp_opt)

最优参数选择: 基于交叉验证误差最小化原则,选择第4行参数组合: CP = 0.02508343, nsplit = 3, xerror = 0.8297080

1.9 模型剪枝优化

Grid search for model selection | R

hyper_grid <- expand.grid(minsplit = minsplit, 
                            maxdepth = maxdepth)
> hyper_grid[1:10,]
   minsplit maxdepth
1         1        5
2         6        5
3        11        5
4        16        5
5        21        5
6        26        5
7         1       15
8         6       15
9        11       15
10       16       15

我认为这个expand.grid可用于模型迭代。

# create an empty list to store models

models <- list()
 # execute the grid search

> for (i in 1:nrow(hyper_grid)) {

    # get minsplit, maxdepth values at row i
    minsplit <- hyper_grid$minsplit[i]
    maxdepth <- hyper_grid$maxdepth[i]

    # train a model and store in the list
    models[[i]] <- rpart(formula = response ~ ., 
                         data = train, 
                         method = "anova",
                         minsplit = minsplit)
}

通过分析发现,直接使用for循环的方法比应用purrr包更加直观和高效,代码可读性更强。 models[[i]]就是sense。

# create an empty vector to store RMSE values
rmse_values <- c()
 # compute validation RMSE fr 

for (i in 1:length(models)) {

    # retreive the i^th model from the list
    model <- models[[i]]

    # generate predictions on grade_valid 
    pred <- predict(object = model,
                    newdata = valid)

    # compute validation RMSE and add to the 
    rmse_values[i] <- rmse(actual = valid$response, 
                           predicted = pred)
}

完全的自动化思路,很好!

Generate a grid of hyperparameter values | R

seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)

记住这个函数,用的很常见,不要老是help,耽误时间。

Generate a grid of models | R

我大概理解了,这里hyper_grid的确方便,合唱一个table或者矩阵, 但是还是要拆分成

minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]

还是麻烦,因此还是purrr好。

Evaluate the grid | R

一共包含了24个模型。

Introduction to bagged trees | R

bootstrap,这是bagging的想法。

图 4:Bagging算法集成学习示意图

> library(ipred)
> bagging(formula = response ~ ., data = dat)

ipred包!哈哈。

Train a bagged tree model | R

nbagg限制了包的个数。

If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the coob parameter to TRUE

The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model

数据样本充分利用。

as.factor(default)这里必须用factor。

Evaluating the bagged tree performance | R

可以回忆### ROC切线理解。 最大点是最优解。

> library(Metrics)
> auc(actual, predicted)
[1] .76765

Prediction and confusion matrix | R

就当练习下矩阵latex。

\[\begin{matrix} & & Pred &\\ & & 1 & 0 \\ Actual & 1 & TP & FN \\ & 0 & FP & TN \\ \end{matrix}\]

confusionMatrix()来自caret

  • data: a factor of predicted classes (for the default method) or an object of class table.
  • reference: a factor of classes to be used as the true results

Predict on a test set and compute AUC | R

type = "prob"区别于type = "class",已经非常熟悉了。

注意predict给的是两列,yesno概率相加为1。 这里预测的时候,我们定义yes为positive,而非原来假设的no。 这里需要在auc中进行说明的。

Using caret for cross-validating models | R

# Specify the training configuration
ctrl <- trainControl(method = "cv",                      # Cross-validation
                     number = 5,                         # 5 folds
                     classProbs = TRUE,                  # For AUC
                     summaryFunction = twoClassSummary)  # For AUC

这里的classProbs = TRUEsummaryFunction = twoClassSummary, 该部分需要查阅帮助文档以获取详细说明。

  • classProbs: a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?
  • summaryFunction: a function to compute performance metrics across resamples. The arguments to the function should be the same as those in defaultSummary.

是一些细节的东西,不要太在意。

之前都是rpart搭配method = "class"或者method = "anova"1。 现在是train搭配 method = "treebag"。 都是些很模糊的东西。

set.seed(1)  #for reproducibility
credit_model <- train(default ~ .,           
                      data = credit_train, 
                      method = "treebag",
                      metric = "ROC",
                      trControl = ctrl)

的确感觉bagging和CV很像啊。

Use caret::train() with the "treebag" method to train a model and evaluate the model using cross-validated AUC.

Compare test set performance to CV performance | R

  • The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the ipred::bagging() function.
  • The credit_caret_model_test_auc object stores the test set AUC from the model trained using the caret::train() function with method = "treebag".

这是可以留心之处,学习的乐趣,感受到了。

Introduction to Random Forest | R

开始随机森林了,都是炒冷饭,bagging也是。

sample subset of the features这个才是随机森林的关键,为什么这么做? feature bagging or random sub-feature.

虽然牺牲了performance(变量减少了), 但是减少了bagging树之间的cor。 这个就是理由。

library(randomForest)

# Train a default RF model (500 trees)

model <- randomForest(formula = response ~ ., data = train)

该包默认设置构建500棵决策树。

Train a Random Forest model | R

相当于复习了饿!

Train a Random Forest model | R

type: one of regression, classification, or unsupervised. 随机森林还可以无监督?

randomForest需要对\(y\)因子化。

randomForest没有办法自动将 char格式转换为factor格式 可以用str(train) 查看下 哪些字段是char ,再用as.factor转化下。 因此,使用data %>% mutate_if(is.character,as.factor)可以高效解决字符型变量向因子型变量转换的问题。

Understanding Random Forest model output | R

# Print the credit_model output

> print(credit_model)


Call:
 randomForest(formula = default ~ ., data = credit_train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 24.12%
Confusion matrix:
     no yes class.error
no  516  46  0.08185053
yes 147  91  0.61764706

No. of variables tried at each split: 4就是 mtry:
number of predictors sampled for spliting at each node. 一般是\(\sqrt{n}\)\(n\)是特征向量数量。

OOB estimate of error rate: 24.12% 中, OOB\(\to\)Out-of-bag,可以用于pred,因为没有train。

# Grab OOB error matrix & take a look

> err <- credit_model$err.rate

> head(err)
           OOB        no       yes
[1,] 0.3414634 0.2657005 0.5375000
[2,] 0.3311966 0.2462908 0.5496183
[3,] 0.3232831 0.2476636 0.5147929
[4,] 0.3164933 0.2180294 0.5561224
[5,] 0.3197756 0.2095808 0.5801887
[6,] 0.3176944 0.2115385 0.5619469

可以用来决定用多少tree。

图 5:随机森林误差变化趋势图

Evaluate out-of-bag error | R

\(\Box\)说实话,OOB这里不是很懂。

Evaluate model performance on a test set | R

oob_err就是\(1-Acc\)

Test AccuracyOOB Accuracy是不一样的。

OOB error vs. test set error | R

Advantages & Disadvantages of OOB estimates

  • Can evaluate your model without a separate test set 不需要分test样本。
  • Computed automatically by the randomForest() function

But …

  • OOB Error only estimates error (not AUC, log-loss, etc.)
  • Can’t compare Random Forest performace to other types of models

在我看来,都不是特别重要。

r - How does predict.randomForest estimate class probabilities? - Cross Validated 解释很清楚。 response, prob. or votes

Tuning a Random Forest model | R

  • ntree: number of trees
  • mtry: number of variables randomly sampled as candidates at each split 每个树有多少变量
  • sampsize: number of samples to train on
  • nodesize: minimum size (number of samples) of the terminal nodes 每个终点的样本最小值
  • maxnodes: maximum number of terminal nodes 可以有多少个终点

这里重点讲mtry

# Execute the tuning process

set.seed(1)              
res <- tuneRF(x = train_predictor_df,
              y = train_response_vector,
              ntreeTry = 500)
 # Look at results
print(res)
      mtry OOBError
2.OOB    2   0.2475
4.OOB    4   0.2475
8.OOB    8   0.2425

tuneRF这是一个很好的包。

Tuning a Random Forest via tree depth | R

ncol(credit_train)衡量了模型有多少变量。 nodesize <- seq(3, 8, 2)允许终点伤最少多少个样本。 sampsize <- nrow(credit_train) * c(0.7, 0.8)衡量了用多少样本来train。 model$err.rate[nrow(model$err.rate), "OOB"]提取OOB-error。

Introduction to boosting | R

这就是区别。 Boosted trees improve the model fit by considering past fits and bagged trees do not.

Train a GBM model | R

  • Adaboost
  • Gradient Boosting Machine (“GBM”)

Adaboost

  • Train decision tree where with equal weight
  • Increase/Lower the weights of the observations
  • Second tree is grown on weighted data
  • Repeat this process for a specified number of iterations

Gradient Boosting = Gradient Descent + Boosting

  • Fit an additive model (ensemble) in a forward, stage-wise manner.
  • In each stage, introduce a “weak learner” (e.g. decision tree) to compensate the shortcomings of existing weak learners.
  • In Adaboost, “shortcomings” are identified by high-weight data points.
  • In Gradient Boosting, the “shortcomings” are identified by gradients.

为什么GBM好/坏?

  • Often performs better than any other algorithm
  • Directly optmizes cost function
  • Overfits (need to find a proper stopping point)
  • Sensitive to extreme values and noises
# Train a 5000-tree GBM model

> model <- gbm(formula = response ~ ., 
               distribution = "bernoulli",
               data = train,
               n.trees = 5000)

distribution = "bernoulli"针对于\(y\)n.trees = 5000就是iteration开关。

Train a GBM model | R

For binary classification, gbm() requires the response to be encoded as 0/1 (numeric), so we will have to convert from a “no/yes” factor to a 0/1 numeric response column. 可以用ifelse()函数。

> print(credit_model)
gbm(formula = default ~ ., distribution = "bernoulli", data = credit_train, 
    n.trees = 10000)
A gradient boosted model with bernoulli loss function.
10000 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
> 
> # summary() prints variable importance
> summary(credit_model)
                                      var     rel.inf
checking_balance         checking_balance 33.49502510
amount                             amount 11.62938098
months_loan_duration months_loan_duration 11.17113439
credit_history             credit_history 11.15698321
savings_balance           savings_balance  6.44293358
employment_duration   employment_duration  6.06266137
age                                   age  5.73175696
percent_of_income       percent_of_income  3.74219743
other_credit                 other_credit  3.56695375
purpose                           purpose  3.38820798
housing                           housing  1.55169398
years_at_residence     years_at_residence  1.35255308
job                                   job  0.47631930
phone                               phone  0.09142691
existing_loans_count existing_loans_count  0.08924265
dependents                     dependents  0.05152933

这里告诉你了, 用了多少迭代, 有多少噪音变量, 并且好变量的具体情况。 这里可以得到一点sense。

陈天奇给R的xgboost代码了。

Prediction using a GBM model | R

predict.gbm需要给n.trees = 10000这给没有default值。 type = "response"会反馈\(p\)给Bernoulli分布和\(E(n)\)给Poisson分布。 否则只给\(0\)\(1\)

> range(preds1)
[1] -3.210354  2.088293
> range(preds2)
[1] 0.03877796 0.88976007

Evaluate test set AUC | R

# Generate the test set AUCs using the two sets of preditions & compare
auc(actual = credit_test$default, predicted = preds1)  #default
auc(actual = credit_test$default, predicted = preds2)  #rescaled
> auc(actual = credit_test$default, predicted = preds1)  #default
[1] 0.7875175
> auc(actual = credit_test$default, predicted = preds2)  #rescaled
[1] 0.7875175

GBM hyperparameters | R

GBM Hyperparameters

  • n.trees: number of trees
  • bag.fraction: proportion of observations to be sampled in each tree
  • n.minobsinnode: minimum number of observations in the trees terminal nodes
  • interaction.depth: maximum nodes per tree
  • shrinkage: learning rate

这里重点看shrinkage

Early stopping in GBMs | R

early stopping 就是选择最优迭代次数。 用gbm.perf()。 可以选择两种方法, method = "OOB"method = "CV"

# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = credit_model, 
                          method = "OOB", 
                          oobag.curve = TRUE)

# Train a CV GBM model
set.seed(1)
credit_model_cv <- gbm(formula = default ~ ., 
                       distribution = "bernoulli", 
                       data = credit_train,
                       n.trees = 10000,
                       cv.folds = 2)

# Optimal ntree estimate based on CV
ntree_opt_cv <- gbm.perf(object = credit_model_cv, 
                         method = "cv")
 
# Compare the estimates                         
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))                         
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv.folds>0 when calling gbm usually results in improved predictive performance.Error in plot.window(...) : need finite 'ylim' values

报错了,来不及看。

图 6:GBM模型OOB误差变化趋势

图 7:GBM模型交叉验证误差变化趋势

> print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
[1] "Optimal n.trees (OOB Estimate): 3233"
> print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
[1] "Optimal n.trees (CV Estimate): 7889"

\(\Box\)这个要好好研究下。

OOB vs CV-based early stopping | R

n.trees = ntree_opt_oob这里的最优迭代次数,直接用train的了?

# Generate predictions on the test set using ntree_opt_oob number of trees
preds1 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = ntree_opt_oob)
                  
# Generate predictions on the test set using ntree_opt_cv number of trees
preds2 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = ntree_opt_cv)   

# Generate the test set AUCs using the two sets of preditions & compare
auc1 <- auc(actual = credit_test$default, predicted = preds1)  #OOB
auc2 <- auc(actual = credit_test$default, predicted = preds2)  #CV 

# Compare AUC 
print(paste0("Test set AUC (OOB): ", auc1))                         
print(paste0("Test set AUC (CV): ", auc2))

Compare all models based on AUC | R

In this final exercise, we will perform a model comparison across all types of models that we’ve learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM).

综合模型性能对比环节

Loaded in your workspace are four numeric vectors:

  • dt_preds
  • bag_preds
  • rf_preds
  • gbm_preds

sprintf很好用啊!

# Generate the test set AUCs using the two sets of predictions & compare
actual <- credit_test$default
dt_auc <- auc(actual = actual, predicted = dt_preds)
bag_auc <- auc(actual = actual, predicted = bag_preds)
rf_auc <- auc(actual = actual, predicted = rf_preds)
gbm_auc <- auc(actual = actual, predicted = gbm_preds)

# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
sprintf("Random Forest Test AUC: %.3f", rf_auc)
sprintf("GBM Test AUC: %.3f", gbm_auc)
> sprintf("Decision Tree Test AUC: %.3f", dt_auc)
[1] "Decision Tree Test AUC: 0.627"
> sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
[1] "Bagged Trees Test AUC: 0.781"
> sprintf("Random Forest Test AUC: %.3f", rf_auc)
[1] "Random Forest Test AUC: 0.804"
> sprintf("GBM Test AUC: %.3f", gbm_auc)
[1] "GBM Test AUC: 0.786"

Plot & compare ROC curves | R

ROCR包能够高效绘制ROC曲线。

# List of predictions
preds_list <- list(dt_preds, bag_preds, rf_preds, gbm_preds)

# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(credit_test$default), m)

# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")
plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright", 
       legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
       fill = 1:m)

图 8:不同模型ROC曲线对比图

证书


    • 连续性method="anova",
    • 离散型method="class",
    • 计数型method="poisson", 泊松分布
    • 生存分析型method="exp" 指数分布
    ↩︎