本文为DataCamp课程”Machine Learning with Tree-Based Models in R”的学习笔记,系统记录了树模型在R语言中的实现方法与技术要点。
Machine Learning with Tree-Based Models in R
课程概述
本课程系统介绍了五种核心树模型算法,提供了丰富的实践练习案例。课程采用R语言作为主要实现工具,为习惯使用Python环境的学习者提供了高效的算法调试与验证途径。课程设计紧凑,约需4小时学习时间,适合快速掌握树模型的核心概念与实现方法。
讲师介绍
课程讲师为Gabriela de Queiroz | DataCamp,具备以下专业背景:
She likes to mentor and share her knowledge through mentorship programs, tutorials and talks.
- 专注于数据科学教学领域,擅长复杂技术概念的通俗化讲解
- 作为R-Ladies NGO组织成员,采用亲和力较强的教学风格,避免了过度技术化表达
课程特点
课程内容结构清晰,理论与实践并重,能够帮助学习者建立系统的树模型知识框架,掌握实际编程技能。
课程学习建议
建议结合视频教程进行学习,以获得最佳的学习效果。
课程概述
Machine Learning with Tree-Based Models in R
本课程系统介绍了五种核心树模型算法,提供了丰富的实践练习案例。课程采用R语言作为主要实现工具,为习惯使用Python环境的学习者提供了高效的算法调试与验证途径。课程设计紧凑,约需4小时学习时间,适合快速掌握树模型的核心概念与实现方法。
核心概念与优势
树模型核心优势:
- 可解释性强 + 易于使用 + 预测精度高
- 支持决策制定 + 数值预测能力
课程学习目标:
- 学习解释和阐述模型决策过程
- 探索不同应用场景
- 构建和评估分类与回归模型
- 调优模型参数以获得最佳性能
课程内容涵盖:
- 分类与回归树(CART)
- 袋装树(Bagged Trees)
- 随机森林(Random Forests)
- 梯度提升树(GBM)

图 1:决策树基本结构示意图
第一章:分类决策树
1.1 基础概念
Welcome to the course! | R
回归问题主要针对数值型目标变量进行预测,而分类问题主要针对分类型目标变量。
1.2 构建第一个分类树
Build a classification tree | R
核心函数介绍:
- rpart.plot:用于决策树可视化
- creditsub:加载德国信贷数据集
- str():类似于pandas中的.info()方法,用于查看数据结构
Train/test split | R
nrow:类似于Python中的.shape[0]和len(),用于获取行数
sample(x, size, replace = FALSE, prob = NULL)函数说明:
replace:默认值为FALSE,表示无放回抽样size:指定抽样样本数量x:被抽样的向量对象
Compute confusion matrix | R
confusionMatrix函数能够自动计算并展示完整的分类性能指标集合,包括准确率、精确率、召回率、F1分数等关键评估指标。
Compare models with a different splitting criterion | R
ce():
Calculates the classification error.
Introduction to regression trees | R
同质性(Homogeneity)度量方法:
- 分类变量:使用基尼系数(Gini)、信息熵(Entropy)等指标
- 连续变量:使用标准差、绝对偏差等指标
rpart函数参数:
"class":处理分类目标变量"anova":处理连续目标变量
数据集划分策略:
- 训练集(Training set):用于模型训练
- 验证集(Validation set):用于超参数调优
- 测试集(Test set):仅用于最终模型性能评估,使用一次
Split the data | R
函数参数说明:
prob:指定抽样概率权重的向量参数
实现特点:
该方法实现了直观的数据分割逻辑,功能上等同于Python环境中的train_test_split函数。R语言的实现方式使数据处理流程更加规范和易于理解。
Train a regression tree model | R
参数解析:**
yesno=2:控制决策树节点中”是/否”标签的显示格式type=0:设置决策树的显示类型extra=0:控制额外信息的显示选项
Performance metrics for regression | R
回归模型评估工具:
Metrics包:提供完整的模型评估指标集rmse函数:用于计算均方根误差(Root Mean Square Error)
What are the hyperparameters for a decision tree | R

图 2:rpart.control主要参数详解
通过对这些超参数的系统学习,能够深入理解决策树模型的核心控制机制,为后续的模型调优奠定理论基础。
关键超参数解析:
minsplit:节点分裂所需的最小样本数,默认minsplit = 2(可能的最小叶节点)cp:复杂度参数 [^cp],\(cp\downarrow\to complexity \uparrow\)maxdepth:决策树的最大深度
[^cp]:决定决策树停止分裂的阈值参数。cp = 0表示不限制分裂(直至满足其他停止条件)
> print(model$cptable)
CP nsplit rel error xerror xstd
1 0.06839852 0 1.0000000 1.0080595 0.09215642
2 0.06726713 1 0.9316015 1.0920667 0.09543723
3 0.03462630 2 0.8643344 0.9969520 0.08632297
4 0.02508343 3 0.8297080 0.9291298 0.08571411
5 0.01995676 4 0.8046246 0.9357838 0.08560120
6 0.01817661 5 0.7846679 0.9337462 0.08087153
7 0.01203879 6 0.7664912 0.9092646 0.07982862
8 0.01000000 7 0.7544525 0.9407895 0.08399125
# Prune the model (to optimized cp value)
# Returns the optimal model
model_opt <- prune(tree = model, cp = cp_opt)
最优参数选择:
基于交叉验证误差最小化原则,选择第4行参数组合:
CP = 0.02508343, nsplit = 3, xerror = 0.8297080
1.9 模型剪枝优化
Grid search for model selection | R
hyper_grid <- expand.grid(minsplit = minsplit,
maxdepth = maxdepth)
> hyper_grid[1:10,]
minsplit maxdepth
1 1 5
2 6 5
3 11 5
4 16 5
5 21 5
6 26 5
7 1 15
8 6 15
9 11 15
10 16 15
我认为这个expand.grid可用于模型迭代。
# create an empty list to store models
models <- list()
# execute the grid search
> for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store in the list
models[[i]] <- rpart(formula = response ~ .,
data = train,
method = "anova",
minsplit = minsplit)
}
通过分析发现,直接使用for循环的方法比应用purrr包更加直观和高效,代码可读性更强。
models[[i]]就是sense。
# create an empty vector to store RMSE values
rmse_values <- c()
# compute validation RMSE fr
for (i in 1:length(models)) {
# retreive the i^th model from the list
model <- models[[i]]
# generate predictions on grade_valid
pred <- predict(object = model,
newdata = valid)
# compute validation RMSE and add to the
rmse_values[i] <- rmse(actual = valid$response,
predicted = pred)
}
完全的自动化思路,很好!
Generate a grid of hyperparameter values | R
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
记住这个函数,用的很常见,不要老是help,耽误时间。
Generate a grid of models | R
我大概理解了,这里hyper_grid的确方便,合唱一个table或者矩阵,
但是还是要拆分成
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
还是麻烦,因此还是purrr好。
Evaluate the grid | R
一共包含了24个模型。
Introduction to bagged trees | R

bootstrap,这是bagging的想法。

图 4:Bagging算法集成学习示意图
> library(ipred)
> bagging(formula = response ~ ., data = dat)
用ipred包!哈哈。
Train a bagged tree model | R
nbagg限制了包的个数。
If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the
coobparameter toTRUE
The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model
数据样本充分利用。
as.factor(default)这里必须用factor。
Evaluating the bagged tree performance | R
可以回忆### ROC切线理解。 最大点是最优解。
> library(Metrics)
> auc(actual, predicted)
[1] .76765
Prediction and confusion matrix | R
就当练习下矩阵latex。
\[\begin{matrix} & & Pred &\\ & & 1 & 0 \\ Actual & 1 & TP & FN \\ & 0 & FP & TN \\ \end{matrix}\]
confusionMatrix()来自caret。
data: a factor of predicted classes (for the default method) or an object of class table.reference: a factor of classes to be used as the true results
Predict on a test set and compute AUC | R
type = "prob"区别于type = "class",已经非常熟悉了。
注意predict给的是两列,yes和no概率相加为1。
这里预测的时候,我们定义yes为positive,而非原来假设的no。
这里需要在auc中进行说明的。
Using caret for cross-validating models | R
# Specify the training configuration
ctrl <- trainControl(method = "cv", # Cross-validation
number = 5, # 5 folds
classProbs = TRUE, # For AUC
summaryFunction = twoClassSummary) # For AUC
这里的classProbs = TRUE
和summaryFunction = twoClassSummary,
该部分需要查阅帮助文档以获取详细说明。
classProbs: a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?summaryFunction: a function to compute performance metrics across resamples. The arguments to the function should be the same as those in defaultSummary.
是一些细节的东西,不要太在意。
之前都是rpart搭配method = "class"或者method = "anova"1。
现在是train搭配
method = "treebag"。
都是些很模糊的东西。
set.seed(1) #for reproducibility
credit_model <- train(default ~ .,
data = credit_train,
method = "treebag",
metric = "ROC",
trControl = ctrl)
的确感觉bagging和CV很像啊。
Use
caret::train()with the"treebag"method to train a model and evaluate the model using cross-validated AUC.
Compare test set performance to CV performance | R
- The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the
ipred::bagging()function. - The credit_caret_model_test_auc object stores the test set AUC from the model trained using the
caret::train()function withmethod = "treebag".
这是可以留心之处,学习的乐趣,感受到了。
Introduction to Random Forest | R
开始随机森林了,都是炒冷饭,bagging也是。
sample subset of the features这个才是随机森林的关键,为什么这么做? feature bagging or random sub-feature.
虽然牺牲了performance(变量减少了), 但是减少了bagging树之间的cor。 这个就是理由。
library(randomForest)
# Train a default RF model (500 trees)
model <- randomForest(formula = response ~ ., data = train)
该包默认设置构建500棵决策树。
Train a Random Forest model | R
相当于复习了饿!
Train a Random Forest model | R
type:
one of regression, classification, or unsupervised.
随机森林还可以无监督?
randomForest没有办法自动将 char格式转换为factor格式 可以用str(train) 查看下 哪些字段是char ,再用as.factor转化下。
因此,使用data %>% mutate_if(is.character,as.factor)可以高效解决字符型变量向因子型变量转换的问题。
Understanding Random Forest model output | R
# Print the credit_model output
> print(credit_model)
Call:
randomForest(formula = default ~ ., data = credit_train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 24.12%
Confusion matrix:
no yes class.error
no 516 46 0.08185053
yes 147 91 0.61764706
No. of variables tried at each split: 4就是
mtry:
number of predictors sampled for spliting at each node.
一般是\(\sqrt{n}\)。\(n\)是特征向量数量。
OOB estimate of error rate: 24.12%
中,
OOB\(\to\)Out-of-bag,可以用于pred,因为没有train。
# Grab OOB error matrix & take a look
> err <- credit_model$err.rate
> head(err)
OOB no yes
[1,] 0.3414634 0.2657005 0.5375000
[2,] 0.3311966 0.2462908 0.5496183
[3,] 0.3232831 0.2476636 0.5147929
[4,] 0.3164933 0.2180294 0.5561224
[5,] 0.3197756 0.2095808 0.5801887
[6,] 0.3176944 0.2115385 0.5619469
可以用来决定用多少tree。

图 5:随机森林误差变化趋势图
Evaluate out-of-bag error | R
\(\Box\)说实话,OOB这里不是很懂。
OOB error vs. test set error | R
Advantages & Disadvantages of OOB estimates
- Can evaluate your model without a separate test set 不需要分test样本。
- Computed automatically by the
randomForest()function
But …
- OOB Error only estimates error (not AUC, log-loss, etc.)
- Can’t compare Random Forest performace to other types of models
在我看来,都不是特别重要。
r - How does predict.randomForest estimate class probabilities? - Cross Validated
解释很清楚。
response, prob. or votes
Tuning a Random Forest model | R
ntree: number of treesmtry: number of variables randomly sampled as candidates at each split 每个树有多少变量sampsize: number of samples to train onnodesize: minimum size (number of samples) of the terminal nodes 每个终点的样本最小值maxnodes: maximum number of terminal nodes 可以有多少个终点
这里重点讲mtry,
# Execute the tuning process
set.seed(1)
res <- tuneRF(x = train_predictor_df,
y = train_response_vector,
ntreeTry = 500)
# Look at results
print(res)
mtry OOBError
2.OOB 2 0.2475
4.OOB 4 0.2475
8.OOB 8 0.2425
tuneRF这是一个很好的包。
Tuning a Random Forest via tree depth | R
ncol(credit_train)衡量了模型有多少变量。
nodesize <- seq(3, 8, 2)允许终点伤最少多少个样本。
sampsize <- nrow(credit_train) * c(0.7, 0.8)衡量了用多少样本来train。
model$err.rate[nrow(model$err.rate), "OOB"]提取OOB-error。
Introduction to boosting | R
这就是区别。 Boosted trees improve the model fit by considering past fits and bagged trees do not.
Train a GBM model | R
- Adaboost
- Gradient Boosting Machine (“GBM”)
Adaboost
- Train decision tree where with equal weight
- Increase/Lower the weights of the observations
- Second tree is grown on weighted data
- Repeat this process for a specified number of iterations
Gradient Boosting = Gradient Descent + Boosting
- Fit an additive model (ensemble) in a forward, stage-wise manner.
- In each stage, introduce a “weak learner” (e.g. decision tree) to compensate the shortcomings of existing weak learners.
- In Adaboost, “shortcomings” are identified by high-weight data points.
- In Gradient Boosting, the “shortcomings” are identified by gradients.
为什么GBM好/坏?
- Often performs better than any other algorithm
- Directly optmizes cost function
- Overfits (need to find a proper stopping point)
- Sensitive to extreme values and noises
# Train a 5000-tree GBM model
> model <- gbm(formula = response ~ .,
distribution = "bernoulli",
data = train,
n.trees = 5000)
distribution = "bernoulli"针对于\(y\)。
n.trees = 5000就是iteration开关。
Train a GBM model | R
For binary classification, gbm() requires the response to be encoded as 0/1 (numeric), so we will have to convert from a “no/yes” factor to a 0/1 numeric response column.
可以用ifelse()函数。
> print(credit_model)
gbm(formula = default ~ ., distribution = "bernoulli", data = credit_train,
n.trees = 10000)
A gradient boosted model with bernoulli loss function.
10000 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
>
> # summary() prints variable importance
> summary(credit_model)
var rel.inf
checking_balance checking_balance 33.49502510
amount amount 11.62938098
months_loan_duration months_loan_duration 11.17113439
credit_history credit_history 11.15698321
savings_balance savings_balance 6.44293358
employment_duration employment_duration 6.06266137
age age 5.73175696
percent_of_income percent_of_income 3.74219743
other_credit other_credit 3.56695375
purpose purpose 3.38820798
housing housing 1.55169398
years_at_residence years_at_residence 1.35255308
job job 0.47631930
phone phone 0.09142691
existing_loans_count existing_loans_count 0.08924265
dependents dependents 0.05152933
这里告诉你了, 用了多少迭代, 有多少噪音变量, 并且好变量的具体情况。 这里可以得到一点sense。
陈天奇给R的xgboost代码了。
Prediction using a GBM model | R
predict.gbm需要给n.trees = 10000这给没有default值。
type = "response"会反馈\(p\)给Bernoulli分布和\(E(n)\)给Poisson分布。
否则只给\(0\)和\(1\)。
> range(preds1)
[1] -3.210354 2.088293
> range(preds2)
[1] 0.03877796 0.88976007
Evaluate test set AUC | R
# Generate the test set AUCs using the two sets of preditions & compare
auc(actual = credit_test$default, predicted = preds1) #default
auc(actual = credit_test$default, predicted = preds2) #rescaled
> auc(actual = credit_test$default, predicted = preds1) #default
[1] 0.7875175
> auc(actual = credit_test$default, predicted = preds2) #rescaled
[1] 0.7875175
GBM hyperparameters | R
GBM Hyperparameters
n.trees: number of treesbag.fraction: proportion of observations to be sampled in each treen.minobsinnode: minimum number of observations in the trees terminal nodesinteraction.depth: maximum nodes per treeshrinkage: learning rate
这里重点看shrinkage。
Early stopping in GBMs | R
early stopping
就是选择最优迭代次数。
用gbm.perf()。
可以选择两种方法,
method = "OOB"和method = "CV"。
# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = credit_model,
method = "OOB",
oobag.curve = TRUE)
# Train a CV GBM model
set.seed(1)
credit_model_cv <- gbm(formula = default ~ .,
distribution = "bernoulli",
data = credit_train,
n.trees = 10000,
cv.folds = 2)
# Optimal ntree estimate based on CV
ntree_opt_cv <- gbm.perf(object = credit_model_cv,
method = "cv")
# Compare the estimates
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv.folds>0 when calling gbm usually results in improved predictive performance.Error in plot.window(...) : need finite 'ylim' values
报错了,来不及看。

图 6:GBM模型OOB误差变化趋势

图 7:GBM模型交叉验证误差变化趋势
> print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
[1] "Optimal n.trees (OOB Estimate): 3233"
> print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
[1] "Optimal n.trees (CV Estimate): 7889"
\(\Box\)这个要好好研究下。
OOB vs CV-based early stopping | R
n.trees = ntree_opt_oob这里的最优迭代次数,直接用train的了?
# Generate predictions on the test set using ntree_opt_oob number of trees
preds1 <- predict(object = credit_model,
newdata = credit_test,
n.trees = ntree_opt_oob)
# Generate predictions on the test set using ntree_opt_cv number of trees
preds2 <- predict(object = credit_model,
newdata = credit_test,
n.trees = ntree_opt_cv)
# Generate the test set AUCs using the two sets of preditions & compare
auc1 <- auc(actual = credit_test$default, predicted = preds1) #OOB
auc2 <- auc(actual = credit_test$default, predicted = preds2) #CV
# Compare AUC
print(paste0("Test set AUC (OOB): ", auc1))
print(paste0("Test set AUC (CV): ", auc2))
Compare all models based on AUC | R
In this final exercise, we will perform a model comparison across all types of models that we’ve learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM).
综合模型性能对比环节
Loaded in your workspace are four numeric vectors:
dt_predsbag_predsrf_predsgbm_preds
sprintf很好用啊!
# Generate the test set AUCs using the two sets of predictions & compare
actual <- credit_test$default
dt_auc <- auc(actual = actual, predicted = dt_preds)
bag_auc <- auc(actual = actual, predicted = bag_preds)
rf_auc <- auc(actual = actual, predicted = rf_preds)
gbm_auc <- auc(actual = actual, predicted = gbm_preds)
# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
sprintf("Random Forest Test AUC: %.3f", rf_auc)
sprintf("GBM Test AUC: %.3f", gbm_auc)
> sprintf("Decision Tree Test AUC: %.3f", dt_auc)
[1] "Decision Tree Test AUC: 0.627"
> sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
[1] "Bagged Trees Test AUC: 0.781"
> sprintf("Random Forest Test AUC: %.3f", rf_auc)
[1] "Random Forest Test AUC: 0.804"
> sprintf("GBM Test AUC: %.3f", gbm_auc)
[1] "GBM Test AUC: 0.786"
Plot & compare ROC curves | R
ROCR包能够高效绘制ROC曲线。
# List of predictions
preds_list <- list(dt_preds, bag_preds, rf_preds, gbm_preds)
# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(credit_test$default), m)
# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")
plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright",
legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
fill = 1:m)

图 8:不同模型ROC曲线对比图
- 连续性
method="anova", - 离散型
method="class", - 计数型
method="poisson", 泊松分布 - 生存分析型
method="exp"指数分布
- 连续性