本文为DataCamp课程"Machine Learning with Tree-Based Models in R"的学习笔记,系统记录了树模型在R语言中的实现方法与技术要点。
{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE, eval=FALSE, echo=FALSE)
Machine Learning with Tree-Based Models in R
课程概述
本课程系统介绍了五种核心树模型算法,提供了丰富的实践练习案例。课程采用R语言作为主要实现工具,为习惯使用Python环境的学习者提供了高效的算法调试与验证途径。课程设计紧凑,约需4小时学习时间,适合快速掌握树模型的核心概念与实现方法。
讲师介绍
课程讲师为Gabriela de Queiroz | DataCamp,具备以下专业背景:
She likes to mentor and share her knowledge through mentorship programs, tutorials and talks.
- 专注于数据科学教学领域,擅长复杂技术概念的通俗化讲解
- 作为R-Ladies NGO组织成员,采用亲和力较强的教学风格,避免了过度技术化表达
课程特点
课程内容结构清晰,理论与实践并重,能够帮助学习者建立系统的树模型知识框架,掌握实际编程技能。
课程学习建议
建议结合视频教程进行学习,以获得最佳的学习效果。
课程概述
Machine Learning with Tree-Based Models in R
本课程系统介绍了五种核心树模型算法,提供了丰富的实践练习案例。课程采用R语言作为主要实现工具,为习惯使用Python环境的学习者提供了高效的算法调试与验证途径。课程设计紧凑,约需4小时学习时间,适合快速掌握树模型的核心概念与实现方法。
核心概念与优势
树模型核心优势:
- 可解释性强 + 易于使用 + 预测精度高
- 支持决策制定 + 数值预测能力
课程学习目标:
- 学习解释和阐述模型决策过程
- 探索不同应用场景
- 构建和评估分类与回归模型
- 调优模型参数以获得最佳性能
课程内容涵盖:
- 分类与回归树(CART)
- 袋装树(Bagged Trees)
- 随机森林(Random Forests)
- 梯度提升树(GBM)

图 1:决策树基本结构示意图
第一章:分类决策树
1.1 基础概念
Welcome to the course! | R
回归问题主要针对数值型目标变量进行预测,而分类问题主要针对分类型目标变量。
1.2 构建第一个分类树
Build a classification tree | R
核心函数介绍: - rpart.plot:用于决策树可视化 - creditsub:加载德国信贷数据集 - str():类似于pandas中的.info()方法,用于查看数据结构
```{r cache=TRUE} url=“https://assets.datacamp.com/production/course_3022/datasets/credit.csv library(tidyverse) creditsub <- read_csv(url) # Look at the data str(creditsub)
Create the model
library(rpart) credit_model <- rpart(formula = default ~ ., data = creditsub, method = “class”)
Display the results
library(rpart.plot) rpart.plot(x = credit_model, yesno = 2, type = 0, extra = 0)
### [Train/test split | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/classification-trees?ex=7)
`nrow`:类似于Python中的`.shape[0]`和`len()`,用于获取行数
`sample(x, size, replace = FALSE, prob = NULL)`函数说明:
- `replace`:默认值为`FALSE`,表示无放回抽样
- `size`:指定抽样样本数量
- `x`:被抽样的向量对象
```
# Total number of rows in the credit data frame
credit <- creditsub
n <- nrow(credit)
# Number of rows for the training set (80% of the dataset)
n_train <- round(0.8 * n)
# Create a vector of indices which is an 80% random sample
set.seed(123)
train_indices <- sample(1:n, n_train)
# Subset the credit data frame to training indices only
credit_train <- credit[train_indices, ]
# Exclude the training indices to create the test set
credit_test <- credit[-train_indices, ]
Train a classification tree model | R
```{r cache=TRUE} # Train the model (to predict ‘default’) credit_model <- rpart(formula = default ~ ., data = credit_train, method = “class”)
Look at the model output
print(credit_model)
### [Compute confusion matrix | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/classification-trees?ex=10)
```
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,
newdata = credit_test,
type = "class")
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,
reference = credit_test$default)
confusionMatrix函数能够自动计算并展示完整的分类性能指标集合,包括准确率、精确率、召回率、F1分数等关键评估指标。
Compare models with a different splitting criterion | R
ce(): Calculates the classification error.
```{r cache=TRUE} library(rpart) # Train a gini-based model credit_model1 <- rpart(formula = default ~ ., data = credit_train, method = “class”, parms = list(split = “gini”))
Train an information-based model
credit_model2 <- rpart(formula = default ~ ., data = credit_train, method = “class”, parms = list(split = “information”))
Generate predictions on the validation set using the gini model
pred1 <- predict(object = credit_model1, newdata = credit_test, type = “class”)
Generate predictions on the validation set using the information model
pred2 <- predict(object = credit_model2, newdata = credit_test, type = “class”)
Compare classification error
library(ModelMetrics) sum(credit_test$default != pred1)/nrow(credit_test) sum(credit_test$default != pred2)/nrow(credit_test)
### [Introduction to regression trees | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/regression-trees?ex=1)
**同质性(Homogeneity)度量方法:**
- **分类变量**:使用基尼系数(Gini)、信息熵(Entropy)等指标
- **连续变量**:使用标准差、绝对偏差等指标
**rpart函数参数:**
- `"class"`:处理分类目标变量
- `"anova"`:处理连续目标变量
**数据集划分策略:**
- **训练集(Training set)**:用于模型训练
- **验证集(Validation set)**:用于超参数调优
- **测试集(Test set)**:仅用于最终模型性能评估,使用一次
### [Split the data | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/regression-trees?ex=3)
```
grade <- read_csv(
"https://assets.datacamp.com/production/course_3022/datasets/grade.csv
)
函数参数说明:
prob:指定抽样概率权重的向量参数
实现特点: 该方法实现了直观的数据分割逻辑,功能上等同于Python环境中的train_test_split函数。R语言的实现方式使数据处理流程更加规范和易于理解。
```{r cache=TRUE} # Look/explore the data str(grade)
Randomly assign rows to ids (1/2/3 represents train/valid/test)
This will generate a vector of ids of length equal to the number of rows
The train/valid/test split will be approximately 70% / 15% / 15%
set.seed(1) assignment <- sample(1:3, size = nrow(grade), prob = c(0.7,0.15,0.15), replace = TRUE)
Create a train, validation and tests from the original data frame
grade_train <- grade[assignment == 1, ] # subset the grade data frame to training indices only grade_valid <- grade[assignment == 2, ] # subset the grade data frame to validation indices only grade_test <- grade[assignment == 3, ] # subset the grade data frame to test indices only
### [Train a regression tree model | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/regression-trees?ex=4)
```
# Train the model
grade_model <- rpart(formula = final_grade ~ .,
data = grade_train,
method = "anova")
# Look at the model output
print(grade_model)
# Plot the tree model
rpart.plot(x = grade_model, yesno = 2, type = 0, extra = 0)
参数解析:**
yesno=2:控制决策树节点中"是/否"标签的显示格式type=0:设置决策树的显示类型extra=0:控制额外信息的显示选项
Performance metrics for regression | R
回归模型评估工具:
Metrics包:提供完整的模型评估指标集rmse函数:用于计算均方根误差(Root Mean Square Error)
Evaluate a regression tree model | R
```{r cache=TRUE} # Generate predictions on a test set pred <- predict(object = grade_model, # model object newdata = grade_test) # test dataset
Compute the RMSE
rmse(actual = grade_test$final_grade, predicted = pred)
### [What are the hyperparameters for a decision tree | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/regression-trees?ex=7)

图 2:rpart.control主要参数详解
通过对这些超参数的系统学习,能够深入理解决策树模型的核心控制机制,为后续的模型调优奠定理论基础。
**关键超参数解析:**
- **`minsplit`**:节点分裂所需的最小样本数,默认`minsplit = 2`(可能的最小叶节点)
- **`cp`**:复杂度参数 [^cp],$cp\downarrow\to complexity \uparrow$
- **`maxdepth`**:决策树的最大深度
[^cp]:决定决策树停止分裂的阈值参数。`cp = 0`表示不限制分裂(直至满足其他停止条件)
print(model$cptable) CP nsplit rel error xerror xstd 1 0.06839852 0 1.0000000 1.0080595 0.09215642 2 0.06726713 1 0.9316015 1.0920667 0.09543723 3 0.03462630 2 0.8643344 0.9969520 0.08632297 4 0.02508343 3 0.8297080 0.9291298 0.08571411 5 0.01995676 4 0.8046246 0.9357838 0.08560120 6 0.01817661 5 0.7846679 0.9337462 0.08087153 7 0.01203879 6 0.7664912 0.9092646 0.07982862 8 0.01000000 7 0.7544525 0.9407895 0.08399125
Prune the model (to optimized cp value)
Returns the optimal model
model_opt <- prune(tree = model, cp = cp_opt)
**最优参数选择:**
基于交叉验证误差最小化原则,选择第4行参数组合:
`CP = 0.02508343, nsplit = 3, xerror = 0.8297080`
### 1.9 模型剪枝优化
```
# Plot "CP Table
plotcp(grade_model)
# Print "CP Table
print(grade_model$cptable)
# Retreive optimal cp value based on cross-validated error
opt_index <- which.min(grade_model$cptable[, "xerror"])
cp_opt <- grade_model$cptable[opt_index, "CP"]
# Prune the model (to optimized cp value)
model_opt <- prune(tree = grade_model, cp = cp_opt)
### [Tuning the model | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/regression-trees?ex=8)
`which.min(x)`反馈index。
```
# Plot the "CP Table
plotcp(grade_model)
# Print the "CP Table
print(grade_model$cptable)
# Retreive optimal cp value based on cross-validated error
opt_index <- which.min(grade_model$cptable[, "xerror"])
cp_opt <- grade_model$cptable[opt_index, "CP"]
# Prune the model (to optimized cp value)
grade_model_opt <- prune(tree = grade_model,
cp = cp_opt)
# Plot the optimized model
rpart.plot(x = grade_model_opt, yesno = 2, type = 0, extra = 0)
Grid search for model selection | R
hyper_grid <- expand.grid(minsplit = minsplit,
maxdepth = maxdepth)
> hyper_grid[1:10,]
minsplit maxdepth
1 1 5
2 6 5
3 11 5
4 16 5
5 21 5
6 26 5
7 1 15
8 6 15
9 11 15
10 16 15
我认为这个expand.grid可用于模型迭代。
# create an empty list to store models
models <- list()
# execute the grid search
> for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store in the list
models[[i]] <- rpart(formula = response ~ .,
data = train,
method = "anova",
minsplit = minsplit)
}
通过分析发现,直接使用for循环的方法比应用purrr包更加直观和高效,代码可读性更强。 models[[i]]就是sense。
# create an empty vector to store RMSE values
rmse_values <- c()
# compute validation RMSE fr
for (i in 1:length(models)) {
# retreive the i^th model from the list
model <- models[[i]]
# generate predictions on grade_valid
pred <- predict(object = model,
newdata = valid)
# compute validation RMSE and add to the
rmse_values[i] <- rmse(actual = valid$response,
predicted = pred)
}
完全的自动化思路,很好!
Generate a grid of hyperparameter values | R
```{r cache=TRUE} # Establish a list of possible values for minsplit and maxdepth minsplit <- seq(1, 4, 1) maxdepth <- seq(1, 6, 1)
Create a data frame containing all combinations
hyper_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)
Check out the grid
head(hyper_grid)
Print the number of grid combinations
nrow(hyper_grid)
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, …)
记住这个函数,用的很常见,不要老是`help`,耽误时间。
### [Generate a grid of models | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/regression-trees?ex=11)
```
# Number of potential models in the grid
num_models <- nrow(hyper_grid)
# Create an empty list to store models
grade_models <- list()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {
# Get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# Train a model and store in the list
grade_models[[i]] <- rpart(formula = final_grade ~ .,
data = grade_train,
method = "anova",
minsplit = minsplit,
maxdepth = maxdepth)
}
我大概理解了,这里hyper_grid的确方便,合唱一个table或者矩阵, 但是还是要拆分成
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
还是麻烦,因此还是purrr好。
Evaluate the grid | R
{r cache=TRUE} typeof(grade_models)
一共包含了24个模型。
```{r cache=TRUE} # Number of potential models in the grid num_models <- length(grade_models)
Create an empty vector to store RMSE values
rmse_values <- c()
Write a loop over the models to compute validation RMSE
for (i in 1:num_models) {
# Retreive the i^th model from the list
model <- grade_models[[i]]
# Generate predictions on grade_valid
pred <- predict(object = model,
newdata = grade_valid)
# Compute validation RMSE and add to the
rmse_values[i] <- rmse(actual = grade_valid$final_grade,
predicted = pred)
}
Identify the model with smallest validation set RMSE
best_model <- grade_models[[which.min(rmse_values)]]
Print the model paramters of the best model
best_model$control
Compute test set RMSE on best_model
pred <- predict(object = best_model, newdata = grade_test) rmse(actual = grade_test$final_grade, predicted = pred)
### [Introduction to bagged trees | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/bagged-trees?ex=1)

bootstrap,这是bagging的想法。

图 4:Bagging算法集成学习:示意图
library(ipred) bagging(formula = response ~ ., data = dat)
用`ipred`包!哈哈。
### [Train a bagged tree model | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/bagged-trees?ex=3)
`nbagg`限制了包的个数。
> If we want to estimate the model's accuracy using the "out-of-bag" (OOB) samples, we can set the the `coob` parameter to `TRUE`
> The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model
数据样本充分利用。
```
library(ipred)
# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)
# Train a bagged model
credit_model <- bagging(formula = as.factor(default) ~ .,
data = credit_train,
coob = TRUE)
# Print the model
print(credit_model)
as.factor(default)这里必须用factor。
Evaluating the bagged tree performance | R
可以回忆### ROC切线理解。 最大点是最优解。
> library(Metrics)
> auc(actual, predicted)
[1] .76765
Prediction and confusion matrix | R
就当练习下矩阵latex。
$$\begin{matrix} & & Pred &\ & & 1 & 0 \ Actual & 1 & TP & FN \ & 0 & FP & TN \ \end{matrix}$$
confusionMatrix()来自caret。
```{r cache=TRUE} # Generate predicted classes using the model object class_prediction <- predict(object = credit_model,
newdata = credit_test,
type = “class”) # return classification labels
Print the predicted classes
print(class_prediction)
Calculate the confusion matrix for the test set
library(caret) confusionMatrix(data = class_prediction,
reference = credit_test$default)
+ `data`:
a factor of predicted classes (for the default method) or an object of class table.
+ `reference`:
a factor of classes to be used as the true results
### [Predict on a test set and compute AUC | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/bagged-trees?ex=6)
`type = "prob"`区别于`type = "class"`,已经非常熟悉了。
```
# Generate predictions on the test set
pred <- predict(object = credit_model,
newdata = credit_test,
type = "prob")
# `pred` is a matrix
class(pred)
# Look at the pred format
head(pred)
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
library(Metrics)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
注意predict给的是两列,yes和no概率相加为1。 这里预测的时候,我们定义yes为positive,而非原来假设的no。 这里需要在auc中进行说明的。
Using caret for cross-validating models | R
# Specify the training configuration
ctrl <- trainControl(method = "cv", # Cross-validation
number = 5, # 5 folds
classProbs = TRUE, # For AUC
summaryFunction = twoClassSummary) # For AUC
这里的classProbs = TRUE 和summaryFunction = twoClassSummary, 该部分需要查阅帮助文档以获取详细说明。
classProbs: a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?summaryFunction: a function to compute performance metrics across resamples. The arguments to the function should be the same as those in defaultSummary.
是一些细节的东西,不要太在意。
之前都是rpart搭配method = "class"或者method = "anova"1。 现在是train搭配 method = "treebag"。 都是些很模糊的东西。
set.seed(1) #for reproducibility
credit_model <- train(default ~ .,
data = credit_train,
method = "treebag",
metric = "ROC",
trControl = ctrl)
的确感觉bagging和CV很像啊。
Use
caret::train()with the"treebag"method to train a model and evaluate the model using cross-validated AUC.
Cross-validate a bagged tree model in caret | R
```{r cache=TRUE} # Specify the training configuration ctrl <- trainControl(method = “cv”, # Cross-validation number = 5, # 5 folds classProbs = TRUE, # For AUC summaryFunction = twoClassSummary) # For AUC
Cross validate the credit model using “treebag” method;
Track AUC (Area under the ROC curve)
set.seed(1) # for reproducibility credit_caret_model <- train(default ~ ., data = credit_train, method = “treebag”, metric = “ROC”, trControl = ctrl)
Look at the model object
print(credit_caret_model)
Inspect the contents of the model list
names(credit_caret_model)
Print the CV AUC
credit_caret_model$results[,“ROC”]
### [Generate predictions from the caret model | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/bagged-trees?ex=9)
```
# Generate predictions on the test set
pred <- predict(object = credit_caret_model,
newdata = credit_test,
type = "prob")
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
Compare test set performance to CV performance | R
- The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the
ipred::bagging()function. - The credit_caret_model_test_auc object stores the test set AUC from the model trained using the
caret::train()function withmethod = "treebag".
这是可以留心之处,学习的乐趣,感受到了。
Introduction to Random Forest | R
开始随机森林了,都是炒冷饭,bagging也是。
sample subset of the features这个才是随机森林的关键,为什么这么做? feature bagging or random sub-feature.
虽然牺牲了performance(变量减少了), 但是减少了bagging树之间的cor。 这个就是理由。
library(randomForest)
# Train a default RF model (500 trees)
model <- randomForest(formula = response ~ ., data = train)
该包默认设置构建500棵决策树。
Train a Random Forest model | R
相当于复习了饿!
Train a Random Forest model | R
```{r cache=TRUE} library(randomForest) library(tidyverse) # Train a Random Forest set.seed(1) # for reproducibility credit_model <- randomForest(formula = as.factor(default) ~ ., data = credit_train %>% mutate_if(is.character,as.factor), type = “classification )
Print the model output
print(credit_model) # head(credit_train)
`type`:
one of `regression`, `classification`, or `unsupervised.`
随机森林还可以无监督?
[`randomForest`需要对$y$因子化。](http://www.guanggua.com/question/39320408-error-in-y-ymean-non-numeric-argument-to-binary-operator-randomforest-r.html)
[randomForest没有办法自动将 char格式转换为factor格式 可以用str(train) 查看下 哪些字段是char ,再用as.factor转化下。](http://f.dataguru.cn/thread-875121-1-1.html)
因此,使用`data %>% mutate_if(is.character,as.factor)`可以高效解决字符型变量向因子型变量转换的问题。
### [Understanding Random Forest model output | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/random-forests?ex=4)
Print the credit_model output
print(credit_model)
Call: randomForest(formula = default ~ ., data = credit_train) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4
OOB estimate of error rate: 24.12%
Confusion matrix: no yes class.error no 516 46 0.08185053 yes 147 91 0.61764706
`No. of variables tried at each split: 4`就是
`mtry`:
number of predictors sampled for spliting at each node.
一般是$\sqrt{n}$。$n$是特征向量数量。
`OOB estimate of error rate: 24.12%`
中,
`OOB`$\to$Out-of-bag,可以用于pred,因为没有train。
Grab OOB error matrix & take a look
err <- credit_model$err.rate
head(err) OOB no yes [1,] 0.3414634 0.2657005 0.5375000 [2,] 0.3311966 0.2462908 0.5496183 [3,] 0.3232831 0.2476636 0.5147929 [4,] 0.3164933 0.2180294 0.5561224 [5,] 0.3197756 0.2095808 0.5801887 [6,] 0.3176944 0.2115385 0.5619469
可以用来决定用多少tree。

图 5:随机森林误差变化趋势图
### [Evaluate out-of-bag error | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/random-forests?ex=5)
```
# Grab OOB error matrix & take a look
err <- credit_model$err.rate
head(err)
# Look at final OOB error rate (last row in err matrix)
oob_err <- err[nrow(err), "OOB"]
print(oob_err)
# Plot the model trained in the previous exercise
plot(credit_model)
# Add a legend since it doesn't have one by default
legend(x = "right",
legend = colnames(err),
fill = 1:ncol(err))
$\Box$说实话,OOB这里不是很懂。
Evaluate model performance on a test set | R
oob_err就是$1-Acc$。
```{r cache=TRUE} # Generate predicted classes using the model object class_prediction <- predict(object = credit_model, # model object newdata = credit_test %>% mutate_if(is.character,as.factor), # test dataset type = “class”) # return classification labels
Calculate the confusion matrix for the test set
cm <- confusionMatrix(data = class_prediction, # predicted classes reference = credit_test$default) # actual classes print(cm)
Compare test set accuracy to OOB accuracy
paste0(“Test Accuracy:”, cm$overall[1]) paste0(“OOB Accuracy:”, 1 - oob_err)
`Test Accuracy`和
`OOB Accuracy`是不一样的。
### [OOB error vs. test set error | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/random-forests?ex=7)
**Advantages & Disadvantages of OOB estimates**
+ Can evaluate your model without a separate test set
不需要分test样本。
+ Computed automatically by the `randomForest()` function
But ...
+ OOB Error only estimates error (not AUC, log-loss, etc.)
+ Can't compare Random Forest performace to other types of models
在我看来,都不是特别重要。
```
# Generate predictions on the test set
pred <- predict(object = credit_model,
newdata = credit_test %>%
mutate_if(is.character,as.factor),
type = "prob")
# `pred` is a matrix
class(pred)
# Look at the pred format
head(pred)
# Compute the AUC (`actual` must be a binary 1/0 numeric vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
r - How does predict.randomForest estimate class probabilities? - Cross Validated 解释很清楚。 response, prob. or votes
Tuning a Random Forest model | R
ntree: number of treesmtry: number of variables randomly sampled as candidates at each split 每个树有多少变量sampsize: number of samples to train onnodesize: minimum size (number of samples) of the terminal nodes 每个终点的样本最小值maxnodes: maximum number of terminal nodes 可以有多少个终点
这里重点讲mtry,
# Execute the tuning process
set.seed(1)
res <- tuneRF(x = train_predictor_df,
y = train_response_vector,
ntreeTry = 500)
# Look at results
print(res)
mtry OOBError
2.OOB 2 0.2475
4.OOB 4 0.2475
8.OOB 8 0.2425
tuneRF这是一个很好的包。
Tuning a Random Forest via mtry | R
```{r cache=TRUE} # Execute the tuning process set.seed(1)
res <- tuneRF(x = subset(credit_train, select = -default) %>% mutate_if(is.character,as.factor), y = credit_train$default %>% as.factor(), ntreeTry = 500)
Look at results
print(res)
Find the mtry value that minimizes OOB Error
mtry_opt <- res[,“mtry”][which.min(res[,“OOBError”])] print(mtry_opt)
If you just want to return the best RF model (rather than results)
you can set doBest = TRUE in tuneRF() to return the best RF model
instead of a set performance matrix.
### [Tuning a Random Forest via tree depth | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/random-forests?ex=12)
`ncol(credit_train)`衡量了模型有多少变量。
`nodesize <- seq(3, 8, 2)`允许终点伤最少多少个样本。
`sampsize <- nrow(credit_train) * c(0.7, 0.8)`衡量了用多少样本来train。
`model$err.rate[nrow(model$err.rate), "OOB"]`提取OOB-error。
```
# Establish a list of possible values for mtry, nodesize and sampsize
mtry <- seq(4, ncol(credit_train) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(credit_train) * c(0.7, 0.8)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)
# Create an empty vector to store OOB error values
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
model <- randomForest(formula = default ~ .,
data = credit_train %>% mutate_if(is.character,as.factor),
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
sampsize = hyper_grid$sampsize[i]
)
# Store OOB error for the model
oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}
# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
Introduction to boosting | R
这就是区别。 Boosted trees improve the model fit by considering past fits and bagged trees do not.
Train a GBM model | R
- Adaboost
- Gradient Boosting Machine (“GBM”)
Adaboost
- Train decision tree where with equal weight
- Increase/Lower the weights of the observations
- Second tree is grown on weighted data
- Repeat this process for a specified number of iterations
Gradient Boosting = Gradient Descent + Boosting
- Fit an additive model (ensemble) in a forward, stage-wise manner.
- In each stage, introduce a “weak learner” (e.g. decision tree) to compensate the shortcomings of existing weak learners.
- In Adaboost, “shortcomings” are identified by high-weight data points.
- In Gradient Boosting, the “shortcomings” are identified by gradients.
为什么GBM好/坏?
- Often performs better than any other algorithm
- Directly optmizes cost function
- Overfits (need to find a proper stopping point)
- Sensitive to extreme values and noises
<!-- -->
# Train a 5000-tree GBM model
> model <- gbm(formula = response ~ .,
distribution = "bernoulli",
data = train,
n.trees = 5000)
distribution = "bernoulli"针对于$y$。 n.trees = 5000就是iteration开关。
Train a GBM model | R
For binary classification, gbm() requires the response to be encoded as 0/1 (numeric), so we will have to convert from a “no/yes” factor to a 0/1 numeric response column. 可以用ifelse()函数。
```{r cache=TRUE} # Convert “yes” to 1, “no” to 0 credit_train$default <- ifelse(credit_train$default == “yes”, 1, 0)
Train a 10000-tree GBM model
set.seed(1) library(gbm) credit_model <- gbm(formula = default ~ ., distribution = “bernoulli”, data = credit_train %>% mutate_if(is.character,as.factor), n.trees = 10000)
Print the model object
print(credit_model)
summary() prints variable importance
summary(credit_model)
str(credit_model)
print(credit_model) gbm(formula = default ~ ., distribution = “bernoulli”, data = credit_train, n.trees = 10000) A gradient boosted model with bernoulli loss function. 10000 iterations were performed. There were 16 predictors of which 16 had non-zero influence.
summary() prints variable importance
summary(credit_model) var rel.inf checking_balance checking_balance 33.49502510 amount amount 11.62938098 months_loan_duration months_loan_duration 11.17113439 credit_history credit_history 11.15698321 savings_balance savings_balance 6.44293358 employment_duration employment_duration 6.06266137 age age 5.73175696 percent_of_income percent_of_income 3.74219743 other_credit other_credit 3.56695375 purpose purpose 3.38820798 housing housing 1.55169398 years_at_residence years_at_residence 1.35255308 job job 0.47631930 phone phone 0.09142691 existing_loans_count existing_loans_count 0.08924265 dependents dependents 0.05152933
这里告诉你了,
用了多少迭代,
有多少噪音变量,
并且好变量的具体情况。
这里可以得到一点sense。
陈天奇给R的xgboost代码了。
### [Prediction using a GBM model | R](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-r/boosted-trees?ex=5)
`predict.gbm`需要给`n.trees = 10000`这给没有default值。
`type = "response"`会反馈$p$给Bernoulli分布和$E(n)$给Poisson分布。
否则只给$0$和$1$。
```
# Since we converted the training response col, let's also convert the test response col
credit_test$default <- ifelse(credit_test$default == "yes", 1, 0)
# Generate predictions on the test set
preds1 <- predict(object = credit_model,
newdata = credit_test,
n.trees = 10000)
# Generate predictions on the test set (scale to response)
preds2 <- predict(object = credit_model,
newdata = credit_test,
n.trees = 10000,
type = "response")
# Compare the range of the two sets of predictions
range(preds1)
range(preds2)
> range(preds1)
[1] -3.210354 2.088293
> range(preds2)
[1] 0.03877796 0.88976007
Evaluate test set AUC | R
# Generate the test set AUCs using the two sets of preditions & compare
auc(actual = credit_test$default, predicted = preds1) #default
auc(actual = credit_test$default, predicted = preds2) #rescaled
> auc(actual = credit_test$default, predicted = preds1) #default
[1] 0.7875175
> auc(actual = credit_test$default, predicted = preds2) #rescaled
[1] 0.7875175
GBM hyperparameters | R
GBM Hyperparameters
n.trees: number of treesbag.fraction: proportion of observations to be sampled in each treen.minobsinnode: minimum number of observations in the trees terminal nodesinteraction.depth: maximum nodes per treeshrinkage: learning rate
这里重点看shrinkage。
Early stopping in GBMs | R
early stopping 就是选择最优迭代次数。 用gbm.perf()。 可以选择两种方法, method = "OOB"和method = "CV"。
# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = credit_model,
method = "OOB",
oobag.curve = TRUE)
# Train a CV GBM model
set.seed(1)
credit_model_cv <- gbm(formula = default ~ .,
distribution = "bernoulli",
data = credit_train,
n.trees = 10000,
cv.folds = 2)
# Optimal ntree estimate based on CV
ntree_opt_cv <- gbm.perf(object = credit_model_cv,
method = "cv")
# Compare the estimates
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv.folds>0 when calling gbm usually results in improved predictive performance.Error in plot.window(...) : need finite 'ylim' values
报错了,来不及看。

图 6:GBM模型OOB误差变化趋势

图 7:GBM模型交叉验证误差变化趋势
> print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
[1] "Optimal n.trees (OOB Estimate): 3233
> print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
[1] "Optimal n.trees (CV Estimate): 7889
$\Box$这个要好好研究下。
OOB vs CV-based early stopping | R
n.trees = ntree_opt_oob这里的最优迭代次数,直接用train的了?
# Generate predictions on the test set using ntree_opt_oob number of trees
preds1 <- predict(object = credit_model,
newdata = credit_test,
n.trees = ntree_opt_oob)
# Generate predictions on the test set using ntree_opt_cv number of trees
preds2 <- predict(object = credit_model,
newdata = credit_test,
n.trees = ntree_opt_cv)
# Generate the test set AUCs using the two sets of preditions & compare
auc1 <- auc(actual = credit_test$default, predicted = preds1) #OOB
auc2 <- auc(actual = credit_test$default, predicted = preds2) #CV
# Compare AUC
print(paste0("Test set AUC (OOB): ", auc1))
print(paste0("Test set AUC (CV): ", auc2))
Compare all models based on AUC | R
In this final exercise, we will perform a model comparison across all types of models that we’ve learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM).
综合模型性能对比环节
Loaded in your workspace are four numeric vectors:
dt_predsbag_predsrf_predsgbm_preds
sprintf很好用啊!
# Generate the test set AUCs using the two sets of predictions & compare
actual <- credit_test$default
dt_auc <- auc(actual = actual, predicted = dt_preds)
bag_auc <- auc(actual = actual, predicted = bag_preds)
rf_auc <- auc(actual = actual, predicted = rf_preds)
gbm_auc <- auc(actual = actual, predicted = gbm_preds)
# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
sprintf("Random Forest Test AUC: %.3f", rf_auc)
sprintf("GBM Test AUC: %.3f", gbm_auc)
> sprintf("Decision Tree Test AUC: %.3f", dt_auc)
[1] "Decision Tree Test AUC: 0.627
> sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
[1] "Bagged Trees Test AUC: 0.781
> sprintf("Random Forest Test AUC: %.3f", rf_auc)
[1] "Random Forest Test AUC: 0.804
> sprintf("GBM Test AUC: %.3f", gbm_auc)
[1] "GBM Test AUC: 0.786
Plot & compare ROC curves | R
ROCR包能够高效绘制ROC曲线。
# List of predictions
preds_list <- list(dt_preds, bag_preds, rf_preds, gbm_preds)
# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(credit_test$default), m)
# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")
plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright",
legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
fill = 1:m)

图 8:不同模型ROC曲线对比图
-
- 连续性
method="anova", - 离散型
method="class", - 计数型
method="poisson", 泊松分布 - 生存分析型
method="exp"指数分布
- 连续性