作为学习笔记,不准备更新了。
knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval=F)
Machine Learning with Tree-Based Models in R
五种tree都有练习,卧槽! 这里的代码主要是集中于R,因此在Python使用上不熟悉的情况下, 用R进行debugging会非常快, 因此用R学习树,也是非常有价值的。 算了下4个小时可以全部掌握, 因此不需要太多顾虑。
She likes to mentor and share her knowledge through mentorship programs, tutorials and talks.
因此对这个人Gabriela de Queiroz | DataCamp,非常有好感了, 她是搞教学的,因此理解起来快。 而且是NGO R-Ladies的,因此不会像业界那么迷之自信。
Let’s get it.
讲的不错,回去把bug都de了哈哈。
Welcome to the course! | R
其实regression问题就是真对numeric变量而说的。
Tree-based models
- Interpretability + Ease-of-Use + Accuracy
- Make Decisions + Numeric Predictions
What you’ll learn:
- Interpret and explain decisions
- Explore different use cases
- Build and evaluate classification and regression models
- Tune model parameters for optimal performance
We will cover:
- Classification & Regression Trees
- Bagged Trees
- Random Forests
- Boosted Trees (GBM)
Build a classification tree | R
rpart.plot
进行可视化。
creditsub
导入德国信贷数据。
str()
类似于pandas的.info
。
url="https://assets.datacamp.com/production/course_3022/datasets/credit.csv"
library(tidyverse)
creditsub <- read_csv(url)
# Look at the data
str(creditsub)
# Create the model
library(rpart)
credit_model <- rpart(formula = default ~ .,
data = creditsub,
method = "class")
# Display the results
library(rpart.plot)
rpart.plot(x = credit_model, yesno = 2, type = 0, extra = 0)
Train/test split | R
nrow
类似于python的.shape
和len
。
sample(x, size, replace = FALSE, prob = NULL)
注意replace
的默认值为FALSE
。
size
的位置是要抽样样本大小。
x
表示要被抽样的向量。
# Total number of rows in the credit data frame
credit <- creditsub
n <- nrow(credit)
# Number of rows for the training set (80% of the dataset)
n_train <- round(0.8 * n)
# Create a vector of indices which is an 80% random sample
set.seed(123)
train_indices <- sample(1:n, n_train)
# Subset the credit data frame to training indices only
credit_train <- credit[train_indices, ]
# Exclude the training indices to create the test set
credit_test <- credit[-train_indices, ]
Train a classification tree model | R
# Train the model (to predict 'default')
credit_model <- rpart(formula = default ~ .,
data = credit_train,
method = "class")
# Look at the model output
print(credit_model)
Compute confusion matrix | R
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,
newdata = credit_test,
type = "class")
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,
reference = credit_test$default)
confusionMatrix
将所有指标都写出来了。
Compare models with a different splitting criterion | R
ce()
:
Calculates the classification error.
library(rpart)
# Train a gini-based model
credit_model1 <- rpart(formula = default ~ .,
data = credit_train,
method = "class",
parms = list(split = "gini"))
# Train an information-based model
credit_model2 <- rpart(formula = default ~ .,
data = credit_train,
method = "class",
parms = list(split = "information"))
# Generate predictions on the validation set using the gini model
pred1 <- predict(object = credit_model1,
newdata = credit_test,
type = "class")
# Generate predictions on the validation set using the information model
pred2 <- predict(object = credit_model2,
newdata = credit_test,
type = "class")
# Compare classification error
library(ModelMetrics)
sum(credit_test$default != pred1)/nrow(credit_test)
sum(credit_test$default != pred2)/nrow(credit_test)
Introduction to regression trees | R
同质性(Homogeneity)的测量中,
- 分类变量是用gini,entropy…
- 连续变了是用标准差、绝对差…
在rpart
中,
"class"
和anova
用来确定对\(y\)变量的对待,前者是分类变量,后者是连续变量。
Train/Validation/Test Split
- training set
- validation set: for hyperparam in training.
- test set: only use once.
Split the data | R
grade <- read_csv(
"https://assets.datacamp.com/production/course_3022/datasets/grade.csv"
)
prob
:
A vector of probability weights for obtaining the elements of the vector being sampled.
因此非常容易,终于搞清楚R的门道了。
类似于python的,train_test_split
一样直观。
# Look/explore the data
str(grade)
# Randomly assign rows to ids (1/2/3 represents train/valid/test)
# This will generate a vector of ids of length equal to the number of rows
# The train/valid/test split will be approximately 70% / 15% / 15%
set.seed(1)
assignment <- sample(1:3, size = nrow(grade), prob = c(0.7,0.15,0.15), replace = TRUE)
# Create a train, validation and tests from the original data frame
grade_train <- grade[assignment == 1, ] # subset the grade data frame to training indices only
grade_valid <- grade[assignment == 2, ] # subset the grade data frame to validation indices only
grade_test <- grade[assignment == 3, ] # subset the grade data frame to test indices only
Train a regression tree model | R
# Train the model
grade_model <- rpart(formula = final_grade ~ .,
data = grade_train,
method = "anova")
# Look at the model output
print(grade_model)
# Plot the tree model
rpart.plot(x = grade_model, yesno = 2, type = 0, extra = 0)
\(\Box\)yesno
什么意思。
Performance metrics for regression | R
Metrics
包可以直接使用,
rmse
函数。
Evaluate a regression tree model | R
# Generate predictions on a test set
pred <- predict(object = grade_model, # model object
newdata = grade_test) # test dataset
# Compute the RMSE
rmse(actual = grade_test$final_grade,
predicted = pred)
What are the hyperparameters for a decision tree | R
以前我很少关注这些参数,但是现在,感觉非常亲切,这也许就是知识的沉潜。
minsplit
: minimum number of data points required to attempt a split,minsplit = 2
(smallest leaf possible)cp
: complexity parameter 1, \(cp\downarrow\to complexity \uparrow\)maxdepth
: depth of a decision tree
> print(model$cptable)
CP nsplit rel error xerror xstd
1 0.06839852 0 1.0000000 1.0080595 0.09215642
2 0.06726713 1 0.9316015 1.0920667 0.09543723
3 0.03462630 2 0.8643344 0.9969520 0.08632297
4 0.02508343 3 0.8297080 0.9291298 0.08571411
5 0.01995676 4 0.8046246 0.9357838 0.08560120
6 0.01817661 5 0.7846679 0.9337462 0.08087153
7 0.01203879 6 0.7664912 0.9092646 0.07982862
8 0.01000000 7 0.7544525 0.9407895 0.08399125
# Prune the model (to optimized cp value)
# Returns the optimal model
model_opt <- prune(tree = model, cp = cp_opt)
明显是
4 0.02508343 3 0.8297080 0.9291298 0.08571411
最优。
Tuning the model | R
which.min(x)
反馈index。
# Plot the "CP Table"
plotcp(grade_model)
# Print the "CP Table"
print(grade_model$cptable)
# Retreive optimal cp value based on cross-validated error
opt_index <- which.min(grade_model$cptable[, "xerror"])
cp_opt <- grade_model$cptable[opt_index, "CP"]
# Prune the model (to optimized cp value)
grade_model_opt <- prune(tree = grade_model,
cp = cp_opt)
# Plot the optimized model
rpart.plot(x = grade_model_opt, yesno = 2, type = 0, extra = 0)
Grid search for model selection | R
hyper_grid <- expand.grid(minsplit = minsplit,
maxdepth = maxdepth)
> hyper_grid[1:10,]
minsplit maxdepth
1 1 5
2 6 5
3 11 5
4 16 5
5 21 5
6 26 5
7 1 15
8 6 15
9 11 15
10 16 15
我认为这个expand.grid
可用于模型迭代。
# create an empty list to store models
models <- list()
# execute the grid search
> for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store in the list
models[[i]] <- rpart(formula = response ~ .,
data = train,
method = "anova",
minsplit = minsplit)
}
我觉得我开窍了,实际上,直接for loop,比搞purrr
要简单许多。
models[[i]]
就是sense。
# create an empty vector to store RMSE values
rmse_values <- c()
# compute validation RMSE fr
for (i in 1:length(models)) {
# retreive the i^th model from the list
model <- models[[i]]
# generate predictions on grade_valid
pred <- predict(object = model,
newdata = valid)
# compute validation RMSE and add to the
rmse_values[i] <- rmse(actual = valid$response,
predicted = pred)
}
完全的自动化思路,很好!
Generate a grid of hyperparameter values | R
# Establish a list of possible values for minsplit and maxdepth
minsplit <- seq(1, 4, 1)
maxdepth <- seq(1, 6, 1)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)
# Check out the grid
head(hyper_grid)
# Print the number of grid combinations
nrow(hyper_grid)
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
记住这个函数,用的很常见,不要老是help
,耽误时间。
Generate a grid of models | R
# Number of potential models in the grid
num_models <- nrow(hyper_grid)
# Create an empty list to store models
grade_models <- list()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {
# Get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# Train a model and store in the list
grade_models[[i]] <- rpart(formula = final_grade ~ .,
data = grade_train,
method = "anova",
minsplit = minsplit,
maxdepth = maxdepth)
}
我大概理解了,这里hyper_grid
的确方便,合唱一个table或者矩阵,
但是还是要拆分成
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
还是麻烦,因此还是purrr
好。
Evaluate the grid | R
typeof(grade_models)
一共包含了24个模型。
# Number of potential models in the grid
num_models <- length(grade_models)
# Create an empty vector to store RMSE values
rmse_values <- c()
# Write a loop over the models to compute validation RMSE
for (i in 1:num_models) {
# Retreive the i^th model from the list
model <- grade_models[[i]]
# Generate predictions on grade_valid
pred <- predict(object = model,
newdata = grade_valid)
# Compute validation RMSE and add to the
rmse_values[i] <- rmse(actual = grade_valid$final_grade,
predicted = pred)
}
# Identify the model with smallest validation set RMSE
best_model <- grade_models[[which.min(rmse_values)]]
# Print the model paramters of the best model
best_model$control
# Compute test set RMSE on best_model
pred <- predict(object = best_model,
newdata = grade_test)
rmse(actual = grade_test$final_grade,
predicted = pred)
Introduction to bagged trees | R
bootstrap,这是bagging的想法。
> library(ipred)
> bagging(formula = response ~ ., data = dat)
用ipred
包!哈哈。
Train a bagged tree model | R
nbagg
限制了包的个数。
If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the
coob
parameter toTRUE
The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model
废物利用。
library(ipred)
# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)
# Train a bagged model
credit_model <- bagging(formula = as.factor(default) ~ .,
data = credit_train,
coob = TRUE)
# Print the model
print(credit_model)
as.factor(default)
这里必须用factor。
Evaluating the bagged tree performance | R
可以回忆### ROC切线理解。 最大点是最优解。
> library(Metrics)
> auc(actual, predicted)
[1] .76765
Prediction and confusion matrix | R
就当练习下矩阵latex。
\[\begin{matrix} & & Pred &\\ & & 1 & 0 \\ Actual & 1 & TP & FN \\ & 0 & FP & TN \\ \end{matrix}\]
confusionMatrix()
来自caret
。
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,
newdata = credit_test,
type = "class") # return classification labels
# Print the predicted classes
print(class_prediction)
# Calculate the confusion matrix for the test set
library(caret)
confusionMatrix(data = class_prediction,
reference = credit_test$default)
data
:
a factor of predicted classes (for the default method) or an object of class table.reference
:
a factor of classes to be used as the true results
Predict on a test set and compute AUC | R
type = "prob"
区别于type = "class"
,已经非常熟悉了。
# Generate predictions on the test set
pred <- predict(object = credit_model,
newdata = credit_test,
type = "prob")
# `pred` is a matrix
class(pred)
# Look at the pred format
head(pred)
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
library(Metrics)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
注意predict
给的是两列,yes
和no
概率相加为1。
这里预测的时候,我们定义yes
为positive,而非原来假设的no
。
这里需要在auc
中进行说明的。
Using caret for cross-validating models | R
# Specify the training configuration
ctrl <- trainControl(method = "cv", # Cross-validation
number = 5, # 5 folds
classProbs = TRUE, # For AUC
summaryFunction = twoClassSummary) # For AUC
这里的classProbs = TRUE
和summaryFunction = twoClassSummary
,
不懂啊,看help。
classProbs
: a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?summaryFunction
: a function to compute performance metrics across resamples. The arguments to the function should be the same as those in defaultSummary.
是一些细节的东西,不要太在意。
之前都是rpart
搭配method = "class"
或者method = "anova"
2。
现在是train
搭配
method = "treebag"
。
都是些很模糊的东西。
set.seed(1) #for reproducibility
credit_model <- train(default ~ .,
data = credit_train,
method = "treebag",
metric = "ROC",
trControl = ctrl)
的确感觉bagging和CV很像啊。
Use
caret::train()
with the"treebag"
method to train a model and evaluate the model using cross-validated AUC.
Cross-validate a bagged tree model in caret | R
# Specify the training configuration
ctrl <- trainControl(method = "cv", # Cross-validation
number = 5, # 5 folds
classProbs = TRUE, # For AUC
summaryFunction = twoClassSummary) # For AUC
# Cross validate the credit model using "treebag" method;
# Track AUC (Area under the ROC curve)
set.seed(1) # for reproducibility
credit_caret_model <- train(default ~ .,
data = credit_train,
method = "treebag",
metric = "ROC",
trControl = ctrl)
# Look at the model object
print(credit_caret_model)
# Inspect the contents of the model list
names(credit_caret_model)
# Print the CV AUC
credit_caret_model$results[,"ROC"]
Generate predictions from the caret model | R
# Generate predictions on the test set
pred <- predict(object = credit_caret_model,
newdata = credit_test,
type = "prob")
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
Compare test set performance to CV performance | R
- The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the
ipred::bagging()
function. - The credit_caret_model_test_auc object stores the test set AUC from the model trained using the
caret::train()
function withmethod = "treebag"
.
这是可以留心之处,学习的乐趣,感受到了。
Introduction to Random Forest | R
开始随机森林了,都是炒冷饭,bagging也是。
sample subset of the features这个才是随机森林的关键,为什么这么做? feature bagging or random sub-feature.
虽然牺牲了performance(变量减少了), 但是减少了bagging树之间的cor。 这个就是理由。
library(randomForest)
# Train a default RF model (500 trees)
model <- randomForest(formula = response ~ ., data = train)
这个脑残包默认五百个树。
Train a Random Forest model | R
相当于复习了饿!
Train a Random Forest model | R
library(randomForest)
library(tidyverse)
# Train a Random Forest
set.seed(1) # for reproducibility
credit_model <- randomForest(formula = as.factor(default) ~ .,
data = credit_train %>% mutate_if(is.character,as.factor),
type = "classification"
)
# Print the model output
print(credit_model)
# head(credit_train)
type
:
one of regression
, classification
, or unsupervised.
随机森林还可以无监督?
randomForest没有办法自动将 char格式转换为factor格式 可以用str(train) 查看下 哪些字段是char ,再用as.factor转化下。
因此,data %>% mutate_if(is.character,as.factor)
拯救世界。
Understanding Random Forest model output | R
# Print the credit_model output
> print(credit_model)
Call:
randomForest(formula = default ~ ., data = credit_train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 24.12%
Confusion matrix:
no yes class.error
no 516 46 0.08185053
yes 147 91 0.61764706
No. of variables tried at each split: 4
就是
mtry
:
number of predictors sampled for spliting at each node.
一般是\(\sqrt{n}\)。\(n\)是特征向量数量。
OOB estimate of error rate: 24.12%
中,
OOB
\(\to\)Out-of-bag,可以用于pred,因为没有train。
# Grab OOB error matrix & take a look
> err <- credit_model$err.rate
> head(err)
OOB no yes
[1,] 0.3414634 0.2657005 0.5375000
[2,] 0.3311966 0.2462908 0.5496183
[3,] 0.3232831 0.2476636 0.5147929
[4,] 0.3164933 0.2180294 0.5561224
[5,] 0.3197756 0.2095808 0.5801887
[6,] 0.3176944 0.2115385 0.5619469
可以用来决定用多少tree。
Evaluate out-of-bag error | R
# Grab OOB error matrix & take a look
err <- credit_model$err.rate
head(err)
# Look at final OOB error rate (last row in err matrix)
oob_err <- err[nrow(err), "OOB"]
print(oob_err)
# Plot the model trained in the previous exercise
plot(credit_model)
# Add a legend since it doesn't have one by default
legend(x = "right",
legend = colnames(err),
fill = 1:ncol(err))
\(\Box\)说实话,OOB这里不是很懂。
Evaluate model performance on a test set | R
oob_err
就是\(1-Acc\)。
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model, # model object
newdata = credit_test %>%
mutate_if(is.character,as.factor), # test dataset
type = "class") # return classification labels
# Calculate the confusion matrix for the test set
cm <- confusionMatrix(data = class_prediction, # predicted classes
reference = credit_test$default) # actual classes
print(cm)
# Compare test set accuracy to OOB accuracy
paste0("Test Accuracy: ", cm$overall[1])
paste0("OOB Accuracy: ", 1 - oob_err)
Test Accuracy
和
OOB Accuracy
是不一样的。
OOB error vs. test set error | R
Advantages & Disadvantages of OOB estimates
- Can evaluate your model without a separate test set 不需要分test样本。
- Computed automatically by the
randomForest()
function
But …
- OOB Error only estimates error (not AUC, log-loss, etc.)
- Can’t compare Random Forest performace to other types of models
在我看来,都不是特别重要。
# Generate predictions on the test set
pred <- predict(object = credit_model,
newdata = credit_test %>%
mutate_if(is.character,as.factor),
type = "prob")
# `pred` is a matrix
class(pred)
# Look at the pred format
head(pred)
# Compute the AUC (`actual` must be a binary 1/0 numeric vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
r - How does predict.randomForest
estimate class probabilities? - Cross Validated
解释很清楚。
response
, prob.
or votes
Tuning a Random Forest model | R
ntree
: number of treesmtry
: number of variables randomly sampled as candidates at each split 每个树有多少变量sampsize
: number of samples to train onnodesize
: minimum size (number of samples) of the terminal nodes 每个终点的样本最小值maxnodes
: maximum number of terminal nodes 可以有多少个终点
这里重点讲mtry
,
# Execute the tuning process
set.seed(1)
res <- tuneRF(x = train_predictor_df,
y = train_response_vector,
ntreeTry = 500)
# Look at results
print(res)
mtry OOBError
2.OOB 2 0.2475
4.OOB 4 0.2475
8.OOB 8 0.2425
tuneRF
这是一个很好的包。
Tuning a Random Forest via mtry | R
# Execute the tuning process
set.seed(1)
res <- tuneRF(x = subset(credit_train, select = -default) %>% mutate_if(is.character,as.factor),
y = credit_train$default %>% as.factor(),
ntreeTry = 500)
# Look at results
print(res)
# Find the mtry value that minimizes OOB Error
mtry_opt <- res[,"mtry"][which.min(res[,"OOBError"])]
print(mtry_opt)
# If you just want to return the best RF model (rather than results)
# you can set `doBest = TRUE` in `tuneRF()` to return the best RF model
# instead of a set performance matrix.
Tuning a Random Forest via tree depth | R
ncol(credit_train)
衡量了模型有多少变量。
nodesize <- seq(3, 8, 2)
允许终点伤最少多少个样本。
sampsize <- nrow(credit_train) * c(0.7, 0.8)
衡量了用多少样本来train。
model$err.rate[nrow(model$err.rate), "OOB"]
提取OOB-error。
# Establish a list of possible values for mtry, nodesize and sampsize
mtry <- seq(4, ncol(credit_train) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(credit_train) * c(0.7, 0.8)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)
# Create an empty vector to store OOB error values
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
model <- randomForest(formula = default ~ .,
data = credit_train %>% mutate_if(is.character,as.factor),
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
sampsize = hyper_grid$sampsize[i]
)
# Store OOB error for the model
oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}
# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
Introduction to boosting | R
这就是区别。 Boosted trees improve the model fit by considering past fits and bagged trees do not.
Train a GBM model | R
- Adaboost
- Gradient Boosting Machine (“GBM”)
Adaboost
- Train decision tree where with equal weight
- Increase/Lower the weights of the observations
- Second tree is grown on weighted data
- Repeat this process for a specified number of iterations
Gradient Boosting = Gradient Descent + Boosting
- Fit an additive model (ensemble) in a forward, stage-wise manner.
- In each stage, introduce a “weak learner” (e.g. decision tree) to compensate the shortcomings of existing weak learners.
- In Adaboost, “shortcomings” are identified by high-weight data points.
- In Gradient Boosting, the “shortcomings” are identified by gradients.
为什么GBM好/坏?
- Often performs better than any other algorithm
- Directly optmizes cost function
- Overfits (need to find a proper stopping point)
- Sensitive to extreme values and noises
# Train a 5000-tree GBM model
> model <- gbm(formula = response ~ .,
distribution = "bernoulli",
data = train,
n.trees = 5000)
distribution = "bernoulli"
针对于\(y\)。
n.trees = 5000
就是iteration开关。
Train a GBM model | R
For binary classification, gbm()
requires the response to be encoded as 0/1 (numeric), so we will have to convert from a “no/yes” factor to a 0/1 numeric response column.
可以用ifelse()
函数。
# Convert "yes" to 1, "no" to 0
credit_train$default <- ifelse(credit_train$default == "yes", 1, 0)
# Train a 10000-tree GBM model
set.seed(1)
library(gbm)
credit_model <- gbm(formula = default ~ .,
distribution = "bernoulli",
data = credit_train %>% mutate_if(is.character,as.factor),
n.trees = 10000)
# Print the model object
print(credit_model)
# summary() prints variable importance
# summary(credit_model)
# str(credit_model)
> print(credit_model)
gbm(formula = default ~ ., distribution = "bernoulli", data = credit_train,
n.trees = 10000)
A gradient boosted model with bernoulli loss function.
10000 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
>
> # summary() prints variable importance
> summary(credit_model)
var rel.inf
checking_balance checking_balance 33.49502510
amount amount 11.62938098
months_loan_duration months_loan_duration 11.17113439
credit_history credit_history 11.15698321
savings_balance savings_balance 6.44293358
employment_duration employment_duration 6.06266137
age age 5.73175696
percent_of_income percent_of_income 3.74219743
other_credit other_credit 3.56695375
purpose purpose 3.38820798
housing housing 1.55169398
years_at_residence years_at_residence 1.35255308
job job 0.47631930
phone phone 0.09142691
existing_loans_count existing_loans_count 0.08924265
dependents dependents 0.05152933
这里告诉你了, 用了多少迭代, 有多少噪音变量, 并且好变量的具体情况。 这里可以得到一点sense。
陈天奇给R的xgboost代码了。
Prediction using a GBM model | R
predict.gbm
需要给n.trees = 10000
这给没有default值。
type = "response"
会反馈\(p\)给Bernoulli分布和\(E(n)\)给Poisson分布。
否则只给\(0\)和\(1\)。
# Since we converted the training response col, let's also convert the test response col
credit_test$default <- ifelse(credit_test$default == "yes", 1, 0)
# Generate predictions on the test set
preds1 <- predict(object = credit_model,
newdata = credit_test,
n.trees = 10000)
# Generate predictions on the test set (scale to response)
preds2 <- predict(object = credit_model,
newdata = credit_test,
n.trees = 10000,
type = "response")
# Compare the range of the two sets of predictions
range(preds1)
range(preds2)
> range(preds1)
[1] -3.210354 2.088293
> range(preds2)
[1] 0.03877796 0.88976007
Evaluate test set AUC | R
# Generate the test set AUCs using the two sets of preditions & compare
auc(actual = credit_test$default, predicted = preds1) #default
auc(actual = credit_test$default, predicted = preds2) #rescaled
> auc(actual = credit_test$default, predicted = preds1) #default
[1] 0.7875175
> auc(actual = credit_test$default, predicted = preds2) #rescaled
[1] 0.7875175
GBM hyperparameters | R
GBM Hyperparameters
n.trees
: number of treesbag.fraction
: proportion of observations to be sampled in each treen.minobsinnode
: minimum number of observations in the trees terminal nodesinteraction.depth
: maximum nodes per treeshrinkage
: learning rate
这里重点看shrinkage
。
Early stopping in GBMs | R
early stopping
就是选择最优迭代次数。
用gbm.perf()
。
可以选择两种方法,
method = "OOB"
和method = "CV"
。
# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = credit_model,
method = "OOB",
oobag.curve = TRUE)
# Train a CV GBM model
set.seed(1)
credit_model_cv <- gbm(formula = default ~ .,
distribution = "bernoulli",
data = credit_train,
n.trees = 10000,
cv.folds = 2)
# Optimal ntree estimate based on CV
ntree_opt_cv <- gbm.perf(object = credit_model_cv,
method = "cv")
# Compare the estimates
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv.folds>0 when calling gbm usually results in improved predictive performance.Error in plot.window(...) : need finite 'ylim' values
报错了,来不及看。
> print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
[1] "Optimal n.trees (OOB Estimate): 3233"
> print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
[1] "Optimal n.trees (CV Estimate): 7889"
\(\Box\)这个要好好研究下。
OOB vs CV-based early stopping | R
n.trees = ntree_opt_oob
这里的最优迭代次数,直接用train的了?
# Generate predictions on the test set using ntree_opt_oob number of trees
preds1 <- predict(object = credit_model,
newdata = credit_test,
n.trees = ntree_opt_oob)
# Generate predictions on the test set using ntree_opt_cv number of trees
preds2 <- predict(object = credit_model,
newdata = credit_test,
n.trees = ntree_opt_cv)
# Generate the test set AUCs using the two sets of preditions & compare
auc1 <- auc(actual = credit_test$default, predicted = preds1) #OOB
auc2 <- auc(actual = credit_test$default, predicted = preds2) #CV
# Compare AUC
print(paste0("Test set AUC (OOB): ", auc1))
print(paste0("Test set AUC (CV): ", auc2))
Compare all models based on AUC | R
In this final exercise, we will perform a model comparison across all types of models that we’ve learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM).
神仙大家环节。
Loaded in your workspace are four numeric vectors:
dt_preds
bag_preds
rf_preds
gbm_preds
sprintf
很好用啊!
# Generate the test set AUCs using the two sets of predictions & compare
actual <- credit_test$default
dt_auc <- auc(actual = actual, predicted = dt_preds)
bag_auc <- auc(actual = actual, predicted = bag_preds)
rf_auc <- auc(actual = actual, predicted = rf_preds)
gbm_auc <- auc(actual = actual, predicted = gbm_preds)
# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
sprintf("Random Forest Test AUC: %.3f", rf_auc)
sprintf("GBM Test AUC: %.3f", gbm_auc)
> sprintf("Decision Tree Test AUC: %.3f", dt_auc)
[1] "Decision Tree Test AUC: 0.627"
> sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
[1] "Bagged Trees Test AUC: 0.781"
> sprintf("Random Forest Test AUC: %.3f", rf_auc)
[1] "Random Forest Test AUC: 0.804"
> sprintf("GBM Test AUC: %.3f", gbm_auc)
[1] "GBM Test AUC: 0.786"
Plot & compare ROC curves | R
ROCR
包画曲线,这个老师sense真的很好啊。
# List of predictions
preds_list <- list(dt_preds, bag_preds, rf_preds, gbm_preds)
# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(credit_test$default), m)
# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")
plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright",
legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
fill = 1:m)