{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE)

SuperLearner is an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings. It then creates an optimal weighted average of those models, which is also called an “ensemble”, using the test data performance.

{r} library(SuperLearner)

Pima Indian Women data set

The type column is the column that indicates the presence of diabetes. It is a binary Yes or No column, which means that it follows a binomial distribution.

``{r} # Get theMASS` library library(MASS)

Train and test sets

train <- Pima.tr test <- Pima.te

Print out the first lines of `train`

head(train)

```
y <- as.numeric(train[,8])-1
ytest <- as.numeric(test[,8])-1
x <- data.frame(train[,1:7])
xtest <- data.frame(test[,1:7])

可以stacking的模型

{r} listWrappers()

ranger是随机森林包
kernlab是SVM包
arm是贝叶斯包，Bayes Generalized Linear Models (GLM)
ipred是bagging包

这些包使用前一定要先安装。

{r} set.seed(150) single.model <- SuperLearner(y, x, family=binomial(), SL.library=list("SL.ranger")) single.model

```{r} # Set the seed set.seed(150)

Fit the ensemble model

model <- SuperLearner(y, x, family=binomial(), SL.library=list(“SL.ranger”, “SL.ksvm”, “SL.ipredbagg”, “SL.bayesglm”))

Return the model

model

什么是risk factor？variation

> Ranger and KVSM have a coefficient of zero, which means that it is not weighted as part of the ensemble anymore.

`Coef`是标准化的。

> Hello Jiaxiang, The risk factor is simply the error of the model being fit. SuperLearner calculates the error of each model put into the ensemble stack. This is a gauge that it then uses to determine the coefficients or weights of each model. Behind the scenes, SuperLearner is using cross validation to determine the risk of each model. If you have a risk of 0.16, then this can be interpreted as having 0.16 error or 0.84 accuracy.
[@Gremmell2018]

```
# Set the seed
set.seed(150)

# Get V-fold cross-validated risk estimate
cv.model <- CV.SuperLearner(y,
                            x,
                            V=5,
                            SL.library=list("SL.ranger",
                                            "SL.ksvm",
                                            "SL.ipredbagg",
                                            "SL.bayesglm"))

# Print out the summary statistics
summary(cv.model)

{r} plot(cv.model)

It’s easy to see that Bayes GLM performs the best on average while KSVM performs the worst and contains a lot of variation compared to the other models.

{r} predictions <- predict.SuperLearner(model, newdata=xtest) predictions$pred %>% head() predictions$library.predict %>% head()

predict是不行的。

```{r} # Load the package library(dplyr)

Recode probabilities

conv.preds <- ifelse(predictions$pred>=0.5,1,0) # Load in caret library(caret)

Create the confusion matrix

cm <- confusionMatrix(conv.preds, ytest)

Return the confusion matrix

# 调整超参数

```
SL.ranger.tune <- function(...){
  SL.ranger(..., num.trees=1000, mtry=2)
}

SL.ipredbagg.tune <- function(...){
  SL.ipredbagg(..., nbagg=250)
}

```{r} # Set the seed set.seed(150)

Tune the model

cv.model.tune <- CV.SuperLearner(y, x, V=5, SL.library=list(“SL.ranger”, “SL.ksvm”,
“SL.ipredbagg”, “SL.bayesglm”, “SL.ranger.tune”, “SL.ipredbagg.tune”))

Get summary statistics

summary(cv.model.tune)

```
plot(cv.model.tune)

前面CV完，后面继续。

```{r} # Set the seed set.seed(150)

Create the tuned model

model.tune <- SuperLearner(y, x, SL.library= list(“SL.ranger”, “SL.ksvm”, “SL.ipredbagg”, “SL.bayesglm”, “SL.ranger.tune”, “SL.ipredbagg.tune”))

Return the tuned model

model.tune

`SL.bayesglm` and `SL.ipredbagg.tune` are now the only algorithms weighted in the ensemble. 

```
# Gather predictions for the tuned model
predictions.tune <- predict.SuperLearner(model.tune, newdata=xtest)

# Recode predictions
conv.preds.tune <- ifelse(predictions.tune$pred>=0.5,1,0)

# Return the confusion matrix
confusionMatrix(conv.preds.tune,ytest)

create.Learner()

{r} learner <- create.Learner("SL.ranger", params=list(num.trees=1000, mtry=2)) learner2 <- create.Learner("SL.ipredbagg", params=list(nbagg=250)) learner

```{r} # Set the seed set.seed(150)

Create a second tuned model

cv.model.tune2 <- CV.SuperLearner(y, x, V=5, SL.library=list(“SL.ranger”, “SL.ksvm”, “SL.ipredbagg”, “SL.bayesglm”, learner$names, learner2$names))

Get summary statistics

summary(cv.model.tune2)

```
plot(cv.model.tune2)

但是实际上，我并不知道bending等等。这个只是展示了一个工具包而已。

"集成学习：R SuperLearner包学习笔记

"集成学习系列导航

Train and test sets

Print out the first lines of `train`

可以stacking的模型

Fit the ensemble model

Return the model

Recode probabilities

Create the confusion matrix

Return the confusion matrix

Tune the model

Get summary statistics

Create the tuned model

Return the tuned model

create.Learner()

Create a second tuned model

Get summary statistics

"集成学习系列导航

"集成学习：R SuperLearner包学习笔记

"集成学习 系列导航

Train and test sets

Print out the first lines of train

可以stacking的模型

Fit the ensemble model

Return the model

Recode probabilities

Create the confusion matrix

Return the confusion matrix

Tune the model

Get summary statistics

Create the tuned model

Return the tuned model

create.Learner()

Create a second tuned model

Get summary statistics

"集成学习 系列导航

"集成学习系列导航

Print out the first lines of `train`

"集成学习系列导航