本文于2020-10-10更新。 如发现问题或者有建议,欢迎提交 Issue
knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval=F)
library(dlstats)
rsample_download_data <- cran_stats(c('rsample','recipes'))
library(tidyverse)
library(lubridate)
rsample_download_data %>%
filter(end < floor_date(now(),unit = 'month')) %>%
ggplot(aes(end,downloads)) +
geom_line() +
facet_wrap(~package,scales = 'free_y') +
theme_bw() +
labs(title = "tidymodel package's downlaod increases by month.")
1 rsample
1.1 Cross Validation
bootstraps函数增加内存很少,如以下这个Github例子。
library(rsample)
library(mlbench) # 提取数据 LetterRecognition
library(pryr) # 使用函数 object_size
data(LetterRecognition)
object_size(LetterRecognition)
set.seed(35222)
boots <- bootstraps(LetterRecognition, times = 50)
object_size(boots)
boots %>% head
as.numeric(object_size(boots)/object_size(LetterRecognition))
1.1.1 splits需要加一个as.data.frame
bootstraps(mtcars,times=2) %>%
.$splits %>%
.[1] %>%
as.data.frame()
1.2 数据预处理
The
recipespackage contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide. (Kuhn and Wickham 2018)
recipes包主要是为了
- 避免花大量时间构建模型和
- 提高很多数据处理的方式
我认为后面一个是非常方便的。
- signal extraction using principal component analysis
- imputation of missing values
- transformations of individual variables (e.g. Box-Cox transformations) (Kuhn and Wickham 2018)
recipes包的函数可以对x变量的进行修正,这里进行举例。
library(AmesHousing)
ames <- make_ames()
names(ames)
log10(Sale_Price) ~ Neighborhood + House_Style + Year_Sold + Lot_Area
library(ggplot2)
theme_set(theme_bw())
ggplot(ames, aes(x = Lot_Area)) +
geom_histogram(binwidth = 5000, col = "red", fill ="red", alpha = .5)
theme_bw()这个图看起来很不错。- 有很常见的右偏,可以使用 Box-Cox 方式进行修正。
ggplot(ames, aes(x = Neighborhood)) + geom_bar() + coord_flip() + xlab("")
- 可以发现,有些频率小的level,最后都要剔除。
根据以上问题,下面继续数据处理。
library(recipes)
rec <- recipe(Sale_Price ~ Neighborhood + House_Style + Year_Sold + Lot_Area,
data = ames) %>%
# Log the outcome
step_log(Sale_Price, base = 10) %>%
# Collapse rarely occurring jobs into "other"
step_other(Neighborhood, House_Style, threshold = 0.05) %>%
# Dummy variables on the qualitative predictors
step_dummy(all_nominal()) %>%
# 相当于一键 one-hot
# Unskew a predictor
step_BoxCox(Lot_Area) %>%
# Normalize
step_center(all_predictors()) %>%
step_scale(all_predictors())
rec
recipe(,...,data=)这样会更直接,定义好模型,但是先不定义算法类型,跟符合现实逻辑。step_other(threshold = )让低频率的并入,这样就可以保证无论之后,开发集产生unknown的level几率会非常小。step_BoxCox转换成unskew的。step_center和step_scale进行标准化。
While the original data object
amesis used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample. (Kuhn and Wickham 2018)
这里解决了一个实际问题。
recipe函数中虽然使用了数据ames,但是只是用来定义变量和变量的特性,因此recipe反馈的规则可以应用到其他的数据集或者resample上。
这点就解决了测试集需要统一的问题。
rec_training_set <- prep(rec, training = ames, retain = TRUE, verbose = TRUE)
rec_training_set
prep函数是对某一个数据执行规则的意思。
bake(rec_training_set, newdata = head(ames))
ames %>% head %>%
select(Neighborhood,House_Style,Year_Sold,Lot_Area)
bake反馈到处理结果到newdata上。相当于predict- 并且
Neighborhood和House_Style进行了one-hot
juice(rec_training_set) %>% head
juice相当于fitted
1.2.1 整合进模型训练
library(rsample)
set.seed(7712)
bt_samples <- bootstraps(ames)
bt_samples
bt_samples$splits[[1]]
这是切分点的选取。
library(purrr)
bt_samples$recipes <- map(bt_samples$splits, prepper, recipe = rec, retain = TRUE, verbose = FALSE)
bt_samples
只要rec定义好整个函数是很好理解的。
bt_samples$recipes[[1]]
prepper是prep的替代品,主要是为了对split的函数,进行执行变量和变量特性的修改。
fit_lm <- function(rec_obj, ...)
lm(..., data = juice(rec_obj, everything()))
bt_samples$lm_mod <-
map(
bt_samples$recipes,
fit_lm,
Sale_Price ~ .
)
bt_samples
学习fit_lm的函数中...的构建和位置。
pred_lm <- function(split_obj, rec_obj, model_obj, ...) {
mod_data <- bake(
rec_obj,
newdata = assessment(split_obj),
all_predictors(),
all_outcomes()
)
out <- mod_data %>% select(Sale_Price)
out$predicted <- predict(model_obj, newdata = mod_data %>% select(-Sale_Price))
out
}
bt_samples$pred <-
pmap(
lst(
split_obj = bt_samples$splits,
rec_obj = bt_samples$recipes,
model_obj = bt_samples$lm_mod
),
pred_lm
)
bt_samples
rmse <- function(dat)
sqrt(mean((dat$Sale_Price - dat$predicted)^2))
bt_samples$RMSE <- map_dbl(bt_samples$pred, rmse)
summary(bt_samples$RMSE)
Kuhn, Max, and Hadley Wickham. 2018. “Recipes with Rsample.” rsample. 2018. https://tidymodels.github.io/rsample/articles/Applications/Recipes_and_rsample.html.