本文于2020-10-10更新。 如发现问题或者有建议,欢迎提交 Issue
knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval=F)
library(dlstats)
rsample_download_data <- cran_stats(c('rsample','recipes'))
library(tidyverse)
library(lubridate)
rsample_download_data %>%
filter(end < floor_date(now(),unit = 'month')) %>%
ggplot(aes(end,downloads)) +
geom_line() +
facet_wrap(~package,scales = 'free_y') +
theme_bw() +
labs(title = "tidymodel package's downlaod increases by month.")
1 rsample
1.1 Cross Validation
bootstraps
函数增加内存很少,如以下这个Github例子。
library(rsample)
library(mlbench) # 提取数据 LetterRecognition
library(pryr) # 使用函数 object_size
data(LetterRecognition)
object_size(LetterRecognition)
set.seed(35222)
boots <- bootstraps(LetterRecognition, times = 50)
object_size(boots)
boots %>% head
as.numeric(object_size(boots)/object_size(LetterRecognition))
1.1.1 splits
需要加一个as.data.frame
bootstraps(mtcars,times=2) %>%
.$splits %>%
.[1] %>%
as.data.frame()
1.2 数据预处理
The
recipes
package contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide. (Kuhn and Wickham 2018)
recipes
包主要是为了
- 避免花大量时间构建模型和
- 提高很多数据处理的方式
我认为后面一个是非常方便的。
- signal extraction using principal component analysis
- imputation of missing values
- transformations of individual variables (e.g. Box-Cox transformations) (Kuhn and Wickham 2018)
recipes
包的函数可以对x变量的进行修正,这里进行举例。
library(AmesHousing)
ames <- make_ames()
names(ames)
log10(Sale_Price) ~ Neighborhood + House_Style + Year_Sold + Lot_Area
library(ggplot2)
theme_set(theme_bw())
ggplot(ames, aes(x = Lot_Area)) +
geom_histogram(binwidth = 5000, col = "red", fill ="red", alpha = .5)
theme_bw()
这个图看起来很不错。- 有很常见的右偏,可以使用 Box-Cox 方式进行修正。
ggplot(ames, aes(x = Neighborhood)) + geom_bar() + coord_flip() + xlab("")
- 可以发现,有些频率小的level,最后都要剔除。
根据以上问题,下面继续数据处理。
library(recipes)
rec <- recipe(Sale_Price ~ Neighborhood + House_Style + Year_Sold + Lot_Area,
data = ames) %>%
# Log the outcome
step_log(Sale_Price, base = 10) %>%
# Collapse rarely occurring jobs into "other"
step_other(Neighborhood, House_Style, threshold = 0.05) %>%
# Dummy variables on the qualitative predictors
step_dummy(all_nominal()) %>%
# 相当于一键 one-hot
# Unskew a predictor
step_BoxCox(Lot_Area) %>%
# Normalize
step_center(all_predictors()) %>%
step_scale(all_predictors())
rec
recipe(,...,data=)
这样会更直接,定义好模型,但是先不定义算法类型,跟符合现实逻辑。step_other(threshold = )
让低频率的并入,这样就可以保证无论之后,开发集产生unknown的level几率会非常小。step_BoxCox
转换成unskew的。step_center
和step_scale
进行标准化。
While the original data object
ames
is used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample. (Kuhn and Wickham 2018)
这里解决了一个实际问题。
recipe
函数中虽然使用了数据ames
,但是只是用来定义变量和变量的特性,因此recipe
反馈的规则可以应用到其他的数据集或者resample上。
这点就解决了测试集需要统一的问题。
rec_training_set <- prep(rec, training = ames, retain = TRUE, verbose = TRUE)
rec_training_set
prep
函数是对某一个数据执行规则的意思。
bake(rec_training_set, newdata = head(ames))
ames %>% head %>%
select(Neighborhood,House_Style,Year_Sold,Lot_Area)
bake
反馈到处理结果到newdata上。相当于predict
- 并且
Neighborhood
和House_Style
进行了one-hot
juice(rec_training_set) %>% head
juice
相当于fitted
1.2.1 整合进模型训练
library(rsample)
set.seed(7712)
bt_samples <- bootstraps(ames)
bt_samples
bt_samples$splits[[1]]
这是切分点的选取。
library(purrr)
bt_samples$recipes <- map(bt_samples$splits, prepper, recipe = rec, retain = TRUE, verbose = FALSE)
bt_samples
只要rec
定义好整个函数是很好理解的。
bt_samples$recipes[[1]]
prepper
是prep
的替代品,主要是为了对split的函数,进行执行变量和变量特性的修改。
fit_lm <- function(rec_obj, ...)
lm(..., data = juice(rec_obj, everything()))
bt_samples$lm_mod <-
map(
bt_samples$recipes,
fit_lm,
Sale_Price ~ .
)
bt_samples
学习fit_lm
的函数中...
的构建和位置。
pred_lm <- function(split_obj, rec_obj, model_obj, ...) {
mod_data <- bake(
rec_obj,
newdata = assessment(split_obj),
all_predictors(),
all_outcomes()
)
out <- mod_data %>% select(Sale_Price)
out$predicted <- predict(model_obj, newdata = mod_data %>% select(-Sale_Price))
out
}
bt_samples$pred <-
pmap(
lst(
split_obj = bt_samples$splits,
rec_obj = bt_samples$recipes,
model_obj = bt_samples$lm_mod
),
pred_lm
)
bt_samples
rmse <- function(dat)
sqrt(mean((dat$Sale_Price - dat$predicted)^2))
bt_samples$RMSE <- map_dbl(bt_samples$pred, rmse)
summary(bt_samples$RMSE)
Kuhn, Max, and Hadley Wickham. 2018. “Recipes with Rsample.” rsample. 2018. https://tidymodels.github.io/rsample/articles/Applications/Recipes_and_rsample.html.