3 min read

tidymodel 使用技巧

本文于2020-10-10更新。 如发现问题或者有建议,欢迎提交 Issue

knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval=F)
library(dlstats)
rsample_download_data <- cran_stats(c('rsample','recipes'))
library(tidyverse)
library(lubridate)
rsample_download_data %>% 
    filter(end < floor_date(now(),unit = 'month')) %>% 
    ggplot(aes(end,downloads)) + 
    geom_line() + 
    facet_wrap(~package,scales = 'free_y') +
    theme_bw() + 
    labs(title = "tidymodel package's downlaod increases by month.")

1 rsample

1.1 Cross Validation

bootstraps函数增加内存很少,如以下这个Github例子。

library(rsample)
library(mlbench) # 提取数据 LetterRecognition
library(pryr) # 使用函数 object_size
data(LetterRecognition)
object_size(LetterRecognition)
set.seed(35222)
boots <- bootstraps(LetterRecognition, times = 50)
object_size(boots)
boots %>% head
as.numeric(object_size(boots)/object_size(LetterRecognition))

1.1.1 splits需要加一个as.data.frame

bootstraps(mtcars,times=2) %>% 
  .$splits %>% 
  .[1] %>% 
  as.data.frame()

1.2 数据预处理

The recipes package contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide. (Kuhn and Wickham 2018)

recipes包主要是为了

  1. 避免花大量时间构建模型和
  2. 提高很多数据处理的方式

我认为后面一个是非常方便的。

  1. signal extraction using principal component analysis
  2. imputation of missing values
  3. transformations of individual variables (e.g. Box-Cox transformations) (Kuhn and Wickham 2018)

recipes包的函数可以对x变量的进行修正,这里进行举例。

library(AmesHousing)
ames <- make_ames()
names(ames)
log10(Sale_Price) ~ Neighborhood + House_Style + Year_Sold + Lot_Area
library(ggplot2)
theme_set(theme_bw())
ggplot(ames, aes(x = Lot_Area)) + 
  geom_histogram(binwidth = 5000, col = "red", fill ="red", alpha = .5)
  1. theme_bw() 这个图看起来很不错。
  2. 有很常见的右偏,可以使用 Box-Cox 方式进行修正。
ggplot(ames, aes(x = Neighborhood)) + geom_bar() + coord_flip() + xlab("")
  1. 可以发现,有些频率小的level,最后都要剔除。

根据以上问题,下面继续数据处理。

library(recipes)

rec <- recipe(Sale_Price ~ Neighborhood + House_Style + Year_Sold + Lot_Area, 
              data = ames) %>%
  # Log the outcome
  step_log(Sale_Price, base = 10) %>%
  # Collapse rarely occurring jobs into "other"
  step_other(Neighborhood, House_Style, threshold = 0.05) %>%
  # Dummy variables on the qualitative predictors
  step_dummy(all_nominal()) %>%
  # 相当于一键 one-hot
  # Unskew a predictor
  step_BoxCox(Lot_Area) %>%
  # Normalize
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) 
rec
  1. recipe(,...,data=)这样会更直接,定义好模型,但是先不定义算法类型,跟符合现实逻辑。
  2. step_other(threshold = )让低频率的并入,这样就可以保证无论之后,开发集产生unknown的level几率会非常小。
  3. step_BoxCox转换成unskew的。
  4. step_centerstep_scale进行标准化。

While the original data object ames is used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample. (Kuhn and Wickham 2018)

这里解决了一个实际问题。

recipe函数中虽然使用了数据ames,但是只是用来定义变量和变量的特性,因此recipe反馈的规则可以应用到其他的数据集或者resample上。 这点就解决了测试集需要统一的问题。

rec_training_set <- prep(rec, training = ames, retain = TRUE, verbose = TRUE)
rec_training_set

prep函数是对某一个数据执行规则的意思。

bake(rec_training_set, newdata = head(ames))
ames %>% head %>%
    select(Neighborhood,House_Style,Year_Sold,Lot_Area)
  1. bake反馈到处理结果到newdata上。相当于predict
  2. 并且NeighborhoodHouse_Style进行了one-hot
juice(rec_training_set) %>% head
  1. juice相当于fitted

1.2.1 整合进模型训练

library(rsample)
set.seed(7712)
bt_samples <- bootstraps(ames)
bt_samples
bt_samples$splits[[1]]

这是切分点的选取。

library(purrr)

bt_samples$recipes <- map(bt_samples$splits, prepper, recipe = rec, retain = TRUE, verbose = FALSE)
bt_samples

只要rec定义好整个函数是很好理解的。

bt_samples$recipes[[1]]

prepperprep的替代品,主要是为了对split的函数,进行执行变量和变量特性的修改。

fit_lm <- function(rec_obj, ...) 
  lm(..., data = juice(rec_obj, everything()))

bt_samples$lm_mod <- 
  map(
    bt_samples$recipes, 
    fit_lm, 
    Sale_Price ~ .
  )
bt_samples

学习fit_lm的函数中...的构建和位置。

pred_lm <- function(split_obj, rec_obj, model_obj, ...) {
  mod_data <- bake(
    rec_obj, 
    newdata = assessment(split_obj),
    all_predictors(),
    all_outcomes()
  ) 
  
  out <- mod_data %>% select(Sale_Price)
  out$predicted <- predict(model_obj, newdata = mod_data %>% select(-Sale_Price))
  out
}

bt_samples$pred <- 
  pmap(
    lst(
      split_obj = bt_samples$splits, 
      rec_obj = bt_samples$recipes, 
      model_obj = bt_samples$lm_mod
    ),
    pred_lm 
  )
bt_samples
rmse <- function(dat) 
  sqrt(mean((dat$Sale_Price - dat$predicted)^2))
bt_samples$RMSE <- map_dbl(bt_samples$pred, rmse)
summary(bt_samples$RMSE)

Kuhn, Max, and Hadley Wickham. 2018. “Recipes with Rsample.” rsample. 2018. https://tidymodels.github.io/rsample/articles/Applications/Recipes_and_rsample.html.