1 rsample
- 1.1 Cross Validation
  - 1.1.1 splits需要加一个as.data.frame
- 1.2 数据预处理
  - 1.2.1 整合进模型训练

本文于2020-10-10更新。如发现问题或者有建议，欢迎提交 Issue

knitr::opts_chunk$set(warning = FALSE, message = FALSE, eval=F)

library(dlstats)
rsample_download_data <- cran_stats(c('rsample','recipes'))
library(tidyverse)
library(lubridate)
rsample_download_data %>% 
    filter(end < floor_date(now(),unit = 'month')) %>% 
    ggplot(aes(end,downloads)) + 
    geom_line() + 
    facet_wrap(~package,scales = 'free_y') +
    theme_bw() + 
    labs(title = "tidymodel package's downlaod increases by month.")

1 rsample

1.1 Cross Validation

bootstraps函数增加内存很少，如以下这个Github例子。

library(rsample)
library(mlbench) # 提取数据 LetterRecognition
library(pryr) # 使用函数 object_size
data(LetterRecognition)
object_size(LetterRecognition)
set.seed(35222)
boots <- bootstraps(LetterRecognition, times = 50)
object_size(boots)
boots %>% head
as.numeric(object_size(boots)/object_size(LetterRecognition))

1.1.1 `splits`需要加一个`as.data.frame`

bootstraps(mtcars,times=2) %>% 
  .$splits %>% 
  .[1] %>% 
  as.data.frame()

1.2 数据预处理

The recipes package contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide. (Kuhn and Wickham 2018)

recipes包主要是为了

避免花大量时间构建模型和
提高很多数据处理的方式

我认为后面一个是非常方便的。

signal extraction using principal component analysis

imputation of missing values

transformations of individual variables (e.g. Box-Cox transformations) (Kuhn and Wickham 2018)

recipes包的函数可以对x变量的进行修正，这里进行举例。

library(AmesHousing)
ames <- make_ames()
names(ames)

log10(Sale_Price) ~ Neighborhood + House_Style + Year_Sold + Lot_Area

library(ggplot2)
theme_set(theme_bw())
ggplot(ames, aes(x = Lot_Area)) + 
  geom_histogram(binwidth = 5000, col = "red", fill ="red", alpha = .5)

theme_bw() 这个图看起来很不错。
有很常见的右偏，可以使用 Box-Cox 方式进行修正。

ggplot(ames, aes(x = Neighborhood)) + geom_bar() + coord_flip() + xlab("")

可以发现，有些频率小的level，最后都要剔除。

根据以上问题，下面继续数据处理。

library(recipes)

rec <- recipe(Sale_Price ~ Neighborhood + House_Style + Year_Sold + Lot_Area, 
              data = ames) %>%
  # Log the outcome
  step_log(Sale_Price, base = 10) %>%
  # Collapse rarely occurring jobs into "other"
  step_other(Neighborhood, House_Style, threshold = 0.05) %>%
  # Dummy variables on the qualitative predictors
  step_dummy(all_nominal()) %>%
  # 相当于一键 one-hot
  # Unskew a predictor
  step_BoxCox(Lot_Area) %>%
  # Normalize
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) 
rec

recipe(,...,data=)这样会更直接，定义好模型，但是先不定义算法类型，跟符合现实逻辑。
step_other(threshold = )让低频率的并入，这样就可以保证无论之后，开发集产生unknown的level几率会非常小。
step_BoxCox转换成unskew的。
step_center和step_scale进行标准化。

While the original data object ames is used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample. (Kuhn and Wickham 2018)

这里解决了一个实际问题。

recipe函数中虽然使用了数据ames，但是只是用来定义变量和变量的特性，因此recipe反馈的规则可以应用到其他的数据集或者resample上。这点就解决了测试集需要统一的问题。

rec_training_set <- prep(rec, training = ames, retain = TRUE, verbose = TRUE)
rec_training_set

prep函数是对某一个数据执行规则的意思。

bake(rec_training_set, newdata = head(ames))
ames %>% head %>%
    select(Neighborhood,House_Style,Year_Sold,Lot_Area)

bake反馈到处理结果到newdata上。相当于predict
并且Neighborhood和House_Style进行了one-hot

juice(rec_training_set) %>% head

juice相当于fitted

1.2.1 整合进模型训练

library(rsample)
set.seed(7712)
bt_samples <- bootstraps(ames)
bt_samples

bt_samples$splits[[1]]

这是切分点的选取。

library(purrr)

bt_samples$recipes <- map(bt_samples$splits, prepper, recipe = rec, retain = TRUE, verbose = FALSE)
bt_samples

只要rec定义好整个函数是很好理解的。

bt_samples$recipes[[1]]

prepper是prep的替代品，主要是为了对split的函数，进行执行变量和变量特性的修改。

fit_lm <- function(rec_obj, ...) 
  lm(..., data = juice(rec_obj, everything()))

bt_samples$lm_mod <- 
  map(
    bt_samples$recipes, 
    fit_lm, 
    Sale_Price ~ .
  )
bt_samples

学习fit_lm的函数中...的构建和位置。

pred_lm <- function(split_obj, rec_obj, model_obj, ...) {
  mod_data <- bake(
    rec_obj, 
    newdata = assessment(split_obj),
    all_predictors(),
    all_outcomes()
  ) 
  
  out <- mod_data %>% select(Sale_Price)
  out$predicted <- predict(model_obj, newdata = mod_data %>% select(-Sale_Price))
  out
}

bt_samples$pred <- 
  pmap(
    lst(
      split_obj = bt_samples$splits, 
      rec_obj = bt_samples$recipes, 
      model_obj = bt_samples$lm_mod
    ),
    pred_lm 
  )
bt_samples

rmse <- function(dat) 
  sqrt(mean((dat$Sale_Price - dat$predicted)^2))
bt_samples$RMSE <- map_dbl(bt_samples$pred, rmse)
summary(bt_samples$RMSE)

Kuhn, Max, and Hadley Wickham. 2018. “Recipes with Rsample.” rsample. 2018. https://tidymodels.github.io/rsample/articles/Applications/Recipes_and_rsample.html.

tidymodel 使用技巧

1 rsample

1.1 Cross Validation

1.1.1 splits需要加一个as.data.frame

1.2 数据预处理

1.2.1 整合进模型训练

1.1.1 `splits`需要加一个`as.data.frame`