3 min read

"技术：建模流程实战

2018/10/01

{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE)

<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>

本文于r format(Sys.Date(), "%Y-%m-%d")更新。如发现问题或者有建议，欢迎提交 Issue

{r} library(dlstats) rsample_download_data <- cran_stats(c('rsample','recipes')) library(tidyverse) library(lubridate) rsample_download_data %>% filter(end < floor_date(now(),unit = 'month')) %>% ggplot(aes(end,downloads)) + geom_line() + facet_wrap(~package,scales = 'free_y') + theme_bw() + labs(title = "tidymodel package's downlaod increases by month.")

rsample

Cross Validation

bootstraps函数增加内存很少，如以下这个Github例子。

{r} library(rsample) library(mlbench) # 提取数据 LetterRecognition library(pryr) # 使用函数 object_size data(LetterRecognition) object_size(LetterRecognition) set.seed(35222) boots <- bootstraps(LetterRecognition, times = 50) object_size(boots) boots %>% head as.numeric(object_size(boots)/object_size(LetterRecognition))

`splits`需要加一个`as.data.frame`

{r} bootstraps(mtcars,times=2) %>% .$splits %>% .[1] %>% as.data.frame()

数据预处理

The recipes package contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide. [@KuhnRecipes]

recipes包主要是为了

避免花大量时间构建模型和
提高很多数据处理的方式

我认为后面一个是非常方便的。

signal extraction using principal component analysis

imputation of missing values

transformations of individual variables (e.g. Box-Cox transformations) [@KuhnRecipes]

recipes包的函数可以对x变量的进行修正，这里进行举例。

{r} library(AmesHousing) ames <- make_ames() names(ames)

{r} log10(Sale_Price) ~ Neighborhood + House_Style + Year_Sold + Lot_Area

{r} library(ggplot2) theme_set(theme_bw()) ggplot(ames, aes(x = Lot_Area)) + geom_histogram(binwidth = 5000, col = "red", fill ="red", alpha = .5)

theme_bw() 这个图看起来很不错。
有很常见的右偏，可以使用 Box-Cox 方式进行修正。

{r} ggplot(ames, aes(x = Neighborhood)) + geom_bar() + coord_flip() + xlab("")

可以发现，有些频率小的level，最后都要剔除。

根据以上问题，下面继续数据处理。

```{r message=FALSE, warning=FALSE} library(recipes)

rec <- recipe(Sale_Price ~ Neighborhood + House_Style + Year_Sold + Lot_Area, data = ames) %>% # Log the outcome step_log(Sale_Price, base = 10) %>% # Collapse rarely occurring jobs into “other step_other(Neighborhood, House_Style, threshold = 0.05) %>% # Dummy variables on the qualitative predictors step_dummy(all_nominal()) %>% # 相当于一键 one-hot # Unskew a predictor step_BoxCox(Lot_Area) %>% # Normalize step_center(all_predictors()) %>% step_scale(all_predictors()) rec

1. `recipe(,...,data=)`这样会更直接，定义好模型，但是先不定义算法类型，跟符合现实逻辑。
1. `step_other(threshold = )`让低频率的并入，这样就可以保证无论之后，开发集产生unknown的level几率会非常小。
1. `step_BoxCox`转换成unskew的。
1. `step_center`和`step_scale`进行标准化。

>
While the original data object `ames` is used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample.
[@KuhnRecipes]

这里解决了一个实际问题。

`recipe`函数中虽然使用了数据`ames`，但是只是用来定义变量和变量的特性，因此`recipe`反馈的规则可以应用到其他的数据集或者resample上。
这点就解决了测试集需要统一的问题。

```
rec_training_set <- prep(rec, training = ames, retain = TRUE, verbose = TRUE)
rec_training_set

prep函数是对某一个数据执行规则的意思。

{r} bake(rec_training_set, newdata = head(ames)) ames %>% head %>% select(Neighborhood,House_Style,Year_Sold,Lot_Area)

bake反馈到处理结果到newdata上。相当于predict
并且Neighborhood和House_Style进行了one-hot

{r} juice(rec_training_set) %>% head

juice相当于fitted

整合进模型训练

{r} library(rsample) set.seed(7712) bt_samples <- bootstraps(ames) bt_samples

{r} bt_samples$splits[[1]]

这是切分点的选取。

```{r} library(purrr)

bt_samples$recipes <- map(bt_samples$splits, prepper, recipe = rec, retain = TRUE, verbose = FALSE) bt_samples

只要`rec`定义好整个函数是很好理解的。

```
bt_samples$recipes[[1]]

prepper是prep的替代品，主要是为了对split的函数，进行执行变量和变量特性的修改。

```{r} fit_lm <- function(rec_obj, …) lm(…, data = juice(rec_obj, everything()))

bt_samples$lm_mod <- map( bt_samples$recipes, fit_lm, Sale_Price ~ . ) bt_samples

学习`fit_lm`的函数中`...`的构建和位置。

```
pred_lm <- function(split_obj, rec_obj, model_obj, ...) {
  mod_data <- bake(
    rec_obj, 
    newdata = assessment(split_obj),
    all_predictors(),
    all_outcomes()
  ) 
  
  out <- mod_data %>% select(Sale_Price)
  out$predicted <- predict(model_obj, newdata = mod_data %>% select(-Sale_Price))
  out
}

bt_samples$pred <- 
  pmap(
    lst(
      split_obj = bt_samples$splits, 
      rec_obj = bt_samples$recipes, 
      model_obj = bt_samples$lm_mod
    ),
    pred_lm 
  )
bt_samples

{r} rmse <- function(dat) sqrt(mean((dat$Sale_Price - dat$predicted)^2)) bt_samples$RMSE <- map_dbl(bt_samples$pred, rmse) summary(bt_samples$RMSE)

"技术：建模流程实战

"技术系列导航

rsample

Cross Validation

`splits`需要加一个`as.data.frame`

数据预处理

整合进模型训练

"技术系列导航

"技术：建模流程实战

"技术 系列导航

rsample

Cross Validation

splits需要加一个as.data.frame

数据预处理

整合进模型训练

"技术 系列导航

"技术系列导航

`splits`需要加一个`as.data.frame`

"技术系列导航