{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE)
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
本文于r format(Sys.Date(), "%Y-%m-%d")更新。 如发现问题或者有建议,欢迎提交 Issue
{r} library(dlstats) rsample_download_data <- cran_stats(c('rsample','recipes')) library(tidyverse) library(lubridate) rsample_download_data %>% filter(end < floor_date(now(),unit = 'month')) %>% ggplot(aes(end,downloads)) + geom_line() + facet_wrap(~package,scales = 'free_y') + theme_bw() + labs(title = "tidymodel package's downlaod increases by month.")
rsample
Cross Validation
bootstraps函数增加内存很少,如以下这个Github例子。
{r} library(rsample) library(mlbench) # 提取数据 LetterRecognition library(pryr) # 使用函数 object_size data(LetterRecognition) object_size(LetterRecognition) set.seed(35222) boots <- bootstraps(LetterRecognition, times = 50) object_size(boots) boots %>% head as.numeric(object_size(boots)/object_size(LetterRecognition))
splits需要加一个as.data.frame
{r} bootstraps(mtcars,times=2) %>% .$splits %>% .[1] %>% as.data.frame()
数据预处理
The
recipespackage contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide. [@KuhnRecipes]
recipes包主要是为了
- 避免花大量时间构建模型和
- 提高很多数据处理的方式
我认为后面一个是非常方便的。
- signal extraction using principal component analysis
- imputation of missing values
- transformations of individual variables (e.g. Box-Cox transformations) [@KuhnRecipes]
recipes包的函数可以对x变量的进行修正,这里进行举例。
{r} library(AmesHousing) ames <- make_ames() names(ames)
{r} log10(Sale_Price) ~ Neighborhood + House_Style + Year_Sold + Lot_Area
{r} library(ggplot2) theme_set(theme_bw()) ggplot(ames, aes(x = Lot_Area)) + geom_histogram(binwidth = 5000, col = "red", fill ="red", alpha = .5)
theme_bw()这个图看起来很不错。- 有很常见的右偏,可以使用 Box-Cox 方式进行修正。
{r} ggplot(ames, aes(x = Neighborhood)) + geom_bar() + coord_flip() + xlab("")
- 可以发现,有些频率小的level,最后都要剔除。
根据以上问题,下面继续数据处理。
```{r message=FALSE, warning=FALSE} library(recipes)
rec <- recipe(Sale_Price ~ Neighborhood + House_Style + Year_Sold + Lot_Area, data = ames) %>% # Log the outcome step_log(Sale_Price, base = 10) %>% # Collapse rarely occurring jobs into “other step_other(Neighborhood, House_Style, threshold = 0.05) %>% # Dummy variables on the qualitative predictors step_dummy(all_nominal()) %>% # 相当于一键 one-hot # Unskew a predictor step_BoxCox(Lot_Area) %>% # Normalize step_center(all_predictors()) %>% step_scale(all_predictors()) rec
1. `recipe(,...,data=)`这样会更直接,定义好模型,但是先不定义算法类型,跟符合现实逻辑。
1. `step_other(threshold = )`让低频率的并入,这样就可以保证无论之后,开发集产生unknown的level几率会非常小。
1. `step_BoxCox`转换成unskew的。
1. `step_center`和`step_scale`进行标准化。
>
While the original data object `ames` is used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample.
[@KuhnRecipes]
这里解决了一个实际问题。
`recipe`函数中虽然使用了数据`ames`,但是只是用来定义变量和变量的特性,因此`recipe`反馈的规则可以应用到其他的数据集或者resample上。
这点就解决了测试集需要统一的问题。
```
rec_training_set <- prep(rec, training = ames, retain = TRUE, verbose = TRUE)
rec_training_set
prep函数是对某一个数据执行规则的意思。
{r} bake(rec_training_set, newdata = head(ames)) ames %>% head %>% select(Neighborhood,House_Style,Year_Sold,Lot_Area)
bake反馈到处理结果到newdata上。相当于predict- 并且
Neighborhood和House_Style进行了one-hot
{r} juice(rec_training_set) %>% head
juice相当于fitted
整合进模型训练
{r} library(rsample) set.seed(7712) bt_samples <- bootstraps(ames) bt_samples
{r} bt_samples$splits[[1]]
这是切分点的选取。
```{r} library(purrr)
bt_samples$recipes <- map(bt_samples$splits, prepper, recipe = rec, retain = TRUE, verbose = FALSE) bt_samples
只要`rec`定义好整个函数是很好理解的。
```
bt_samples$recipes[[1]]
prepper是prep的替代品,主要是为了对split的函数,进行执行变量和变量特性的修改。
```{r} fit_lm <- function(rec_obj, …) lm(…, data = juice(rec_obj, everything()))
bt_samples$lm_mod <- map( bt_samples$recipes, fit_lm, Sale_Price ~ . ) bt_samples
学习`fit_lm`的函数中`...`的构建和位置。
```
pred_lm <- function(split_obj, rec_obj, model_obj, ...) {
mod_data <- bake(
rec_obj,
newdata = assessment(split_obj),
all_predictors(),
all_outcomes()
)
out <- mod_data %>% select(Sale_Price)
out$predicted <- predict(model_obj, newdata = mod_data %>% select(-Sale_Price))
out
}
bt_samples$pred <-
pmap(
lst(
split_obj = bt_samples$splits,
rec_obj = bt_samples$recipes,
model_obj = bt_samples$lm_mod
),
pred_lm
)
bt_samples
{r} rmse <- function(dat) sqrt(mean((dat$Sale_Price - dat$predicted)^2)) bt_samples$RMSE <- map_dbl(bt_samples$pred, rmse) summary(bt_samples$RMSE)