想法很好的一个包,但是稳定性太差,bug多,不推荐,有这点闲工夫,一个回归都跑完了。
OneR
Holger von Jouanne-Diedrich (2017), Holte (1993) 使用单变量分析,类似于WOE等,但是构建方式不同,关键是不用手动切分WOE,可以自动切好,非常方便。
The following story is one of the most often told in the Data Science community: some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark! Holte (1993)
实际上swallow learning、deep learning、复杂模型、简单模型都有一个平衡。 我们需要区分: 简单模型找出我们我们足够知道的信息,复杂模型去负责特别的复杂的信息方面。
例子1
library(OneR)
data <- optbin(iris)
model <- OneR(data, verbose = TRUE)
summary(model)
plot(model)
prediction <- predict(model, data)
eval_model(prediction, data)
optbin直接切分好了data的各个连续变量。
这是这个包的一次分析全部使用的函数,非常简单。
例子2
data(breastcancer)
data <- breastcancer
set.seed(12)
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- optbin(data[random, ], method = "infogain")
data_test <- data[-random, ]
model_train <- OneR(data_train, verbose = TRUE)
summary(model_train)
plot(model_train)
prediction <- predict(model_train, data_test)
eval_model(prediction, data_test)
infogain类似于决策树。
函数解释
OneR主要的学习函数。bin()等距分bin,参数nbins、labelsoptbin自动分binmaxlavels剔除levels太多的分类变量eval_model类似于summary
data_frame(numeric = c(1:26), alphabet = letters) %>%
maxlevels() %>%
datatable()
报错
使用前数据一定要as.data.frame。
并且禁不起大数据考验。
这个包很水。