Holger von Jouanne-Diedrich (2017), Holte (1993) 使用单变量分析,类似于WOE等,但是构建方式不同,关键是不用手动切分WOE,可以自动切好,非常方便。

The following story is one of the most often told in the Data Science community: some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark! (Holger von Jouanne-Diedrich 2017, @Holte1993)

实际上swallow learning、deep learning、复杂模型、简单模型都有一个平衡。 我们需要区分: 简单模型找出我们我们足够知道的信息,复杂模型去负责特别的复杂的信息方面。


data <- optbin(iris)
model <- OneR(data, verbose = TRUE)
##     Attribute    Accuracy
## 1 * Petal.Width  96%     
## 2   Petal.Length 95.33%  
## 3   Sepal.Length 74.67%  
## 4   Sepal.Width  55.33%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
## Accuracy:
## 144 of 150 instances classified correctly (96%)
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

prediction <- predict(model, data)
eval_model(prediction, data)
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
## Accuracy:
## 0.96 (144/150)
## Error rate:
## 0.04 (6/150)
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)

optbin直接切分好了data的各个连续变量。 这是这个包的一次分析全部使用的函数,非常简单。


data <- breastcancer
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- optbin(data[random, ], method = "infogain")
data_test <- data[-random, ]
model_train <- OneR(data_train, verbose = TRUE)
##     Attribute                   Accuracy
## 1 * Uniformity of Cell Size     92.87%  
## 2   Uniformity of Cell Shape    91.59%  
## 3   Bare Nuclei                 91.22%  
## 4   Bland Chromatin             90.68%  
## 5   Single Epithelial Cell Size 90.49%  
## 6   Normal Nucleoli             89.4%   
## 7   Marginal Adhesion           87.57%  
## 8   Clump Thickness             85.37%  
## 9   Mitoses                     79.34%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## Call:
## OneR.data.frame(x = data_train, verbose = TRUE)
## Rules:
## If Uniformity of Cell Size = (0.991,2] then Class = benign
## If Uniformity of Cell Size = (2,10]    then Class = malignant
## Accuracy:
## 508 of 547 instances classified correctly (92.87%)
## Contingency table:
##            Uniformity of Cell Size
## Class       (0.991,2] (2,10] Sum
##   benign        * 324     30 354
##   malignant         9  * 184 193
##   Sum             333    214 547
## ---
## Maximum in each column: '*'
## Pearson's Chi-squared test:
## X-squared = 392.05, df = 1, p-value < 2.2e-16

prediction <- predict(model_train, data_test)
eval_model(prediction, data_test)
## Confusion matrix (absolute):
##            Actual
## Prediction  benign malignant Sum
##   benign        83         3  86
##   malignant     10        44  54
##   Sum           93        47 140
## Confusion matrix (relative):
##            Actual
## Prediction  benign malignant  Sum
##   benign      0.59      0.02 0.61
##   malignant   0.07      0.31 0.39
##   Sum         0.66      0.34 1.00
## Accuracy:
## 0.9071 (127/140)
## Error rate:
## 0.0929 (13/140)
## Error rate reduction (vs. base rate):
## 0.7234 (p-value = 1.69e-11)



  • OneR主要的学习函数。
  • bin()等距分bin,参数nbinslabels
  • optbin自动分bin
  • maxlavels剔除levels太多的分类变量
  • eval_model类似于summary
data_frame(numeric = c(1:26), alphabet = letters) %>% 
  maxlevels() %>% 


使用前数据一定要as.data.frame。 并且禁不起大数据考验。 这个包很水。


