学习首页 方便下载参考数据。
{r setup, include=FALSE} knitr::opts_chunk$set( eval = FALSE, message = FALSE, warning = FALSE, cache = T )
Customer Lifetime Value
{r} salesData <- read_csv('../../../picbackup/salesData.csv')
{r} read_csv( Variable,Description id,identification number of customer mostFreqStore,store person bought mostly from mostFreqCat,category person purchased mostly nCats,number of different categories preferredBrand,brand person purchased mostly nBrands,number of different brands",col_names=T,skip=1) %>% kable()
相关性矩阵
{r} salesData %>% select_if(is.numeric) %>% select(-id) %>% cor() %>% # as.data.frame() corrplot()
corrplot函数是based在cor的结果矩阵上。
共线性问题
```{r} # Estimating the full model salesModel1 <- lm(salesThisMon ~ . - id, data = salesData) salesModel2 <- lm(salesThisMon ~ . - id - preferredBrand - nBrands, data = salesData)
vif(salesModel1) %>% as.data.frame() %>% rename(vif1=’.’) %>% rownames_to_column(var = ’type’) %>% left_join( vif(salesModel2) %>% as.data.frame() %>% rename(vif2=’.’) %>% rownames_to_column(var = ’type’) ,by = ’type’ ) %>% mutate(delta_vif = vif1-vif2) %>% ggplot(aes(y=delta_vif)) + geom_boxplot()
VIF定义参考
[这里](https://jiaxiangli.netlify.com/2018/04/training-model/)
。
这里明显看到VIF下降了。
# Churn Prevention
```
defaultData <-
read_csv2('../../../picbackup/defaultData.csv')
使用read_csv2函数是因为分隔符为;
$\exp(\hat \beta)$
```{r} # Build logistic regression model logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage + age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, family = binomial, data = defaultData)
logitModelFull %>% tidy() %>% mutate(beta = exp(estimate))
理解beta值导出的意义。
## step wise 方法
```
library(MASS)
#Build the new model
logitModelNew <- stepAIC(logitModelFull,trace = 0)
# Save the formula of the new model (it will be needed for the out-of-sample part)
formulaLogit <- as.formula(summary(logitModelNew)$call)
formulaLogit
confusion matrix
```{r} # Make predictions using the full Model defaultData$predFull <- predict(logitModelFull, type = “response”, na.action = na.exclude)
Construct the in-sample confusion matrix
confMatrixModelFull <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predFull, threshold = 0.5) confMatrixModelFull
Calculate the accuracy for the full Model
accuracyFull <- sum(diag(confMatrixModelFull)) / sum(confMatrixModelFull) accuracyFull
```
# Calculate the accuracy for 'logitModelNew'
# Make prediction
defaultData$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)
# Construct the in-sample confusion matrix
confMatrixModelNew <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predNew, threshold = 0.5)
confMatrixModelNew
# Calculate the accuracy...
accuracyNew <- sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew)
accuracyNew
# and compare it to the full model's accuracy
accuracyFull
cross validation
使用boot包的cv.glm完成,还可以自定义评价函数。
```{r} library(boot) # Accuracy function costAcc <- function(r, pi = 0) { cm <- SDMTools::confusion.matrix(r, pi, threshold = 0.3) acc <- sum(diag(cm)) / sum(cm) return(acc) }
Cross validated accuracy for logitModelNew
set.seed(534381) cv.glm(defaultData, logitModelNew, cost = costAcc, K = 6)$delta
`$delta`前面是train组的评价,后面是test组的评价。
# Time to Reorder with Survival Analysis
```
dataNextOrder <-
read_csv('../../../picbackup/survivalDataExercise.csv')
{r} # Plot a histogram ggplot(dataNextOrder) + geom_histogram(aes(x = daysSinceFirstPurch, fill = factor(boughtAgain))) + facet_grid( ~ boughtAgain) + # Separate plots for boughtAgain = 1 vs. 0 theme(legend.position = "none") # Don't show legend
- There are more customers in the data who bought a second time.
- Apart from that, the differences between the distributions are not very large.
明显第二个图的count更多。
foundation
```{r} library(survival) # Create survival object survObj <- Surv(dataNextOrder$daysSinceFirstPurch ,dataNextOrder$boughtAgain)
Look at structure
str(survObj)
在`str(survObj)`的结果中,`+`表示发生了。
## adding more covariate[^covariate]
[^covariate]:
In statistics, a covariate is a variable that is possibly predictive of the outcome under study. A covariate may be of direct interest or it may be a confounding or interacting variable.
The alternative terms explanatory variable, independent variable, or predictor, are used in a regression analysis. In econometrics, the term "control variable" is usually used instead of "covariate". In a more specific usage, a covariate is a secondary variable that can affect the relationship between the dependent variable and other independent variables of primary interest.
这里主要讲 Kaplan-Meier Analysis。
```
# Compute and print fit
fitKMSimple <- survfit(survObj ~ 1)
print(fitKMSimple)
# Plot fit
plot(fitKMSimple,
conf.int = FALSE, xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")
# Compute fit with categorical covariate
fitKMCov <- survfit(survObj ~ voucher, data = dataNextOrder)
# Plot fit with covariate and add labels
plot(fitKMCov, lty = 2:3,
xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")
legend(90, .9, c("No", "Yes"), lty = 2:3)
Customers using a voucher seem to take longer to place their second order. They are maybe waiting for another voucher?