学习首页方便下载参考数据。

{r setup, include=FALSE} knitr::opts_chunk$set( eval = FALSE, message = FALSE, warning = FALSE, cache = T )

Customer Lifetime Value

{r} salesData <- read_csv('../../../picbackup/salesData.csv')

{r} read_csv( Variable,Description id,identification number of customer mostFreqStore,store person bought mostly from mostFreqCat,category person purchased mostly nCats,number of different categories preferredBrand,brand person purchased mostly nBrands,number of different brands",col_names=T,skip=1) %>% kable()

共线性问题

```{r} # Estimating the full model salesModel1 <- lm(salesThisMon ~ . - id, data = salesData) salesModel2 <- lm(salesThisMon ~ . - id - preferredBrand - nBrands, data = salesData)

vif(salesModel1) %>% as.data.frame() %>% rename(vif1=’.’) %>% rownames_to_column(var = ’type’) %>% left_join( vif(salesModel2) %>% as.data.frame() %>% rename(vif2=’.’) %>% rownames_to_column(var = ’type’) ,by = ’type’ ) %>% mutate(delta_vif = vif1-vif2) %>% ggplot(aes(y=delta_vif)) + geom_boxplot()

VIF定义参考 
[这里](https://jiaxiangli.netlify.com/2018/04/training-model/)
。
这里明显看到VIF下降了。



# Churn Prevention

```
defaultData <- 
    read_csv2('../../../picbackup/defaultData.csv')

使用read_csv2函数是因为分隔符为;

$\exp(\hat \beta)$

```{r} # Build logistic regression model logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage + age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, family = binomial, data = defaultData)

logitModelFull %>% tidy() %>% mutate(beta = exp(estimate))

理解beta值导出的意义。

## step wise 方法

```
library(MASS)

#Build the new model
logitModelNew <- stepAIC(logitModelFull,trace = 0) 

# Save the formula of the new model (it will be needed for the out-of-sample part) 
formulaLogit <- as.formula(summary(logitModelNew)$call)
formulaLogit

confusion matrix

```{r} # Make predictions using the full Model defaultData$predFull <- predict(logitModelFull, type = “response”, na.action = na.exclude)

Construct the in-sample confusion matrix

confMatrixModelFull <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predFull, threshold = 0.5) confMatrixModelFull

Calculate the accuracy for the full Model

accuracyFull <- sum(diag(confMatrixModelFull)) / sum(confMatrixModelFull) accuracyFull

```
# Calculate the accuracy for 'logitModelNew'
# Make prediction
defaultData$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelNew <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predNew, threshold = 0.5)
confMatrixModelNew

# Calculate the accuracy...
accuracyNew <- sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew)
accuracyNew

# and compare it to the full model's accuracy
accuracyFull

cross validation

使用boot包的cv.glm完成，还可以自定义评价函数。

```{r} library(boot) # Accuracy function costAcc <- function(r, pi = 0) { cm <- SDMTools::confusion.matrix(r, pi, threshold = 0.3) acc <- sum(diag(cm)) / sum(cm) return(acc) }

Cross validated accuracy for logitModelNew

set.seed(534381) cv.glm(defaultData, logitModelNew, cost = costAcc, K = 6)$delta

`$delta`前面是train组的评价，后面是test组的评价。

# Time to Reorder with Survival Analysis

```
dataNextOrder <- 
    read_csv('../../../picbackup/survivalDataExercise.csv')

{r} # Plot a histogram ggplot(dataNextOrder) + geom_histogram(aes(x = daysSinceFirstPurch, fill = factor(boughtAgain))) + facet_grid( ~ boughtAgain) + # Separate plots for boughtAgain = 1 vs. 0 theme(legend.position = "none") # Don't show legend

There are more customers in the data who bought a second time.

Apart from that, the differences between the distributions are not very large.

明显第二个图的count更多。

foundation

```{r} library(survival) # Create survival object survObj <- Surv(dataNextOrder$daysSinceFirstPurch ,dataNextOrder$boughtAgain)

Look at structure

str(survObj)

在`str(survObj)`的结果中，`+`表示发生了。

## adding more covariate[^covariate]

[^covariate]:
    In statistics, a covariate is a variable that is possibly predictive of the outcome under study. A covariate may be of direct interest or it may be a confounding or interacting variable.
    The alternative terms explanatory variable, independent variable, or predictor, are used in a regression analysis. In econometrics, the term "control variable" is usually used instead of "covariate". In a more specific usage, a covariate is a secondary variable that can affect the relationship between the dependent variable and other independent variables of primary interest.
    
这里主要讲 Kaplan-Meier Analysis。
    
```
# Compute and print fit
fitKMSimple <- survfit(survObj ~ 1)
print(fitKMSimple)

# Plot fit
plot(fitKMSimple,
     conf.int = FALSE, xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")

# Compute fit with categorical covariate
fitKMCov <- survfit(survObj ~ voucher, data = dataNextOrder)

# Plot fit with covariate and add labels
plot(fitKMCov, lty = 2:3,
     xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")
legend(90, .9, c("No", "Yes"), lty = 2:3)

Customers using a voucher seem to take longer to place their second order. They are maybe waiting for another voucher?

Reducing Dimensionality with Principal Component Analysis

"技术：统计建模学习笔记

"技术系列导航

Customer Lifetime Value

相关性矩阵

共线性问题

$\exp(\hat \beta)$

confusion matrix

Construct the in-sample confusion matrix

Calculate the accuracy for the full Model

cross validation

Cross validated accuracy for logitModelNew

foundation

Look at structure

Reducing Dimensionality with Principal Component Analysis

"技术系列导航

"技术：统计建模学习笔记

"技术 系列导航

Customer Lifetime Value

相关性矩阵

共线性问题

$\exp(\hat \beta)$

confusion matrix

Construct the in-sample confusion matrix

Calculate the accuracy for the full Model

cross validation

Cross validated accuracy for logitModelNew

foundation

Look at structure

Reducing Dimensionality with Principal Component Analysis

"技术 系列导航

"技术系列导航

"技术系列导航