4 min read

Marketing Analytics in R: Statistical Modeling 学习笔记

学习首页 方便下载参考数据。

Customer Lifetime Value

salesData <- 
    read_csv('../../../picbackup/salesData.csv')
read_csv("
Variable,Description
id,identification number of customer
mostFreqStore,store person bought mostly from
mostFreqCat,category person purchased mostly
nCats,number of different categories
preferredBrand,brand person purchased mostly
nBrands,number of different brands",col_names=T,skip=1) %>% 
    kable()

相关性矩阵

salesData %>% select_if(is.numeric) %>%
    select(-id) %>%
    cor() %>% 
    # as.data.frame()
    corrplot()

corrplot函数是based在cor的结果矩阵上。

共线性问题

# Estimating the full model
salesModel1 <- lm(salesThisMon ~ . - id, 
                 data = salesData)
salesModel2 <- lm(salesThisMon ~ . - id - preferredBrand - nBrands, 
                 data = salesData)

vif(salesModel1) %>% 
    as.data.frame() %>% 
    rename(`vif1`='.') %>% 
    rownames_to_column(var = 'type') %>% 
    left_join(
vif(salesModel2) %>% 
    as.data.frame() %>% 
    rename(`vif2`='.') %>% 
    rownames_to_column(var = 'type')
    ,by = 'type'
) %>% 
    mutate(delta_vif = vif1-vif2) %>% 
    ggplot(aes(y=delta_vif)) + 
        geom_boxplot()

VIF定义参考 这里 。 这里明显看到VIF下降了。

Churn Prevention

defaultData <- 
    read_csv2('../../../picbackup/defaultData.csv')

使用read_csv2函数是因为分隔符为;

\(\exp(\hat \beta)\)

# Build logistic regression model
logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage +
                   age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
                   billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
                   payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, 
                family = binomial, data = defaultData)

logitModelFull %>% 
    tidy() %>% 
    mutate(beta = exp(estimate))

理解beta值导出的意义。

step wise 方法

library(MASS)

#Build the new model
logitModelNew <- stepAIC(logitModelFull,trace = 0) 

# Save the formula of the new model (it will be needed for the out-of-sample part) 
formulaLogit <- as.formula(summary(logitModelNew)$call)
formulaLogit

confusion matrix

# Make predictions using the full Model
defaultData$predFull <- predict(logitModelFull, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelFull <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predFull, threshold = 0.5)
confMatrixModelFull

# Calculate the accuracy for the full Model
accuracyFull <- sum(diag(confMatrixModelFull)) / sum(confMatrixModelFull)
accuracyFull
# Calculate the accuracy for 'logitModelNew'
# Make prediction
defaultData$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelNew <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predNew, threshold = 0.5)
confMatrixModelNew

# Calculate the accuracy...
accuracyNew <- sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew)
accuracyNew

# and compare it to the full model's accuracy
accuracyFull

cross validation

使用boot包的cv.glm完成,还可以自定义评价函数。

library(boot)
# Accuracy function
costAcc <- function(r, pi = 0) {
  cm <- SDMTools::confusion.matrix(r, pi, threshold = 0.3)
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)
}

# Cross validated accuracy for logitModelNew
set.seed(534381)
cv.glm(defaultData, logitModelNew, cost = costAcc, K = 6)$delta

$delta前面是train组的评价,后面是test组的评价。

Time to Reorder with Survival Analysis

dataNextOrder <- 
    read_csv('../../../picbackup/survivalDataExercise.csv')
# Plot a histogram
ggplot(dataNextOrder) +
  geom_histogram(aes(x = daysSinceFirstPurch,
                     fill = factor(boughtAgain))) +
  facet_grid( ~ boughtAgain) + # Separate plots for boughtAgain = 1 vs. 0
  theme(legend.position = "none") # Don't show legend
  • There are more customers in the data who bought a second time.
  • Apart from that, the differences between the distributions are not very large.

明显第二个图的count更多。

foundation

library(survival)
# Create survival object
survObj <- Surv(dataNextOrder$daysSinceFirstPurch
                ,dataNextOrder$boughtAgain)

# Look at structure
str(survObj)

str(survObj)的结果中,+表示发生了。

adding more covariate1

这里主要讲 Kaplan-Meier Analysis。

# Compute and print fit
fitKMSimple <- survfit(survObj ~ 1)
print(fitKMSimple)

# Plot fit
plot(fitKMSimple,
     conf.int = FALSE, xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")

# Compute fit with categorical covariate
fitKMCov <- survfit(survObj ~ voucher, data = dataNextOrder)

# Plot fit with covariate and add labels
plot(fitKMCov, lty = 2:3,
     xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")
legend(90, .9, c("No", "Yes"), lty = 2:3)

Customers using a voucher seem to take longer to place their second order. They are maybe waiting for another voucher?

Reducing Dimensionality with Principal Component Analysis


  1. In statistics, a covariate is a variable that is possibly predictive of the outcome under study. A covariate may be of direct interest or it may be a confounding or interacting variable. The alternative terms explanatory variable, independent variable, or predictor, are used in a regression analysis. In econometrics, the term “control variable” is usually used instead of “covariate”. In a more specific usage, a covariate is a secondary variable that can affect the relationship between the dependent variable and other independent variables of primary interest.