学习首页 方便下载参考数据。
Customer Lifetime Value
salesData <-
read_csv('../../../picbackup/salesData.csv')
read_csv("
Variable,Description
id,identification number of customer
mostFreqStore,store person bought mostly from
mostFreqCat,category person purchased mostly
nCats,number of different categories
preferredBrand,brand person purchased mostly
nBrands,number of different brands",col_names=T,skip=1) %>%
kable()
相关性矩阵
salesData %>% select_if(is.numeric) %>%
select(-id) %>%
cor() %>%
# as.data.frame()
corrplot()
corrplot
函数是based在cor
的结果矩阵上。
共线性问题
# Estimating the full model
salesModel1 <- lm(salesThisMon ~ . - id,
data = salesData)
salesModel2 <- lm(salesThisMon ~ . - id - preferredBrand - nBrands,
data = salesData)
vif(salesModel1) %>%
as.data.frame() %>%
rename(`vif1`='.') %>%
rownames_to_column(var = 'type') %>%
left_join(
vif(salesModel2) %>%
as.data.frame() %>%
rename(`vif2`='.') %>%
rownames_to_column(var = 'type')
,by = 'type'
) %>%
mutate(delta_vif = vif1-vif2) %>%
ggplot(aes(y=delta_vif)) +
geom_boxplot()
VIF定义参考 这里 。 这里明显看到VIF下降了。
Churn Prevention
defaultData <-
read_csv2('../../../picbackup/defaultData.csv')
使用read_csv2
函数是因为分隔符为;
\(\exp(\hat \beta)\)
# Build logistic regression model
logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage +
age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 +
billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 +
payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6,
family = binomial, data = defaultData)
logitModelFull %>%
tidy() %>%
mutate(beta = exp(estimate))
理解beta值导出的意义。
step wise 方法
library(MASS)
#Build the new model
logitModelNew <- stepAIC(logitModelFull,trace = 0)
# Save the formula of the new model (it will be needed for the out-of-sample part)
formulaLogit <- as.formula(summary(logitModelNew)$call)
formulaLogit
confusion matrix
# Make predictions using the full Model
defaultData$predFull <- predict(logitModelFull, type = "response", na.action = na.exclude)
# Construct the in-sample confusion matrix
confMatrixModelFull <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predFull, threshold = 0.5)
confMatrixModelFull
# Calculate the accuracy for the full Model
accuracyFull <- sum(diag(confMatrixModelFull)) / sum(confMatrixModelFull)
accuracyFull
# Calculate the accuracy for 'logitModelNew'
# Make prediction
defaultData$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)
# Construct the in-sample confusion matrix
confMatrixModelNew <- SDMTools::confusion.matrix(defaultData$PaymentDefault,defaultData$predNew, threshold = 0.5)
confMatrixModelNew
# Calculate the accuracy...
accuracyNew <- sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew)
accuracyNew
# and compare it to the full model's accuracy
accuracyFull
cross validation
使用boot
包的cv.glm
完成,还可以自定义评价函数。
library(boot)
# Accuracy function
costAcc <- function(r, pi = 0) {
cm <- SDMTools::confusion.matrix(r, pi, threshold = 0.3)
acc <- sum(diag(cm)) / sum(cm)
return(acc)
}
# Cross validated accuracy for logitModelNew
set.seed(534381)
cv.glm(defaultData, logitModelNew, cost = costAcc, K = 6)$delta
$delta
前面是train组的评价,后面是test组的评价。
Time to Reorder with Survival Analysis
dataNextOrder <-
read_csv('../../../picbackup/survivalDataExercise.csv')
# Plot a histogram
ggplot(dataNextOrder) +
geom_histogram(aes(x = daysSinceFirstPurch,
fill = factor(boughtAgain))) +
facet_grid( ~ boughtAgain) + # Separate plots for boughtAgain = 1 vs. 0
theme(legend.position = "none") # Don't show legend
- There are more customers in the data who bought a second time.
- Apart from that, the differences between the distributions are not very large.
明显第二个图的count
更多。
foundation
library(survival)
# Create survival object
survObj <- Surv(dataNextOrder$daysSinceFirstPurch
,dataNextOrder$boughtAgain)
# Look at structure
str(survObj)
在str(survObj)
的结果中,+
表示发生了。
adding more covariate1
这里主要讲 Kaplan-Meier Analysis。
# Compute and print fit
fitKMSimple <- survfit(survObj ~ 1)
print(fitKMSimple)
# Plot fit
plot(fitKMSimple,
conf.int = FALSE, xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")
# Compute fit with categorical covariate
fitKMCov <- survfit(survObj ~ voucher, data = dataNextOrder)
# Plot fit with covariate and add labels
plot(fitKMCov, lty = 2:3,
xlab = "Time since first purchase", ylab = "Survival function", main = "Survival function")
legend(90, .9, c("No", "Yes"), lty = 2:3)
Customers using a voucher seem to take longer to place their second order. They are maybe waiting for another voucher?
Reducing Dimensionality with Principal Component Analysis
In statistics, a covariate is a variable that is possibly predictive of the outcome under study. A covariate may be of direct interest or it may be a confounding or interacting variable. The alternative terms explanatory variable, independent variable, or predictor, are used in a regression analysis. In econometrics, the term “control variable” is usually used instead of “covariate”. In a more specific usage, a covariate is a secondary variable that can affect the relationship between the dependent variable and other independent variables of primary interest.↩