本文于2026-03-28更新。 如发现问题或者有建议,欢迎提交 Issue
目前的排名在top 6%。

前言
计划
- 标准化
- 箱型图、
fct_lump异常值处理,- 出现次数在1%以下的情况
fct_lump分箱使用方法
新增
- 更新 9 参考资料
- 9 参考资料
- 8 集成
- 最新结果的总结,
reg:gammain Python - 5 Pool Quality 和 Pool Area 相关
- 4 y 做 Power Transformation
- 3.2 Importances plot
- 2 EDAEDA,更新完毕
- 3 naive xgboost
- 2.2 dashboard 增加字段解释
- 2.3 y的比例
- 1.2.1 使用方式
EDA
## Error in `library()`:
## ! there is no package called 'tidyverse'
## Error in `library()`:
## ! there is no package called 'formattable'
## Error in `library()`:
## ! there is no package called 'skimr'
## Error in `library()`:
## ! there is no package called 'DT'
## Error in `library()`:
## ! there is no package called 'readxl'
## Error in `library()`:
## ! there is no package called 'xgboost'
## Error in `library()`:
## ! there is no package called 'SmartEDA'
## Error in `library()`:
## ! there is no package called 'DT'
## Error in `library()`:
## ! there is no package called 'tidyquant'
## Error in `library()`:
## ! there is no package called 'pryr'
theme_ilo <- function(){
theme_minimal() +
theme(
# text = element_text(family = "Bookman", color = "gray25"),
plot.subtitle = element_text(size = 9),
plot.caption = element_text(color = "gray30"),
# plot.background = element_rect(fill = "gray95"),
plot.margin = unit(c(5, 10, 5, 10), units = "mm"),
axis.title.x = element_text(size=12,face = "bold"),
axis.title.y = element_text(size=12,face = "bold"),
# x和y的命名,要加粗,ppt才好看
axis.text.x = element_text(size=7, angle = 70, hjust = 1),
# 控制axis字体大小,7号大小最好
axis.text.y = element_text(size=7),
legend.title=element_blank()
)
}get_path <- function(x){file.path(getwd(),"required_data",x)}
train <- read_csv(get_path("train.csv")) %>%
rename(v1stFlrSF = `1stFlrSF`,
v2ndFlrSF = `2ndFlrSF`,
v3SsnPorch = `3SsnPorch`)
test <- read_csv(get_path("test.csv")) %>%
rename(v1stFlrSF = `1stFlrSF`,
v2ndFlrSF = `2ndFlrSF`,
v3SsnPorch = `3SsnPorch`)
eda_data <- train %>% bind_rows(test)变量进行了重命名 1。
Overview
EDA 主要参考 SmartEDA 包 测评和 Tidyverse使用技巧。
连续变量
table
ExpNumStat(train,by="A",gp="SalePrice",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2) %>%
mutate_at(vars(Per_of_Missing),percent) %>% datatable()Vname– Variable name - 变量名称Group– Target variable -TN– Total sample (inculded NA observations) - 样本总数nNeg– Total negative observations - 负样本数量nZero– Total zero observations - 零值数量nPos– Total positive observations - 正样本数量NegInf– Negative infinite count - 负无穷大极值PosInf– Positive infinite count - 正无穷大极值NA_value– Not Applicable count - 缺失值Per_of_Missing– Percentage of missings - 缺失率Min– minimum value - 最小值Max– maximum value - 最大值Mean– average value - 平均值Median– median value - 中位数SD– Standard deviation - 总体标准差CV– coefficient of variations (SD/mean)*100 - z scoreIQR– Inter quartile range - 四分位距 $QD = \frac{Q3-Q1}{2}$Qnt– Specified quantiles - 百分位点MesofShape– Skewness and Kurtosis - 偏度和峰度Outlier– Number of outliers - 异常值数量Cor– Correlation b/w target and independent variables - 自变量和因变量相关性
分类变量
table
WOE和IV
dashboard
dashboard <-
bind_rows(
skim_to_wide(train) %>% mutate(tag = 'train'),
skim_to_wide(test) %>% mutate(tag = 'test')
) %>%
left_join(comment, by = c("variable"="var")) %>%
select(tag,hist, everything())
dashboard %>% datatable()skim函数的展示结果参考 Stack Overflow- 字段解释来自
magicyang
使用方式
这个表格比较大,因此建议将网页向右拖拽,即可查看相关的histogram图。

y的比例
naive xgboost
get_watchlist <- function(data){
set.seed(456)
index_train <- sample(1:nrow(data),0.80*nrow(data))
train <- data[index_train,] %>% select(-Id)
test <- data[-index_train,] %>% select(-Id)
get_dmatrix <- function(data){
x <- data %>% select(-SalePrice)
x_mtx <- data.matrix(x)
y <- data$SalePrice
ddata <- xgb.DMatrix(data = x_mtx,label=y)
return(ddata)
}
dtrain <- get_dmatrix(train)
dtest <- get_dmatrix(test)
watchlist <- list(train = dtrain, test = dtest)
return(list(
train = train, test = test,
dtrain = dtrain, dtest = dtest, watchlist = watchlist))
}train_xgb_mod <- xgb.train(
data = train_xgb$dtrain,
# 1
eta = 0.1,
nround = 2000,
# 2
max_depth = 7,
min_child_weight = 17,
gamma = 0.72,
# 3
subsample = 0.8,
colsample_bytree = 0.95,
# 评价标准
# eval.metric = "error",
eval.metric = "rmse",
# eval.metric = ks_value,
# eval.metric = "auc",
# eval.metric = "logloss",
# objective
objective = "reg:linear", # 这是一个回归问题
# 其他
seed = 596,
watchlist = train_xgb$watchlist,
# 300万数据一起用!
nfold = 2,
early.stop = 50,
nthread = 8
)sm_ljx_180525 <-
tibble(
Id = test$Id,
SalePrice = predict(train_xgb_mod, dtest)
) %>%
write_csv(get_path("sm_ljx_180525.csv")) %>%
select(everything())lift curve
bind_rows(
tibble(
y = train_xgb$train$SalePrice,
yhat = predict(train_xgb_mod, train_xgb$dtrain)
) %>%
mutate(tag = 'train'),
tibble(
y = train_xgb$test$SalePrice,
yhat = predict(train_xgb_mod, train_xgb$dtest)
) %>%
mutate(tag = 'test')
) %>%
mutate(yhat_bin = ntile(yhat,20)) %>%
group_by(tag,yhat_bin) %>%
summarise(y = mean(y), yhat = mean(yhat)) %>%
gather(key,value,y:yhat) %>%
ggplot(aes(x = yhat_bin,y = value, col = key)) +
geom_line() +
facet_wrap(~ tag) +
theme_ilo() +
labs(
x = "predicted binned value",
y = "actual value",
title = "lift curve on train and validation set",
subtitle = "binning by ntile function in ggplot2 package",
caption = "Jiaxiang Li - jiaxiangli.netlify.com"
)theme_ilo函数参考 ggplot2使用技巧。- 开发集过拟合了,参考训练模型 training model 使用技巧,不然效果很好,但是kaggle上的分数不高。
Importances plot
xgb.importance(feature_names = train %>% select(-Id,-SalePrice) %>% names(),
model = train_xgb_mod) %>%
xgb.plot.importance()相关口径解释见 xgboost的理解。
结果
Your submission scored 0.14340, which is not an improvement of your best score. Keep trying!
Power Transformation for y
相关推导见 训练模型 training model 使用技巧。
寻找最好的λ
boxcox_lambda <- MASS::boxcox(SalePrice ~ 1, data = train, lambda = seq(-0.25, 0.25, length = 10))
boxcox_lambda_best <- boxcox_lambda %>% as.tibble() %>%
filter(y == max(y)) %>%
.$x
boxcox_lambda_bestSP_cb <-
train %>%
select(SalePrice) %>%
mutate(tag = 'original') %>%
bind_rows(
train %>%
# mutate(SalePrice = log(SalePrice),
# tag = 'transformed') %>%
mutate(comp = SalePrice^boxcox_lambda_best,
SalePrice = comp/(comp-1),
tag = 'transformed') %>%
select(SalePrice,tag)
)
SP_cb %>%
group_by(tag) %>%
summarise(
mean = mean(SalePrice),
sd = sd(SalePrice),
skew = skewness(SalePrice),
kurt = kurtosis(SalePrice)
) %>%
mutate_if(is.double, accounting)
SP_cb %>%
ggplot(aes(x = SalePrice, col = tag)) +
geom_freqpoly() +
facet_wrap(~ tag, scales = "free")我们发现四项指标均有改善,因此转换有效。
naive xgboost
train_xgb_trans <- get_watchlist(
train %>%
# mutate(SalePrice = SalePrice^boxcox_lambda_best/(SalePrice^boxcox_lambda_best-1))
mutate(SalePrice = log(SalePrice))
)train_xgb_trans_mod <- xgb.train(
data = train_xgb_trans$dtrain,
# 1
eta = 0.1,
nround = 2000,
# 2
max_depth = 7,
min_child_weight = 17,
gamma = 0.72,
# 3
subsample = 0.8,
colsample_bytree = 0.95,
# 评价标准
# eval.metric = "error",
eval.metric = "rmse",
# eval.metric = ks_value,
# eval.metric = "auc",
# eval.metric = "logloss",
# objective
objective = "reg:linear", # 这是一个回归问题
# 其他
seed = 596,
watchlist = train_xgb_trans$watchlist,
# 300万数据一起用!
nfold = 2,
early.stop = 50,
nthread = 8
)sm_ljx_180527 <-
tibble(
Id = test$Id,
SalePrice = predict(train_xgb_trans_mod, dtest)
) %>%
# mutate(SalePrice = (SalePrice/(SalePrice-1))^(1/boxcox_lambda_best)) %>%
mutate(SalePrice = exp(SalePrice)) %>%
write_csv(get_path("sm_ljx_180527.csv")) %>%
select(everything())left_curve
bind_rows(
tibble(
y = train_xgb_trans$train$SalePrice,
yhat = predict(train_xgb_trans_mod, train_xgb_trans$dtrain)
) %>%
mutate(tag = 'train'),
tibble(
y = train_xgb_trans$test$SalePrice,
yhat = predict(train_xgb_trans_mod, train_xgb_trans$dtest)
) %>%
mutate(tag = 'test')
) %>%
mutate(yhat_bin = ntile(yhat,20)) %>%
group_by(tag,yhat_bin) %>%
summarise(y = mean(y), yhat = mean(yhat)) %>%
gather(key,value,y:yhat) %>%
ggplot(aes(x = yhat_bin,y = value, col = key)) +
geom_line() +
facet_wrap(~ tag) +
theme_ilo() +
labs(
x = "predicted binned value",
y = "actual value",
title = "lift curve on train and validation set",
subtitle = "binning by ntile function in ggplot2 package",
caption = "Jiaxiang Li - jiaxiangli.netlify.com"
)theme_ilo函数参考 ggplot2使用技巧。- 开发集过拟合了,参考训练模型 training model 使用技巧,不然效果很好,但是kaggle上的分数不高。
Importances plot
xgb.importance(feature_names = train %>% select(-Id,-SalePrice) %>% names(),
model = train_xgb_trans_mod) %>%
xgb.plot.importance()相关口径解释见 xgboost的理解。
结果
Your submission scored 0.15594, which is not an improvement of your best score. Keep trying!
结果没有提升,因此y变量在树模型上不需要做变换。
Pool Quality 和 Pool Area 相关
eda_data %>%
select(PoolQC,PoolArea) %>%
group_by(PoolQC,PoolArea) %>%
count() %>%
spread(PoolQC,n) %>%
filter(!is.na(`<NA>`))PoolArea = 0表示没有游泳池,因此无法给出PoolQC,因此2906个缺失值作为一个level没有问题。
但是有PoolQC有三个缺失值是拥有PoolArea值的,因此应该是漏记了。
[@Owen2017]
因此这里需要将这三个缺失值修改为Ex、Fa、Gd的其中一种。
我们发现三种类别的PoolArea平均值如上,按照就近原则,
依次附上
Ex、Fa、Fa或者Ex、Ex、Fa
Ex、Fa、Fa
eda_data_ipt_PQC <-
eda_data %>%
# filter(PoolArea %in% c(368,444,561)) %>%
mutate(PoolQC =
case_when(
PoolArea == 368 ~ 'Ex',
PoolArea == 444 ~ 'Fa',
PoolArea == 561 ~ 'Fa',
TRUE ~ PoolQC
))eda_data_ipt_PQC_xgb_mod <- xgb.train(
data = eda_data_ipt_PQC_xgb$dtrain,
# 1
eta = 0.1,
nround = 2000,
# 2
max_depth = 7,
min_child_weight = 17,
gamma = 0.72,
# 3
subsample = 0.8,
colsample_bytree = 0.95,
# 评价标准
# eval.metric = "error",
eval.metric = "rmse",
# eval.metric = ks_value,
# eval.metric = "auc",
# eval.metric = "logloss",
# objective
objective = "reg:linear", # 这是一个回归问题
# 其他
seed = 596,
watchlist = eda_data_ipt_PQC_xgb$watchlist,
# 300万数据一起用!
nfold = 2,
early.stop = 50,
nthread = 8
)dtest <- xgb.DMatrix(data = data.matrix(
eda_data_ipt_PQC %>%
semi_join(test,by = "Id") %>%
select(-Id)))sm_ljx_180528 <-
tibble(
Id = test$Id,
SalePrice = predict(eda_data_ipt_PQC_xgb_mod, dtest)
) %>%
write_csv(get_path("sm_ljx_180528.csv"))
sm_ljx_180528left_curve
bind_rows(
tibble(
y = eda_data_ipt_PQC_xgb$train$SalePrice,
yhat = predict(eda_data_ipt_PQC_xgb_mod, eda_data_ipt_PQC_xgb$dtrain)
) %>%
mutate(tag = 'train'),
tibble(
y = eda_data_ipt_PQC_xgb$test$SalePrice,
yhat = predict(eda_data_ipt_PQC_xgb_mod, eda_data_ipt_PQC_xgb$dtest)
) %>%
mutate(tag = 'test')
) %>%
mutate(yhat_bin = ntile(yhat,20)) %>%
group_by(tag,yhat_bin) %>%
summarise(y = mean(y), yhat = mean(yhat)) %>%
gather(key,value,y:yhat) %>%
ggplot(aes(x = yhat_bin,y = value, col = key)) +
geom_line() +
facet_wrap(~ tag) +
theme_ilo() +
labs(
x = "predicted binned value",
y = "actual value",
title = "lift curve on train and validation set",
subtitle = "binning by ntile function in ggplot2 package",
caption = "Jiaxiang Li - jiaxiangli.netlify.com"
)theme_ilo函数参考 ggplot2使用技巧。- 开发集过拟合了,参考训练模型 training model 使用技巧,不然效果很好,但是kaggle上的分数不高。
Importances plot
xgb.importance(feature_names = train %>% select(-Id,-SalePrice) %>% names(),
model = eda_data_ipt_PQC_xgb_mod) %>%
xgb.plot.importance()相关口径解释见 xgboost的理解。
结果
Your submission scored 0.14340, which is not an improvement of your best score. Keep trying!
结果没有提升,因此y变量在树模型上不需要做变换。
Ex、Ex、Fa
eda_data_ipt_PQC2 <-
eda_data %>%
# filter(PoolArea %in% c(368,444,561)) %>%
mutate(PoolQC =
case_when(
PoolArea == 368 ~ 'Ex',
PoolArea == 444 ~ 'Ex',
PoolArea == 561 ~ 'Fa',
TRUE ~ PoolQC
))eda_data_ipt_PQC2_xgb_mod <- xgb.train(
data = eda_data_ipt_PQC2_xgb$dtrain,
# 1
eta = 0.1,
nround = 2000,
# 2
max_depth = 7,
min_child_weight = 17,
gamma = 0.72,
# 3
subsample = 0.8,
colsample_bytree = 0.95,
# 评价标准
# eval.metric = "error",
eval.metric = "rmse",
# eval.metric = ks_value,
# eval.metric = "auc",
# eval.metric = "logloss",
# objective
objective = "reg:linear", # 这是一个回归问题
# 其他
seed = 596,
watchlist = eda_data_ipt_PQC2_xgb$watchlist,
# 300万数据一起用!
nfold = 2,
early.stop = 50,
nthread = 8
)dtest <- xgb.DMatrix(data = data.matrix(
eda_data_ipt_PQC2 %>%
semi_join(test,by = "Id") %>%
select(-Id)))sm_ljx_180528_02 <-
tibble(
Id = test$Id,
SalePrice = predict(eda_data_ipt_PQC2_xgb_mod, dtest)
) %>%
write_csv(get_path("sm_ljx_180528_02.csv"))
sm_ljx_180528_02left_curve
bind_rows(
tibble(
y = eda_data_ipt_PQC2_xgb$train$SalePrice,
yhat = predict(eda_data_ipt_PQC2_xgb_mod, eda_data_ipt_PQC2_xgb$dtrain)
) %>%
mutate(tag = 'train'),
tibble(
y = eda_data_ipt_PQC2_xgb$test$SalePrice,
yhat = predict(eda_data_ipt_PQC2_xgb_mod, eda_data_ipt_PQC2_xgb$dtest)
) %>%
mutate(tag = 'test')
) %>%
mutate(yhat_bin = ntile(yhat,20)) %>%
group_by(tag,yhat_bin) %>%
summarise(y = mean(y), yhat = mean(yhat)) %>%
gather(key,value,y:yhat) %>%
ggplot(aes(x = yhat_bin,y = value, col = key)) +
geom_line() +
facet_wrap(~ tag) +
theme_ilo() +
labs(
x = "predicted binned value",
y = "actual value",
title = "lift curve on train and validation set",
subtitle = "binning by ntile function in ggplot2 package",
caption = "Jiaxiang Li - jiaxiangli.netlify.com"
)theme_ilo函数参考 ggplot2使用技巧。- 开发集过拟合了,参考训练模型 training model 使用技巧,不然效果很好,但是kaggle上的分数不高。
Importances plot
xgb.importance(feature_names = train %>% select(-Id,-SalePrice) %>% names(),
model = eda_data_ipt_PQC2_xgb_mod) %>%
xgb.plot.importance()相关口径解释见 xgboost的理解。
结果
Your submission scored 0.14340, which is not an improvement of your best score. Keep trying!
结果没有提升,因此y变量在树模型上不需要做变换。
xgboost arguement changed
train_xgb_more_regular_mod <- xgb.train(
data = train_xgb_more_regular$dtrain,
# 1
eta = 0.002,
nround = 20000,
# 2
max_depth = 4,
min_child_weight = 1,
gamma = 0.5,
# 3
subsample = 0.5,
colsample_bytree = 0.5,
# 评价标准
# eval.metric = "error",
eval.metric = "rmse",
# eval.metric = ks_value,
# eval.metric = "auc",
# eval.metric = "logloss",
# objective
objective = "reg:gamma", # 这是一个回归问题
# 其他
seed = 596,
watchlist = train_xgb_more_regular$watchlist,
# 300万数据一起用!
nfold = 2,
early.stop = 50,
nthread = 8
)train_xgb_more_regular_mod <- xgb.load(file.path("required_data","train_xgb_more_regular_mod.model"))sm_ljx_180529 <-
tibble(
Id = test$Id,
SalePrice = predict(train_xgb_more_regular_mod, dtest)
) %>%
write_csv(get_path("sm_ljx_180529.csv")) %>%
select(everything())left_curve
bind_rows(
tibble(
y = train_xgb_more_regular$train$SalePrice,
yhat = predict(train_xgb_more_regular_mod, train_xgb_more_regular$dtrain)
) %>%
mutate(tag = 'train'),
tibble(
y = train_xgb_more_regular$test$SalePrice,
yhat = predict(train_xgb_more_regular_mod, train_xgb_more_regular$dtest)
) %>%
mutate(tag = 'test')
) %>%
mutate(yhat_bin = ntile(yhat,20)) %>%
group_by(tag,yhat_bin) %>%
summarise(y = mean(y), yhat = mean(yhat)) %>%
gather(key,value,y:yhat) %>%
ggplot(aes(x = yhat_bin,y = value, col = key)) +
geom_line() +
facet_wrap(~ tag) +
theme_ilo() +
labs(
x = "predicted binned value",
y = "actual value",
title = "lift curve on train and validation set",
subtitle = "binning by ntile function in ggplot2 package",
caption = "Jiaxiang Li - jiaxiangli.netlify.com"
)theme_ilo函数参考 ggplot2使用技巧。- 开发集过拟合了,参考训练模型 training model 使用技巧,不然效果很好,但是kaggle上的分数不高。
Importances plot
xgb.importance(feature_names = train %>% select(-Id,-SalePrice) %>% names(),
model = train_xgb_more_regular_mod) %>%
xgb.plot.importance()相关口径解释见 xgboost的理解。
结果
best_iteration: 11555
best_ntreelimit: 11555
best_score: 30228.03
Your submission scored 0.13871, which is not an improvement of your best score. Keep trying!
- 3447次迭代
- 正则化也处理了,但是还是过拟合。
reg:linear
这些都不是问题,关键是Python前面的特征工程,好好研究。
reg:gamma in Python
思路参考CSDN博客。 @DMLCXgboostPython 提供了模型保存和录入等函数。 这次建模特点是
You advanced 851 places on the leaderboard! Your submission scored 0.12124, which is an improvement of your previous score of 0.13513. Great job!
- 迭代了20000次。
- 使用的
reg:gamma
You advanced 9 places on the leaderboard! Your submission scored 0.12112, which is an improvement of your previous score of 0.12124. Great job!
- 迭代了40000次。
- 使用的
reg:gamma
Your submission scored 0.12178, which is not an improvement of your best score. Keep trying!
迭代了100000次,
1h 29m 56s使用的
reg:gamma[99999] train-error:0.020631 test-error:0.020631
看来增加迭代次数的效果已经体现不出来了。 分析的原因是最后的round不够,学习率比较慢,这次扩大到,200000次,保持η = 0.0004。
Your submission scored 0.12247, which is not an improvement of your best score. Keep trying!
迭代了200000次,
5h 2m 15s使用的
reg:gamma[199999] train-error:0.004858 test-error:0.004858
增加迭代次数的红利消失了。
- 损失函数使用RMSLE口径,这是比赛的要求,详见 15 Root Mean Squared Log Error RMSLE
- 构建方程是基于指数回归完成的,详见Generalized linear model GLM 模型、指数回归 和 Gamma 回归方程
集成
naive
file.path(getwd(),"required_data") %>%
list.files(full.names = T) %>%
str_subset(".csv") %>%
str_subset("^(?!.*train.csv|.*test.csv)") %>%
tibble(path = .) %>%
mutate(shortpath =
str_remove(path,
"/Users/JiaxiangLi/Downloads/me/trans/housingPrices/required_data/")) %>%
mutate(data = map(.x = path, .f = read_csv)) %>%
mutate(size = map_dbl(.x = data, .f = object_size)) %>%
select(-path) %>%
filter(size <= 19832) %>%
select(-size) %>%
unnest() %>%
group_by(Id) %>%
summarise(SalePrice = median(unique(SalePrice))) %>%
write_csv(get_path("sm_ljx_180530.csv")) %>%
select(everything())Your submission scored 0.12545, which is not an improvement of your best score. Keep trying!
naive OLS 回归
training
ensemble_data_train_nest <-
file.path(getwd(),"required_data") %>%
list.files(full.names = T) %>%
str_subset("train_pred.csv") %>%
tibble(path = .) %>%
mutate(shortpath =
str_remove(path,
"/Users/JiaxiangLi/Downloads/me/trans/housingPrices/required_data/")) %>%
mutate(data = map(.x = path, .f = read_csv)) %>%
mutate(size = map_dbl(.x = data, .f = object_size)) %>%
select(-path) %>%
select(-size) expand.grid(
a = ensemble_data_train_nest$shortpath,
b = ensemble_data_train_nest$shortpath
) %>%
left_join(
ensemble_data_train_nest, by = c("a" = "shortpath")
) %>%
rename(data_a = data) %>%
left_join(
ensemble_data_train_nest, by = c("b" = "shortpath")
) %>%
rename(data_b = data) %>%
mutate(equal = map2(.x = data_a, .y = data_b, .f = setequal)) %>%
select(-data_a,-data_b) %>%
.$equalensemble_data_train <-
ensemble_data_train_nest %>%
unnest() %>%
spread(shortpath,SalePrice) %>%
add_column(SalePrice = train$SalePrice) %>%
select(-Id) %>%
select(SalePrice, everything())log_gamma_glm <- glm(SalePrice ~ ., data = ensemble_data_train, family=Gamma(link="log"))
summary(log_gamma_glm)□gamma 预测有问题。
prediction
ensemble_data_test_nest <-
file.path(getwd(),"required_data") %>%
list.files(full.names = T) %>%
str_subset("test_pred.csv") %>%
tibble(path = .) %>%
mutate(shortpath =
str_remove(path,
"/Users/JiaxiangLi/Downloads/me/trans/housingPrices/required_data/")) %>%
mutate(data = map(.x = path, .f = read_csv)) %>%
mutate(size = map_dbl(.x = data, .f = object_size)) %>%
select(-path) %>%
select(-size) ensemble_data_test <-
ensemble_data_test_nest %>%
unnest() %>%
mutate(shortpath = str_replace_all(shortpath, "test","train")) %>%
spread(shortpath,SalePrice)
ensemble_data_test %>% cor()
ensemble_data_output <-
ensemble_data_test %>%
mutate(
SalePrice = predict(lm, newdata = ensemble_data_test)
) %>%
select(Id,SalePrice,everything()) %>%
select(Id,SalePrice) %>%
write_csv(get_path("sm_ljx_180531.csv"))mutate(shortpath = str_replace(shortpath, "test","train")) %>%
统一训练集和测试集x变量的名称。
Your submission scored 0.75198, which is not an improvement of your best score. Keep trying!
效果不好。 主要是线性结构需要变量之间的相关性比较低,但是这里的相关性都比较高。
参考资料
模型选取、特征工程参考 @Roberts2018 和 @Serigne2017。 相关尝试比较多,这里就不展开了,下面每篇推荐了三个点可以进行尝试,相关代码,直接跳转到参考文献,即可查询。 测试前建议安装好Python环境、Xgboost和Lightgbm。
@Roberts2018 给出了三点新颖的思路:
- 使用lasso、ridge、ElasticNet类的模型,得到了不错的结果,适合后期stacking。
- 使用μ定义outliter (详见 Andrew Ng Machine Learning 学习笔记 2.2)。
- 因为比赛要求的损失函数是RMSLE,因此最好的power transformation就是log 变换,即λ = 0。
还有其他一些改进思路,未进行验证:
- 对其他连续变量进行power transformation。
- 大部分尝试的模型是过拟合的,根据作者给出每个模型的RMSE2中,Xgboost表现最好,0.0597,然而最优模型为0.0786,因此作者还是通过不断的上传结果,最后得到最优效果的,测试集上的RMSE不是唯一标准。
Your submission scored 0.11556, which is not an improvement of your best score. Keep trying!
参考 @Roberts2018 的思路,成绩有所提升。
@Serigne2017 给出了三点新颖的思路:
- 对大多数变量执行了缺失值处理,有用中位数替换、有的设置为新类别
- 连续变量具备高skew的,进行了power transformation
- stacking基本上使用的是平均集成,效果不错
You advanced 684 places on the leaderboard! Your submission scored 0.11549, which is an improvement of your previous score of 0.12124. Great job!
参考 @Serigne2017 的思路,成绩有所提升。
总结
setequal(sm_ljx_180525,sm_ljx_180527)
setequal(sm_ljx_180525,sm_ljx_180528)
setequal(sm_ljx_180525,sm_ljx_180528_02)@ref(PQC)没有改变预测结果。
xgboost过拟合是一个问题。 除了模型 (model-python)没有过拟合外,其他的结果都是过拟合的。
- base→调参: 0.14340 → 0.13870
- 调参→平方项: 0.13870 → 0.12124
- 平方项→提高round: to0.12124 → 0.12112
- 提高round→过度提高round: to0.12112↓
reg:gamma过拟合下降,train和test差异不大。
参考文献
输入code
names(train) %>% str_subset("^[:digit:]")发现变量
[1] "1stFlrSF" "2ndFlrSF" "3SsnPorch"是数字开头的不符合R命名规则,统一更改。↩︎
实际上是RMSLE,因为@Roberts2018 对y进行了log处理。↩︎