15 min read

"技术:实战技巧

"技术 系列导航

1 "技术:Poisson分布、回归Python实现 2017-12-20
2 "技术:beta系数理解 2017-12-21
3 "技术:Python List剔除重复值 2017-12-21
4 "技术:t-SNE处理高维数据可视化 2017-12-21
5 "技术:用R语言进行文件系统管理 2017-12-21
6 "技术:fct_lump分箱使用方法 2017-12-22
7 "技术:F1分数为什么可以看不平衡样本的预测能力 2017-12-27
8 "技术:Fisher的一个矩阵预算 2017-12-27
9 "技术:case_when使用方法 2017-12-28
10 "技术:Python接口函数 2017-12-28
11 "技术:fct_relevel调整对照组,批量的方案 2017-12-30
12 "技术:python中变量批量处理集成方案 2017-12-30
13 "技术:Python接口函数-中台 2018-01-03
14 "技术:使用pbd包进行debug 2018-01-03
15 "技术:R实现随机分组 2018-01-04
16 "技术:jiebaR包做文本清洗 2018-01-05
17 "技术:r_WACC使用说明 2018-01-05
18 "技术:使用switchhost安装host 2018-01-05
19 "技术:Git的下载问题 2018-01-06
20 "技术:新闻爬虫 2018-01-07
21 "技术:Scalable Data Processing in R 2018-01-08
22 "技术:使用git创建一个自己的本地仓库 2018-01-11
23 "技术:dashboard构建,来自yihui的包 2018-01-12
24 "技术:最大似然估计再理解 2018-01-12
25 "技术:itchat包提取微信好友公开数据 2018-01-13
26 "技术:do函数和biglm包 2018-01-17
27 "技术:Imputer后X少了一列 2018-01-18
28 "技术:mac连接mysql,理论上win7也可以 2018-01-19
29 "技术:ggpubr提高作图效率 2018-01-20
30 "技术:t-SNE理论部分补充 2018-01-22
31 "技术:RMarkdown的使用技巧 2018-01-30
32 "技术:通过anova检验,理解R2、R_adj.2、F值 2018-01-31
33 "技术:ggridges 山峦图 学习笔记 2018-02-02
34 "技术:Tidyverse使用技巧 2018-02-02
35 "技术:XGBoost 学习笔记 2018-02-02
36 "技术:分布变离散,或者纠正skew 2018-02-02
37 "技术:rsq在R中自定义函数 2018-02-03
38 "技术:Jupyter实战 2018-02-13
39 "技术:美化与定制 2018-02-21
40 "技术:数据对比可视化指南 2018-02-22
41 "技术:功能体验 2018-02-26
42 "技术:高效数据处理 2018-02-27
43 "技术:模型优化技巧 2018-03-05
44 "技术:原理与应用 2018-03-06
45 "技术:模型与可视化 2018-03-08
46 "技术:美化与交互指南 2018-03-11
47 "技术:高效操作指南 2018-03-14
48 "技术:用法与优化技巧 2018-03-17
49 "技术:效率提升指南 2018-03-18
50 "技术:问题排查技巧 2018-03-18
51 "技术:高效操作指南 2018-03-19
52 "技术:方法与代码示例 2018-03-21
53 "技术:进阶技巧与优化 2018-03-21
54 "技术:实战示例 2018-03-22
55 "技术:效率与规范指南 2018-03-24
56 "技术:dplython包测评 2018-03-25
57 "技术:原理与实现 2018-04-02
58 "技术:原理与应用解析 2018-04-03
59 "技术:实战指南 2018-04-05
60 "技术:核心语法与函数整理 2018-04-05
61 "技术:复利计算与应用 2018-04-08
62 "技术:简单规则模型解析 2018-04-14
63 "技术:高效代码设计指南 2018-04-25
64 "技术:原理与应用学习笔记 2018-04-29
65 "技术:实战指南 2018-04-29
66 "技术:原理与应用 2018-05-01
67 "技术:表格格式化指南 2018-05-02
68 "技术:原理与应用介绍 2018-05-08
69 "技术:高效文本拼接 2018-05-11
70 "技术:方法与实践学习笔记 2018-05-12
71 "技术:方法与工具 2018-05-12
72 "技术:功能解析 2018-05-17
73 "技术:高效数据输入 2018-05-21
74 "技术:基础模型与方法 2018-05-22
75 "技术:功能与使用体验 2018-05-26
76 "技术:特征筛选学习笔记 2018-05-29
77 "技术:建模思路解析 2018-06-03
78 "技术:策略与实战 2018-06-03
79 "技术:数据展示指南 2018-06-04
80 "技术:包与环境配置指南 2018-07-14
81 "技术:高效操作指南 2018-07-19
82 "技术:方法与案例解析 2018-07-24
83 "技术:统计建模学习笔记 2018-07-24
84 "技术:展示技巧与原则 2018-08-10
85 "技术:数据采集实战技巧 2018-08-21
86 "技术:指标设计学习笔记 2018-09-20
87 "技术:建模流程实战 2018-10-01
88 "技术:大规模数据探索 2018-10-20
89 "技术:文本提取与分析 2018-10-20
90 "技术:原理与R实现实战 2018-10-21
91 "技术:学习资源获取技巧 2018-10-23
92 "技术:深度学习模型实战 2018-10-24
93 "技术:实战指南 2018-10-30
94 "技术:分析与展示指南 2018-11-03
95 "技术:图片编辑与转换 2018-11-03
96 "技术:安装与使用基础学习笔记 2018-11-07
97 "技术:非结构化数据处理 2018-11-29
98 "技术:表格美化技巧 2018-12-13
99 "技术:分类数据可视化 2018-12-24
100 "技术:流程图绘制技巧 2018-12-24
101 "技术:自动化设置 2018-12-28
102 "技术:协作与版本控制 2018-12-29
103 "技术:pipeline设计 2018-12-30
104 "技术:Git历史记录清理 2018-12-31
105 "技术:AUC指标对比 2019-01-01
106 "技术:时间序列可视化 2019-01-01
107 "技术:变量命名工具指南 2019-01-02
108 "技术:网页自动化截图 2019-01-02
109 "技术:配置与优化 2019-01-03
110 "技术:原理与应用 2019-01-07
111 "技术:语法与核心概念 2019-01-14
112 "技术:R包徽章设计 2019-01-15
113 "技术:项目结构设计 2019-01-21
114 "技术:文本分类基础任务 2019-01-22
115 "技术:线性与非线性模型 2019-01-22
116 "技术:有效性验证 2019-01-27
117 "技术:评估与应用 2019-01-29
118 "技术:循环神经网络入门 2019-01-30
119 "技术:长短期记忆网络入门 2019-01-30
120 "技术:for循环示例 2019-02-03
121 "技术:基础到进阶 2019-02-06
122 "技术:查询与整合 2019-02-06
123 "技术:方法与案例 2019-02-14
124 "技术:R包高效开发指南 2019-02-20
125 "技术:解析与操作 2019-02-20
126 "技术:训练与预测 2019-02-25
127 "技术:原理与代码 2019-02-26
128 "技术:GitHub个人访问令牌(PAT)设置 2019-03-04
129 "技术:方法与工具 2019-03-07
130 "技术:文本特征提取示例 2019-03-08
131 "技术:基础任务示例 2019-03-18
132 "技术:条形图与表头设计 2019-03-20
133 "技术:连续与分类变量差异 2019-03-30
134 "技术:思路与方法 2019-04-08
135 "技术:方法与工具 2019-04-15
136 "技术:多格式读取 2019-04-16
137 "技术:方法与工具 2019-05-11
138 "技术:Git/GitHub/GitLab 2019-05-13
139 "技术:命令与操作 2019-05-19
140 "技术:协作与版本控制 2019-05-26
141 "技术:语法与实践 2019-06-28
142 "技术:功能与API 2019-07-13
143 "技术:安装与使用 2019-07-24
144 "技术:高效数据处理 2019-10-09
145 "技术:性能优化技巧 2019-10-12
146 "技术:配置与运维 2019-10-29
147 "技术:原理与经典模型 2019-12-25
148 "技术:构建到发布流程 2019-12-26
149 "技术:方法与案例 2019-12-27
150 "技术:命令与自动化 2019-12-30
151 "技术:Pandas数据处理实战指南 2020-01-19
152 "技术:特征工程之目标编码学习笔记 2020-01-20
153 "技术:文档编写与美化 2020-01-28
154 "技术:核心算法与应用 2020-01-29
155 "技术:流程图绘制技巧 2020-01-29
156 "技术:DataCamp课程笔记 2020-01-31
157 "技术:Python实用代码片段合集 2020-01-31
158 "技术:自动化构建流程 2020-02-02
159 "技术:自动化工作流配置 2020-02-04
160 "技术:高效查找代码与项目 2020-02-11
161 "技术:代码环境快速部署 2020-02-24
162 "技术:USD数据分析论文收录暨GitBook发布 2020-05-02
163 "技术:Causal Forest 2021-03-18

本文于r format(Sys.Date(), "%Y-%m-%d")更新。 如发现问题或者有建议,欢迎提交 Issue

@Scavettaggplot1, @Scavettaggplot2, @Scavettaggplot3 给出比较系统的教程,这里单独学习并整理成文。

{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE)

Part 1

geom_histogram

```{r}

Custom color code

myBlue <- “#377EB8 # 这个可以用Color Picker调

Change the fill color to myBlue

ggplot(mtcars, aes(x = mpg, y = ..count..)) + geom_histogram(binwidth = 1, aes(y = ..count..), fill = myBlue) ggplot(mtcars, aes(x = mpg, y = ..density..)) + geom_histogram(binwidth = 1, aes(y = ..density..), fill = myBlue)

+ `..count..`就是形容频数
+ `..density..`就是形容频率

bin的宽度为`diff(range(dataset$x))/30`
[@Scavettaggplot1, [Histograms | R](https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-4-geometries?ex=5)]

## position


```
for (i in c("stack","fill","dodge")){
  p1 <- 
  ggplot(mtcars, aes(x = cyl, fill = factor(am))) + 
    geom_bar(position = i) +
    theme_minimal() +
    scale_fill_brewer(palette = "Diamond\nclarity")
    # 加配色,不需要自己手动弄
    print(p1)
}
  • stack: 累计,频数
  • fill: 累计,频率, 这里显然饼状图 __有时候__更有优势。
  • dodge: 并列 [@Scavettaggplot1, Position | R]

{r} # Convert bar chart to pie chart ggplot(mtcars, aes(x = factor(1), fill = as.factor(am))) + geom_bar(position = "fill", width = 1) + facet_grid(. ~ cyl) + coord_polar(theta = "y")

geom_rect

就是设计方块。

{r} download.file("https://assets.datacamp.com/production/course_774/datasets/recess.RData","recess.RData") load("recess.RData") ggplot(economics, aes(x = date, y = unemploy/pop)) + geom_line() + geom_rect(data = recess, inherit.aes = FALSE, aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf), fill = "red", alpha = 0.2)

Part 2

覆盖aes

{r} ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point() + stat_smooth(method = "lm", se = F) + stat_smooth(method = "lm", se = F, aes(group = 123))

group = 123类似于覆盖之前aes,产生一条综合的线。

geom_boxplot、geom_violin的理解

varwidth反应样本大小,

scale If "count", areas are scaled proportionally to the number of observations.

{r} diamonds %>% ggplot(aes(x = color, y = depth)) + geom_boxplot(varwidth = T) diamonds %>% ggplot(aes(x = color, y = depth)) + geom_violin(scale = "count")

geom_density的理解

按照有限的样本,每个obs的值作为分布均值,标准差假设好,然后把这么多分布全部叠加起来,组成分布,就是geom_density。

另外超出实际值的区域超出了intermediate steps,就T掉,因此看到的图,左右边是切出来的。

bandwidth就是类似于标准差,小了,假设的分布就越陡峭,那么总体的分布越容易体现出两个或两个以上的波峰。 bw > The smoothing bandwidth to be used. If numeric, the standard deviation of the smoothing kernel. If character, a rule to choose the bandwidth, as listed in bw.nrd.

weight来加入样本大小的影响。如果同一个值有很多,当然这个的分布就要陡峭一点啊。

对于density函数。

x the n coordinates of the points where the density is estimated.

y the estimated density values. These will be non-negative, but can be zero.

```{r} # test_data is available

Calculating density: d

d <- density(diamonds$depth)

Use which.max() to calculate mode

mode <- d$x[which.max(d$y)]

Finish the ggplot call

ggplot(diamonds, aes(x = depth)) + geom_rug() + geom_density() + geom_vline(xintercept = mode, col = “red”)

这里`mode`的取法简单,值得借鉴。

```
# test_data is available
test_data = data_frame(norm = rnorm(1000))
# Arguments you'll need later on
fun_args <- list(mean = mean(test_data$norm), sd = sd(test_data$norm))

# Finish the ggplot
ggplot(test_data, aes(x = norm)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density(col = "red") +
  stat_function(fun = dnorm, args = fun_args, col = "blue")

注意看,

  • histogram是样本数据(1000个)的表现,
  • density红线是,KDE后叠加的,所以左边有切痕,
  • density蓝线是,直接用样本数据(1000个)求得$\mu$和$\sigma$而得的正态分布。

Adjusting density plots

adjust A multiplicate bandwidth adjustment. This makes it possible to adjust the bandwidth while still using the a bandwidth estimator. For exampe, adjust = 1/2 means use half of the default bandwidth. [@Scavettaggplot2, Adjusting density plots | R]

不信我们验证。

```{r} # small_data is available

Get the bandwith

get_bw <- density(test_data$norm)$bw

Basic plotting object

p <- ggplot(test_data, aes(x = norm)) + geom_rug() + coord_cartesian(ylim = c(0,0.5))

Create three plots

p + geom_density(adjust = 0.25) p + geom_density(bw = 0.25 * get_bw) # 简直一模一样。

`kernel` - kernel used for density estimation, defined as

+ `"g"` = gaussian
+ `"r"` = rectangular
+ `"t"` = triangular
+ `"e"` = epanechnikov
+ `"b"` = biweight
+ `"c"` = cosine
+ `"o"` = optcosine

## stat_density_2d

```
diamonds %>% 
  ggplot(aes(x = depth, y = table)) + 
    stat_density_2d(aes(col = ..level..), h = c(5, 0.5))
# h是两个的bandwidth,
# ..level..厉害!

geom_rug()

{r} p <- ggplot(mtcars, aes(wt, mpg)) + geom_point() p p + geom_rug()

stat_smooth

span: Smaller numbers produce wigglier lines, larger numbers produce smoother lines. span越小,窗口越小,被平均的样本就越小。 [@Scavettaggplot2, Modifying stat_smooth | R]

{r} # Plot 1: change the LOESS span for (i in c(0.2,0.7)){ p1 <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + # Add span below geom_smooth(se = F, span = i) print(p1) }

```{r} # Plot 2: Set the overall model to LOESS and use a span of 0.7 ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point() + stat_smooth(method = “lm”, se = F) + # Change method and add span below stat_smooth(method = “loess”, aes(group = 1), se = F, col = “black”,span = 0.7)

Plot 3: Set col to “All”, inside the aes layer of stat_smooth()

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point() + stat_smooth(method = “lm”, se = F) + stat_smooth(method = “loess”, # Add col inside aes() aes(group = 1,col = “All”), # Remove the col argument below # 这样就可以体现在legend中了 se = F, span = 0.7)

Plot 4: Add scale_color_manual to change the colors

library(RColorBrewer) myColors <- c(brewer.pal(3, “Dark2”), “black”) ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point() + stat_smooth(method = “lm”, se = F, span = 0.75) + stat_smooth(method = “loess”, aes(group = 1, col=“All”), se = F, span = 0.7) + # Add correct arguments to scale_color_manual scale_color_manual(“Cylinders”, values = myColors)

> `scale_color_brewer()` to use a default ColorBrewer. This should result in an error, since the default palette, "Blues", only has 9 colors, but we have 16 years here.
[@Scavettaggplot2, [Modifying stat_smooth (2) | R](https://campus.datacamp.com/courses/data-visualization-with-ggplot2-2/chapter-1-statistics?ex=5)]

Plot 1: Jittered scatter plot, add a linear model (lm) smooth

ggplot(Vocab, aes(x = education, y = vocabulary)) + geom_jitter(alpha = 0.2) + stat_smooth(method = “lm”, se = F)

Plot 2: Only lm, colored by year

ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) + stat_smooth(method = “lm”, se = F)

Plot 3: Set a color brewer palette

ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) + stat_smooth(method = “lm”, se = F) + scale_color_brewer()

Plot 4: Change col and group, specify alpha, size and geom, and add scale_color_gradient

ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) + stat_smooth(method = “lm”, se = F, alpha = 0.6, size = 2) + scale_color_gradientn(colors = brewer.pal(9,“YlOrRd”))

当group很多,且是interger时候,还是连续变量比factor好。

## geom_quantile


> This fits a quantile regression to the data and draws the fitted quantiles with lines. This is as a __continuous analogue__ to `geom_boxplot.`
[@Scavettaggplot2, [Quantiles | R](https://campus.datacamp.com/courses/data-visualization-with-ggplot2-2/chapter-1-statistics?ex=6)]

也就是说同一个x,拥有多个y。

```
m <- ggplot(mpg, aes(displ, 1 / hwy)) + geom_point()
m + geom_quantile()
m + geom_quantile(quantiles = 0.5)
q10 <- seq(0.05, 0.95, by = 0.05)
m + geom_quantile(quantiles = q10)

sum

Another useful stat function is stat_sum() which calculates the count for each group. [@Scavettaggplot2, Sum | R]

range a numeric vector of length 2 that specifies the minimum and maximum size of the plotting symbol after transformation.

```{r} ggplot(mpg, aes(cty, hwy)) + geom_point()

ggplot(mpg, aes(cty, hwy)) + geom_count() # 大小体现了x和y定位的数据量大小 ggplot(mpg, aes(cty, hwy)) + geom_count() + scale_size_area() ggplot(mpg, aes(cty, hwy)) + geom_count(aes(size = ..prop..)) # 展示的是比例

第一个章节可以好好复习一下。
这些图都是可以自己画出来,但是实际上idea很重要。

## stat_summary

> `mult`是mutiple,   
for `smean.cl.normal` is the multiplier of the standard error of the mean。

> `stat_summary` operates on unique x; `stat_summary_bin` operators on binned x. They are more flexible versions of `stat_bin`: instead of just counting, they can compute any aggregate.

看到`stat_function`感觉非常好!
非常自定义

```
# Display structure of mtcars
str(mtcars)

# Convert cyl and am to factors:
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

# Define positions:
posn.d <- position_dodge(width = 0.1) 
posn.jd <- position_jitterdodge(jitter.width = 0.1, dodge.width = 0.2) 
posn.j <- position_jitter(width = 0.2) 

# base layers:
wt.cyl.am <- mtcars %>% 
  ggplot(aes(x = cyl, y = wt, col = am, fill = am, group = am))
wt.cyl.am +
  geom_point(position = posn.jd, alpha = 0.6)
  # 这个地方jitter主要是为了让点不重合。
for (i in c(mean_sdl,mean_cl_normal)){
wt.cyl.am.p <- wt.cyl.am +
  stat_summary(fun.data = i,fun.args = list(mult=1),
               position = posn.d) +
  labs(
    title = "Mean and SD",
    subtitle = paste(
      "这个就可以比较分析了。\n这里默认使用了geom_pointrange(),",
      "使用",
      substitute(i))
  ) + 
  theme(text=element_text(family="STKaiti"))
print(wt.cyl.am.p)
}
wt.cyl.am +
  stat_summary(geom = "point", fun.y = mean,
               position = posn.d) +
  stat_summary(geom = "errorbar", fun.data = mean_sdl,
               position = posn.d, fun.args = list(mult = 1), width = 0.1)

```{r} # Play vector xx is available xx <- 1:100 # Function to save range for use in ggplot: gg_range <- function(x) { # Change x below to return the instructed values data.frame(ymin = min(x), # Min ymax = max(x)) # Max }

gg_range(xx) # Required output: # ymin ymax # 1 1 100

Function to Custom function:

med_IQR <- function(x) { # Change x below to return the instructed values data.frame(y = median(x), # Median ymin = quantile(x)[2], # 1st quartile ymax = quantile(x)[4]) # 3rd quartile }

med_IQR(xx) # Required output: # y ymin ymax # 25% 50.5 25.75 75.25

```
1:100 %>% 
  quantile()

```{r} # The base ggplot command, you don’t have to change this wt.cyl.am <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))

Add three stat_summary calls to wt.cyl.am

wt.cyl.am + stat_summary(geom = “linerange”, fun.data = med_IQR, position = posn.d, size = 3) + stat_summary(geom = “linerange”, fun.data = gg_range, position = posn.d, size = 3, alpha = 0.4) + stat_summary(geom = “point”, fun.y = median, position = posn.d, size = 3, col = “black”, shape = “X”) + labs( subtitle = “中间的点是中位数\n深色的是四分位点\n浅色的是极值 ) + theme(text=element_text(family=“STKaiti”))

### errorbar

```
# Base layers
m <- ggplot(mtcars, aes(x = cyl,y = wt, col = as.factor(am), fill = as.factor(am)))

# Plot 1: Draw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", alpha = 0.2) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)

# Plot 2: Set position dodge in each stat function
m +
  stat_summary(fun.y = mean, geom = "bar", position = "dodge", alpha = 0.2) +
  stat_summary(fun.data = mean_sdl, 
               fun.args = list(mult = 1), 
               geom = "errorbar", 
               width = 0.1, position = "dodge")

# Set your dodge posn manually
posn.d <- position_dodge(0.9)

# Plot 3:  Redraw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", position = posn.d, alpha = 0.2) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1, position = posn.d)

stat_summary(fun.data = "mean_cl_normal",
           geom = "crossbar",
           width = 0.2,
           col = "red") +

这是最容易加95%置信区间的方式了。

{r} diamonds %>% ggplot(aes(x = color, y = depth)) + geom_point() + stat_summary(fun.data = "mean_cl_normal", geom = "crossbar", width = 0.2, col = "red")

可以看到95的置信区间在很中间,因此说明这个数据很分散。

Zoom in

```{r} # Basic ggplot() command, coded for you p <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) + geom_point() + geom_smooth()

Add scale_x_continuous

p + scale_x_continuous(limits = c(3,6),expand = c(0,0)) + labs( caption = “就是因为绿色的点只有两个,所以画不了线。 ) + theme(text=element_text(family=“STKaiti”))

The proper way to zoom in:

p + coord_cartesian(xlim = c(3,6)) + labs( caption = “就是因为用全集画图,也画了线,所以截图的时候,才有smooth line。 ) + theme(text=element_text(family=“STKaiti”))

`expand`
These constants ensure that the data is placed some distance away from the axes. The defaults are `c(0.05, 0)` for continuous variables, and `c(0, 0.6)` for discrete variables.

## aspect

```
# Complete basic scatter plot function
base.plot <- ggplot(iris, aes(x = Sepal.Length, 
                              y = Sepal.Width,
                              col = Species)) +
               geom_jitter() +
               geom_smooth(method = "lm", se = F)

# Plot base.plot: default aspect ratio
base.plot + coord_fixed(ratio = 1/1)  +
  labs(
    subtitle = 
      "因为minmax不一样,如果一样,就是正方形
  ) +
theme(text=element_text(family="STKaiti"))  

# Fix aspect ratio (1:1) of base.plot
base.plot + coord_equal()

感觉上一定要学有所获才可以。

coord_polar()

We can imagine two forms for pie charts - the typical filled circle, or a colored ring.

理解极坐标

As an example, consider the stacked bar chart shown in the viewer. Imagine that we just take the y axis on the left and bend it until it loops back on itself, while expanding the right side as we go along. We’d end up with a pie chart - it’s simply a bar chart transformed onto a polar coordinate system.

```{r} # Create stacked bar plot: thin.bar thin.bar <- ggplot(mtcars, aes(x = 1, fill = as.factor(cyl))) + geom_bar() + labs( subtitle = “x轴为常数,y轴不存在,\n在x轴上stack,颜色区分,那么就是count来区分 ) + theme(text=element_text(family=“STKaiti”))

Convert thin.bar to pie chart

thin.bar + coord_polar(theta = “y”) # y轴作为极坐标 # 圆外围标记是count

Create stacked bar plot: wide.bar

wide.bar <- ggplot(mtcars, aes(x = 1, fill = as.factor(cyl))) + geom_bar(width = 1)

Convert wide.bar to pie chart

wide.bar + coord_polar(theta = “y”)

## facet

```
mtcars %>% 
  add_rownames() %>% 
  ggplot(aes(x = mpg, y = rowname)) +
  geom_point() +
  facet_grid(cyl ~ ., space = "free_y") +
  labs(
    subtitle = 
      "space是为了使得y轴的空间都随着样本量变化
  ) +
  theme(text = element_text(family = "STKaiti"))

theme

这些都是ggplot一个图里面text的地方。

一共有三种调整

  • element_text()
  • element_line()
  • element_rect()

这里可以设置fill,边框设置在col

<!-- text.png -->
<!-- line.png -->
<!-- rect.png -->
<!-- summary_elment.png -->
<!-- element_blank.png -->

panel.grid

{r} mtcars %>% ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) + geom_point() + theme( panel.grid = element_blank(), # 背景颜色网格没了 panel.background = element_blank(), # 背景颜色没了 axis.line = element_line(color = "black") # 的确清爽很多 )

strip.text

{r} mtcars %>% ggplot(aes(x = mpg, y = disp)) + geom_point() + facet_grid(. ~ as.factor(cyl)) + theme( panel.grid = element_blank(), # 背景颜色网格没了 panel.background = element_blank(), # 背景颜色没了 axis.line = element_line(color = "black") # 的确清爽很多 ) + theme( strip.background = element_blank(), strip.text = element_text(face = "bold", size = 12) )

Legends

```{r} z <- mtcars %>% ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) + geom_point() + facet_grid(. ~ cyl) # Move legend by position z + theme( legend.position = c(0.85,0.85) )

Change direction

z + theme( legend.position = c(0.85,0.85), legend.direction = “horizontal )

Change location by name

z + theme( legend.position = “bottom”, legend.direction = “horizontal )

Remove legend entirely

z + theme( legend.position = “none”, legend.direction = “horizontal )

### margin

```
z
z + 
  theme(
    panel.spacing.x = grid::unit(2,"cm")
  )
z + 
  theme(
    plot.margin = unit(c(0,0,0,0),"cm")
    # 页边距
  ) +
  labs(
    subtitle = "页边距设置
  ) +
  theme(text = element_text(family = "STKaiti"))

Get, set, and modify the active theme

The current/active theme is automatically applied to every plot you draw. Use theme_get to get the current theme, and theme_set to completely override it. theme_update and theme_replace are shorthands for changing individual elements.

不是特别懂。

<!-- ``` -->
<!-- mtcars %>%  -->
<!--   ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) + -->
<!--     geom_point() + -->
<!--   theme( -->
<!--     panel.background = element_rect(fill = "red") -->
<!--   ) -->
<!-- ``` -->

`

ggthemes

{r} library(ggthemes)

{r} mtcars %>% ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) + geom_point() + facet_grid(. ~ cyl) + labs( title = "测试中文是否能够被tufte修改 ) + theme_tufte() + theme(text = element_text(family = "STKaiti"))

```{r} # Base layers m <- ggplot(mtcars, aes(x = cyl, y = wt))

Draw dynamite plot

m + stat_summary(fun.y = mean, geom = “bar”, fill = “skyblue”) + stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = “errorbar”, width = 0.1)

## GGally

这里

+ 连续变量 + 连续变量 $\to$ `geom_point`
+ 连续变量 + 分类变量 $\to$ `geom_boxplot`
+ 分类变量 + 分类变量 $\to$ `geom_point`
+ 分类变量 + 自身 $\to$ `geom_bar`
+ 连续变量 + 自身 $\to$ `geom_freploy`

```
# Parallel coordinates plot using GGally
library(GGally)
ggp <- 
  mtcars %>% 
  mutate_at(vars(cyl,am),as.factor) %>% 
  ggpairs()
ggp +
  theme_tufte()

heat map

```{r} # Create color palette library(RColorBrewer) myColors <- brewer.pal(9, “Reds”)

Build the heat map from scratch

library(lattice) data(barley) ggplot(barley,aes(x = year, y = variety, fill = yield)) + geom_tile() + facet_wrap(~ site, ncol = 1) + scale_fill_gradientn(colors = myColors) + labs( capition = “热力图,就是用颜色来表达两个分类变量之间的关系,\n第三个连续变量的变化。 ) + theme_tufte() + theme(text = element_text(family =“STKaiti”))

## ribbon


```
barley %>% 
  ggplot(aes(x = year, y = yield, col = site, fill = site,group = site)) +
  stat_summary(fun.y = mean, geom = "line") +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               geom = "ribbon",
               alpha = 0.1,
               col = NA)

{r} # Reproduce the plot ggplot(diamonds, aes(x = carat, y = price, col = color)) + geom_point(alpha = 0.5, size = 0.5, shape = 16) + scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) + scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) + scale_color_brewer(palette = "YlOrRd") + coord_equal() + theme_classic()

expression(log10) 控制比例尺

{r} diamonds %>% ggplot(aes(x = carat, y = price, col = color)) + geom_point(alpha = 0.5, size = 0.5, shape = 16) + scale_x_log10(expression(log[10](carat)), limits = c(0.1,10)) + scale_y_log10(expression(log[10](price)), # 这个expression(log[10](price)),方法很好啊。 limits = c(1000,10000)) + scale_color_brewer(palette = "YlOrRd") + # 让颜色更好看。 coord_equal() + theme_classic()

Part 3

Large dataset

alpha blending1

其实找好label,然后把cor的值放上去就好了,easy,干起来。

ggplot实现相关矩阵

```{r} cor_list <- function(x) { L <- M <- cor(x)

M[lower.tri(M, diag = TRUE)] <- NA M <- melt(M) names(M)[3] <- “points # lower.tri就是i比j大,而已。 L[upper.tri(L, diag = TRUE)] <- NA L <- melt(L) names(L)[3] <-“labels

merge(M, L) }

cor_list(iris[1:4]) # 这里的缺失值有三种 # 1. cor对角线上的 # 2. upper.tri中的一半 # 3. lower.tri中的一半

iris1 <- iris %>% group_by(Species) %>% do(cor_list(.[1:4])) # 这里相当于unnest了,比map函数方便。 iris1 %>% ggplot(aes(x = Var1, y = Var2)) + geom_point(aes(col = labels, size = abs(labels)), shape = 16) + geom_text(aes(x = Var2, y = Var1, # 这里要交叉一下, # 这样文字就在下三角了。 col = points, # size = abs(points), # size 不可以加,不然看不见 # hjust = 2, label = round(labels, 2))) + scale_size(range = c(0, 6)) + # 控制点的大小 scale_color_gradient2(“r”, limits = c(-1, 1)) + scale_y_discrete(””, limits = rev(levels(iris1$Var1))) + # rev控制了factor反着走,这样可以控制图像在上三角还是下三角 scale_x_discrete(””) + guides(size = FALSE) + # 没什么用 geom_abline(slope = -1, intercept = nlevels(iris1$Var1) + 1) + coord_fixed() + facet_grid(. ~ Species) + # 不然图像重合了很难看。 labs( caption =“数据来源:iris”, subtitle = “建立相关性矩阵很简单\n抓好x和y轴变量和计算的相关系数就好”, title = “ggplot实现相关矩阵 ) + theme_tufte() + theme(text = element_text(family =“STKaiti”)) + # 为了显示中文 theme(axis.text.y = element_text(angle = 45, hjust = 1), axis.text.x = element_text(angle = 45, hjust = 1), strip.background = element_blank())

## ggtern三角图

<!-- ternary.png -->

这个图可以表达三个变量,
$x,y,z$。
现在可以看出,下方的比例尺是$z$的。
从$z$点作垂线。
我们定义,$z$点的对边做平行线。
这些平行线上,跟比例尺相交的点,表达了数据中点的$z$值。
显然,离$z$点更近的平行线上的点,$z$值取得越高。

<!-- ternary2.png -->

这是个例子。

```
library(ggtern)
download.file("https://assets.datacamp.com/production/course_862/datasets/africa.RData","africa.RData")
load("africa.RData")

```{r} # ggtern and ggplot2 are loaded # Original plot: ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) + geom_point(shape = 16, alpha = 0.2)

Plot 1

ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) + geom_density_tern()

Plot 2

ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) + stat_density_tern( geom = “polygon”, aes(fill = ..level.., alpha = ..level..)) + guides(fill = FALSE) # Suppress the legend

## geomnet

```
# Load geomnet & examine structure of madmen
library(geomnet)
# str(madmen)

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Examine structure of mmnet
# str(mmnet)
madmen$edges %>% head()
madmen$vertices %>% head()
mmnet %>% head()

# Finish the ggplot command
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
  geom_net(aes(col = Gender),
  size = 6, linewidth = 1, 
  labelon = TRUE, 
  # 这里就打上标签了
  fontsize = 3, 
  labelcolour = "black",
  directed = TRUE) +
  # 连接线上有标签
  scale_color_manual(values = c("#FF69B4", "#0099ff")) +
  xlim(c(-0.05, 1.05)) +
  ggmap::theme_nothing(legend = T) +
  # 这是很好的方法,theme_nothing
  # legend = F可以保留legend
  theme(legend.key = element_blank())
  # 让legend的背景变透明。

shape of points

<!-- pointshape.png -->

ggfortify

可以把base plot的图转化成ggplot的图。

利用autoplot函数,但是我还是没动leverage是干嘛的。 甚至time-series ts和multiple time-series mts也是可以的。

Distance matrices and Multi-Dimensional Scaling (MDS) | R中的 cmdscale function | R Documentation没太看得懂,没给数学:公式啊。

可视化聚类模型

cluster::clara(), cluster::fanny(), cluster::pam()stats::prcomp()都是聚类模型, ggfortify可以可视化结果,方便大家理解,这里以stats::kmeans为例。

library(stats)
# use kmeans
library(ggfortify)
# Perform clustering
iris_k <- kmeans(iris[-5], centers=3)

# Autoplot: color according to cluster
autoplot(iris_k, data = iris, frame = T)
# frame = T
# draw a polygon around each cluster.

# Autoplot: above, plus shape according to species
autoplot(iris_k, data = iris, frame = T,shape ="Species")
# 显然每个框里面都有不同的品种,所以不好啊。

ggfortify安装不好,太烦。

map

A choropleth map (from Greek χ<U+03CE>ρο (“area/region”) + πλ<U+03AE>θο<U+03C2> (“multitude”)) is a thematic map.

```{r} # maps, ggplot2, and ggmap are pre-loaded # Use map_data() to create usa and inspect library(ggmap) usa <- map_data(“usa”) str(usa)

Build the map

ggplot(usa, aes(x = long, y = lat, group = group)) + geom_polygon() + # 是实现地图的关键 geom_point(aes(col = cut_number(lat,3))) + # 点根据维度划分 coord_map() + theme_nothing() # ggmap::theme_nothing

library(tidyverse) library(ggmap) get_map(location = “Shanghai”) %>% ggmap()

__由于调用的是Google地图__,可能需要翻墙,同时速度有点慢(访问的是Google地图数据库,由于网络限制数据抓取可能不完整)。

不然也非常好!!!

## gganimate

[gganimate](https://github.com/dgrtwo/gganimate)包非常适合展示图像变化。

library(ggthemes) mtcars %>% ggplot(aes(x = mpg, y = disp, col = cyl)) + geom_point() + # labs() + theme_tufte() <- p gg_animate(p, filename = “mtcars.gif”, interval = 1.0) ```

参考文献


  1. blending 混合,类似于modeling blending ↩︎

"技术 系列导航

1 "技术:Poisson分布、回归Python实现 2017-12-20
2 "技术:beta系数理解 2017-12-21
3 "技术:Python List剔除重复值 2017-12-21
4 "技术:t-SNE处理高维数据可视化 2017-12-21
5 "技术:用R语言进行文件系统管理 2017-12-21
6 "技术:fct_lump分箱使用方法 2017-12-22
7 "技术:F1分数为什么可以看不平衡样本的预测能力 2017-12-27
8 "技术:Fisher的一个矩阵预算 2017-12-27
9 "技术:case_when使用方法 2017-12-28
10 "技术:Python接口函数 2017-12-28
11 "技术:fct_relevel调整对照组,批量的方案 2017-12-30
12 "技术:python中变量批量处理集成方案 2017-12-30
13 "技术:Python接口函数-中台 2018-01-03
14 "技术:使用pbd包进行debug 2018-01-03
15 "技术:R实现随机分组 2018-01-04
16 "技术:jiebaR包做文本清洗 2018-01-05
17 "技术:r_WACC使用说明 2018-01-05
18 "技术:使用switchhost安装host 2018-01-05
19 "技术:Git的下载问题 2018-01-06
20 "技术:新闻爬虫 2018-01-07
21 "技术:Scalable Data Processing in R 2018-01-08
22 "技术:使用git创建一个自己的本地仓库 2018-01-11
23 "技术:dashboard构建,来自yihui的包 2018-01-12
24 "技术:最大似然估计再理解 2018-01-12
25 "技术:itchat包提取微信好友公开数据 2018-01-13
26 "技术:do函数和biglm包 2018-01-17
27 "技术:Imputer后X少了一列 2018-01-18
28 "技术:mac连接mysql,理论上win7也可以 2018-01-19
29 "技术:ggpubr提高作图效率 2018-01-20
30 "技术:t-SNE理论部分补充 2018-01-22
31 "技术:RMarkdown的使用技巧 2018-01-30
32 "技术:通过anova检验,理解R2、R_adj.2、F值 2018-01-31
33 "技术:ggridges 山峦图 学习笔记 2018-02-02
34 "技术:Tidyverse使用技巧 2018-02-02
35 "技术:XGBoost 学习笔记 2018-02-02
36 "技术:分布变离散,或者纠正skew 2018-02-02
37 "技术:rsq在R中自定义函数 2018-02-03
38 "技术:Jupyter实战 2018-02-13
39 "技术:美化与定制 2018-02-21
40 "技术:数据对比可视化指南 2018-02-22
41 "技术:功能体验 2018-02-26
42 "技术:高效数据处理 2018-02-27
43 "技术:模型优化技巧 2018-03-05
44 "技术:原理与应用 2018-03-06
45 "技术:模型与可视化 2018-03-08
46 "技术:美化与交互指南 2018-03-11
47 "技术:高效操作指南 2018-03-14
48 "技术:用法与优化技巧 2018-03-17
49 "技术:效率提升指南 2018-03-18
50 "技术:问题排查技巧 2018-03-18
51 "技术:高效操作指南 2018-03-19
52 "技术:方法与代码示例 2018-03-21
53 "技术:进阶技巧与优化 2018-03-21
54 "技术:实战示例 2018-03-22
55 "技术:效率与规范指南 2018-03-24
56 "技术:dplython包测评 2018-03-25
57 "技术:原理与实现 2018-04-02
58 "技术:原理与应用解析 2018-04-03
59 "技术:实战指南 2018-04-05
60 "技术:核心语法与函数整理 2018-04-05
61 "技术:复利计算与应用 2018-04-08
62 "技术:简单规则模型解析 2018-04-14
63 "技术:高效代码设计指南 2018-04-25
64 "技术:原理与应用学习笔记 2018-04-29
65 "技术:实战指南 2018-04-29
66 "技术:原理与应用 2018-05-01
67 "技术:表格格式化指南 2018-05-02
68 "技术:原理与应用介绍 2018-05-08
69 "技术:高效文本拼接 2018-05-11
70 "技术:方法与实践学习笔记 2018-05-12
71 "技术:方法与工具 2018-05-12
72 "技术:功能解析 2018-05-17
73 "技术:高效数据输入 2018-05-21
74 "技术:基础模型与方法 2018-05-22
75 "技术:功能与使用体验 2018-05-26
76 "技术:特征筛选学习笔记 2018-05-29
77 "技术:建模思路解析 2018-06-03
78 "技术:策略与实战 2018-06-03
79 "技术:数据展示指南 2018-06-04
80 "技术:包与环境配置指南 2018-07-14
81 "技术:高效操作指南 2018-07-19
82 "技术:方法与案例解析 2018-07-24
83 "技术:统计建模学习笔记 2018-07-24
84 "技术:展示技巧与原则 2018-08-10
85 "技术:数据采集实战技巧 2018-08-21
86 "技术:指标设计学习笔记 2018-09-20
87 "技术:建模流程实战 2018-10-01
88 "技术:大规模数据探索 2018-10-20
89 "技术:文本提取与分析 2018-10-20
90 "技术:原理与R实现实战 2018-10-21
91 "技术:学习资源获取技巧 2018-10-23
92 "技术:深度学习模型实战 2018-10-24
93 "技术:实战指南 2018-10-30
94 "技术:分析与展示指南 2018-11-03
95 "技术:图片编辑与转换 2018-11-03
96 "技术:安装与使用基础学习笔记 2018-11-07
97 "技术:非结构化数据处理 2018-11-29
98 "技术:表格美化技巧 2018-12-13
99 "技术:分类数据可视化 2018-12-24
100 "技术:流程图绘制技巧 2018-12-24
101 "技术:自动化设置 2018-12-28
102 "技术:协作与版本控制 2018-12-29
103 "技术:pipeline设计 2018-12-30
104 "技术:Git历史记录清理 2018-12-31
105 "技术:AUC指标对比 2019-01-01
106 "技术:时间序列可视化 2019-01-01
107 "技术:变量命名工具指南 2019-01-02
108 "技术:网页自动化截图 2019-01-02
109 "技术:配置与优化 2019-01-03
110 "技术:原理与应用 2019-01-07
111 "技术:语法与核心概念 2019-01-14
112 "技术:R包徽章设计 2019-01-15
113 "技术:项目结构设计 2019-01-21
114 "技术:文本分类基础任务 2019-01-22
115 "技术:线性与非线性模型 2019-01-22
116 "技术:有效性验证 2019-01-27
117 "技术:评估与应用 2019-01-29
118 "技术:循环神经网络入门 2019-01-30
119 "技术:长短期记忆网络入门 2019-01-30
120 "技术:for循环示例 2019-02-03
121 "技术:基础到进阶 2019-02-06
122 "技术:查询与整合 2019-02-06
123 "技术:方法与案例 2019-02-14
124 "技术:R包高效开发指南 2019-02-20
125 "技术:解析与操作 2019-02-20
126 "技术:训练与预测 2019-02-25
127 "技术:原理与代码 2019-02-26
128 "技术:GitHub个人访问令牌(PAT)设置 2019-03-04
129 "技术:方法与工具 2019-03-07
130 "技术:文本特征提取示例 2019-03-08
131 "技术:基础任务示例 2019-03-18
132 "技术:条形图与表头设计 2019-03-20
133 "技术:连续与分类变量差异 2019-03-30
134 "技术:思路与方法 2019-04-08
135 "技术:方法与工具 2019-04-15
136 "技术:多格式读取 2019-04-16
137 "技术:方法与工具 2019-05-11
138 "技术:Git/GitHub/GitLab 2019-05-13
139 "技术:命令与操作 2019-05-19
140 "技术:协作与版本控制 2019-05-26
141 "技术:语法与实践 2019-06-28
142 "技术:功能与API 2019-07-13
143 "技术:安装与使用 2019-07-24
144 "技术:高效数据处理 2019-10-09
145 "技术:性能优化技巧 2019-10-12
146 "技术:配置与运维 2019-10-29
147 "技术:原理与经典模型 2019-12-25
148 "技术:构建到发布流程 2019-12-26
149 "技术:方法与案例 2019-12-27
150 "技术:命令与自动化 2019-12-30
151 "技术:Pandas数据处理实战指南 2020-01-19
152 "技术:特征工程之目标编码学习笔记 2020-01-20
153 "技术:文档编写与美化 2020-01-28
154 "技术:核心算法与应用 2020-01-29
155 "技术:流程图绘制技巧 2020-01-29
156 "技术:DataCamp课程笔记 2020-01-31
157 "技术:Python实用代码片段合集 2020-01-31
158 "技术:自动化构建流程 2020-02-02
159 "技术:自动化工作流配置 2020-02-04
160 "技术:高效查找代码与项目 2020-02-11
161 "技术:代码环境快速部署 2020-02-24
162 "技术:USD数据分析论文收录暨GitBook发布 2020-05-02
163 "技术:Causal Forest 2021-03-18