16 min read

Data Visualization with ggplot2 at DataCamp 学习笔记

本文于2020-10-10更新。 如发现问题或者有建议,欢迎提交 Issue

Scavetta (2017a), Scavetta (2017b), (???) 给出比较系统的教程,这里单独学习并整理成文。

1 Part 1

1.1 geom_histogram

# Custom color code
myBlue <- "#377EB8"
# 这个可以用Color Picker调

# Change the fill color to myBlue
ggplot(mtcars, aes(x = mpg, y = ..count..)) +
  geom_histogram(binwidth = 1, aes(y = ..count..), fill = myBlue)
ggplot(mtcars, aes(x = mpg, y = ..density..)) +
  geom_histogram(binwidth = 1, aes(y = ..density..), fill = myBlue)
  • ..count..就是形容频数
  • ..density..就是形容频率

bin的宽度为diff(range(dataset$x))/30 (Scavetta 2017a, Histograms | R)

1.2 position

for (i in c("stack","fill","dodge")){
  p1 <- 
  ggplot(mtcars, aes(x = cyl, fill = factor(am))) + 
    geom_bar(position = i) +
    theme_minimal() +
    scale_fill_brewer(palette = "Diamond\nclarity")
    # 加配色,不需要自己手动弄
    print(p1)
}
  • stack: 累计,频数
  • fill: 累计,频率, 这里显然饼状图 __有时候__更有优势。
  • dodge: 并列 (Scavetta 2017a, Position | R)
# Convert bar chart to pie chart
ggplot(mtcars, aes(x = factor(1), fill = as.factor(am))) +
  geom_bar(position = "fill", width = 1) +
  facet_grid(. ~ cyl) +
  coord_polar(theta = "y")

1.3 geom_rect

就是设计方块。

download.file("https://assets.datacamp.com/production/course_774/datasets/recess.RData","recess.RData")
load("recess.RData")
ggplot(economics, aes(x = date, y = unemploy/pop)) +
  geom_line() +
  geom_rect(data = recess, inherit.aes = FALSE,
            aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf), fill = "red", alpha = 0.2)

2 Part 2

2.1 覆盖aes

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = F) +
  stat_smooth(method = "lm", se = F, aes(group = 123))

group = 123类似于覆盖之前aes,产生一条综合的线。

2.2 geom_boxplot、geom_violin的理解

varwidth反应样本大小,

scale If "count", areas are scaled proportionally to the number of observations.

diamonds %>% 
  ggplot(aes(x = color, y = depth)) +
    geom_boxplot(varwidth = T)
diamonds %>% 
  ggplot(aes(x = color, y = depth)) +
    geom_violin(scale = "count")

2.3 geom_density的理解

按照有限的样本,每个obs的值作为分布均值,标准差假设好,然后把这么多分布全部叠加起来,组成分布,就是geom_density。

另外超出实际值的区域超出了intermediate steps,就T掉,因此看到的图,左右边是切出来的。

bandwidth就是类似于标准差,小了,假设的分布就越陡峭,那么总体的分布越容易体现出两个或两个以上的波峰。 bw > The smoothing bandwidth to be used. If numeric, the standard deviation of the smoothing kernel. If character, a rule to choose the bandwidth, as listed in bw.nrd.

weight来加入样本大小的影响。如果同一个值有很多,当然这个的分布就要陡峭一点啊。

对于density函数。

x the n coordinates of the points where the density is estimated.

y the estimated density values. These will be non-negative, but can be zero.

# test_data is available

# Calculating density: d
d <- density(diamonds$depth)

# Use which.max() to calculate mode
mode <- d$x[which.max(d$y)]

# Finish the ggplot call
ggplot(diamonds, aes(x = depth)) +
  geom_rug() +
  geom_density() +
  geom_vline(xintercept = mode, col = "red")

这里mode的取法简单,值得借鉴。

# test_data is available
test_data = data_frame(norm = rnorm(1000))
# Arguments you'll need later on
fun_args <- list(mean = mean(test_data$norm), sd = sd(test_data$norm))

# Finish the ggplot
ggplot(test_data, aes(x = norm)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density(col = "red") +
  stat_function(fun = dnorm, args = fun_args, col = "blue")

注意看,

  • histogram是样本数据(1000个)的表现,
  • density红线是,KDE后叠加的,所以左边有切痕,
  • density蓝线是,直接用样本数据(1000个)求得\(\mu\)\(\sigma\)而得的正态分布。

2.3.1 Adjusting density plots

adjust A multiplicate bandwidth adjustment. This makes it possible to adjust the bandwidth while still using the a bandwidth estimator. For exampe, adjust = 1/2 means use half of the default bandwidth. (Scavetta 2017b, Adjusting density plots | R)

不信我们验证。

# small_data is available

# Get the bandwith
get_bw <- density(test_data$norm)$bw

# Basic plotting object
p <- ggplot(test_data, aes(x = norm)) +
  geom_rug() +
  coord_cartesian(ylim = c(0,0.5))

# Create three plots
p + geom_density(adjust = 0.25)
p + geom_density(bw = 0.25 * get_bw)
# 简直一模一样。

kernel - kernel used for density estimation, defined as

  • "g" = gaussian
  • "r" = rectangular
  • "t" = triangular
  • "e" = epanechnikov
  • "b" = biweight
  • "c" = cosine
  • "o" = optcosine

2.4 stat_density_2d

diamonds %>% 
  ggplot(aes(x = depth, y = table)) + 
    stat_density_2d(aes(col = ..level..), h = c(5, 0.5))
# h是两个的bandwidth,
# ..level..厉害!

2.5 geom_rug()

p <- ggplot(mtcars, aes(wt, mpg)) +
  geom_point()
p
p + geom_rug()

2.6 stat_smooth

span: Smaller numbers produce wigglier lines, larger numbers produce smoother lines. span越小,窗口越小,被平均的样本就越小。 (Scavetta 2017b, Modifying stat_smooth | R)

# Plot 1: change the LOESS span
for (i in c(0.2,0.7)){
p1 <- 
  ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # Add span below
  geom_smooth(se = F, span = i)
print(p1)
}
# Plot 2: Set the overall model to LOESS and use a span of 0.7
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = F) +
  # Change method and add span below
  stat_smooth(method = "loess", aes(group = 1),
              se = F, col = "black",span = 0.7)

# Plot 3: Set col to "All", inside the aes layer of stat_smooth()
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = F) +
  stat_smooth(method = "loess",
              # Add col inside aes()
              aes(group = 1,col = "All"),
              # Remove the col argument below
              # 这样就可以体现在legend中了
              se = F, span = 0.7)

# Plot 4: Add scale_color_manual to change the colors
library(RColorBrewer)
myColors <- c(brewer.pal(3, "Dark2"), "black")
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = F, span = 0.75) +
  stat_smooth(method = "loess",
              aes(group = 1, col="All"),
              se = F, span = 0.7) +
  # Add correct arguments to scale_color_manual
  scale_color_manual("Cylinders", values = myColors)

scale_color_brewer() to use a default ColorBrewer. This should result in an error, since the default palette, “Blues”, only has 9 colors, but we have 16 years here. (Scavetta 2017b, Modifying stat_smooth (2) | R)

# Plot 1: Jittered scatter plot, add a linear model (lm) smooth
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.2) +
  stat_smooth(method = "lm", se = F)

# Plot 2: Only lm, colored by year
ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
  stat_smooth(method = "lm", se = F)

# Plot 3: Set a color brewer palette
ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
  stat_smooth(method = "lm", se = F) +
  scale_color_brewer()

# Plot 4: Change col and group, specify alpha, size and geom, and add scale_color_gradient
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
  stat_smooth(method = "lm", se = F, alpha = 0.6, size = 2) +
  scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))

当group很多,且是interger时候,还是连续变量比factor好。

2.7 geom_quantile

This fits a quantile regression to the data and draws the fitted quantiles with lines. This is as a continuous analogue to geom_boxplot. (Scavetta 2017b, Quantiles | R)

也就是说同一个x,拥有多个y。

m <- ggplot(mpg, aes(displ, 1 / hwy)) + geom_point()
m + geom_quantile()
m + geom_quantile(quantiles = 0.5)
q10 <- seq(0.05, 0.95, by = 0.05)
m + geom_quantile(quantiles = q10)

2.8 sum

Another useful stat function is stat_sum() which calculates the count for each group. (Scavetta 2017b, Sum | R)

range a numeric vector of length 2 that specifies the minimum and maximum size of the plotting symbol after transformation.

ggplot(mpg, aes(cty, hwy)) +
 geom_point()

ggplot(mpg, aes(cty, hwy)) +
 geom_count()
# 大小体现了x和y定位的数据量大小
ggplot(mpg, aes(cty, hwy)) +
 geom_count() +
 scale_size_area()
ggplot(mpg, aes(cty, hwy)) +
 geom_count(aes(size = ..prop..))
# 展示的是比例

第一个章节可以好好复习一下。 这些图都是可以自己画出来,但是实际上idea很重要。

2.9 stat_summary

mult是mutiple,
for smean.cl.normal is the multiplier of the standard error of the mean。

stat_summary operates on unique x; stat_summary_bin operators on binned x. They are more flexible versions of stat_bin: instead of just counting, they can compute any aggregate.

看到stat_function感觉非常好! 非常自定义

# Display structure of mtcars
str(mtcars)

# Convert cyl and am to factors:
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

# Define positions:
posn.d <- position_dodge(width = 0.1) 
posn.jd <- position_jitterdodge(jitter.width = 0.1, dodge.width = 0.2) 
posn.j <- position_jitter(width = 0.2) 

# base layers:
wt.cyl.am <- mtcars %>% 
  ggplot(aes(x = cyl, y = wt, col = am, fill = am, group = am))
wt.cyl.am +
  geom_point(position = posn.jd, alpha = 0.6)
  # 这个地方jitter主要是为了让点不重合。
for (i in c(mean_sdl,mean_cl_normal)){
wt.cyl.am.p <- wt.cyl.am +
  stat_summary(fun.data = i,fun.args = list(mult=1),
               position = posn.d) +
  labs(
    title = "Mean and SD",
    subtitle = paste(
      "这个就可以比较分析了。\n这里默认使用了geom_pointrange(),",
      "使用",
      substitute(i))
  ) + 
  theme(text=element_text(family="STKaiti"))
print(wt.cyl.am.p)
}
wt.cyl.am +
  stat_summary(geom = "point", fun.y = mean,
               position = posn.d) +
  stat_summary(geom = "errorbar", fun.data = mean_sdl,
               position = posn.d, fun.args = list(mult = 1), width = 0.1)
# Play vector xx is available
xx <- 1:100
# Function to save range for use in ggplot:
gg_range <- function(x) {
  # Change x below to return the instructed values
  data.frame(ymin = min(x), # Min
             ymax = max(x)) # Max
}

gg_range(xx)
# Required output:
#   ymin ymax
# 1    1  100

# Function to Custom function:
med_IQR <- function(x) {
  # Change x below to return the instructed values
  data.frame(y =  median(x), # Median
             ymin = quantile(x)[2], # 1st quartile
             ymax = quantile(x)[4])  # 3rd quartile
}

med_IQR(xx)
# Required output:
#        y  ymin  ymax
# 25% 50.5 25.75 75.25
1:100 %>% 
  quantile()
# The base ggplot command, you don't have to change this
wt.cyl.am <- 
  ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))

# Add three stat_summary calls to wt.cyl.am
wt.cyl.am +
  stat_summary(geom = "linerange", fun.data = med_IQR,
               position = posn.d, size = 3) +
  stat_summary(geom = "linerange", fun.data = gg_range,
               position = posn.d, size = 3,
               alpha = 0.4) +
  stat_summary(geom = "point", fun.y = median,
               position = posn.d, size = 3,
               col = "black", shape = "X") +
  labs(
    subtitle = 
      "中间的点是中位数\n深色的是四分位点\n浅色的是极值"
  ) +
  theme(text=element_text(family="STKaiti"))

2.9.1 errorbar

# Base layers
m <- ggplot(mtcars, aes(x = cyl,y = wt, col = as.factor(am), fill = as.factor(am)))

# Plot 1: Draw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", alpha = 0.2) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)

# Plot 2: Set position dodge in each stat function
m +
  stat_summary(fun.y = mean, geom = "bar", position = "dodge", alpha = 0.2) +
  stat_summary(fun.data = mean_sdl, 
               fun.args = list(mult = 1), 
               geom = "errorbar", 
               width = 0.1, position = "dodge")

# Set your dodge posn manually
posn.d <- position_dodge(0.9)

# Plot 3:  Redraw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", position = posn.d, alpha = 0.2) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1, position = posn.d)
stat_summary(fun.data = "mean_cl_normal",
           geom = "crossbar",
           width = 0.2,
           col = "red") +

这是最容易加95%置信区间的方式了。

diamonds %>% 
  ggplot(aes(x = color, y = depth)) +
  geom_point() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red")

可以看到95的置信区间在很中间,因此说明这个数据很分散。

2.10 Zoom in

# Basic ggplot() command, coded for you
p <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) + geom_point() + geom_smooth()

# Add scale_x_continuous
p + scale_x_continuous(limits = c(3,6),expand = c(0,0)) +
    labs(
    caption =
      "就是因为绿色的点只有两个,所以画不了线。"
  ) + 
  theme(text=element_text(family="STKaiti"))

# The proper way to zoom in:
p + coord_cartesian(xlim = c(3,6)) +
  labs(
    caption =
      "就是因为用全集画图,也画了线,所以截图的时候,才有smooth line。"
  ) + 
  theme(text=element_text(family="STKaiti"))

expand These constants ensure that the data is placed some distance away from the axes. The defaults are c(0.05, 0) for continuous variables, and c(0, 0.6) for discrete variables.

2.11 aspect

# Complete basic scatter plot function
base.plot <- ggplot(iris, aes(x = Sepal.Length, 
                              y = Sepal.Width,
                              col = Species)) +
               geom_jitter() +
               geom_smooth(method = "lm", se = F)

# Plot base.plot: default aspect ratio
base.plot + coord_fixed(ratio = 1/1)  +
  labs(
    subtitle = 
      "因为minmax不一样,如果一样,就是正方形"
  ) +
theme(text=element_text(family="STKaiti"))  

# Fix aspect ratio (1:1) of base.plot
base.plot + coord_equal()

感觉上一定要学有所获才可以。

2.12 coord_polar()

We can imagine two forms for pie charts - the typical filled circle, or a colored ring.

理解极坐标

As an example, consider the stacked bar chart shown in the viewer. Imagine that we just take the y axis on the left and bend it until it loops back on itself, while expanding the right side as we go along. We’d end up with a pie chart - it’s simply a bar chart transformed onto a polar coordinate system.

# Create stacked bar plot: thin.bar
thin.bar <- ggplot(mtcars, aes(x = 1, 
                               fill = as.factor(cyl))) +
  geom_bar() + 
  labs(
    subtitle =
      "x轴为常数,y轴不存在,\n在x轴上stack,颜色区分,那么就是count来区分"
  ) +
  theme(text=element_text(family="STKaiti"))  
  

# Convert thin.bar to pie chart
thin.bar +
 coord_polar(theta = "y")
 # y轴作为极坐标
 # 圆外围标记是count

# Create stacked bar plot: wide.bar
wide.bar <- ggplot(mtcars, aes(x = 1, 
                               fill = as.factor(cyl))) +
              geom_bar(width = 1)

# Convert wide.bar to pie chart
wide.bar + 
  coord_polar(theta = "y")

2.13 facet

mtcars %>% 
  add_rownames() %>% 
  ggplot(aes(x = mpg, y = rowname)) +
  geom_point() +
  facet_grid(cyl ~ ., space = "free_y") +
  labs(
    subtitle = 
      "space是为了使得y轴的空间都随着样本量变化"
  ) +
  theme(text = element_text(family = "STKaiti"))

2.14 theme

这些都是ggplot一个图里面text的地方。

一共有三种调整

  • element_text()
  • element_line()
  • element_rect()

这里可以设置fill,边框设置在col

2.14.1 panel.grid

mtcars %>% 
  ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) +
  geom_point() +
  theme(
    panel.grid = element_blank(),
    # 背景颜色网格没了
    panel.background = element_blank(),
    # 背景颜色没了
    axis.line = element_line(color = "black")
    # 的确清爽很多
  )

2.14.2 strip.text

mtcars %>% 
  ggplot(aes(x = mpg, y = disp)) +
  geom_point() +
  facet_grid(. ~ as.factor(cyl)) +
  theme(
    panel.grid = element_blank(),
    # 背景颜色网格没了
    panel.background = element_blank(),
    # 背景颜色没了
    axis.line = element_line(color = "black")
    # 的确清爽很多
  ) + 
  theme(
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 12)
  )

2.14.3 Legends

z <- 
  mtcars %>% 
  ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) + 
  geom_point() +
  facet_grid(. ~ cyl)
# Move legend by position
z + theme(
  legend.position = c(0.85,0.85)
  )

# Change direction
z + theme(
  legend.position = c(0.85,0.85),
  legend.direction = "horizontal"
  )

# Change location by name
z + theme(
  legend.position = "bottom",
  legend.direction = "horizontal"
  )

# Remove legend entirely
z + theme(
  legend.position = "none",
  legend.direction = "horizontal"
  )

2.14.4 margin

z
z + 
  theme(
    panel.spacing.x = grid::unit(2,"cm")
  )
z + 
  theme(
    plot.margin = unit(c(0,0,0,0),"cm")
    # 页边距
  ) +
  labs(
    subtitle = "页边距设置"
  ) +
  theme(text = element_text(family = "STKaiti"))

2.14.5 Get, set, and modify the active theme

The current/active theme is automatically applied to every plot you draw. Use theme_get to get the current theme, and theme_set to completely override it. theme_update and theme_replace are shorthands for changing individual elements.

不是特别懂。

`

2.14.6 ggthemes

library(ggthemes)
mtcars %>% 
  ggplot(aes(x = mpg, y = disp, col = as.factor(cyl))) + 
  geom_point() +
  facet_grid(. ~ cyl) +
  labs(
    title = "测试中文是否能够被tufte修改"
  ) +
  theme_tufte() +
  theme(text = element_text(family = "STKaiti"))
# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))

# Draw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)

2.15 GGally

这里

  • 连续变量 + 连续变量 \(\to\) geom_point
  • 连续变量 + 分类变量 \(\to\) geom_boxplot
  • 分类变量 + 分类变量 \(\to\) geom_point
  • 分类变量 + 自身 \(\to\) geom_bar
  • 连续变量 + 自身 \(\to\) geom_freploy
# Parallel coordinates plot using GGally
library(GGally)
ggp <- 
  mtcars %>% 
  mutate_at(vars(cyl,am),as.factor) %>% 
  ggpairs()
ggp +
  theme_tufte()

2.16 heat map

# Create color palette
library(RColorBrewer)
myColors <- brewer.pal(9, "Reds")

# Build the heat map from scratch
library(lattice)
data(barley)
ggplot(barley,aes(x = year, y = variety, fill = yield)) +
  geom_tile() +
  facet_wrap(~ site, ncol = 1) +
  scale_fill_gradientn(colors = myColors) +
  labs(
    capition = 
      "热力图,就是用颜色来表达两个分类变量之间的关系,\n第三个连续变量的变化。"
  ) +
  theme_tufte() +
  theme(text = element_text(family = "STKaiti"))

2.17 ribbon

barley %>% 
  ggplot(aes(x = year, y = yield, col = site, fill = site,group = site)) +
  stat_summary(fun.y = mean, geom = "line") +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               geom = "ribbon",
               alpha = 0.1,
               col = NA)
# Reproduce the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
  geom_point(alpha = 0.5, size = 0.5, shape = 16) +
  scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
  scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
  scale_color_brewer(palette = "YlOrRd") +
  coord_equal() +
  theme_classic()

2.18 expression(log10) 控制比例尺

diamonds %>% 
  ggplot(aes(x = carat, y = price, col = color)) +
  geom_point(alpha = 0.5, size = 0.5, shape = 16) +
  scale_x_log10(expression(log[10](carat)),
                limits = c(0.1,10)) +
  scale_y_log10(expression(log[10](price)),
  # 这个expression(log[10](price)),方法很好啊。
                limits = c(1000,10000)) +
  scale_color_brewer(palette = "YlOrRd") +
  # 让颜色更好看。
  coord_equal() +
  theme_classic()

3 Part 3

3.1 Large dataset

alpha blending1

其实找好label,然后把cor的值放上去就好了,easy,干起来。

ggplot实现相关矩阵

cor_list <- function(x) {
  L <- M <- cor(x)
  
  M[lower.tri(M, diag = TRUE)] <- NA
  M <- melt(M)
  names(M)[3] <- "points"
  # lower.tri就是i比j大,而已。
  L[upper.tri(L, diag = TRUE)] <- NA
  L <- melt(L)
  names(L)[3] <- "labels"
  
  merge(M, L)
}

cor_list(iris[1:4])
# 这里的缺失值有三种
# 1. cor对角线上的
# 2. upper.tri中的一半
# 3. lower.tri中的一半


iris1 <- 
iris %>%
  group_by(Species) %>%
  do(cor_list(.[1:4])) 
# 这里相当于unnest了,比map函数方便。
iris1 %>% 
  ggplot(aes(x = Var1, y = Var2)) +
    geom_point(aes(col = labels, 
                   size = abs(labels)), shape = 16) +
  geom_text(aes(x = Var2, y = Var1, 
                # 这里要交叉一下,
                # 这样文字就在下三角了。
                col = points,
                # size = abs(points),
                # size 不可以加,不然看不见
                # hjust = 2,
                label = round(labels, 2))) +
  scale_size(range = c(0, 6)) +
  # 控制点的大小
  scale_color_gradient2("r", limits = c(-1, 1)) +
  scale_y_discrete("", limits = rev(levels(iris1$Var1))) +
  # rev控制了factor反着走,这样可以控制图像在上三角还是下三角
  scale_x_discrete("") +
  guides(size = FALSE) +
  # 没什么用
  geom_abline(slope = -1, intercept = nlevels(iris1$Var1) + 1) +
  coord_fixed() +
  facet_grid(. ~ Species) +
  # 不然图像重合了很难看。
  labs(
    caption = "数据来源:iris",
    subtitle = "建立相关性矩阵很简单\n抓好x和y轴变量和计算的相关系数就好",
    title = "ggplot实现相关矩阵"
  ) +
  theme_tufte() +
  theme(text = element_text(family = "STKaiti")) +
  # 为了显示中文
  theme(axis.text.y = element_text(angle = 45, hjust = 1),
        axis.text.x = element_text(angle = 45, hjust = 1),
        strip.background = element_blank())

3.2 ggtern三角图

这个图可以表达三个变量, \(x,y,z\)。 现在可以看出,下方的比例尺是\(z\)的。 从\(z\)点作垂线。 我们定义,\(z\)点的对边做平行线。 这些平行线上,跟比例尺相交的点,表达了数据中点的\(z\)值。 显然,离\(z\)点更近的平行线上的点,\(z\)值取得越高。

这是个例子。

library(ggtern)
download.file("https://assets.datacamp.com/production/course_862/datasets/africa.RData","africa.RData")
load("africa.RData")
# ggtern and ggplot2 are loaded
# Original plot:
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_point(shape = 16, alpha = 0.2)

# Plot 1
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_density_tern()

# Plot 2
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  stat_density_tern(
  geom = "polygon", 
  aes(fill = ..level.., 
  alpha = ..level..)) +
  guides(fill = FALSE)
  # Suppress the legend

3.3 geomnet

# Load geomnet & examine structure of madmen
library(geomnet)
# str(madmen)

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Examine structure of mmnet
# str(mmnet)
madmen$edges %>% head()
madmen$vertices %>% head()
mmnet %>% head()

# Finish the ggplot command
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
  geom_net(aes(col = Gender),
  size = 6, linewidth = 1, 
  labelon = TRUE, 
  # 这里就打上标签了
  fontsize = 3, 
  labelcolour = "black",
  directed = TRUE) +
  # 连接线上有标签
  scale_color_manual(values = c("#FF69B4", "#0099ff")) +
  xlim(c(-0.05, 1.05)) +
  ggmap::theme_nothing(legend = T) +
  # 这是很好的方法,theme_nothing
  # legend = F可以保留legend
  theme(legend.key = element_blank())
  # 让legend的背景变透明。

3.4 shape of points

3.5 ggfortify

可以把base plot的图转化成ggplot的图。

利用autoplot函数,但是我还是没动leverage是干嘛的。 甚至time-series ts和multiple time-series mts也是可以的。

Distance matrices and Multi-Dimensional Scaling (MDS) | R中的 cmdscale function | R Documentation没太看得懂,没给数学公式啊。

3.5.0.1 可视化聚类模型

cluster::clara(), cluster::fanny(), cluster::pam()stats::prcomp()都是聚类模型, ggfortify可以可视化结果,方便大家理解,这里以stats::kmeans为例。

library(stats)
# use kmeans
library(ggfortify)
# Perform clustering
iris_k <- kmeans(iris[-5], centers=3)

# Autoplot: color according to cluster
autoplot(iris_k, data = iris, frame = T)
# frame = T
# draw a polygon around each cluster.

# Autoplot: above, plus shape according to species
autoplot(iris_k, data = iris, frame = T,shape ="Species")
# 显然每个框里面都有不同的品种,所以不好啊。

ggfortify安装不好,太烦。

3.6 map

A choropleth map (from Greek χώρο (“area/region”) + πλήθος (“multitude”)) is a thematic map.

# maps, ggplot2, and ggmap are pre-loaded
# Use map_data() to create usa and inspect
library(ggmap)
usa <- map_data("usa")
str(usa)

# Build the map
ggplot(usa, aes(x = long, y = lat, group = group)) +
  geom_polygon() +
  # 是实现地图的关键
  geom_point(aes(col = cut_number(lat,3))) +
  # 点根据维度划分
  coord_map() +
  theme_nothing()
  # ggmap::theme_nothing
library(tidyverse)
library(ggmap)
get_map(location = "Shanghai") %>% ggmap()

由于调用的是Google地图,可能需要翻墙,同时速度有点慢(访问的是Google地图数据库,由于网络限制数据抓取可能不完整)。

不然也非常好!!!

3.7 gganimate

gganimate包非常适合展示图像变化。

library(ggthemes)
mtcars %>% 
  ggplot(aes(x = mpg, y = disp, col = cyl)) +
    geom_point() +
    # labs() +
    theme_tufte() <- p
gg_animate(p, filename = "mtcars.gif", interval = 1.0)

参考文献

Scavetta, Rick. 2017a. “Data Visualization with Ggplot2 (Part 1).” 2017. https://www.datacamp.com/courses/data-visualization-with-ggplot2-1.

———. 2017b. “Data Visualization with Ggplot2 (Part 2).” 2017. https://www.datacamp.com/courses/data-visualization-with-ggplot2-2.


  1. blending 混合,类似于modeling blending