6 min read

data.table学习笔记

本文于2020-10-10更新。 如发现问题或者有建议,欢迎提交 Issue

Matt Dowle | DataCamp 这个人之牛逼,可以和Hadley Wickhams肩并肩。他是data.table包的作者,语法非常类sql,非常支持sql使用者。 data.table包是十大最流行的的包,超过ggplot2

1 intro

我感觉DT分为DT[where, select, group by]三种结构,是非常类似sql的语言。 而且data.table建表自带recycling的功能,非常方便,本身也是一种data.frame。

1.1 recycling

You can also give columns with different lengths when creating a data.table, and R will “recycle” the shorter column to match the length of the longer one by re-using the first items. In the example below, column x is recycled to match the length of column y:

data.table(x = c("A", "B"), y = 1:4)
##    x y
## 1: A 1
## 2: B 2
## 3: A 3
## 4: B 4
my_first_data_table <- 
  data.table(x = c("a","b","c","d","e"),
             y = c(1,2,3,4,5))
DT <- data.table(a = c(1L,2L),b = LETTERS[1:4])
DT[3,]
##    a b
## 1: 1 C
DT[2:3,]
##    a b
## 1: 2 B
## 2: 1 C

1.2 where

# DT and the data.table package are pre-loaded

# Print the second to last row of DT using .N
DT[2:.N]
##    a b
## 1: 2 B
## 2: 1 C
## 3: 2 D
# .N 相当于最后一行的意思

# Print the column names of DT
names(DT)
## [1] "a" "b"
# Print the number or rows and columns of DT
dim(DT)
## [1] 4 2
# Print a new data.table containing rows 2, 2, and 3 of DT
DT[c(2,2,3)]
##    a b
## 1: 2 B
## 2: 2 B
## 3: 1 C

1.3 select

typeof(DT[,.(b)])
## [1] "list"
D <- 5
DT[, .(D)]
##    D
## 1: 5
# DT[, D]
DT[,b] # vector
## [1] "A" "B" "C" "D"
DT[,.("b")] # data.table by only one chr.
##    V1
## 1:  b
DT[,.(b)] # data.table
##    b
## 1: A
## 2: B
## 3: C
## 4: D
DT[,"b"] # data.table "b" = .(b)
##    b
## 1: A
## 2: B
## 3: C
## 4: D

.()相当于select函数,其中的数字可以相当于真的就是select 5。 然而,, D]相当于select No.5 col

# DT and the data.table package are pre-loaded
DT <- data.table(
  A = 1:5,
  B = letters[1:5],
  C = 6:10
)

# Subset rows 1 and 3, and columns B and C
DT[c(1,3), .(B,C)]
##    B C
## 1: a 6
## 2: c 8
# Assign to ans the correct value
ans <- DT[,.(B, val = A*C)]
  
# Fill in the blanks such that ans2 equals target
target <- data.table(B = c("a", "b", "c", "d", "e", 
                           "a", "b", "c", "d", "e"), 
                     val = as.integer(c(6:10, 1:5)))
ans2 <- DT[, .(B, val = c(C,A))]

我觉得 val = c(C,A)这步操作太强了。

1.4 group by

# iris is already available in your workspace

# Convert iris to a data.table: DT
DT <- as.data.table(iris)

# For each Species, print the mean Sepal.Length
DT[, mean(Sepal.Length), by = Species]
##       Species    V1
## 1:     setosa 5.006
## 2: versicolor 5.936
## 3:  virginica 6.588
# Print mean Sepal.Length, grouping by first letter of Species
DT[, mean(Sepal.Length), by = substr(Species, 1, 1)]
##    substr    V1
## 1:      s 5.006
## 2:      v 6.262
# data.table version of iris: DT
DT <- as.data.table(iris)

# Group the specimens by Sepal area (to the nearest 10 cm2) and count how many occur in each group
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)]
##    round   N
## 1:    20 117
## 2:    10  29
## 3:    30   4
# Now name the output columns `Area` and `Count`
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)][,.(Area = round,Count  = N)]
##    Area Count
## 1:   20   117
## 2:   10    29
## 3:   30     4
# Create the data.table DT
DT <- data.table(A = rep(letters[2:1], each = 4L), 
                 B = rep(1:4, each = 2L), 
                 C = sample(8))

# Create the new data.table, DT2
DT2 <- DT[,.(C = cumsum(C)),by = .(A,B)]
DT2
##    A B  C
## 1: b 1  6
## 2: b 1 14
## 3: b 2  3
## 4: b 2  5
## 5: a 3  7
## 6: a 3 12
## 7: a 4  1
## 8: a 4  5
# Select from DT2 the last two values from C while you group by A
DT2[,.(C = tail(C,2)),by =.(A)]
##    A C
## 1: b 3
## 2: b 5
## 3: a 1
## 4: a 5

2 intermediate

2.1 chaining

这里好的地方是,不需要像dplyr一样进行ungroup有点麻烦。

# The data.table package has already been loaded

# Build DT
DT <- data.table(A = rep(letters[2:1], each = 4L), 
                 B = rep(1:4, each = 2L), 
                 C = sample(8)) 

# Combine the two steps in a one-liner
DT[, .(C = cumsum(C)), by = .(A, B)][, .(C = tail(C, 2)), by = A]
##    A  C
## 1: b  3
## 2: b  5
## 3: a  8
## 4: a 14
# The data.table DT is loaded in your workspace

# Perform chained operations on DT
 as.data.table(iris)[, .(Sepal.Length = median(Sepal.Length), 
                         Sepal.Width = median(Sepal.Width), 
                         Petal.Length = median(Petal.Length),
                         Petal.Width = median(Petal.Width)), 
                    by = Species][order(-Species)]
##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:  virginica          6.5         3.0         5.55         2.0
## 2: versicolor          5.9         2.8         4.35         1.3
## 3:     setosa          5.0         3.4         1.50         0.2

order(-Species)倒序。 这里很多函数一定要定义是data.table

2.2 .SD

这里类似于mutate_all

DT <- data.table(
  x = c(2,1,2,1,2,2,1),
  y = seq(1,13,2),
  z = seq(2,14,2)
)
DT[,.(lapply(.SD,mean)),by = x]
##    x       V1
## 1: 2      6.5
## 2: 2      7.5
## 3: 1 7.666667
## 4: 1 8.666667
# A new data.table DT is available

# Mean of columns
DT[,lapply(.SD,mean),by = x]
##    x        y        z
## 1: 2 6.500000 7.500000
## 2: 1 7.666667 8.666667
# Median of columns
DT[,lapply(.SD,median),by = x]
##    x y z
## 1: 2 7 8
## 2: 1 7 8

2.3 .SDcols

.SDcols是一个参数,是可以限定对哪些col执行函数。

> DT[, .SD[-1], by = grp, .SDcols = paste0("Q", 1:3)]
   grp Q1 Q2 Q3
1:   6  4  1  4
2:   8  1  3  1
3:   8  5  2  5

这个的SD[-1]相当于踢掉了每组的第一行。 paste0("Q", 1:3)相当于选用了以"Q"为首字母的前三列。

这点有点难理解,很正常,但是data.table包本身就是base在R的基础函数上开发的,非常快。

> DT[,.(lapply(.SD,sum),.N),by = x]
   x V1 N
1: 2 26 4
2: 2 30 4
3: 1 23 3
4: 1 26 3
> DT[, lapply(.SD, cumsum), by = .(by1 = x, by2 = z > 8), .SDcols = c("x", "y")]
   by1   by2 x  y
1:   2 FALSE 2  1
2:   2 FALSE 4  6
3:   1 FALSE 1  3
4:   1 FALSE 2 10
5:   2  TRUE 2  9
6:   2  TRUE 4 20
7:   1  TRUE 1 13
> 

2.4 :=

这是计算机一个符号,类似于x := x + 10,x在循环中每次增加10.

# The data.table DT
DT <- data.table(A = letters[c(1, 1, 1, 2, 2)], B = 1:5)

# Add column by reference: Total
DT[, Total := sum(B), by = A]

# Add 1 to column B
DT[c(2, 4), B := B + 1L]

# Add a new column Total2
DT[2:4, Total2 := sum(B), by = A]

# Remove the Total column
DT[, Total := NULL]

# Select the third column using `[[`
DT[[3]]
## [1] NA  6  6  5 NA

我感觉这个包到这地方开始比tidyverse复杂了,所以作为backup了。