本文于2020-10-10更新。 如发现问题或者有建议,欢迎提交 Issue
Matt Dowle | DataCamp
这个人之牛逼,可以和Hadley Wickhams肩并肩。他是data.table
包的作者,语法非常类sql,非常支持sql使用者。
data.table
包是十大最流行的的包,超过ggplot2
。
1 intro
我感觉DT分为DT[where, select, group by]
三种结构,是非常类似sql的语言。
而且data.table
建表自带recycling的功能,非常方便,本身也是一种data.frame。
1.1 recycling
data.table(x = c("A", "B"), y = 1:4)
## x y
## 1: A 1
## 2: B 2
## 3: A 3
## 4: B 4
my_first_data_table <-
data.table(x = c("a","b","c","d","e"),
y = c(1,2,3,4,5))
DT <- data.table(a = c(1L,2L),b = LETTERS[1:4])
DT[3,]
## a b
## 1: 1 C
DT[2:3,]
## a b
## 1: 2 B
## 2: 1 C
1.2 where
# DT and the data.table package are pre-loaded
# Print the second to last row of DT using .N
DT[2:.N]
## a b
## 1: 2 B
## 2: 1 C
## 3: 2 D
# .N 相当于最后一行的意思
# Print the column names of DT
names(DT)
## [1] "a" "b"
# Print the number or rows and columns of DT
dim(DT)
## [1] 4 2
# Print a new data.table containing rows 2, 2, and 3 of DT
DT[c(2,2,3)]
## a b
## 1: 2 B
## 2: 2 B
## 3: 1 C
1.3 select
typeof(DT[,.(b)])
## [1] "list"
datatable-faq.pdf 这个地方反馈了一些
data.table
在select功能上的一些bug。
D <- 5
DT[, .(D)]
## D
## 1: 5
# DT[, D]
DT[,b] # vector
## [1] "A" "B" "C" "D"
DT[,.("b")] # data.table by only one chr.
## V1
## 1: b
DT[,.(b)] # data.table
## b
## 1: A
## 2: B
## 3: C
## 4: D
DT[,"b"] # data.table "b" = .(b)
## b
## 1: A
## 2: B
## 3: C
## 4: D
.()
相当于select
函数,其中的数字可以相当于真的就是select 5
。
然而,, D]
相当于select No.5 col
。
# DT and the data.table package are pre-loaded
DT <- data.table(
A = 1:5,
B = letters[1:5],
C = 6:10
)
# Subset rows 1 and 3, and columns B and C
DT[c(1,3), .(B,C)]
## B C
## 1: a 6
## 2: c 8
# Assign to ans the correct value
ans <- DT[,.(B, val = A*C)]
# Fill in the blanks such that ans2 equals target
target <- data.table(B = c("a", "b", "c", "d", "e",
"a", "b", "c", "d", "e"),
val = as.integer(c(6:10, 1:5)))
ans2 <- DT[, .(B, val = c(C,A))]
我觉得
val = c(C,A)
这步操作太强了。
1.4 group by
# iris is already available in your workspace
# Convert iris to a data.table: DT
DT <- as.data.table(iris)
# For each Species, print the mean Sepal.Length
DT[, mean(Sepal.Length), by = Species]
## Species V1
## 1: setosa 5.006
## 2: versicolor 5.936
## 3: virginica 6.588
# Print mean Sepal.Length, grouping by first letter of Species
DT[, mean(Sepal.Length), by = substr(Species, 1, 1)]
## substr V1
## 1: s 5.006
## 2: v 6.262
# data.table version of iris: DT
DT <- as.data.table(iris)
# Group the specimens by Sepal area (to the nearest 10 cm2) and count how many occur in each group
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)]
## round N
## 1: 20 117
## 2: 10 29
## 3: 30 4
# Now name the output columns `Area` and `Count`
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)][,.(Area = round,Count = N)]
## Area Count
## 1: 20 117
## 2: 10 29
## 3: 30 4
# Create the data.table DT
DT <- data.table(A = rep(letters[2:1], each = 4L),
B = rep(1:4, each = 2L),
C = sample(8))
# Create the new data.table, DT2
DT2 <- DT[,.(C = cumsum(C)),by = .(A,B)]
DT2
## A B C
## 1: b 1 6
## 2: b 1 14
## 3: b 2 3
## 4: b 2 5
## 5: a 3 7
## 6: a 3 12
## 7: a 4 1
## 8: a 4 5
# Select from DT2 the last two values from C while you group by A
DT2[,.(C = tail(C,2)),by =.(A)]
## A C
## 1: b 3
## 2: b 5
## 3: a 1
## 4: a 5
2 intermediate
2.1 chaining
这里好的地方是,不需要像dplyr
一样进行ungroup
有点麻烦。
# The data.table package has already been loaded
# Build DT
DT <- data.table(A = rep(letters[2:1], each = 4L),
B = rep(1:4, each = 2L),
C = sample(8))
# Combine the two steps in a one-liner
DT[, .(C = cumsum(C)), by = .(A, B)][, .(C = tail(C, 2)), by = A]
## A C
## 1: b 3
## 2: b 5
## 3: a 8
## 4: a 14
# The data.table DT is loaded in your workspace
# Perform chained operations on DT
as.data.table(iris)[, .(Sepal.Length = median(Sepal.Length),
Sepal.Width = median(Sepal.Width),
Petal.Length = median(Petal.Length),
Petal.Width = median(Petal.Width)),
by = Species][order(-Species)]
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: virginica 6.5 3.0 5.55 2.0
## 2: versicolor 5.9 2.8 4.35 1.3
## 3: setosa 5.0 3.4 1.50 0.2
order(-Species)
倒序。
这里很多函数一定要定义是data.table
。
2.2 .SD
这里类似于mutate_all
DT <- data.table(
x = c(2,1,2,1,2,2,1),
y = seq(1,13,2),
z = seq(2,14,2)
)
DT[,.(lapply(.SD,mean)),by = x]
## x V1
## 1: 2 6.5
## 2: 2 7.5
## 3: 1 7.666667
## 4: 1 8.666667
# A new data.table DT is available
# Mean of columns
DT[,lapply(.SD,mean),by = x]
## x y z
## 1: 2 6.500000 7.500000
## 2: 1 7.666667 8.666667
# Median of columns
DT[,lapply(.SD,median),by = x]
## x y z
## 1: 2 7 8
## 2: 1 7 8
2.3 .SDcols
.SDcols
是一个参数,是可以限定对哪些col执行函数。
> DT[, .SD[-1], by = grp, .SDcols = paste0("Q", 1:3)]
grp Q1 Q2 Q3
1: 6 4 1 4
2: 8 1 3 1
3: 8 5 2 5
这个的SD[-1]
相当于踢掉了每组的第一行。
paste0("Q", 1:3)
相当于选用了以"Q"
为首字母的前三列。
这点有点难理解,很正常,但是data.table
包本身就是base在R的基础函数上开发的,非常快。
> DT[,.(lapply(.SD,sum),.N),by = x]
x V1 N
1: 2 26 4
2: 2 30 4
3: 1 23 3
4: 1 26 3
> DT[, lapply(.SD, cumsum), by = .(by1 = x, by2 = z > 8), .SDcols = c("x", "y")]
by1 by2 x y
1: 2 FALSE 2 1
2: 2 FALSE 4 6
3: 1 FALSE 1 3
4: 1 FALSE 2 10
5: 2 TRUE 2 9
6: 2 TRUE 4 20
7: 1 TRUE 1 13
>
2.4 :=
这是计算机一个符号,类似于x := x + 10
,x在循环中每次增加10.
# The data.table DT
DT <- data.table(A = letters[c(1, 1, 1, 2, 2)], B = 1:5)
# Add column by reference: Total
DT[, Total := sum(B), by = A]
# Add 1 to column B
DT[c(2, 4), B := B + 1L]
# Add a new column Total2
DT[2:4, Total2 := sum(B), by = A]
# Remove the Total column
DT[, Total := NULL]
# Select the third column using `[[`
DT[[3]]
## [1] NA 6 6 5 NA
我感觉这个包到这地方开始比tidyverse
复杂了,所以作为backup了。