{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE) 本文于r format(Sys.Date(), "%Y-%m-%d")更新。 如发现问题或者有建议,欢迎提交 Issue
Matt Dowle | DataCamp 这个人之牛逼,可以和Hadley Wickhams肩并肩。他是data.table包的作者,语法非常类sql,非常支持sql使用者。 data.table包是十大最流行的的包,超过ggplot2。
{r message=FALSE, warning=FALSE, include=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE, error = TRUE) library(data.table)
intro
我感觉DT分为DT[where, select, group by]三种结构,是非常类似sql的语言。 而且data.table建表自带recycling的功能,非常方便,本身也是一种data.frame。
recycling
{r} data.table(x = c("A", "B"), y = 1:4)
{r} my_first_data_table <- data.table(x = c("a","b","c","d","e"), y = c(1,2,3,4,5))
{r} DT <- data.table(a = c(1L,2L),b = LETTERS[1:4]) DT[3,] DT[2:3,]
where
```{r} # DT and the data.table package are pre-loaded
Print the second to last row of DT using .N
DT[2:.N] # .N 相当于最后一行的意思
Print the column names of DT
names(DT)
Print the number or rows and columns of DT
dim(DT)
Print a new data.table containing rows 2, 2, and 3 of DT
DT[c(2,2,3)]
## select
```
typeof(DT[,.(b)])
-
datatable-faq.pdf 这个地方反馈了一些
data.table在select功能上的一些bug。
{r} D <- 5 DT[, .(D)] # DT[, D] DT[,b] # vector DT[,.("b")] # data.table by only one chr. DT[,.(b)] # data.table DT[,"b"] # data.table "b" = .(b)
.()相当于select函数,其中的数字可以相当于真的就是select 5。 然而,, D]相当于select No.5 col。
```{r} # DT and the data.table package are pre-loaded DT <- data.table( A = 1:5, B = letters[1:5], C = 6:10 )
Subset rows 1 and 3, and columns B and C
DT[c(1,3), .(B,C)]
Assign to ans the correct value
ans <- DT[,.(B, val = A*C)]
Fill in the blanks such that ans2 equals target
target <- data.table(B = c(“a”, “b”, “c”, “d”, “e”, “a”, “b”, “c”, “d”, “e”), val = as.integer(c(6:10, 1:5))) ans2 <- DT[, .(B, val = c(C,A))]
我觉得
`val = c(C,A)`这步操作太强了。
## group by
```
# iris is already available in your workspace
# Convert iris to a data.table: DT
DT <- as.data.table(iris)
# For each Species, print the mean Sepal.Length
DT[, mean(Sepal.Length), by = Species]
# Print mean Sepal.Length, grouping by first letter of Species
DT[, mean(Sepal.Length), by = substr(Species, 1, 1)]
```{r} # data.table version of iris: DT DT <- as.data.table(iris)
Group the specimens by Sepal area (to the nearest 10 cm2) and count how many occur in each group
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)]
Now name the output columns Area and Count
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)][,.(Area = round,Count = N)]
```
# Create the data.table DT
DT <- data.table(A = rep(letters[2:1], each = 4L),
B = rep(1:4, each = 2L),
C = sample(8))
# Create the new data.table, DT2
DT2 <- DT[,.(C = cumsum(C)),by = .(A,B)]
DT2
# Select from DT2 the last two values from C while you group by A
DT2[,.(C = tail(C,2)),by =.(A)]
intermediate
chaining
这里好的地方是,不需要像dplyr一样进行ungroup有点麻烦。
```{r} # The data.table package has already been loaded
Build DT
DT <- data.table(A = rep(letters[2:1], each = 4L), B = rep(1:4, each = 2L), C = sample(8))
Combine the two steps in a one-liner
DT[, .(C = cumsum(C)), by = .(A, B)][, .(C = tail(C, 2)), by = A]
```
# The data.table DT is loaded in your workspace
# Perform chained operations on DT
as.data.table(iris)[, .(Sepal.Length = median(Sepal.Length),
Sepal.Width = median(Sepal.Width),
Petal.Length = median(Petal.Length),
Petal.Width = median(Petal.Width)),
by = Species][order(-Species)]
order(-Species)倒序。 这里很多函数一定要定义是data.table。
.SD
这里类似于mutate_all
{r} DT <- data.table( x = c(2,1,2,1,2,2,1), y = seq(1,13,2), z = seq(2,14,2) ) DT[,.(lapply(.SD,mean)),by = x]
```{r} # A new data.table DT is available
Mean of columns
DT[,lapply(.SD,mean),by = x]
Median of columns
DT[,lapply(.SD,median),by = x]
## .SDcols
`.SDcols`是一个参数,是可以限定对哪些col执行函数。
DT[, .SD[-1], by = grp, .SDcols = paste0(“Q”, 1:3)] grp Q1 Q2 Q3 1: 6 4 1 4 2: 8 1 3 1 3: 8 5 2 5
这个的`SD[-1]`相当于踢掉了每组的第一行。
`paste0("Q", 1:3)`相当于选用了以`"Q"`为首字母的前三列。
这点有点难理解,很正常,但是`data.table`包本身就是base在R的基础函数上开发的,非常快。
DT[,.(lapply(.SD,sum),.N),by = x] x V1 N 1: 2 26 4 2: 2 30 4 3: 1 23 3 4: 1 26 3 DT[, lapply(.SD, cumsum), by = .(by1 = x, by2 = z > 8), .SDcols = c(“x”, “y”)] by1 by2 x y 1: 2 FALSE 2 1 2: 2 FALSE 4 6 3: 1 FALSE 1 3 4: 1 FALSE 2 10 5: 2 TRUE 2 9 6: 2 TRUE 4 20 7: 1 TRUE 1 13
## :=
这是计算机一个符号,类似于`x := x + 10`,x在循环中每次增加10.
```
# The data.table DT
DT <- data.table(A = letters[c(1, 1, 1, 2, 2)], B = 1:5)
# Add column by reference: Total
DT[, Total := sum(B), by = A]
# Add 1 to column B
DT[c(2, 4), B := B + 1L]
# Add a new column Total2
DT[2:4, Total2 := sum(B), by = A]
# Remove the Total column
DT[, Total := NULL]
# Select the third column using `[[`
DT[[3]]
我感觉这个包到这地方开始比tidyverse复杂了,所以作为backup了。