1 新增

duration in filter
目录
duration
整理的lubridate的使用技巧

2 整理

## # A tibble: 1 x 3
##   Package   Version Built
##   <chr>     <chr>   <chr>
## 1 lubridate 1.7.9   3.6.3

library(lubridate)

2.1 as date-times 系列

和 R Core Team (2017) 重合很多，但是也很比较好的功能。

2.1.1 `as_date`

as_date(0)

## [1] "1970-01-01"

as_date(1)

## [1] "1970-01-02"

一天一单位

sample_date <- "2017-01-01"
as_date(sample_date)

## [1] "2017-01-01"

as.Date(sample_date)

## [1] "2017-01-01"

as_date替换as.Date。

2.1.2 `as_datetime`

as_datetime(0)

## [1] "1970-01-01 UTC"

as_datetime(1)

## [1] "1970-01-01 00:00:01 UTC"

as_datetime(10)

## [1] "1970-01-01 00:00:10 UTC"

一秒一单位。

2.1.3 `as.hms`

hms::as.hms(0)

## Warning: as.hms() is deprecated, please use as_hms().
## This warning is displayed once per session.

## 00:00:00

hms::as.hms(1)

## 00:00:01

这个功能很好， R Core Team (2017) 应该没有。

2.2 round

d <- "2018-05-03 11:19:10 CST"
d <- as_datetime(d)
d

## [1] "2018-05-03 11:19:10 UTC"

floor_date(d,unit = "hour")

## [1] "2018-05-03 11:00:00 UTC"

ceiling_date(d,unit = "hour")

## [1] "2018-05-03 12:00:00 UTC"

round_date(d,unit = "hour")

## [1] "2018-05-03 11:00:00 UTC"

rollback(d)

## [1] "2018-04-30 11:19:10 UTC"

rollback(d,roll_to_first=T)

## [1] "2018-05-01 11:19:10 UTC"

floor_date() $\to$ floor $\to$ round down $\to$ 向下取整
ceiling_date() $\to$ ceiling $\to$ round up $\to$ 向上取整
round_date() $\to$ round $\to$ round nearest $\to$ 就近取整
rollback方便去月初月末数据，这里最好的是月末数据的取法，因为每个月最后一天不是一致的。

2.3 parse date-times

在 R Core Team (2017) 中，

deparse函数将表达式string化
parse函数将string表达式化

一因此这里也是将string转化为时间变量的意思。因此要控制这里的输入是string。

ymd_hms("2018-05-03 11:19:10 CST")

## [1] "2018-05-03 11:19:10 UTC"

yq("2018:Q1")

## [1] "2018-01-01"

date_decimal(2017.5, tz = "Asia/Shanghai")

## [1] "2017-07-02 12:00:00 CST"

date_decimal函数的输入是numeric， 2017.5等价于 $2017 + \frac{5}{10} \cdot 1 年$
时区指定tz = "Asia/Shanghai"

2.4 time zones

查看时区可以用 R Core Team (2017) 的OlsonNames()函数。

OlsonNames() %>%
  as_tibble() %>%
  filter(value %in% str_subset(value, "Shanghai|Chicago"))

## # A tibble: 2 x 1
##   value          
##   <chr>          
## 1 America/Chicago
## 2 Asia/Shanghai

d2 <- ymd_hms(d)
d2

## [1] "2018-05-03 11:19:10 UTC"

with_tz(d2,tzone = "US/Pacific")

## [1] "2018-05-03 04:19:10 PDT"

force_tz(d2,tzone = "US/Pacific")

## [1] "2018-05-03 11:19:10 PDT"

with_tz的输入必须是时间变量而非string。
with_tz是转换时差
force_tz是保持时间不变，改变时区。

2.5 Math with Date-times

吃完饭，来搞一搞。

这里主要区分periods和durations两种函数

periods: minutes等*s
durations: dminutes等d*s

它们之间的区别是periods不考虑美国冬令时等转换、闰年转换等，因此推荐使用durations。

ymd_hms("2018-05-03 12:52:10 CST") + dminutes(30)

## [1] "2018-05-03 13:22:10 UTC"

interval()函数会建立成特定的时间变量，但是我觉得我使用的少。

interval(
  ymd_hms("2018-05-03 12:52:10 CST") + dminutes(30),
  ymd_hms("2018-05-03 12:52:10 CST")
) %>%
  as_tibble()

## # A tibble: 1 x 1
##   value                                           
##   <Interval>                                      
## 1 2018-05-03 13:22:10 UTC--2018-05-03 12:52:10 UTC

2.5.1 duration

区别于interval

library(lubridate)
interval(ymd("2019-03-01"),ymd("2019-02-01")) %>% as.duration()

## [1] "-2419200s (~-4 weeks)"

(ymd("2019-03-01")-ymd("2019-02-01"))

## Time difference of 28 days

(ymd("2019-03-01")-ymd("2019-02-01")) %>% as.integer()

## [1] 28

(ymd_hms("2019-03-01 00:00:00")-ymd_hms("2019-03-01 01:00:00")) %>% as.integer() %>% dhours()

## [1] "-3600s (~-1 hours)"

d*s 函数只要回传的数字，并且*反馈正确的单位即可。

3 未整理

# 生成 100 个日期，从2018-01-01开始
set.seed(42)
n <- 100
dt <- 
  data.table(date = seq(ymd("2018-01-01"),length.out = n, by = "day"), 
             x = runif(n)
             )
dt %>% head()

##          date         x
## 1: 2018-01-01 0.9148060
## 2: 2018-01-02 0.9370754
## 3: 2018-01-03 0.2861395
## 4: 2018-01-04 0.8304476
## 5: 2018-01-05 0.6417455
## 6: 2018-01-06 0.5190959

by = "day"是递增按日计算。

3.1 按照周进行分类(大猫的R语言课堂 2018)

dt %>% 
  mutate(week = week(date)) %>% 
  group_by(week) %>% 
  summarise(avg = mean(x))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 15 x 2
##     week   avg
##    <dbl> <dbl>
##  1     1 0.695
##  2     2 0.552
##  3     3 0.634
##  4     4 0.567
##  5     5 0.558
##  6     6 0.483
##  7     7 0.700
##  8     8 0.467
##  9     9 0.577
## 10    10 0.520
## 11    11 0.258
## 12    12 0.377
## 13    13 0.389
## 14    14 0.521
## 15    15 0.682

week是lubridate的函数。

3.2 按照星期进行分类(大猫的R语言课堂 2018)

dt %>% 
  mutate(weekday = wday(date)) %>% 
  group_by(weekday) %>% 
  summarise(avg = mean(x))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 7 x 2
##   weekday   avg
##     <dbl> <dbl>
## 1       1 0.557
## 2       2 0.451
## 3       3 0.583
## 4       4 0.463
## 5       5 0.530
## 6       6 0.550
## 7       7 0.539

wday是lubridate的函数，表达星期几。

3.3 按照“每个三天”分类(大猫的R语言课堂 2018)

dt %>% 
  mutate(three.day = ceiling_date(date,unit = "3 days")) %>% 
  head()

##          date         x  three.day
## 1: 2018-01-01 0.9148060 2018-01-04
## 2: 2018-01-02 0.9370754 2018-01-04
## 3: 2018-01-03 0.2861395 2018-01-04
## 4: 2018-01-04 0.8304476 2018-01-07
## 5: 2018-01-05 0.6417455 2018-01-07
## 6: 2018-01-06 0.5190959 2018-01-07

ceiling_date是lubridate的函数， unit = "3 days"表达间隔三天。

3.4 转换成`"%Y-%m"`的方法(大猫的R语言课堂 2018)

format(transactiondate, "%Y-%m") 但是这是文本格式。

这是DataCamp出的xts包的 cheatsheet

当对月份设为group，进行汇总时，可以使用xts包，也可以使用lubridate包，进行时间变量的计算。通过year、month函数提取时间变量的年和月，仿造day=1，然后通过make_date(year,month,day)函数进行合并。这里需要对时间变量再转换as.POSIXct.Date。因为在ggplot中表示时，scale_x_datetime(date_breaks = "1 month")函数需要x为POSIXct.Date格式。

3.5 时区的bug解决(大猫的R语言课堂 2018)

我估计是我的时区选的有问题。发现我的input的时候是UTC时区。所以要修改成Asia/Taipei。并且with_tz(.,tzone = "Asia/Shanghai")可以查看具体时间在本时区的表达情况。 mutate(start = ymd_hms(as.character(start), tz = "Asia/Shanghai"))这是一个置换不同时区的方式。综上，excel处理时间的函数有毒。

3.6 少用`ymd_hms`函数(大猫的R语言课堂 2018)

最后转化成double了。 mutate(start = ymd_hms(start))中ymd_hms常常会让一个时间变量变成double格式，这个很麻烦，因为转换都需要as.POSIXct(as.numeric(time), origin='1970-01-01')中的origin，这个不知道啊，所以坑。

3.7 duration in filter

as.integer函数使得duration可以在filter中进行筛选。

library(lubridate)
data_3 %>% 
    select(1:5) %>% 
    transmute(datetime = make_datetime(X1,X2,X3,X4,X5)) %>% 
    arrange(datetime) %>% 
    mutate(duration = interval(datetime,lag(datetime)) %>% as.duration(),
           duration_int = as.integer(duration)) %>% 
    filter(duration_int != -300)

3.8 `%within%` 函数

ref_tbl <-
tibble(
    placement = c("NewYorkTimes_iPhone","NewYorkTimes_iPhone"),
    start = c("2018-06-01","2018-06-26"),
    end = c("2018-06-25","2018-06-30"),
    rate = c(5,7)
) %>%
    mutate_at(vars(start, end),as.Date)
des_tbl <-
tibble(
    placement = "NewYorkTimes_iPhone",
    date = "2018-06-15",
    rate = 5
) %>%
    mutate(date = as.Date(date))
ref_tbl

## # A tibble: 2 x 4
##   placement           start      end         rate
##   <chr>               <date>     <date>     <dbl>
## 1 NewYorkTimes_iPhone 2018-06-01 2018-06-25     5
## 2 NewYorkTimes_iPhone 2018-06-26 2018-06-30     7

des_tbl

## # A tibble: 1 x 3
##   placement           date        rate
##   <chr>               <date>     <dbl>
## 1 NewYorkTimes_iPhone 2018-06-15     5

ref_tbl %>%
    left_join(des_tbl, by = c("placement","rate")) %>%
    mutate(
        ifelse(date %within% interval(start,end),1,0)
    )

## # A tibble: 2 x 6
##   placement    start      end         rate date       `ifelse(date %within% int~
##   <chr>        <date>     <date>     <dbl> <date>                          <dbl>
## 1 NewYorkTime~ 2018-06-01 2018-06-25     5 2018-06-15                          1
## 2 NewYorkTime~ 2018-06-26 2018-06-30     7 NA                                 NA

%within% 和 interval 是lubridate的函数，主要算时间区间的。 I post an answer related to this function on Stack Overflow.

3.9 月首日 `floor_date`

参考 Spring (2018) 的思路。

x <- ymd(c("2012-03-26", "2012-05-04", "2012-09-23", "2012-12-31"))
floor_date(x, "1 month")

## [1] "2012-03-01" "2012-05-01" "2012-09-01" "2012-12-01"

floor_date(x, "1 month") %>% decimal_date()

## [1] 2012.164 2012.331 2012.667 2012.915

3.10 以周四为首日的周度数据

date_add(date_add(makedate(year(inserttime),1), interval week(date_sub(inserttime,interval 4 day)) week), interval 3 day) as 发标时间,

date_add(makedate(year(inserttime),1): 201X-01-1
date_sub(inserttime,interval 4 day): 倒退四天
week(date_sub(inserttime,interval 4 day)): 倒退四天的占今年的星期数，所有的周五、周六、周日星期数不变，周一、周二、周三、周四都少加一个星期。
date_sub(...,interval 3 day): 这里的计算还是不清楚

3.11 使用`seq`函数创建时间序列 (LaBarr 2018, Chapter 1.2)

seq(as.Date("2014-01-19"), length = 176, by = 7)

3.12 随机生成时间

as.POSIXct(" 2017-10-08 07:00:00") + runif(n=100, min=0, max=3600)

Stack Overflow

3.13 unix time converting

参考 Eddelbuettel (2012)

as.POSIXct(1352068320, origin="1970-01-01")

## [1] "2012-11-05 06:32:00 CST"

3.14 月度变化

floor(cohort_index/30/3600/24) 比这个包的函数精确好用。

Eddelbuettel, Dirk. 2012. “Convert Unix Epoch to Date Object.” 2012. https://stackoverflow.com/a/13456338.

LaBarr, Aric. 2018. “Forecasting Product Demand in R.” DataCamp. 2018. https://www.datacamp.com/courses/forecasting-product-demand-in-r.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Spring, Jon. 2018. 2018. https://community.rstudio.com/t/add-with-year-option-in-lubridate-month/9704.

大猫的R语言课堂. 2018. “如何快速按照日期分组.” 2018. https://mp.weixin.qq.com/s/8kpewvyOcIGglftkmnT6Bg.

【技术·R】📅 lubridate的使用技巧