{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE) > The new pdftools package allows for extracting text and metadata1 from pdf files in R. [@Ooms2016]
pdftools包主要用于提取
- pdf文本中的文字和
- 元数据 (作者等)
以节省我们复制pdf上信息的时间。 以下是一些常用功能的介绍。
页面文本
去除 highlight 等,直接提取文字。
{r} # install.packages("pdftools") library(pdftools)
```{r eval=F} # download.file(“http://arxiv.org/pdf/1403.2805.pdf", “1403.2805.pdf”, mode = “wb”) txt <- pdf_text(“1403.2805.pdf”)
first page text
cat(txt[1])
second page text
cat(txt[2])
# 提取目录
方便进行阅读计划。
```
# Table of contents
toc <- pdf_toc("1403.2805.pdf")
# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
提取 metadata
```{r eval=F} # Author, version, etc (info <- pdf_info(“1403.2805.pdf”))
Table with fonts
(fonts <- pdf_fonts(“1403.2805.pdf”))
# PDF 转图片
```
# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)
# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")
可以生成首页的图片
raw table
```{r} # download.file(“http://arxiv.org/pdf/1406.4806.pdf", “1406.4806.pdf”, mode = “wb”) txt <- pdf_text(“1406.4806.pdf”)
some tables
cat(txt[18]) cat(txt[19])
# subset
参考 @Oomspdftools-22
```
# add2bibtex::add_bibtex("online")
# file.edit("add.bib")
# fs::dir_ls(regexp = "pdf$")
# extract some pages
pdf_subset('1403.2805.pdf',
pages = 1:3, output = "subset.pdf")
# Should say 3
pdf_length("subset.pdf")
combine
```{r eval=FALSE} # Generate another pdf pdf(“test.pdf”) rstudioapi::viewer(“test.pdf”) plot(mtcars) # dev.off()
Combine them with the other one
pdf_combine(c(“test.pdf”, “subset.pdf”), output = “joined.pdf”)
Should say 4
pdf_length(“joined.pdf”) ```
-
主要指的是数据描述、结构、统计值等。 “These descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.”[@wikimetadata] ↩︎