5 min read

pdftools 使用技巧

The new pdftools package allows for extracting text and metadata1 from pdf files in R. (Ooms 2016)

pdftools包主要用于提取

  1. pdf文本中的文字和
  2. 元数据 (作者等)

以节省我们复制pdf上信息的时间。 以下是一些常用功能的介绍。

页面文本

去除 highlight 等,直接提取文字。

# install.packages("pdftools")
library(pdftools)
## Warning: 程辑包'pdftools'是用R版本3.6.3 来建造的
## Using poppler version 0.73.0
# download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

提取目录

方便进行阅读计划。

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

提取 metadata

# Author, version, etc
(info <- pdf_info("1403.2805.pdf"))

# Table with fonts
(fonts <- pdf_fonts("1403.2805.pdf"))

PDF 转图片

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")

可以生成首页的图片

raw table

# download.file("http://arxiv.org/pdf/1406.4806.pdf", "1406.4806.pdf", mode = "wb")
txt <- pdf_text("1406.4806.pdf")

# some tables
cat(txt[18])
##  Method     Target     Action          Parameters            Example
##  GET        object     retrieve        formatting            GET /ocpu/library/MASS/data/cats/json
##             manual     read            formatting            GET /ocpu/library/MASS/man/rlm/html
##             graphic    render          formatting            GET /ocpu/tmp/{key}/graphics/1/png
##             file       download        -                     GET /ocpu/library/MASS/NEWS
##             path       list contents   -                     GET /ocpu/library/MASS/scripts/
##  POST       object     call function   function arguments    POST /ocpu/library/stats/R/rnorm
##             file       run script      control interpreter   POST /ocpu/library/MASS/scripts/ch01.R
##                                  Table 1: Currently implemented HTTP methods
## 4.4     Status codes
## Each HTTP response includes a status code. Table 2 lists some common HTTP status codes used by OpenCPU
## that the client should be able to interpret. The meaning of these status codes is conform HTTP standards.
## The web server may use additional status codes for more general purposes that are not specific to OpenCPU.
##          Status Code            Happens when                   Response content
##          200 OK                 On successful GET request      Requested data
##          201 Created            On successful POST request     Output key and location
##          302 Found              Redirect                       Redirect location
##          400 Bad Request        On computational error in R    Error message from R in text/plain
##          502 Bad Gateway        Back-end server offline        – (See error logs)
##          503 Bad Request        Back-end server failure        – (See error logs)
##                                   Table 2: Commonly used HTTP status codes
## 4.5     Content-types
## Clients can retrieve objects in various formats by adding a format identifier suffix to the URL in a GET request.
## Which formats are supported and how object types map to a particular format is at the discretion of the
## server implementation. Not every format can support any object type. For example, csv can only be used to
## retrieve tabular data structures and png is only appropriate for graphics. Table 3 lists the formats OpenCPU
## supports, the respective internet media type, and the R function that OpenCPU uses to export an object into
## a particular format. Arguments of the GET requests are mapped to this export function. The png format
## has parameters such as width and height as documented in ?png, whereas the tab format has parameters
## sep, eol, dec which specify the delimiting, end-of-line and decimal character respectively as documented in
## ?write.table.
##                                                         18
cat(txt[19])
## Format     Content-type                       Export function             Example
## print      text/plain                         base::print                 /ocpu/cran/MASS/R/rlm/print
## rda        application/octet-stream           base::save                  /ocpu/cran/MASS/data/cats/rda
## rds        application/octet-stream           base::saveRDS               /ocpu/cran/MASS/data/cats/rds
## json       application/json                   jsonlite::toJSON            /ocpu/cran/MASS/data/cats/json
## pb         application/x-protobuf             RProtoBuf::serialize pb     /ocpu/cran/MASS/data/cats/pb
## tab        text/plain                         utils::write.table          /ocpu/cran/MASS/data/cats/tab
## csv        text/csv                           utils::write.csv            /ocpu/cran/MASS/data/cats/csv
## png        image/png                          grDevices::png              /ocpu/tmp/{key}/graphics/1/png
## pdf        application/pdf                    grDevices::pdf              /ocpu/tmp/{key}/graphics/1/pdf
## svg        image/svg+xml                      grDevices::svg              /ocpu/tmp/{key}/graphics/1/svg
##                Table 3: Currently supported export formats and corresponding Content-type
## 4.6     URLs
## The root of the API is dynamic, but defaults to /ocpu/ in the current implementation. Clients should make
## the OpenCPU server address and root path configurable. In the examples we assume the defaults. As discussed
## before, OpenCPU currently implements two container types to hold resources. Table 4 lists the URLs of the
## package container type, which includes objects, data, manual pages and files.
## Path       Description                                                   Examples
## .          Package information                                           /ocpu/cran/MASS/
## ./R        Exported namespace objects                                    /ocpu/cran/MASS/R/
##                                                                          /ocpu/cran/MASS/R/rlm/print
## ./data     Data objects in the package (HTTP GET only)                   /ocpu/cran/MASS/data/
##                                                                          /ocpu/cran/MASS/data/cats/json
## ./man      Manual pages in the package (HTTP GET only)                   /ocpu/cran/MASS/man/
##                                                                          /ocpu/cran/MASS/man/rlm/html
## ./*        Files in installation directory, relative to package the root /ocpu/cran/MASS/NEWS
##                                                                          /ocpu/cran/MASS/scripts/
##                 Table 4: The package container includes objects, data, manual pages and files.
## Table 5 lists URLs of the session container type. This container holds outputs generated from a RPC request
## and includes objects, graphics, source code, stdout and files. As noted earlier, the distinction between
## packages and sessions is considered an implementation detail. The API does not differentiate between objects
## and files that appear in packages or in sessions.
##                                                          19

subset

参考 Ooms (2019)

# add2bibtex::add_bibtex("online")
# file.edit("add.bib")
# fs::dir_ls(regexp = "pdf$")

# extract some pages
pdf_subset('1403.2805.pdf',
           pages = 1:3, output = "subset.pdf")

# Should say 3
pdf_length("subset.pdf")

combine

# Generate another pdf
pdf("test.pdf")
rstudioapi::viewer("test.pdf")
plot(mtcars)
# dev.off()

# Combine them with the other one
pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf")

# Should say 4
pdf_length("joined.pdf")

Ooms, Jeroen. 2016. “Introducing Pdftools - a Fast and Portable Pdf Extractor.” 2016. https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen/.

———. 2019. “Join, Split, and Compress Pdf Files with Pdftools.” OpenSci. 2019. https://ropensci.org/technotes/2019/04/24/pdftools-22/.

Wikipedia contributors. 2018. “Metadata — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Metadata&oldid=864629869.


  1. 主要指的是数据描述、结构、统计值等。 “These descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.”(Wikipedia contributors 2018)