5 min read

【技术·R】📄 pdftools包PDF处理技巧:文本提取与分析

The new pdftools package allows for extracting text and metadata1 from pdf files in R. (Ooms 2016)

pdftools包主要用于提取

  1. pdf文本中的文字和
  2. 元数据 (作者等)

以节省我们复制pdf上信息的时间。 以下是一些常用功能的介绍。

页面文本

去除 highlight 等,直接提取文字。

# install.packages("pdftools")
library(pdftools)
## Warning: 程辑包'pdftools'是用R版本3.6.3 来建造的
## Using poppler version 0.73.0
# download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

提取目录

方便进行阅读计划。

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

提取 metadata

# Author, version, etc
(info <- pdf_info("1403.2805.pdf"))

# Table with fonts
(fonts <- pdf_fonts("1403.2805.pdf"))

PDF 转图片

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")

可以生成首页的图片

raw table

# download.file("http://arxiv.org/pdf/1406.4806.pdf", "1406.4806.pdf", mode = "wb")
txt <- pdf_text("1406.4806.pdf")

# some tables
cat(txt[18])
##  Method     Target     Action          Parameters            Example
##  GET        object     retrieve        formatting            GET /ocpu/library/MASS/data/cats/json
##             manual     read            formatting            GET /ocpu/library/MASS/man/rlm/html
##             graphic    render          formatting            GET /ocpu/tmp/{key}/graphics/1/png
##             file       download        -                     GET /ocpu/library/MASS/NEWS
##             path       list contents   -                     GET /ocpu/library/MASS/scripts/
##  POST       object     call function   function arguments    POST /ocpu/library/stats/R/rnorm
##             file       run script      control interpreter   POST /ocpu/library/MASS/scripts/ch01.R
##                                  Table 1: Currently implemented HTTP methods
## 4.4     Status codes
## Each HTTP response includes a status code. Table 2 lists some common HTTP status codes used by OpenCPU
## that the client should be able to interpret. The meaning of these status codes is conform HTTP standards.
## The web server may use additional status codes for more general purposes that are not specific to OpenCPU.
##          Status Code            Happens when                   Response content
##          200 OK                 On successful GET request      Requested data
##          201 Created            On successful POST request     Output key and location
##          302 Found              Redirect                       Redirect location
##          400 Bad Request        On computational error in R    Error message from R in text/plain
##          502 Bad Gateway        Back-end server offline        – (See error logs)
##          503 Bad Request        Back-end server failure        – (See error logs)
##                                   Table 2: Commonly used HTTP status codes
## 4.5     Content-types
## Clients can retrieve objects in various formats by adding a format identifier suffix to the URL in a GET request.
## Which formats are supported and how object types map to a particular format is at the discretion of the
## server implementation. Not every format can support any object type. For example, csv can only be used to
## retrieve tabular data structures and png is only appropriate for graphics. Table 3 lists the formats OpenCPU
## supports, the respective internet media type, and the R function that OpenCPU uses to export an object into
## a particular format. Arguments of the GET requests are mapped to this export function. The png format
## has parameters such as width and height as documented in ?png, whereas the tab format has parameters
## sep, eol, dec which specify the delimiting, end-of-line and decimal character respectively as documented in
## ?write.table.
##                                                         18
cat(txt[19])
## Format     Content-type                       Export function             Example
## print      text/plain                         base::print                 /ocpu/cran/MASS/R/rlm/print
## rda        application/octet-stream           base::save                  /ocpu/cran/MASS/data/cats/rda
## rds        application/octet-stream           base::saveRDS               /ocpu/cran/MASS/data/cats/rds
## json       application/json                   jsonlite::toJSON            /ocpu/cran/MASS/data/cats/json
## pb         application/x-protobuf             RProtoBuf::serialize pb     /ocpu/cran/MASS/data/cats/pb
## tab        text/plain                         utils::write.table          /ocpu/cran/MASS/data/cats/tab
## csv        text/csv                           utils::write.csv            /ocpu/cran/MASS/data/cats/csv
## png        image/png                          grDevices::png              /ocpu/tmp/{key}/graphics/1/png
## pdf        application/pdf                    grDevices::pdf              /ocpu/tmp/{key}/graphics/1/pdf
## svg        image/svg+xml                      grDevices::svg              /ocpu/tmp/{key}/graphics/1/svg
##                Table 3: Currently supported export formats and corresponding Content-type
## 4.6     URLs
## The root of the API is dynamic, but defaults to /ocpu/ in the current implementation. Clients should make
## the OpenCPU server address and root path configurable. In the examples we assume the defaults. As discussed
## before, OpenCPU currently implements two container types to hold resources. Table 4 lists the URLs of the
## package container type, which includes objects, data, manual pages and files.
## Path       Description                                                   Examples
## .          Package information                                           /ocpu/cran/MASS/
## ./R        Exported namespace objects                                    /ocpu/cran/MASS/R/
##                                                                          /ocpu/cran/MASS/R/rlm/print
## ./data     Data objects in the package (HTTP GET only)                   /ocpu/cran/MASS/data/
##                                                                          /ocpu/cran/MASS/data/cats/json
## ./man      Manual pages in the package (HTTP GET only)                   /ocpu/cran/MASS/man/
##                                                                          /ocpu/cran/MASS/man/rlm/html
## ./*        Files in installation directory, relative to package the root /ocpu/cran/MASS/NEWS
##                                                                          /ocpu/cran/MASS/scripts/
##                 Table 4: The package container includes objects, data, manual pages and files.
## Table 5 lists URLs of the session container type. This container holds outputs generated from a RPC request
## and includes objects, graphics, source code, stdout and files. As noted earlier, the distinction between
## packages and sessions is considered an implementation detail. The API does not differentiate between objects
## and files that appear in packages or in sessions.
##                                                          19

subset

参考 Ooms (2019)

# add2bibtex::add_bibtex("online")
# file.edit("add.bib")
# fs::dir_ls(regexp = "pdf$")

# extract some pages
pdf_subset('1403.2805.pdf',
           pages = 1:3, output = "subset.pdf")

# Should say 3
pdf_length("subset.pdf")

combine

# Generate another pdf
pdf("test.pdf")
rstudioapi::viewer("test.pdf")
plot(mtcars)
# dev.off()

# Combine them with the other one
pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf")

# Should say 4
pdf_length("joined.pdf")

Ooms, Jeroen. 2016. “Introducing Pdftools - a Fast and Portable Pdf Extractor.” 2016. https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen/.

———. 2019. “Join, Split, and Compress Pdf Files with Pdftools.” OpenSci. 2019. https://ropensci.org/technotes/2019/04/24/pdftools-22/.

Wikipedia contributors. 2018. “Metadata — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Metadata&oldid=864629869.


  1. 主要指的是数据描述、结构、统计值等。 “These descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.”(Wikipedia contributors 2018)