The new pdftools package allows for extracting text and metadata1 from pdf files in R. (Ooms 2016)
pdftools
包主要用于提取
- pdf文本中的文字和
- 元数据 (作者等)
以节省我们复制pdf上信息的时间。 以下是一些常用功能的介绍。
页面文本
去除 highlight 等,直接提取文字。
# install.packages("pdftools")
library(pdftools)
## Warning: 程辑包'pdftools'是用R版本3.6.3 来建造的
## Using poppler version 0.73.0
# download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")
# first page text
cat(txt[1])
# second page text
cat(txt[2])
提取目录
方便进行阅读计划。
# Table of contents
toc <- pdf_toc("1403.2805.pdf")
# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
提取 metadata
# Author, version, etc
(info <- pdf_info("1403.2805.pdf"))
# Table with fonts
(fonts <- pdf_fonts("1403.2805.pdf"))
PDF 转图片
# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)
# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")
可以生成首页的图片
raw table
# download.file("http://arxiv.org/pdf/1406.4806.pdf", "1406.4806.pdf", mode = "wb")
txt <- pdf_text("1406.4806.pdf")
# some tables
cat(txt[18])
## Method Target Action Parameters Example
## GET object retrieve formatting GET /ocpu/library/MASS/data/cats/json
## manual read formatting GET /ocpu/library/MASS/man/rlm/html
## graphic render formatting GET /ocpu/tmp/{key}/graphics/1/png
## file download - GET /ocpu/library/MASS/NEWS
## path list contents - GET /ocpu/library/MASS/scripts/
## POST object call function function arguments POST /ocpu/library/stats/R/rnorm
## file run script control interpreter POST /ocpu/library/MASS/scripts/ch01.R
## Table 1: Currently implemented HTTP methods
## 4.4 Status codes
## Each HTTP response includes a status code. Table 2 lists some common HTTP status codes used by OpenCPU
## that the client should be able to interpret. The meaning of these status codes is conform HTTP standards.
## The web server may use additional status codes for more general purposes that are not specific to OpenCPU.
## Status Code Happens when Response content
## 200 OK On successful GET request Requested data
## 201 Created On successful POST request Output key and location
## 302 Found Redirect Redirect location
## 400 Bad Request On computational error in R Error message from R in text/plain
## 502 Bad Gateway Back-end server offline – (See error logs)
## 503 Bad Request Back-end server failure – (See error logs)
## Table 2: Commonly used HTTP status codes
## 4.5 Content-types
## Clients can retrieve objects in various formats by adding a format identifier suffix to the URL in a GET request.
## Which formats are supported and how object types map to a particular format is at the discretion of the
## server implementation. Not every format can support any object type. For example, csv can only be used to
## retrieve tabular data structures and png is only appropriate for graphics. Table 3 lists the formats OpenCPU
## supports, the respective internet media type, and the R function that OpenCPU uses to export an object into
## a particular format. Arguments of the GET requests are mapped to this export function. The png format
## has parameters such as width and height as documented in ?png, whereas the tab format has parameters
## sep, eol, dec which specify the delimiting, end-of-line and decimal character respectively as documented in
## ?write.table.
## 18
cat(txt[19])
## Format Content-type Export function Example
## print text/plain base::print /ocpu/cran/MASS/R/rlm/print
## rda application/octet-stream base::save /ocpu/cran/MASS/data/cats/rda
## rds application/octet-stream base::saveRDS /ocpu/cran/MASS/data/cats/rds
## json application/json jsonlite::toJSON /ocpu/cran/MASS/data/cats/json
## pb application/x-protobuf RProtoBuf::serialize pb /ocpu/cran/MASS/data/cats/pb
## tab text/plain utils::write.table /ocpu/cran/MASS/data/cats/tab
## csv text/csv utils::write.csv /ocpu/cran/MASS/data/cats/csv
## png image/png grDevices::png /ocpu/tmp/{key}/graphics/1/png
## pdf application/pdf grDevices::pdf /ocpu/tmp/{key}/graphics/1/pdf
## svg image/svg+xml grDevices::svg /ocpu/tmp/{key}/graphics/1/svg
## Table 3: Currently supported export formats and corresponding Content-type
## 4.6 URLs
## The root of the API is dynamic, but defaults to /ocpu/ in the current implementation. Clients should make
## the OpenCPU server address and root path configurable. In the examples we assume the defaults. As discussed
## before, OpenCPU currently implements two container types to hold resources. Table 4 lists the URLs of the
## package container type, which includes objects, data, manual pages and files.
## Path Description Examples
## . Package information /ocpu/cran/MASS/
## ./R Exported namespace objects /ocpu/cran/MASS/R/
## /ocpu/cran/MASS/R/rlm/print
## ./data Data objects in the package (HTTP GET only) /ocpu/cran/MASS/data/
## /ocpu/cran/MASS/data/cats/json
## ./man Manual pages in the package (HTTP GET only) /ocpu/cran/MASS/man/
## /ocpu/cran/MASS/man/rlm/html
## ./* Files in installation directory, relative to package the root /ocpu/cran/MASS/NEWS
## /ocpu/cran/MASS/scripts/
## Table 4: The package container includes objects, data, manual pages and files.
## Table 5 lists URLs of the session container type. This container holds outputs generated from a RPC request
## and includes objects, graphics, source code, stdout and files. As noted earlier, the distinction between
## packages and sessions is considered an implementation detail. The API does not differentiate between objects
## and files that appear in packages or in sessions.
## 19
subset
参考 Ooms (2019)
# add2bibtex::add_bibtex("online")
# file.edit("add.bib")
# fs::dir_ls(regexp = "pdf$")
# extract some pages
pdf_subset('1403.2805.pdf',
pages = 1:3, output = "subset.pdf")
# Should say 3
pdf_length("subset.pdf")
combine
# Generate another pdf
pdf("test.pdf")
rstudioapi::viewer("test.pdf")
plot(mtcars)
# dev.off()
# Combine them with the other one
pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf")
# Should say 4
pdf_length("joined.pdf")
Ooms, Jeroen. 2016. “Introducing Pdftools - a Fast and Portable Pdf Extractor.” 2016. https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen/.
———. 2019. “Join, Split, and Compress Pdf Files with Pdftools.” OpenSci. 2019. https://ropensci.org/technotes/2019/04/24/pdftools-22/.
Wikipedia contributors. 2018. “Metadata — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Metadata&oldid=864629869.