r - Error trying to read a PDF using readPDF from the tm package -
(windows 7 / r version 3.0.1)
below commands , resulting error:
> library(tm) > pdf <- readpdf(pdftotextoptions = "-layout") > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")  error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") :   cannot open file 'c:\users\raffael\appdata\local\temp     \rtmps8uql1\pdfinfo167c2bc159f8': no such file or directory how solve issue?
edit i
(as suggested ben , described here)
i downloaded xpdf copied 32bit version  c:\program files (x86)\xpdf32 , 64bit version  c:\program files\xpdf64
the environment variables pdfinfo , pdftotext referring respective executables either 32bit (tested r 32bit) or 64bit (tested r 64bit)
edit ii
one confusing observation starting fresh session (tm not loaded) last command alone produce error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")  error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") :   cannot open file 'c:\users\raffael\appdata\local\temp\rtmpki5gnl      \pdfinfode8283c422f': no such file or directory i don't understand @ because function variable not defined tm.readpdf yet. below you'll find function pdf refers "naturally" , returned tm.readpdf:
> pdf  function (elem, language, id)  {     meta <- tm:::pdfinfo(elem$uri)     content <- system2("pdftotext", c(pdftotextoptions, shquote(elem$uri),          "-"), stdout = true)     plaintextdocument(content, meta$author, meta$creationdate,          meta$subject, meta$title, id, meta$creator, language) } <environment: 0x0674bd8c>  > library(tm) > pdf <- readpdf(pdftotextoptions = "-layout") > pdf  function (elem, language, id)  {     meta <- tm:::pdfinfo(elem$uri)     content <- system2("pdftotext", c(pdftotextoptions, shquote(elem$uri),          "-"), stdout = true)     plaintextdocument(content, meta$author, meta$creationdate,          meta$subject, meta$title, id, meta$creator, language) } <environment: 0x0c3d7364> apparently there no difference - why use readpdf @ all?
edit iii
the pdf file located here: c:\users\raffael\documents
> getwd() [1] "c:/users/raffael/documents" edit iv
first instruction in pdf() call tm:::pdfinfo() - , there error caused within first few lines:
> outfile <- tempfile("pdfinfo") > on.exit(unlink(outfile)) > status <- system2("pdfinfo", shquote(normalizepath("c:/users/raffael/documents/17214.pdf")),  +                   stdout = outfile) > tags <- c("title", "subject", "keywords", "author", "creator",  +           "producer", "creationdate", "moddate", "tagged", "form",  +           "pages", "encrypted", "page size", "file size", "optimized",  +           "pdf version") > re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",  +                                                       tags)), collapse = "|")) > lines <- readlines(outfile, warn = false) error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") :   cannot open file 'c:\users\raffael\appdata\local\temp\rtmpquryx6\pdfinfo8d419174450':   no such file or direc apparently tempfile() doesn't create file.
> outfile <- tempfile("pdfinfo") > outfile [1] "c:\\users\\raffael\\appdata\\local\\temp\\rtmpquryx6\\pdfinfo8d437bd65d9" the folder c:\users\raffael\appdata\local\temp\rtmpquryx6 exists , holds files none named pdfinfo8d437bd65d9.
intersting, on machine after fresh start pdf function convert image pdf:
 getanywhere(pdf) single object matching ‘pdf’ found found in following places   package:grdevices   namespace:grdevices [etc.] but problem of reading in pdf files text, fiddling path bit hit-and-miss (and annoying if work across several different computers), think simplest , safest method call pdf2text using system tony breyal describes here. 
in case (note 2 sets of quotes):
system(paste('"c:/program files/xpdf64/pdftotext.exe"',               '"c:/users/raffael/documents/17214.pdf"'), wait=false) this extended *apply function or loop if have many pdf files. 
Comments
Post a Comment