Page 1 of 1

Problems with PDF

Posted: Sun Sep 27, 2020 9:36 am
by webfork
While I've become very good at dealing with difficult PDF files over the years, this offers a very good breakdown on why the original design created so many problems:

What's so hard about PDF text extraction?https://filingdb.com/b/pdf-text-extraction (from: https://news.ycombinator.com/item?id=22473263)

In addition to notes about invisible text, added spaces for text separation, and off-page text (because, you know, why not), the author points out that the format was always intended as an OUTPUT format, never an input.

Setting aside the issues mentioned above along with a host of security issues, poor support for mobile users, compression, and compatibility, it's frustrating to say that in my world PDF is still the source. I still look for information on company intranets in PDF format first, because it tends to be more polished and not vague or in-progress junkfiles. It is still my destination format even (as the article points out) it's far from ideal.

Some suggested freeware workarounds for problems the article details:

Re: Problems with PDF

Posted: Sun Sep 27, 2020 2:57 pm
by vevy
Did you try Xpdf tools?

Code: Select all

pdftotext.exe -table input.pdf output.txt			//output name is optional
pdftotext.exe -layout input.pdf
or

Code: Select all

pdftohtml.exe input.pdf OutputDir

Re: Problems with PDF

Posted: Sun Sep 27, 2020 10:44 pm
by webfork
vevy wrote:
Sun Sep 27, 2020 2:57 pm
Did you try Xpdf tools?
They've been discussed in part on the site and yes, I was very impressed.

---

Some related threads

How to pull text out of a PDF and have it wrap correctly
viewtopic.php?p=91372#p91372

Combine batch of images into a PDF
viewtopic.php?p=97239#p97239

Re: Problems with PDF

Posted: Mon Sep 28, 2020 1:39 pm
by vevy
webfork wrote:
Sun Sep 27, 2020 10:44 pm
vevy wrote:
Sun Sep 27, 2020 2:57 pm
Did you try Xpdf tools?
They've been discussed in part on the site and yes, I was very impressed.
These are different tools than the one in the link. And still maintained. :wink: YMMV with complex pdf though.
Although, as lintalist noted here (from your second link above), they are related.