Problems with PDF

Any other tech-related topics
Post Reply
Message
Author
User avatar
webfork
Posts: 9523
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Problems with PDF

#1 Post by webfork » Sun Sep 27, 2020 9:36 am

While I've become very good at dealing with difficult PDF files over the years, this offers a very good breakdown on why the original design created so many problems:

What's so hard about PDF text extraction?https://filingdb.com/b/pdf-text-extraction (from: https://news.ycombinator.com/item?id=22473263)

In addition to notes about invisible text, added spaces for text separation, and off-page text (because, you know, why not), the author points out that the format was always intended as an OUTPUT format, never an input.

Setting aside the issues mentioned above along with a host of security issues, poor support for mobile users, compression, and compatibility, it's frustrating to say that in my world PDF is still the source. I still look for information on company intranets in PDF format first, because it tends to be more polished and not vague or in-progress junkfiles. It is still my destination format even (as the article points out) it's far from ideal.

Some suggested freeware workarounds for problems the article details:

vevy
Posts: 678
Joined: Tue Sep 10, 2019 11:17 am

Re: Problems with PDF

#2 Post by vevy » Sun Sep 27, 2020 2:57 pm

Did you try Xpdf tools?

Code: Select all

pdftotext.exe -table input.pdf output.txt			//output name is optional
pdftotext.exe -layout input.pdf
or

Code: Select all

pdftohtml.exe input.pdf OutputDir
"Is there a Windows-included tool for this task?"
"I only want open-source tools"
"I want a tool that is still actively developed"
"So many to choose from!"
and many more!
Support easy-to-do filters and badges!

User avatar
webfork
Posts: 9523
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Problems with PDF

#3 Post by webfork » Sun Sep 27, 2020 10:44 pm

vevy wrote:
Sun Sep 27, 2020 2:57 pm
Did you try Xpdf tools?
They've been discussed in part on the site and yes, I was very impressed.

---

Some related threads

How to pull text out of a PDF and have it wrap correctly
viewtopic.php?p=91372#p91372

Combine batch of images into a PDF
viewtopic.php?p=97239#p97239

vevy
Posts: 678
Joined: Tue Sep 10, 2019 11:17 am

Re: Problems with PDF

#4 Post by vevy » Mon Sep 28, 2020 1:39 pm

webfork wrote:
Sun Sep 27, 2020 10:44 pm
vevy wrote:
Sun Sep 27, 2020 2:57 pm
Did you try Xpdf tools?
They've been discussed in part on the site and yes, I was very impressed.
These are different tools than the one in the link. And still maintained. :wink: YMMV with complex pdf though.
Although, as lintalist noted here (from your second link above), they are related.
"Is there a Windows-included tool for this task?"
"I only want open-source tools"
"I want a tool that is still actively developed"
"So many to choose from!"
and many more!
Support easy-to-do filters and badges!

Post Reply