What's so hard about PDF text extraction? https://filingdb.com/b/pdf-text-extraction (from: https://news.ycombinator.com/item?id=22473263)
In addition to notes about invisible text, added spaces for text separation, and off-page text (because, you know, why not), the author points out that the format was always intended as an OUTPUT format, never an input.
Setting aside the issues mentioned above along with a host of security issues, poor support for mobile users, compression, and compatibility, it's frustrating to say that in my world PDF is still the source. I still look for information on company intranets in PDF format first, because it tends to be more polished and not vague or in-progress junkfiles. It is still my destination format even (as the article points out) it's far from ideal.
Some suggested freeware workarounds for problems the article details:
- ByteScout PDF Multitool has a detect tables function, which is at least one workaround for a common PDF issue
- For columns, there's a trick using Excel that could very easily be adapted to PDF:
https://www.bluereefadvisor.com/how-to- ... nto-excel/