How to pull text out of a PDF and have it wrap correctly

Share interesting information or links related to portable apps here.
Post Reply
Message
Author
User avatar
webfork
Posts: 7921
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

How to pull text out of a PDF and have it wrap correctly

#1 Post by webfork » Sat Aug 04, 2018 7:44 am

This process uses the CuteMarkEd program, which is not currently here on the site but is probably portable as it's hosted by PortableApps.

In the screen below, I've already copied some legal text from a random PDF file online and I've setup CuteMarkEd above and LibreOffice (with paragraph markings) below:

Image

How it works:

CuteMarkEd doesn't recognize standard end-of-line values unless there's two in a row. This allows the program to quickly and easily break down bad line wraps but preserve paragraph breaks. Unfortunately you have to select ALL the text, right-click., and then copy. CTRL+C doesn't work.

I've tried several different systems, which I can detail here if anyone is curious, but this one is the fastest.

---

EDIT: I realized shortly after posting that ghostwriter has similar functionality and probably works better. See smaragdus' review of ghostwriter.
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

juverax
Posts: 5
Joined: Mon Jun 11, 2018 5:19 am

Re: How to pull text out of a PDF and have it wrap correctly

#2 Post by juverax » Sat Aug 04, 2018 9:06 am

Sorry if I missed the point .... but if I had to pull out some text from a pdf, I would open that pdf with Sumatra, select with the mouse (left button) the part of the text I want to paste in a text editor (e.g. MS-Notepad), and done ... !
In Notepad, the text is wrapped exactly the way it is in the original pdf.

User avatar
lintalist
Posts: 212
Joined: Sat Apr 19, 2014 12:52 am
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#3 Post by lintalist » Wed Aug 08, 2018 11:21 am

If I need to grab the text of the entire PDF or specific pages I use the cmd line program pdftotext from the xpdftools toolset ( https://www.xpdfreader.com/download.html ) but it preserves the layout of the text e.g. page in two columns of text -> text in two columns of text.
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. The Xpdf viewer uses the Qt cross-platform GUI toolkit. The other command line utilties do not require Qt.
Included in the toolset:
  • xpdf -- viewer
  • pdftops -- generate a PostScript file
  • pdftotext -- generate a plain text file
  • pdftohtml -- converts a PDF file to HTML
  • pdfinfo -- dumps a PDF file's Info dictionary
  • pdffonts -- lists the fonts used in a PDF file along with various information for each font
  • pdfdetach -- lists or extracts embedded files (attachments) from a PDF file
  • pdftoppm -- converts a PDF file to a series of PPM/PGM/PBM-format bitmaps
  • pdftopng -- converts a PDF file to a series of PNG image files
  • pdfimages -- extracts the images from a PDF file
Re pdftotext : there are various options but the ones you will want to experiment with are:
-layout
Maintain (as best as possible) the original physical layout ofthe text. The default is to 'undo' physical layout (columns, hyphenation, etc.) and output the text in reading order. If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

-simple
Similar to -layout, but optimized for simple one-column pages. This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.

-table
Table mode is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

User avatar
webfork
Posts: 7921
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#4 Post by webfork » Mon Aug 13, 2018 10:06 am

lintalist wrote:
Wed Aug 08, 2018 11:21 am
The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
juverax wrote:
Sat Aug 04, 2018 9:06 am
Sorry if I missed the point .... but if I had to pull out some text from a pdf, I would open that pdf with Sumatra
True: some text is wrapped correctly in some PDFs, but I think you'll find most PDFs in even a cursory search on the web don't wrap properly. Including when using Sumatra.
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

User avatar
lintalist
Posts: 212
Joined: Sat Apr 19, 2014 12:52 am
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#5 Post by lintalist » Mon Aug 13, 2018 10:19 am

webfork wrote:
Mon Aug 13, 2018 10:06 am
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
Which uses poppler which in turn uses code from the xpdf toolset - six degrees of seperation :-)

User avatar
webfork
Posts: 7921
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#6 Post by webfork » Thu Aug 16, 2018 9:26 am

lintalist wrote:
Mon Aug 13, 2018 10:19 am
webfork wrote:
Mon Aug 13, 2018 10:06 am
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
Which uses poppler which in turn uses code from the xpdf toolset - six degrees of seperation :-)
I should have guessed :)
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

Post Reply