Page 1 of 1

How to pull text out of a PDF and have it wrap correctly

Posted: Sat Aug 04, 2018 7:44 am
by webfork
This process uses the CuteMarkEd program, which is not currently here on the site but is probably portable as it's hosted by PortableApps.

In the screen below, I've already copied some legal text from a random PDF file online and I've setup CuteMarkEd above and LibreOffice (with paragraph markings) below:

Image

How it works:

CuteMarkEd doesn't recognize standard end-of-line values unless there's two in a row. This allows the program to quickly and easily break down bad line wraps but preserve paragraph breaks. Unfortunately you have to select ALL the text, right-click., and then copy. CTRL+C doesn't work.

I've tried several different systems, which I can detail here if anyone is curious, but this one is the fastest.

---

EDIT: I realized shortly after posting that ghostwriter has similar functionality and probably works better. See smaragdus' review of ghostwriter.

Re: How to pull text out of a PDF and have it wrap correctly

Posted: Sat Aug 04, 2018 9:06 am
by juverax
Sorry if I missed the point .... but if I had to pull out some text from a pdf, I would open that pdf with Sumatra, select with the mouse (left button) the part of the text I want to paste in a text editor (e.g. MS-Notepad), and done ... !
In Notepad, the text is wrapped exactly the way it is in the original pdf.

Re: How to pull text out of a PDF and have it wrap correctly

Posted: Wed Aug 08, 2018 11:21 am
by lintalist
If I need to grab the text of the entire PDF or specific pages I use the cmd line program pdftotext from the xpdftools toolset ( https://www.xpdfreader.com/download.html ) but it preserves the layout of the text e.g. page in two columns of text -> text in two columns of text.
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. The Xpdf viewer uses the Qt cross-platform GUI toolkit. The other command line utilties do not require Qt.
Included in the toolset:
  • xpdf -- viewer
  • pdftops -- generate a PostScript file
  • pdftotext -- generate a plain text file
  • pdftohtml -- converts a PDF file to HTML
  • pdfinfo -- dumps a PDF file's Info dictionary
  • pdffonts -- lists the fonts used in a PDF file along with various information for each font
  • pdfdetach -- lists or extracts embedded files (attachments) from a PDF file
  • pdftoppm -- converts a PDF file to a series of PPM/PGM/PBM-format bitmaps
  • pdftopng -- converts a PDF file to a series of PNG image files
  • pdfimages -- extracts the images from a PDF file
Re pdftotext : there are various options but the ones you will want to experiment with are:
-layout
Maintain (as best as possible) the original physical layout ofthe text. The default is to 'undo' physical layout (columns, hyphenation, etc.) and output the text in reading order. If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

-simple
Similar to -layout, but optimized for simple one-column pages. This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.

-table
Table mode is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

Re: How to pull text out of a PDF and have it wrap correctly

Posted: Mon Aug 13, 2018 10:06 am
by webfork
lintalist wrote:
Wed Aug 08, 2018 11:21 am
The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
juverax wrote:
Sat Aug 04, 2018 9:06 am
Sorry if I missed the point .... but if I had to pull out some text from a pdf, I would open that pdf with Sumatra
True: some text is wrapped correctly in some PDFs, but I think you'll find most PDFs in even a cursory search on the web don't wrap properly. Including when using Sumatra.

Re: How to pull text out of a PDF and have it wrap correctly

Posted: Mon Aug 13, 2018 10:19 am
by lintalist
webfork wrote:
Mon Aug 13, 2018 10:06 am
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
Which uses poppler which in turn uses code from the xpdf toolset - six degrees of seperation :-)

Re: How to pull text out of a PDF and have it wrap correctly

Posted: Thu Aug 16, 2018 9:26 am
by webfork
lintalist wrote:
Mon Aug 13, 2018 10:19 am
webfork wrote:
Mon Aug 13, 2018 10:06 am
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
Which uses poppler which in turn uses code from the xpdf toolset - six degrees of seperation :-)
I should have guessed :)