How to pull text out of a PDF and have it wrap correctly

Share interesting information or links related to portable apps here.
Post Reply
Message
Author
User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

How to pull text out of a PDF and have it wrap correctly

#1 Post by webfork »

This process uses the CuteMarkEd program, which is not currently here on the site but is probably portable as it's hosted by PortableApps.

In the screen below, I've already copied some legal text from a random PDF file online and I've setup CuteMarkEd above and LibreOffice (with paragraph markings) below:

Image

How it works:

CuteMarkEd doesn't recognize standard end-of-line values unless there's two in a row. This allows the program to quickly and easily break down bad line wraps but preserve paragraph breaks. Unfortunately you have to select ALL the text, right-click., and then copy. CTRL+C doesn't work.

I've tried several different systems, which I can detail here if anyone is curious, but this one is the fastest.

---

EDIT: I realized shortly after posting that ghostwriter has similar functionality and probably works better. See smaragdus' review of ghostwriter.

User avatar
juverax
Posts: 355
Joined: Mon Jun 11, 2018 5:19 am

Re: How to pull text out of a PDF and have it wrap correctly

#2 Post by juverax »

Sorry if I missed the point .... but if I had to pull out some text from a pdf, I would open that pdf with Sumatra, select with the mouse (left button) the part of the text I want to paste in a text editor (e.g. MS-Notepad), and done ... !
In Notepad, the text is wrapped exactly the way it is in the original pdf.

User avatar
lintalist
Posts: 434
Joined: Sat Apr 19, 2014 12:52 am
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#3 Post by lintalist »

If I need to grab the text of the entire PDF or specific pages I use the cmd line program pdftotext from the xpdftools toolset ( https://www.xpdfreader.com/download.html ) but it preserves the layout of the text e.g. page in two columns of text -> text in two columns of text.
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. The Xpdf viewer uses the Qt cross-platform GUI toolkit. The other command line utilties do not require Qt.
Included in the toolset:
  • xpdf -- viewer
  • pdftops -- generate a PostScript file
  • pdftotext -- generate a plain text file
  • pdftohtml -- converts a PDF file to HTML
  • pdfinfo -- dumps a PDF file's Info dictionary
  • pdffonts -- lists the fonts used in a PDF file along with various information for each font
  • pdfdetach -- lists or extracts embedded files (attachments) from a PDF file
  • pdftoppm -- converts a PDF file to a series of PPM/PGM/PBM-format bitmaps
  • pdftopng -- converts a PDF file to a series of PNG image files
  • pdfimages -- extracts the images from a PDF file
Re pdftotext : there are various options but the ones you will want to experiment with are:
-layout
Maintain (as best as possible) the original physical layout ofthe text. The default is to 'undo' physical layout (columns, hyphenation, etc.) and output the text in reading order. If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

-simple
Similar to -layout, but optimized for simple one-column pages. This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.

-table
Table mode is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#4 Post by webfork »

lintalist wrote: Wed Aug 08, 2018 11:21 am The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.
Very interesting, thanks for that. Reminds me very much of pdf2htmlEX.
juverax wrote: Sat Aug 04, 2018 9:06 am Sorry if I missed the point .... but if I had to pull out some text from a pdf, I would open that pdf with Sumatra
True: some text is wrapped correctly in some PDFs, but I think you'll find most PDFs in even a cursory search on the web don't wrap properly. Including when using Sumatra.

User avatar
lintalist
Posts: 434
Joined: Sat Apr 19, 2014 12:52 am
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#5 Post by lintalist »

webfork wrote: Mon Aug 13, 2018 10:06 amVery interesting, thanks for that. Reminds me very much of pdf2htmlEX.
Which uses poppler which in turn uses code from the xpdf toolset - six degrees of seperation :-)

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#6 Post by webfork »

lintalist wrote: Mon Aug 13, 2018 10:19 am
webfork wrote: Mon Aug 13, 2018 10:06 amVery interesting, thanks for that. Reminds me very much of pdf2htmlEX.
Which uses poppler which in turn uses code from the xpdf toolset - six degrees of seperation :-)
I should have guessed :)

User avatar
juverax
Posts: 355
Joined: Mon Jun 11, 2018 5:19 am

Re: How to pull text out of a PDF and have it wrap correctly

#7 Post by juverax »

(@lintalist, Aug. 2018)

I just tried XpdfReader ( https://www.xpdfreader.com/ ) , and Pdf Image Extractor ( https://www.softpedia.com/get/Office-to ... ctor.shtml).
In both cases the extracted images have a very low resolution (74 and 96dpi which is the resolution of a screen capture) ... Though I don't have time to investigate in depth the performance of these utilities, it could be that they simply perform a screen capture of each PDF page but do not extract the actual images contained in the PDF file (not sure though)!.
Then I used Pdf-Xchange Editor ( https://www.portablefreeware.com/index.php?id=2832 )to extract ONE image from the same PDF file as above (when I select to extract ALL the images, the program closes itself, and I have no idea where the extracted images - if any - are), the extracted image has a resolution of 150 dpi AhAh !!!

User avatar
Cornflower
Posts: 244
Joined: Fri Aug 31, 2007 7:58 am
Location: Canada's capital

Re: How to pull text out of a PDF and have it wrap correctly

#8 Post by Cornflower »

@lintalist, @juverax:

I rarely have to convert PDF to images (as opposed to extract existing images) but this works for me (at 300--no patience to do higher dpi)
  • Use Windows 10 Microsoft Print to PDF
    Use PDF-XChange Editor to export the "printed" pdf to images

On the main topic--pulling text out of a PDF and wrapping it:

I am reading mulitple PDF articles every single day, and very often extracting text, etc. While it isn't perfect, the following is my process, using Pdf-Xchange Editor ( https://www.portablefreeware.com/index.php?id=2832 ) and a quick-n-dirty Autohotkey (https://www.portablefreeware.com/index.php?id=217 ) script:

Code: Select all

!q::				; paste clipboard joinlines
send ^c
StringReplace, clipboard, clipboard, -`r`n,, All
StringReplace, clipboard, clipboard, `r`n, %a_space%, All
send ^v
send {Enter}{Enter}
return
My process is to use Select Text in the editor, move to where I want to paste it to (usually a MemPad document) and press Alt-Q. removes wraps and hyphenations, and (personal preference) adds two carriage returns at the end.

User avatar
Midas
Posts: 6705
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: How to pull text out of a PDF and have it wrap correctly

#9 Post by Midas »

webfork wrote: This process uses the CuteMarkEd program, which is not currently here on the site but is probably portable as it's hosted by PortableApps.

Quick note to link to a CuteMarkEd topic you created at viewtopic.php?t=24648.

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#10 Post by webfork »

Midas wrote: Mon Feb 15, 2021 7:01 am Quick note to link to a CuteMarkEd topic you created at viewtopic.php?t=24648.
Thanks for that

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: How to pull text out of a PDF and have it wrap correctly

#11 Post by webfork »

Old thread update:

Calibre just keeps adding features year after year and one option in conversion tools is "heuristic processing" which handles line wrapping (among other conversion issues) very well for PDFs and other formats:

Image

Post Reply