mutool is a stand-alone cli with a LOT of options. just about the easiest is the extract (images and resources)
I pulled out all the images (including both larger and smaller versions where the PDF had both) and non-text pages very quickly with
In the current directory it created (at a rate of some 5-6 pages per second) image-xxxx.jpg, png, fonts, and pam (a 2D map format--I have no idea how to use).
Code: Select all
mutool extract filespec.pdf
I tested an OCR'd book (by PDF X-Change Editor) and it extracted the OCR'd pages anyway, as it did pre-OCR. On a document that did not need OCR to select text, it only picked out images and fonts.
Definitely a process I will use again.