pdf2htmlEX Windows builds (CLI)

Submit command line tools that you find here.
Post Reply
Message
Author
shnbwmn
Posts: 265
Joined: Sat Jul 11, 2015 12:59 am

pdf2htmlEX Windows builds (CLI)

#1 Post by shnbwmn »

Original project: https://github.com/coolwanglu/pdf2htmlEX
Github Pages: https://coolwanglu.github.io/pdf2htmlEX/
PDF/HTML Demos: https://github.com/coolwanglu/pdf2htmlEX#-pdf2htmlex

Windows builds: http://soft.rubypdf.com/software/pdf2ht ... ws-version (other PDF programs there too)

Quick background on pdf2htmlEX:
  • The authors wrote:pdf2htmlEX renders PDF files to HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display.

    pdf2htmlEX is best for text-based PDF files, for example scientific papers with complicated formulas and figures. Text, fonts and formats are natively preserved in HTML such that you can still search and copy. The generated HTML file is static, with optional features powered by JavaScript.

    Features:
    • Precise and native text in HTML
    • Flexible Output
    • Moderate Size
    • More PDF stuffs that you love: links, outlines & printing
Last edited by shnbwmn on Wed Jul 13, 2016 5:03 pm, edited 3 times in total.

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: pdf2htmlEX Windows builds (CLI)

#2 Post by webfork »

I did a bunch of testing on this and the short version is that I'm really impressed. More to come ...

User avatar
Midas
Posts: 6724
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: pdf2htmlEX Windows builds (CLI)

#3 Post by Midas »

:o Eagerly waiting...

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: pdf2htmlEX Windows builds (CLI)

#4 Post by webfork »

This is a very impressive resource. Here is a side by side view of the PDF vs. HTML output, borrowing from an Intel PDF that came with my machine (before is left, after is right): http://i.imgur.com/ZWC2VGa.png

It maintains the index, hyperlinks, formatting, images, and margins. Wow.

This program might thin the lines between slides, printouts, web pages, and even improve PDF typography, which isn't always great. It also might make some companies who are drunk on PDF (Oracle) let go a little bit. While LibreOffice has a pretty solid PDF viewer and HTML export, at least on this particular operation, pdf2htmlex left it in the dust.

--

Note that the HTML file is about 10x larger than the PDF, but that's for 3 reasons (possibly 4):
  1. Integrated PDF compression. Though most webservers have integrated compression to save on download time, the file itself will be uncompressed.
  2. Some of the images aren't saved as their actual size, as in the case below where you can see the Intel icon is saved with a lot of whitespace around it: http://i.imgur.com/hmdl2YV.png
  3. It looks like images are being embedded in the HTML as text (similar to how RTF files store images). Essentially if you pull it up in a text editor, it looks like this: http://i.imgur.com/rwxE5F0.png

    This can changed with command-line switches. I used:

    Code: Select all

       pdf2htmlex.exe --embed-image 0 --dest-dir output filename.pdf 
  4. The images used in PDFs are usually JPEG compressed but the output here are PNG files. That might also be an issue.

VT analysis of pdf2htmlEX.exe (clean; 0/57)
https://www.virustotal.com/en/file/7A1C ... /analysis/

---

Also from the author, there are some impressive programs on his website. Two that caught my eye include:
  • A very nice PDF Cropper page that auto-cuts margines. I have little doubt that this could save an amazing amount of data if used on just a small percentage of scanned documents out there.
  • A PDFTK version that supports Chinese, Japanese, and Korean paths (which I didn't realize was an issue).

shnbwmn
Posts: 265
Joined: Sat Jul 11, 2015 12:59 am

Re: pdf2htmlEX Windows builds (CLI)

#5 Post by shnbwmn »

Thanks for your detailed notes webfork. And also for mentioning LibreOffice. Looking a bit deeper into that I was surprised to learn that LO has some rudimentary CLI support. Relevant here is convert-to:

Code: Select all

--convert-to output_file_extension[:output_filter_name] [--outdir output_dir] files

Batch convert files:

If --outdir is not specified, then current working directory is used as output_dir.

Eg.
 --convert-to pdf *.doc
 --convert-to pdf:writer_pdf_Export --outdir /home/user *.doc 

shnbwmn
Posts: 265
Joined: Sat Jul 11, 2015 12:59 am

Re: pdf2htmlEX Windows builds (CLI)

#6 Post by shnbwmn »

Recent use case ...

I had a (simple) PDF book that I wanted to convert to HTML. The content was mostly typography, with a spattering of diagrams. Using Acrobat Pro DC, the result was readable and the typography correct, however the format was completely disjointed. All graphics/diagrams were aligned flush-left, and none of the margins/spacing of the PDF were preserved. Quite disappointing Adobe ...

So I gave pdf2htmlEX a try - the resulting HTML was formatted correctly, exactly as in the PDF. The only downside was the size of the output ...

Code: Select all

PDF:   ~15 Mb
HTML: ~114 Mb
Also, the pdf2htmlEX output was a single HTML, while the Acrobat output stored the graphics in an accompanying folder.

TP109
Posts: 571
Joined: Sat Apr 08, 2006 7:12 pm
Location: Midwestern US

Re: pdf2htmlEX Windows builds (CLI)

#7 Post by TP109 »

Anybody get the pdfcropper tool to work? If so, what version of GS was used and what is the command-line? Instructions are vague. Tried GS 8.54 and 9.19 so far but no luck.

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: pdf2htmlEX Windows builds (CLI)

#8 Post by webfork »

shnbwmn wrote:Looking a bit deeper into that I was surprised to learn that LO has some rudimentary CLI support.
I was wanting to test this out before I replied but it's taking too long so suffice it to say that's great news. Thanks for that.
TP109 wrote:Anybody get the pdfcropper tool to work? If so, what version of GS was used and what is the command-line? Instructions are vague. Tried GS 8.54 and 9.19 so far but no luck.
Another I'm hoping to test sometime soon. Hopefully I'll have some feedback in the weeks ahead.
shnbwmn wrote:Using Acrobat Pro DC, the result was readable and the typography correct, however the format was completely disjointed. All graphics/diagrams were aligned flush-left, and none of the margins/spacing of the PDF were preserved. Quite disappointing Adobe ...
I haven't had good experience with Adobe's converters. What's strange is how they're so far behind several free options. HTML isn't exactly a new format.

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: pdf2htmlEX Windows builds (CLI)

#9 Post by webfork »

This is turning into a bit of a thread hijack but hopefully not ...
shnbwmn wrote:
I couldn't find anything about batch adding or removing of passwords for LibreOffice files similar to ExcelPass. Any guess about how that might be done?

Post Reply