pdf2htmlEX Windows builds (CLI)

Message

shnbwmn · #1 Post by **shnbwmn** » Tue Jul 05, 2016 1:01 am

Original project: https://github.com/coolwanglu/pdf2htmlEX
Github Pages: https://coolwanglu.github.io/pdf2htmlEX/
PDF/HTML Demos: https://github.com/coolwanglu/pdf2htmlEX#-pdf2htmlex

Windows builds: http://soft.rubypdf.com/software/pdf2ht ... ws-version (other PDF programs there too)

Quick background on pdf2htmlEX:

The authors wrote:pdf2htmlEX renders PDF files to HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display.

pdf2htmlEX is best for text-based PDF files, for example scientific papers with complicated formulas and figures. Text, fonts and formats are natively preserved in HTML such that you can still search and copy. The generated HTML file is static, with optional features powered by JavaScript.

Features:
- Precise and native text in HTML
- Flexible Output
- Moderate Size
- More PDF stuffs that you love: links, outlines & printing

#2 Post by **webfork** » Tue Jul 05, 2016 2:13 pm

I did a bunch of testing on this and the short version is that I'm really impressed. More to come ...

#3 Post by **Midas** » Wed Jul 06, 2016 10:04 am

Eagerly waiting...

#4 Post by **webfork** » Tue Jul 12, 2016 4:01 pm

This is a very impressive resource. Here is a side by side view of the PDF vs. HTML output, borrowing from an Intel PDF that came with my machine (before is left, after is right): http://i.imgur.com/ZWC2VGa.png

It maintains the index, hyperlinks, formatting, images, and margins. Wow.

This program might thin the lines between slides, printouts, web pages, and even improve PDF typography, which isn't always great. It also might make some companies who are drunk on PDF (Oracle) let go a little bit. While LibreOffice has a pretty solid PDF viewer and HTML export, at least on this particular operation, pdf2htmlex left it in the dust.

--

Note that the HTML file is about 10x larger than the PDF, but that's for 3 reasons (possibly 4):

Integrated PDF compression. Though most webservers have integrated compression to save on download time, the file itself will be uncompressed.
Some of the images aren't saved as their actual size, as in the case below where you can see the Intel icon is saved with a lot of whitespace around it: http://i.imgur.com/hmdl2YV.png
It looks like images are being embedded in the HTML as text (similar to how RTF files store images). Essentially if you pull it up in a text editor, it looks like this: http://i.imgur.com/rwxE5F0.png

This can changed with command-line switches. I used:
Code: Select all
```
   pdf2htmlex.exe --embed-image 0 --dest-dir output filename.pdf 
```
The images used in PDFs are usually JPEG compressed but the output here are PNG files. That might also be an issue.

VT analysis of pdf2htmlEX.exe (clean; 0/57)
https://www.virustotal.com/en/file/7A1C ... /analysis/

---

Also from the author, there are some impressive programs on his website. Two that caught my eye include:

A very nice PDF Cropper page that auto-cuts margines. I have little doubt that this could save an amazing amount of data if used on just a small percentage of scanned documents out there.
A PDFTK version that supports Chinese, Japanese, and Korean paths (which I didn't realize was an issue).

shnbwmn · #5 Post by **shnbwmn** » Wed Jul 13, 2016 1:48 pm

Thanks for your detailed notes webfork. And also for mentioning LibreOffice. Looking a bit deeper into that I was surprised to learn that LO has some rudimentary CLI support. Relevant here is convert-to:

Code: Select all

--convert-to output_file_extension[:output_filter_name] [--outdir output_dir] files

Batch convert files:

If --outdir is not specified, then current working directory is used as output_dir.

Eg.
 --convert-to pdf *.doc
 --convert-to pdf:writer_pdf_Export --outdir /home/user *.doc

LibreOffice CLI parameters: https://help.libreoffice.org/Common/Sta ... Parameters

shnbwmn · #6 Post by **shnbwmn** » Sat Jul 30, 2016 8:18 am

Recent use case ...

I had a (simple) PDF book that I wanted to convert to HTML. The content was mostly typography, with a spattering of diagrams. Using Acrobat Pro DC, the result was readable and the typography correct, however the format was completely disjointed. All graphics/diagrams were aligned flush-left, and none of the margins/spacing of the PDF were preserved. Quite disappointing Adobe ...

So I gave pdf2htmlEX a try - the resulting HTML was formatted correctly, exactly as in the PDF. The only downside was the size of the output ...

Code: Select all

PDF:   ~15 Mb
HTML: ~114 Mb

Also, the pdf2htmlEX output was a single HTML, while the Acrobat output stored the graphics in an accompanying folder.

TP109 · #7 Post by **TP109** » Sat Jul 30, 2016 2:12 pm

Anybody get the pdfcropper tool to work? If so, what version of GS was used and what is the command-line? Instructions are vague. Tried GS 8.54 and 9.19 so far but no luck.

#8 Post by **webfork** » Sat Jul 30, 2016 9:04 pm

shnbwmn wrote:Looking a bit deeper into that I was surprised to learn that LO has some rudimentary CLI support.

I was wanting to test this out before I replied but it's taking too long so suffice it to say that's great news. Thanks for that.

TP109 wrote:Anybody get the pdfcropper tool to work? If so, what version of GS was used and what is the command-line? Instructions are vague. Tried GS 8.54 and 9.19 so far but no luck.

Another I'm hoping to test sometime soon. Hopefully I'll have some feedback in the weeks ahead.

shnbwmn wrote:Using Acrobat Pro DC, the result was readable and the typography correct, however the format was completely disjointed. All graphics/diagrams were aligned flush-left, and none of the margins/spacing of the PDF were preserved. Quite disappointing Adobe ...

I haven't had good experience with Adobe's converters. What's strange is how they're so far behind several free options. HTML isn't exactly a new format.

#9 Post by **webfork** » Wed Aug 03, 2016 5:23 pm

This is turning into a bit of a thread hijack but hopefully not ...

shnbwmn wrote:
LibreOffice CLI parameters: https://help.libreoffice.org/Common/Sta ... Parameters

I couldn't find anything about batch adding or removing of passwords for LibreOffice files similar to ExcelPass. Any guess about how that might be done?

The Portable Freeware Collection Forums

pdf2htmlEX Windows builds (CLI)

pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)

Re: pdf2htmlEX Windows builds (CLI)