This is a very impressive resource. Here is a side by side view of the PDF vs. HTML output, borrowing from an Intel PDF that came with my machine (before is left, after is right):
http://i.imgur.com/ZWC2VGa.png
It maintains the index, hyperlinks, formatting, images, and margins. Wow.
This program might thin the lines between slides, printouts, web pages, and even improve PDF typography, which isn't always great. It also might make some companies who are drunk on PDF (Oracle) let go a little bit. While LibreOffice has a pretty solid PDF viewer and HTML export, at least on this particular operation, pdf2htmlex left it in the dust.
--
Note that the HTML file is about 10x larger than the PDF, but that's for 3 reasons (possibly 4):
- Integrated PDF compression. Though most webservers have integrated compression to save on download time, the file itself will be uncompressed.
- Some of the images aren't saved as their actual size, as in the case below where you can see the Intel icon is saved with a lot of whitespace around it: http://i.imgur.com/hmdl2YV.png
- It looks like images are being embedded in the HTML as text (similar to how RTF files store images). Essentially if you pull it up in a text editor, it looks like this: http://i.imgur.com/rwxE5F0.png
This can changed with command-line switches. I used:
Code: Select all
pdf2htmlex.exe --embed-image 0 --dest-dir output filename.pdf
- The images used in PDFs are usually JPEG compressed but the output here are PNG files. That might also be an issue.
VT analysis of pdf2htmlEX.exe (clean; 0/57)
https://www.virustotal.com/en/file/7A1C ... /analysis/
---
Also from the author, there are some impressive programs on his website. Two that caught my eye include:
- A very nice PDF Cropper page that auto-cuts margines. I have little doubt that this could save an amazing amount of data if used on just a small percentage of scanned documents out there.
- A PDFTK version that supports Chinese, Japanese, and Korean paths (which I didn't realize was an issue).