Convert a PDF to XML

The pdftohtml programme is good at converting PDF files that have text (ie no OCR needed) into text files like HTML or XML.

Convert PDF to XML (pdf2xml)

pdftohtml -c -xml document.pdf prefix

This will produce an XML file like this: <page number="7" position="absolute" top="0" left="0" height="594" width="396"> <fontspec id="7" size="16" family="Times" color="#0000ff"/> <fontspec id="8" size="8" family="Times" color="#000000"/> <text top="97" left="122" width="153" height="18" font="7"><b>INTRODUCTION</b></text> <text top="210" left="36" width="324" height="11" font="8">Lorum ipsum</text>

Convert PDF to HTML with images

pdftohtml -c document.pdf prefix

This produces a series of PNG and HTML pages. You can open each HTML page in your browser and you can see

Feature wishes

As great as pdftohtml is, there is some things that I wish it would do. This would make it easier to edit PDF files using standard unix textual tools

  1. It would be cool if one could take that generated XML or HTML and convert it back to a PDF file. This would allow you to edit the HTML/XML using one of the millions of tools and then convert that back to a PDF
  2. There is an inconsistancy in the XML and HTML output. The XML output contains the width and height of text items, but the HTML doesn't. The HTML contains the background images, but the XML has no images, only text.

This entry is tagged: