Skip to main content

Generating PDF/A From HTML in Meteor


My live-chat app was a folk of project Rocket.Chat which was built with Meteor. The app had a feature that administrative users were able to export the conversations into PDF files. And, they wanted to archive these files for a long time.

I happened to know that PDF/A documents were good for this purpose. It was really frustrated to find a solution with free libraries. Actually, it took me more than two weeks to find a possible approach.

TL, DR;
Using Puppeteer to generate a normal PDF and using PDFBox to load and converting the generated PDF into PDF/A compliance.

What is PDF/A?

Here is a definition from Wikipedia:
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Why PDF/A documents matter

In my point of view, it’s a standard file for archiving thanks to key features:
  • The fonts of document’s content will always be displayed the same on all places, not depending on any devices.
  • Documents are safe. For example, no executable file launches are allowed or no external content references are allowed.
(Find more features at https://en.wikipedia.org/wiki/PDF/A#Description)

Several approaches for creating a PDF/A document

At the first glance, I found some approaches as follows:

(1). Paid APIs
  • Aspose PDF: Java-based APIs. It seemed the price was too expensive.
  • PdfTron: Javascript-based APIs. I tried to contact them to know the price but they requested me some information even I could not estimate to answer it.
(2). Free APIs
  • Apache FOP: Java-based APIs. I needed to defined the templates by its own ways (FOP files) instead of HTML and CSS. (I created a hello world app here)
  • PDFBox: Java-based APIs. It looked promising that I could convert a normal PDF document to PDF/A one.
(3). Forking a open source project/building my own PDF/A generating engine
  • pdfkit: Javascript-based project. I needed to know the specification of PDF/A compliance, and to build my own lib for generating PDF/A documents.
Basing on these insights, I decided to go with approach (2).

Apache FOP tryout

I decided to stop this approach. Apache FOP strictly required to provide templates with XSL-FO format which is not supported wisely. Ref: https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708
  • Due to limitation of formatting support, it’s hard to format content of a PDF if its template is complex such as including images, tables, etc..
  • There seemed to be no tools for converting from HTML, CSS into XSL-FO. But my templates were built on HTML, CSS and Meteor and data was bound dynamically.

PDFBox tryout

Basing on the example of creating PDF/A here, I tried to convert an existing normal PDF file into another PDF/A. There were 3 parts to do:
  1. embed/load fonts
  2. include XMP metadata
  3. include color profile
I used these following tools to verify the result:
With filling some mandatory information, I passed the parts 2 and 3 successfully. But, I have really got stuck at part 1 for a long time.

Eventually, I found the root cause by delving deeper into source code of some libraries PDFBox, PDFBox Preflight, node-html-pdf, PhantomJS, etc.. as follows:

I could not successfully generate PDF/A document by converting the existing normal PDFs generated from Meteor because the PDF generated did not embedded the font fully.

My project used node-html-pdf which in turn used PhantomJS to render HTML into PDF. PhantomJS used Qt for writing PDF. The library always embedded the fonts using Subsetting (meaning only the characters were used in the document were embedded). This did not comply with the PDF/A specification.

I couldn’t do anything here because PhantomJS used a very old version of Qt (only Qt v5.x onwards supported PDA/A). The PhantomJS was also discontinued since Mar 2018.

Besides, I could not use PDFBox to embed fonts fully either. There was a method PDType0Font.load for loading fonts but it also embedded the fonts using Subsetting.

Yay! It worked

I discovered that by replacing the process rendering HTML to PDF with the fonts fully embedded.
Using a different library to replace PhantomJS, I found Google Chrome’s Puppeteer.

Then, I could use PDFBox to convert the normal PDF to PDF/A with including XMP metadata and color profile.


References:
[1]. https://www.youtube.com/watch?v=EqII7ilmY8o&feature=youtu.be
[2]. https://stackoverflow.com/questions/38737219/how-to-convert-pdf-to-pdf-a-in-java
[3].http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java
[4]. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
[5]. https://medium.com/@raphaelstbler/advanced-pdf-generation-for-node-js-using-puppeteer-e168253e159c

Comments

Popular posts from this blog

Attribute 'for' of label component with id xxxx is not defined

I got the warning in the log file when I have used the tag <h:outputLabel> without attribute " for " in xhtml file. It was really polluting my server log files. The logged information actually makes sense anyway! We could find an answer as the following: "Having h:outputLabel without a "for" attribute is meaningless. If you are not attaching the label, you should be using h:outputText instead of h:outputLabel." However, these solutions are not possible just for my situation. Instead of using h:outputText for only displaying text, my team has used h:outputLabel too many places. We were nearly in our release time (next day) so it is quite risky and takes much efforts if we try to correct it. Because the style (with CSS) is already done with h:ouputLabel . The alternative by adding attribute " for " the existing h:outputLabel is not reasonable either. I really need to find another solution. Fortunately, I came across a way if I cha...

Coding Exercise, Episode 1

I have received the following exercise from an interviewer, he didn't give the name of the problem. Honestly, I have no idea how to solve this problem even I have tried to read it three times before. Since I used to be a person who always tells myself "I am not the one good at algorithms", but giving up something too soon which I feel that I didn't spend enough effort to overcome is not my way. Then, I have sticked on it for 24 hours. According to the given image on the problem, I tried to get more clues by searching. Thanks to Google, I found a similar problem on Hackerrank (attached link below). My target here was trying my best to just understand the problem and was trying to solve it accordingly by the Editorial on Hackerrank. Due to this circumstance, it turns me to love solving algorithms from now on (laugh). Check it out! Problem You are given a very organized square of size N (1-based index) and a list of S commands The i th command will follow t...

Google I/O '18

Here are my top 5 impressions on this conference. Gmail - live sentence suggestion It looked like the way developers use intelligent code completion in an IDE when coding. Google Photos: converting a photo has a document to pdf I have a paid app on my iPad called "Scanner for Me" but now I can use Google Photos instead. ;) Google Assistant: bot makes a real call to book a restaurant dinner or a hair salon Wow! This feature, for me, is really a big innovation. My team is working so hard on building our bot which is able to have a continued conversation. Google is so good! Google Maps: using Phone's camera to watch the direction When I saw a fox as a cicerone on the demo, I was thinking of Pokémon GO. Google Lens: extract text from images I have heard a story from a friend of me that he had to use his app about "optical character recognition (OCR)" to scan and translate the texts into English whenever he saw texts in China. Google Lens would be ...

Regex - Check a text without special characters but German, French

Special characters such as square brackets ([ ]) can cause an exception " java.util.regex.PatternSyntaxException " or something like this if we don't handle them correctly. I had met this issue. In my case, my customers want our application should allow some characters in German and French even not allow some special characters. The solution is that we limit the allowed characters by showing the validation message on GUI. For an instance, the message looks like the following: "This field can't contain any special characters; only letters, numbers, underscores (_), spaces and single quotes (') are allowed." I used Regular Expression to check it. For entering Germany and French, I actually don't have this type of keyboard, so I referred these sites: * German characters: http://german.typeit.org/ * French characters: http://french.typeit.org/ Here is my code: package vn.nvanhuong.practice; import java.util.regex.Matcher; import java.util...

4 Remarkable Notes for JSF Apps Using HTML5

In the previous post , I've already shared with you how my team consults clients by using a HTML prototype. This post is about the used technologies for reusing the provided HTML template and communicating with backend. What is the issue when using HTML elements with Primefaces components? Primefaces is a great extension for developing JSF web apps. However, it is really frustrating in case we have to make it work with an existing HTML template. Why? - Primefaces has its own theme for styling. - Primefaces changes the HTML structure. Therefore, that would be a huge effort to use the Primefaces' components to replicate the elements of the HTML template; especially it is impossible for images drawing by " canvas " tag. That requires us to find a better approach. Since Java EE 7 (introducing JSF 2.2 included), it supports to use HTML5 elements . The idea is that JSF components don't effect the style and HTML structure, so we can easily reuse the provided HTM...