Skip to main content

Generating PDF/A From HTML in Meteor


My live-chat app was a folk of project Rocket.Chat which was built with Meteor. The app had a feature that administrative users were able to export the conversations into PDF files. And, they wanted to archive these files for a long time.

I happened to know that PDF/A documents were good for this purpose. It was really frustrated to find a solution with free libraries. Actually, it took me more than two weeks to find a possible approach.

TL, DR;
Using Puppeteer to generate a normal PDF and using PDFBox to load and converting the generated PDF into PDF/A compliance.

What is PDF/A?

Here is a definition from Wikipedia:
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Why PDF/A documents matter

In my point of view, it’s a standard file for archiving thanks to key features:
  • The fonts of document’s content will always be displayed the same on all places, not depending on any devices.
  • Documents are safe. For example, no executable file launches are allowed or no external content references are allowed.
(Find more features at https://en.wikipedia.org/wiki/PDF/A#Description)

Several approaches for creating a PDF/A document

At the first glance, I found some approaches as follows:

(1). Paid APIs
  • Aspose PDF: Java-based APIs. It seemed the price was too expensive.
  • PdfTron: Javascript-based APIs. I tried to contact them to know the price but they requested me some information even I could not estimate to answer it.
(2). Free APIs
  • Apache FOP: Java-based APIs. I needed to defined the templates by its own ways (FOP files) instead of HTML and CSS. (I created a hello world app here)
  • PDFBox: Java-based APIs. It looked promising that I could convert a normal PDF document to PDF/A one.
(3). Forking a open source project/building my own PDF/A generating engine
  • pdfkit: Javascript-based project. I needed to know the specification of PDF/A compliance, and to build my own lib for generating PDF/A documents.
Basing on these insights, I decided to go with approach (2).

Apache FOP tryout

I decided to stop this approach. Apache FOP strictly required to provide templates with XSL-FO format which is not supported wisely. Ref: https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708
  • Due to limitation of formatting support, it’s hard to format content of a PDF if its template is complex such as including images, tables, etc..
  • There seemed to be no tools for converting from HTML, CSS into XSL-FO. But my templates were built on HTML, CSS and Meteor and data was bound dynamically.

PDFBox tryout

Basing on the example of creating PDF/A here, I tried to convert an existing normal PDF file into another PDF/A. There were 3 parts to do:
  1. embed/load fonts
  2. include XMP metadata
  3. include color profile
I used these following tools to verify the result:
With filling some mandatory information, I passed the parts 2 and 3 successfully. But, I have really got stuck at part 1 for a long time.

Eventually, I found the root cause by delving deeper into source code of some libraries PDFBox, PDFBox Preflight, node-html-pdf, PhantomJS, etc.. as follows:

I could not successfully generate PDF/A document by converting the existing normal PDFs generated from Meteor because the PDF generated did not embedded the font fully.

My project used node-html-pdf which in turn used PhantomJS to render HTML into PDF. PhantomJS used Qt for writing PDF. The library always embedded the fonts using Subsetting (meaning only the characters were used in the document were embedded). This did not comply with the PDF/A specification.

I couldn’t do anything here because PhantomJS used a very old version of Qt (only Qt v5.x onwards supported PDA/A). The PhantomJS was also discontinued since Mar 2018.

Besides, I could not use PDFBox to embed fonts fully either. There was a method PDType0Font.load for loading fonts but it also embedded the fonts using Subsetting.

Yay! It worked

I discovered that by replacing the process rendering HTML to PDF with the fonts fully embedded.
Using a different library to replace PhantomJS, I found Google Chrome’s Puppeteer.

Then, I could use PDFBox to convert the normal PDF to PDF/A with including XMP metadata and color profile.


References:
[1]. https://www.youtube.com/watch?v=EqII7ilmY8o&feature=youtu.be
[2]. https://stackoverflow.com/questions/38737219/how-to-convert-pdf-to-pdf-a-in-java
[3].http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java
[4]. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
[5]. https://medium.com/@raphaelstbler/advanced-pdf-generation-for-node-js-using-puppeteer-e168253e159c

Comments

Popular posts from this blog

Junit - Test fails on French or German string assertion

In my previous post about building a regex to check a text without special characters but allow German and French . I met a problem that the unit test works fine on my machine using Eclipse, but it was fail when running on Jenkins' build job. Here is my test: @Test public void shouldAllowFrenchAndGermanCharacters(){ String source = "ÄäÖöÜüß áÁàÀâÂéÉèÈêÊîÎçÇ"; assertFalse(SpecialCharactersUtils.isExistSpecialCharater(source)); } Production code: public static boolean isExistNotAllowedCharacters(String source){ Pattern regex = Pattern.compile("^[a-zA-Z_0-9_ÄäÖöÜüß áÁàÀâÂéÉèÈêÊîÎçÇ]*$"); Matcher matcher = regex.matcher(source); return !matcher.matches(); } The result likes the following: Failed tests: SpecialCharactersUtilsTest.shouldAllowFrenchAndGermanCharacters:32 null A guy from stackoverflow.com says: "This is probably due to the default encoding used for your Java source files. The ö in the string literal in the J...

How to convert time between timezone in Java, Primefaces?

I use the calendar Primefaces component with timeOnly and timeZone attributes for using only hour format (HH:mm). Like this: <p:calendar id="xabsOvertimeTimeFrom" pattern="HH:mm" timeOnly="true" value="#{data.dateFrom}" timeZone="#{data.timeZone}"/> We can convert the value of #{data.dateFrom} from GMT/UTC time zone to local, conversely, from local time zone to GMT/UTC time zone. Here is my functions: package vn.nvanhuong.timezoneconverter; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; import java.util.TimeZone; public class TimeZoneConverter { /** * convert a date with hour format (HH:mm) from local time zone to UTC time zone */ public static Date convertHourToUTCTimeZone(Date inputDate) throws ParseException { if(inputDate == null){ return null; } Calendar calendar = Calendar.getInstance(); calendar.setTime(inputDate); int ...

4 Remarkable Notes for JSF Apps Using HTML5

In the previous post , I've already shared with you how my team consults clients by using a HTML prototype. This post is about the used technologies for reusing the provided HTML template and communicating with backend. What is the issue when using HTML elements with Primefaces components? Primefaces is a great extension for developing JSF web apps. However, it is really frustrating in case we have to make it work with an existing HTML template. Why? - Primefaces has its own theme for styling. - Primefaces changes the HTML structure. Therefore, that would be a huge effort to use the Primefaces' components to replicate the elements of the HTML template; especially it is impossible for images drawing by " canvas " tag. That requires us to find a better approach. Since Java EE 7 (introducing JSF 2.2 included), it supports to use HTML5 elements . The idea is that JSF components don't effect the style and HTML structure, so we can easily reuse the provided HTM...

AngularJS - Build a custom validation directive for using multiple emails in textarea

AngularJS already supports the built-in validation with text input with type email. Something simple likes the following: <input name="input" ng-model="email.text" required="" type="email" /> <span class="error" ng-show="myForm.input.$error.email"> Not valid email!</span> However, I used a text area and I wanted to enter some email addresses that's saparated by a comma (,). I had a short research and it looked like AngualarJS has not supported this functionality so far. Therefore, I needed to build a custom directive that I could add my own validation functions. My validation was done only on client side, so I used the $validators object. Note that, there is the $asyncValidators object which handles asynchronous validation, such as making an $http request to the backend. This is just my implementation on my project. In order to understand that, I supposed you already had experiences with ...

Updates to the Scrum Guide - The 5 Scrum Values

This article is available at blog.scrum.org , here I just quote my favorite points and give my comment at the end of this post. Ken Schwaber and Jeff Sutherland, the creators of Scrum delivered a webinar on their latest update to the Scrum Guide .  The update was a simple one, adding the 5 values of Scrum to the Guide: These values sound easy? Well, there are many misunderstandings and common problems when applying these values. Here are some example: Value Misunderstanding Getting the value right Commitment Committing to something that you don’t understand because you are told to by your boss. Committing yourself to the team and Sprint Goal. Focus Focusing on keeping the customer happy Being focused on the sprint and its goal. Openness Telling everyone everything about all your work Highlighting when you have challenges and problems that are stopping you from success Respect Thinking you are helping the team by being a hero Helping peop...