Skip to main content

Generating PDF/A From HTML in Meteor


My live-chat app was a folk of project Rocket.Chat which was built with Meteor. The app had a feature that administrative users were able to export the conversations into PDF files. And, they wanted to archive these files for a long time.

I happened to know that PDF/A documents were good for this purpose. It was really frustrated to find a solution with free libraries. Actually, it took me more than two weeks to find a possible approach.

TL, DR;
Using Puppeteer to generate a normal PDF and using PDFBox to load and converting the generated PDF into PDF/A compliance.

What is PDF/A?

Here is a definition from Wikipedia:
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Why PDF/A documents matter

In my point of view, it’s a standard file for archiving thanks to key features:
  • The fonts of document’s content will always be displayed the same on all places, not depending on any devices.
  • Documents are safe. For example, no executable file launches are allowed or no external content references are allowed.
(Find more features at https://en.wikipedia.org/wiki/PDF/A#Description)

Several approaches for creating a PDF/A document

At the first glance, I found some approaches as follows:

(1). Paid APIs
  • Aspose PDF: Java-based APIs. It seemed the price was too expensive.
  • PdfTron: Javascript-based APIs. I tried to contact them to know the price but they requested me some information even I could not estimate to answer it.
(2). Free APIs
  • Apache FOP: Java-based APIs. I needed to defined the templates by its own ways (FOP files) instead of HTML and CSS. (I created a hello world app here)
  • PDFBox: Java-based APIs. It looked promising that I could convert a normal PDF document to PDF/A one.
(3). Forking a open source project/building my own PDF/A generating engine
  • pdfkit: Javascript-based project. I needed to know the specification of PDF/A compliance, and to build my own lib for generating PDF/A documents.
Basing on these insights, I decided to go with approach (2).

Apache FOP tryout

I decided to stop this approach. Apache FOP strictly required to provide templates with XSL-FO format which is not supported wisely. Ref: https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708
  • Due to limitation of formatting support, it’s hard to format content of a PDF if its template is complex such as including images, tables, etc..
  • There seemed to be no tools for converting from HTML, CSS into XSL-FO. But my templates were built on HTML, CSS and Meteor and data was bound dynamically.

PDFBox tryout

Basing on the example of creating PDF/A here, I tried to convert an existing normal PDF file into another PDF/A. There were 3 parts to do:
  1. embed/load fonts
  2. include XMP metadata
  3. include color profile
I used these following tools to verify the result:
With filling some mandatory information, I passed the parts 2 and 3 successfully. But, I have really got stuck at part 1 for a long time.

Eventually, I found the root cause by delving deeper into source code of some libraries PDFBox, PDFBox Preflight, node-html-pdf, PhantomJS, etc.. as follows:

I could not successfully generate PDF/A document by converting the existing normal PDFs generated from Meteor because the PDF generated did not embedded the font fully.

My project used node-html-pdf which in turn used PhantomJS to render HTML into PDF. PhantomJS used Qt for writing PDF. The library always embedded the fonts using Subsetting (meaning only the characters were used in the document were embedded). This did not comply with the PDF/A specification.

I couldn’t do anything here because PhantomJS used a very old version of Qt (only Qt v5.x onwards supported PDA/A). The PhantomJS was also discontinued since Mar 2018.

Besides, I could not use PDFBox to embed fonts fully either. There was a method PDType0Font.load for loading fonts but it also embedded the fonts using Subsetting.

Yay! It worked

I discovered that by replacing the process rendering HTML to PDF with the fonts fully embedded.
Using a different library to replace PhantomJS, I found Google Chrome’s Puppeteer.

Then, I could use PDFBox to convert the normal PDF to PDF/A with including XMP metadata and color profile.


References:
[1]. https://www.youtube.com/watch?v=EqII7ilmY8o&feature=youtu.be
[2]. https://stackoverflow.com/questions/38737219/how-to-convert-pdf-to-pdf-a-in-java
[3].http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java
[4]. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
[5]. https://medium.com/@raphaelstbler/advanced-pdf-generation-for-node-js-using-puppeteer-e168253e159c

Comments

Popular posts from this blog

BarcampSaigon 2015

Barcamp Saigon is one of my most expected events of the year. This year, it took place at RMIT university. As usual, it brought many useful topics to the community. Here is all topics that I have attended. Scale it! - Lars Jankowfsky Lars is founder of 8bitrockr.com How do we make a decision correctly? It is hard to know that until we try and measure it. He gave an example about how good an app was. And, most of people thought that the app with nice user interfaces is good at the first look. But it is not correct because it is only true until we try to use it, even the nice GUI app sometime is not good at UX, functionalities, etc. The key of success for working in team is collaboration. We can not only base on the experience of members likes: "In my opinions| As I know.... this is the best way..bla..bla.." but we should test it. Therefore, manually testing as well as automation testing is more and more necessary nowadays. "Don't think, just try...

My 2017 Review

Passion for System Design After finishing a one year project, my longest stable team (lasted for 3 years) was separated into two smaller teams. Sadly, but that was a good chance for me to become a key member in my new team. My preferred skills were about system architectures; therefore most of the tasks of building the application structures were handled by me. In order to enhance my design system skills, I have spent much my time for reading books closely after work. These following books help me a lot. - Object-Oriented Thought Process | Matt Weisfeld - Head First Design Pattern  | Elisabeth Freeman and Kathy Sierra - Java 8 in Action: Lambdas, Streams, and Functional-style Programming | Alan Mycroft and Mario Fusco Junior Technical Architect I was requested to join a technical architect team (aka Team. Alpha) where I actually had gained experiences almost on interviewing candidates for my company (lol). Besides, I noticed myself must improve the skills of convinci...

A User Guide To Working With Huong

  Introduction I write this user guide to help us (you and me) have a good collaboration at work. I hope you also share yours. How I view success We all feel passionate and happy at work. We all enjoy discussing transparently. We take it easy to give and receive feedback. After all, we together develop and bring valuable applications to users. How I communicate I mostly prefer a face-to-face conversation. Just leave me a message on Slack if you don't want to come to my desk. For a big topic which takes more than 30 minutes, we should have a meeting. Only send me emails only if stuff is very formal or out-of-office hours Things I do that may annoy you I do practice the Pomodoro technique so that sometimes you see me in the "do not disturb" mode. Often to make things clear, I am at ease talking   frankly   with you. What gains and loses my trust It is easy to gain my trust when you commit to what you say. You show your passion and endeavors to achieve that. It is easy to lo...

JSF, Primefaces - Invoking Application Code Even When Validation Failed

A use case I have a form which has requirements as follow: - There are some mandatory fields. - Validation is triggered when changing value on each field. - A button "Next" is enable only when all fields are entered. It turns to disabled if any field is empty. My first approach I defined a variable "isDisableNext" at a backend bean "Controller" for dynamically disabling/enabling the "Next" button by performing event "onValueChange", but, it had a problem: <h:form id="personForm"> <p:outputLabel value="First Name" for="firstName"/> <p:inputText id="firstName" value="#{person.firstName}" required="true"> <p:ajax event="change" listener="#{controller.onValueChange}" update="nextButton"/> </p:inputText> <p:outputLabel value="Last Name" for="lastName"/> <p:i...

JQuery - Fixed Element during Scroll

I want to keep the position of an element likes a component on right side when I scroll down because of a very long content. Please take look at the code by visit the following link: http://jsfiddle.net/p3unbmdy/ Javascript function: $("#container").bind('scroll', function() { var fromTop = 50; var scrollVal = $("#container").scrollTop(); var top = 0; if ( scrollVal > fromTop) { top = scrollVal - fromTop; $('#rightElement').css({'position':'absolute','right':'1em','top' :top+'px'}); } else { $('#rightElement').css({'position':'static','top':'0px'}); } });