Skip to main content

Generating PDF/A From HTML in Meteor


My live-chat app was a folk of project Rocket.Chat which was built with Meteor. The app had a feature that administrative users were able to export the conversations into PDF files. And, they wanted to archive these files for a long time.

I happened to know that PDF/A documents were good for this purpose. It was really frustrated to find a solution with free libraries. Actually, it took me more than two weeks to find a possible approach.

TL, DR;
Using Puppeteer to generate a normal PDF and using PDFBox to load and converting the generated PDF into PDF/A compliance.

What is PDF/A?

Here is a definition from Wikipedia:
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Why PDF/A documents matter

In my point of view, it’s a standard file for archiving thanks to key features:
  • The fonts of document’s content will always be displayed the same on all places, not depending on any devices.
  • Documents are safe. For example, no executable file launches are allowed or no external content references are allowed.
(Find more features at https://en.wikipedia.org/wiki/PDF/A#Description)

Several approaches for creating a PDF/A document

At the first glance, I found some approaches as follows:

(1). Paid APIs
  • Aspose PDF: Java-based APIs. It seemed the price was too expensive.
  • PdfTron: Javascript-based APIs. I tried to contact them to know the price but they requested me some information even I could not estimate to answer it.
(2). Free APIs
  • Apache FOP: Java-based APIs. I needed to defined the templates by its own ways (FOP files) instead of HTML and CSS. (I created a hello world app here)
  • PDFBox: Java-based APIs. It looked promising that I could convert a normal PDF document to PDF/A one.
(3). Forking a open source project/building my own PDF/A generating engine
  • pdfkit: Javascript-based project. I needed to know the specification of PDF/A compliance, and to build my own lib for generating PDF/A documents.
Basing on these insights, I decided to go with approach (2).

Apache FOP tryout

I decided to stop this approach. Apache FOP strictly required to provide templates with XSL-FO format which is not supported wisely. Ref: https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708
  • Due to limitation of formatting support, it’s hard to format content of a PDF if its template is complex such as including images, tables, etc..
  • There seemed to be no tools for converting from HTML, CSS into XSL-FO. But my templates were built on HTML, CSS and Meteor and data was bound dynamically.

PDFBox tryout

Basing on the example of creating PDF/A here, I tried to convert an existing normal PDF file into another PDF/A. There were 3 parts to do:
  1. embed/load fonts
  2. include XMP metadata
  3. include color profile
I used these following tools to verify the result:
With filling some mandatory information, I passed the parts 2 and 3 successfully. But, I have really got stuck at part 1 for a long time.

Eventually, I found the root cause by delving deeper into source code of some libraries PDFBox, PDFBox Preflight, node-html-pdf, PhantomJS, etc.. as follows:

I could not successfully generate PDF/A document by converting the existing normal PDFs generated from Meteor because the PDF generated did not embedded the font fully.

My project used node-html-pdf which in turn used PhantomJS to render HTML into PDF. PhantomJS used Qt for writing PDF. The library always embedded the fonts using Subsetting (meaning only the characters were used in the document were embedded). This did not comply with the PDF/A specification.

I couldn’t do anything here because PhantomJS used a very old version of Qt (only Qt v5.x onwards supported PDA/A). The PhantomJS was also discontinued since Mar 2018.

Besides, I could not use PDFBox to embed fonts fully either. There was a method PDType0Font.load for loading fonts but it also embedded the fonts using Subsetting.

Yay! It worked

I discovered that by replacing the process rendering HTML to PDF with the fonts fully embedded.
Using a different library to replace PhantomJS, I found Google Chrome’s Puppeteer.

Then, I could use PDFBox to convert the normal PDF to PDF/A with including XMP metadata and color profile.


References:
[1]. https://www.youtube.com/watch?v=EqII7ilmY8o&feature=youtu.be
[2]. https://stackoverflow.com/questions/38737219/how-to-convert-pdf-to-pdf-a-in-java
[3].http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java
[4]. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
[5]. https://medium.com/@raphaelstbler/advanced-pdf-generation-for-node-js-using-puppeteer-e168253e159c

Comments

Popular posts from this blog

How I did customize "rasa-nlu-trainer" as my own tool

Check out my implementation here Background I wanted to have a tool for human beings to classify intents and extract entities of texts which were obtained from a raw dataset such as Rocket.chat's conversation, Maluuba Frames or  here . Then, the output (labeled texts) could be consumed by an NLU tool such as Rasa NLU. rasa-nlu-trainer was a potential one which I didn't need to build an app from scratch. However, I needed to add more of my own features to fulfill my needs. They were: 1. Loading/displaying raw texts stored by a database such as MongoDB 2. Manually labeling intents and entities for the loaded texts 3. Persisting labeled texts into the database I firstly did look up what rasa-nlu-trainer 's technologies were used in order to see how to implement my mentioned features. At first glance rasa-nlu-trainer was bootstrapped with Create React App. Create React App is a tool to create a React app with no build configuration, as it said. This too...

Fulfilling Your Contribution Needs

Human resource management motivation Managing human today is quite different from the industrial age which treats people as just "chickens". Rather than people now are very important to the success of an organization. People are an organization's special resource. They should be encouraged to grow to contribute their effort and creativeness to their beloved working environment because the contribution is one of their most needs in life. Training people: getting rid of the ineffective model and adopting the new one The ineffective model of training people: Hiring new people --> giving them a crash course once --> expecting them working effectively.  That somehow makes sense but you're about to expect a luck because you do not really spend your effort for mentoring them. If they can work effectively, well...lucky you! Otherwise, you will blame that these people are ineffective and you let them go and hire the new ones. What a waste of time! The new effe...

When we don't see the sun, we see other stars

What are your motivations for creativity? - I want to make a change. - It makes me happy! It is a need of my mind. How to be creative for a thing? There are two steps: - See the thing as every people see it - Think about a new different thing from it How to think about a new different thing? There are two ways: - Forget all things you have already known. - A whack on the side of your head. ;) This was what I have learned from the following great book: source: Amazon.com Well! A physical whack on the side of your head is needed sometimes but the meaning behind is that you need to break these 9 following locks on your mind. Remove them! The lock #1: "The correct answer" We all learn from schools that there is only one correct answer to a question. For example, a proposition is only true or false in Algebra. In reality, there are always some answers to a question basing on a point of view. For example, number 6 becomes number 9 if you look it ...

Performance of a Data Structure

Why data structures matter The fact is that programs are all about processing data. Data structures are referred to how data is organized which affects the time of executing a program. How to measure the performance of a data structure In order to measure "how fast"/efficiency/performance of a data structure, we measure the performance of its operations. There are four basic operations including reading , searching , insertion , and deletion . A pure time consuming is not used for the measuring because it is not reliable depending on the hardware that it is run on. But instead, we use the term time complexity which refers to how many steps an operation takes. An example of how a single rule can affect efficiency Let's compare two data structures: Array and Set (with N elements). 1. Array - Reading : 1 step (because the computer has the ability to jump to any particular index in the array) - Searching : N steps (the worst case with linear search) - Inserti...

Installing NGINX on macOS

I have heard of a lot of NGINX recently. One of them was it can help for security issues; for sure, it much be more. It so happens that our team has got a ton of user stories from a security audit. It's time to delve into it. What is NGINX? In order to get a basic idea and have some fun , I've just picked some available posts from my favorite Vietnamese blogger communities as below: https://kipalog.com/posts/Cau-hinh-nginx-co-ban---Phan-1 https://viblo.asia/hoang.thi.tuan.dung/posts/ZabG912QGzY6 NGINX (pronounce: Engine-X) is a web server (comparing to IIS, Apache). It can be used as a reverse proxy ( this is what I need for security issues with configuration ), load balancer and more. How to get started? I found the below path for learning NGINX by googling "learn nginx": https://www.quora.com/What-are-some-good-online-resources-to-learn-Nginx In this post, I only went first step. This is installing NGINX on macOS and taking a first look at the confi...