Skip to main content

Generating PDF/A From HTML in Meteor


My live-chat app was a folk of project Rocket.Chat which was built with Meteor. The app had a feature that administrative users were able to export the conversations into PDF files. And, they wanted to archive these files for a long time.

I happened to know that PDF/A documents were good for this purpose. It was really frustrated to find a solution with free libraries. Actually, it took me more than two weeks to find a possible approach.

TL, DR;
Using Puppeteer to generate a normal PDF and using PDFBox to load and converting the generated PDF into PDF/A compliance.

What is PDF/A?

Here is a definition from Wikipedia:
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Why PDF/A documents matter

In my point of view, it’s a standard file for archiving thanks to key features:
  • The fonts of document’s content will always be displayed the same on all places, not depending on any devices.
  • Documents are safe. For example, no executable file launches are allowed or no external content references are allowed.
(Find more features at https://en.wikipedia.org/wiki/PDF/A#Description)

Several approaches for creating a PDF/A document

At the first glance, I found some approaches as follows:

(1). Paid APIs
  • Aspose PDF: Java-based APIs. It seemed the price was too expensive.
  • PdfTron: Javascript-based APIs. I tried to contact them to know the price but they requested me some information even I could not estimate to answer it.
(2). Free APIs
  • Apache FOP: Java-based APIs. I needed to defined the templates by its own ways (FOP files) instead of HTML and CSS. (I created a hello world app here)
  • PDFBox: Java-based APIs. It looked promising that I could convert a normal PDF document to PDF/A one.
(3). Forking a open source project/building my own PDF/A generating engine
  • pdfkit: Javascript-based project. I needed to know the specification of PDF/A compliance, and to build my own lib for generating PDF/A documents.
Basing on these insights, I decided to go with approach (2).

Apache FOP tryout

I decided to stop this approach. Apache FOP strictly required to provide templates with XSL-FO format which is not supported wisely. Ref: https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708
  • Due to limitation of formatting support, it’s hard to format content of a PDF if its template is complex such as including images, tables, etc..
  • There seemed to be no tools for converting from HTML, CSS into XSL-FO. But my templates were built on HTML, CSS and Meteor and data was bound dynamically.

PDFBox tryout

Basing on the example of creating PDF/A here, I tried to convert an existing normal PDF file into another PDF/A. There were 3 parts to do:
  1. embed/load fonts
  2. include XMP metadata
  3. include color profile
I used these following tools to verify the result:
With filling some mandatory information, I passed the parts 2 and 3 successfully. But, I have really got stuck at part 1 for a long time.

Eventually, I found the root cause by delving deeper into source code of some libraries PDFBox, PDFBox Preflight, node-html-pdf, PhantomJS, etc.. as follows:

I could not successfully generate PDF/A document by converting the existing normal PDFs generated from Meteor because the PDF generated did not embedded the font fully.

My project used node-html-pdf which in turn used PhantomJS to render HTML into PDF. PhantomJS used Qt for writing PDF. The library always embedded the fonts using Subsetting (meaning only the characters were used in the document were embedded). This did not comply with the PDF/A specification.

I couldn’t do anything here because PhantomJS used a very old version of Qt (only Qt v5.x onwards supported PDA/A). The PhantomJS was also discontinued since Mar 2018.

Besides, I could not use PDFBox to embed fonts fully either. There was a method PDType0Font.load for loading fonts but it also embedded the fonts using Subsetting.

Yay! It worked

I discovered that by replacing the process rendering HTML to PDF with the fonts fully embedded.
Using a different library to replace PhantomJS, I found Google Chrome’s Puppeteer.

Then, I could use PDFBox to convert the normal PDF to PDF/A with including XMP metadata and color profile.


References:
[1]. https://www.youtube.com/watch?v=EqII7ilmY8o&feature=youtu.be
[2]. https://stackoverflow.com/questions/38737219/how-to-convert-pdf-to-pdf-a-in-java
[3].http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java
[4]. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
[5]. https://medium.com/@raphaelstbler/advanced-pdf-generation-for-node-js-using-puppeteer-e168253e159c

Comments

Popular posts from this blog

Set up a web server for learning HTTP headers

Motivation We all follow the client-server model using the HTTP protocol for most of our web apps today. In development, we simply may have a backend API server and a frontend (web pages or mobile apps) only. However, it seemed that a proxy server is always required for production. In fact, most of the hardest issues in production come from integration. The requests and responses might be modified by the proxy server. Therefore, the understanding of HTTP protocol is one of the key skills to resolve those issues. I wanted to dive deep into HTTP with some core concepts such as caching, cookies, and CORS. I didn't intend to go quickly rather than moved slowly to have a well understanding of what I do. Prepare a server The easiest way is to use my laptop as a server then I can just use "localhost". I can also use ngrok to make my web server online. Finally, I use an online tool such as RedBot to check the HTTP headers. To make it more excited though, I deployed the app on A...

The power of acceptance test

User Story is the place PO gives his ideas about features so that developers are able to know what requirements are. Acceptance tests are these show the most valuable things of the features represented by some specific cases. Usually PO defines them, but not always. Therefore, refining existing acceptance tests – even defining new ones that cover all features of the User Story must be a worth task. Acceptance test with Given When Then pattern If we understand what we are going to do, we can complete it by 50% I have worked with some members those just start implementing the features one by one and from top to down of the User Story description. Be honest, I am the one used to be. What a risky approach! Because it might meet a case that is very easy to miss requirements or needs to re-work after finding any misunderstood things. I have also worked with some members those accept spending a long time to clarify the User Story. Reading carefully of whole User Story by defining...

What the heck is Meteor DDP?

I was using Meteor for my messenger project. I was so curious about the real time connection. I wanted to know how exactly this mechanism works. In this post, I will go through the DDP Specification, an overview of WebSocket, and a simple demo about how to subscribe a publication of Rocket.Chat (containing a DDP server) from an external webpage. At a glance, I knew that Meteor invented a protocol called DDP which uses for handling real time connection. So then, what is DDP? "DDP (Distributed Data Protocol) is the stateful WebSocket protocol that Meteor uses to communicate between the client and the server." [1] All right! Why does DDP matter? "DDP is a standard way to solve the biggest problem facing client-side JavaScript developers: querying a server-side database, sending the results down to the client, and then pushing changes to the client whenever anything changes in the database" . [2] In order to understand deeply the protocol, I decided ...

DevOps Toolchain Enhancement

 Historically, our company ubitec had started with a customer project. Agile/Scrum was our proposal for working with customers. Time by time, Agile/Scrum also became our culture for software development. To be successful with this development approach, we somehow needed to have a fast release for customers (i.e. every one week). Back then, we had a build tool Jenkins which was responsible for having sprint release packages for our customers. The build job pipelines contain some steps such as gathering the artifacts, checking the code convention, running the tests, building docker images, and packaging an archived file (a zip file). The set of tools involved in a pipeline is roughly called a toolchain. It is just a part of a bigger process called the DevOps toolchain. Source: https://www.ibm.com/blogs/cloud-archive/2016/11/devops-architecture-available-on-bluemix-garage-method-site/ DevOps is a proven method that fits Agile. Today,  it is even treated as a mandatory factor...

Solving your data visualization needs with open source reporting

Most of applications have some types of data visualization needs: - Gather the data. - Perform calculation, sort, group, aggregate, total,.. - Present information professionally. and meeting user demand is crucial to the success of an application. To solve this problem, there are some different approaches: - Buy a closed-source commercial product (for example, Crystal Reports, JReport,..), we must to pay for a lot of features but maybe more of features we don't need. - Build a custom-developed solution, so we need a team to develop our solution but the problem is how much time and money that we need to spend for that. Nowaday, open source creates new choices. Firstly, we can leverage open source in a customer solution by plug-in it to our solution. Secondly, we can build open-source-based products by using open source code. There are many open source reporting tools for use in the enterprise such as BIRT, iReport, JasperReports,... In this post, I wou...