Skip to main content

Generating PDF/A From HTML in Meteor


My live-chat app was a folk of project Rocket.Chat which was built with Meteor. The app had a feature that administrative users were able to export the conversations into PDF files. And, they wanted to archive these files for a long time.

I happened to know that PDF/A documents were good for this purpose. It was really frustrated to find a solution with free libraries. Actually, it took me more than two weeks to find a possible approach.

TL, DR;
Using Puppeteer to generate a normal PDF and using PDFBox to load and converting the generated PDF into PDF/A compliance.

What is PDF/A?

Here is a definition from Wikipedia:
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Why PDF/A documents matter

In my point of view, it’s a standard file for archiving thanks to key features:
  • The fonts of document’s content will always be displayed the same on all places, not depending on any devices.
  • Documents are safe. For example, no executable file launches are allowed or no external content references are allowed.
(Find more features at https://en.wikipedia.org/wiki/PDF/A#Description)

Several approaches for creating a PDF/A document

At the first glance, I found some approaches as follows:

(1). Paid APIs
  • Aspose PDF: Java-based APIs. It seemed the price was too expensive.
  • PdfTron: Javascript-based APIs. I tried to contact them to know the price but they requested me some information even I could not estimate to answer it.
(2). Free APIs
  • Apache FOP: Java-based APIs. I needed to defined the templates by its own ways (FOP files) instead of HTML and CSS. (I created a hello world app here)
  • PDFBox: Java-based APIs. It looked promising that I could convert a normal PDF document to PDF/A one.
(3). Forking a open source project/building my own PDF/A generating engine
  • pdfkit: Javascript-based project. I needed to know the specification of PDF/A compliance, and to build my own lib for generating PDF/A documents.
Basing on these insights, I decided to go with approach (2).

Apache FOP tryout

I decided to stop this approach. Apache FOP strictly required to provide templates with XSL-FO format which is not supported wisely. Ref: https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708
  • Due to limitation of formatting support, it’s hard to format content of a PDF if its template is complex such as including images, tables, etc..
  • There seemed to be no tools for converting from HTML, CSS into XSL-FO. But my templates were built on HTML, CSS and Meteor and data was bound dynamically.

PDFBox tryout

Basing on the example of creating PDF/A here, I tried to convert an existing normal PDF file into another PDF/A. There were 3 parts to do:
  1. embed/load fonts
  2. include XMP metadata
  3. include color profile
I used these following tools to verify the result:
With filling some mandatory information, I passed the parts 2 and 3 successfully. But, I have really got stuck at part 1 for a long time.

Eventually, I found the root cause by delving deeper into source code of some libraries PDFBox, PDFBox Preflight, node-html-pdf, PhantomJS, etc.. as follows:

I could not successfully generate PDF/A document by converting the existing normal PDFs generated from Meteor because the PDF generated did not embedded the font fully.

My project used node-html-pdf which in turn used PhantomJS to render HTML into PDF. PhantomJS used Qt for writing PDF. The library always embedded the fonts using Subsetting (meaning only the characters were used in the document were embedded). This did not comply with the PDF/A specification.

I couldn’t do anything here because PhantomJS used a very old version of Qt (only Qt v5.x onwards supported PDA/A). The PhantomJS was also discontinued since Mar 2018.

Besides, I could not use PDFBox to embed fonts fully either. There was a method PDType0Font.load for loading fonts but it also embedded the fonts using Subsetting.

Yay! It worked

I discovered that by replacing the process rendering HTML to PDF with the fonts fully embedded.
Using a different library to replace PhantomJS, I found Google Chrome’s Puppeteer.

Then, I could use PDFBox to convert the normal PDF to PDF/A with including XMP metadata and color profile.


References:
[1]. https://www.youtube.com/watch?v=EqII7ilmY8o&feature=youtu.be
[2]. https://stackoverflow.com/questions/38737219/how-to-convert-pdf-to-pdf-a-in-java
[3].http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java
[4]. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
[5]. https://medium.com/@raphaelstbler/advanced-pdf-generation-for-node-js-using-puppeteer-e168253e159c

Comments

Popular posts from this blog

How to apply Lean - Kanban for your business

This is the topic of Scrum Breakfast meetup this time, speaker: Ms. Phuong Bui - Technical Project Manager of YOOSE Pte. Ltd. http://www.meetup.com/Scrum-Breakfast-Vietnam-Agile-and-Scrum-Meetup/events/230313727/ Lean comes from Lean manufacturing is a method that focuses on elimination of wastes. In other words, this is a set of principles for archiving the quality, speed and customer alignment. The first time I knew about the term "Lean" is  from the book Software Craftsmanship . Sandro recommends if we want to transform our pet projects into a real business, we should get familiar with Lean Startup concepts. In this talk, Ms. Phuong pointed out some major wastes includes information (ex: unclear requirements), processes (ex: waiting), physical environment and people. Knowing what the problems should be the best way to eliminate them. The difference between  Single item flow and Batch processing is the second main point; and it is the Lean's idea. Batch pr...

[Snippet] CSS - Child element overlap parent

I searched from somewhere and found that a lot of people says a basic concept for implementing this feature looks like below: HTML code: <div id="parent">  <div id="child">  </div> </div> And, CSS: #parent{   position: relative;   overflow:hidden; } #child{   position: absolute;   top: -1;   right: -1px; } However, I had a lot of grand-parents in my case and the above code didn't work. Therefore, I needed an alternative. I presumed that my app uses Boostrap and AngularJs, maybe some CSS from them affects mine. I didn't know exactly the problem, but I believed when all CSS is loaded into my browser, I could completely handle it. www.tom-collinson.com I tried to create an example to investigated this problem by Fiddle . Accidentally, I just changed: position: parent; to position: static; for one of parents -> the problem is solved. Look at my code: <div class="modal-body dn-placeholder-parent-positi...

Improving the execution time of CI pipelines

Executing a large number of tests, especially integration tests, takes a lot of time. For instance, the pipeline of one of our projects for each Pull Request previously took nearly 30 minutes, including over 1 thousand test cases. This article guides you through several good techniques that we have discovered and applied to improve the time-consuming process. Parallel stages Analyze the current phases in your pipeline and categorize them in parallel. For example, we can separate the build and verify code of Node.js and Maven modules simultaneously in our Jenkins pipelines. Please mind using the setting failFast whether you want to abort the pipeline immediately. Read more: Parallel stages with Declarative Pipeline 1.2 (jenkins.io) Parallel test execution If you use Maven, t he plugin maven-failsafe-plugin is used to execute integration tests during phases integration-test and verify  the build lifecycle. It allows us to execute tests in parallel. There are many settings related ...

Sharing a virtualenv across several Python projects using Pipenv

There is a standard library for all projects in Python. However, several projects don’t always have the same dependencies all the time. That is where virtual environments come to play. You can follow this official document to use two separated tools  virtualenv and pip to  fulfill that need. My preferred alternative is to use pipenv . Pipenv is easy to use and convenient. The following are my steps to make a shared virtualenv for my all projects which requires the same dependencies. Step 1. Create an isolated virtualenv. python -m venv my-shared-env Step 2. Create a symbolic link to the created virtualenv. cd project_1 ln -s ~/.local/share/virtualenvs/my-shared-env .venv I have encountered the following issue at step 1. FileNotFoundError: [Errno 2] No such file or directory: '{my_project_path}/.venv/bin/pip': '{my_project_path}/.venv/bin/pip' The root cause was I tried to create virtualenv by running pipenv install and renaming the generated virtualenv to ...

The HelloWorld example of JSF 2.2 with Myfaces

I just did by myself create a very simple app "HelloWorld" of JSF 2.2 with a concrete implementation Myfaces that we can use it later on for our further JSF trying out. I attached the source code link at the end part. Just follow these steps below: 1. Create a Maven project in Eclipse (Kepler) with a simple Java web application archetype "maven-archetype-webapp". Maven should be the best choice for managing the dependencies , so far. JSF is a web framework that is the reason why I chose the mentioned archetype for my example. 2. Import dependencies for JSF implementation - Myfaces (v2.2.10) into file pom.xml . The following code that is easy to find from  http://mvnrepository.com/  with key words "myfaces". <dependency> <groupId>org.apache.myfaces.core</groupId> <artifactId>myfaces-api</artifactId> <version>2.2.10</version> </dependency> <dependency> <groupId>org.apache.myfaces.core<...