Converting HTML to PDF in C# under .NET

 

Friday, May 13, 2022

So converting HTML to PDF using C#. How hard can it be? Well the answer is ‘deceptively hard’. To explain why, and how it is that, with a little bit of C#, ABCpdf .NET can do this kind of conversion so easily, we have written this – the definitive guide to HTML to PDF C# conversion.

 

The ABCpdf .NET  library has enabled this kind of HTML to PDF C# conversion since the release of ABCpdf .NET Version 4 back in January 2003 – so yes - nineteen years of experience in web page to PDF conversion! We wrote the book on this one.

As you might imagine the technological background has not remained constant during this time. There have been a number of solutions to getting web content into PDFs and as browser technology has changed, some technologies like CSS and SVG have come to the fore, while others like Flash and VML have faded away.

It is the intermesh of different technologies which provides the core challenge here. What you may see as a simple web page or chunk of HTML is usually a mix of media and techniques, all of which are crucial to the appearance of the page.

Yes there is HTML but also JavaScript, there is some CSS for styling and some SVG for your Font Awesome typefaces. But perhaps not all your fonts? Perhaps some of them work via WOFF or EOT. Then maybe you have a little jQuery component which unbeknownst to you uses VML for its drawing. Then there is Flash and WebGL and many more – the possibilities are large.

It is the exceptionally wide range of technologies which makes this difficult. The idea that one could just dust off Visual Studio and write an HTML to PDF C# conversion tool from scratch is grossly unrealistic. The biggest companies in the world, thousands of developers, decades of work, hundreds of millions of dollars – this are what you are looking at. Yet people still try to do this. In terms of real world web page conversion it is a total failure. The only way this kind of approach can work is if you write your web pages specially to be handled by these restricted web browsers.

So how does one make something more real-world work? Well essentially you have to leverage a real-world browser. You need to be careful doing this because browsers are not designed to do automated conversion. However it is possible make it work really exceptionally well. Well that is, provided you use a tool like the ABCpdf .NET C# library: a tool which has been refining these techniques for the last decade and a half.

History of HTML to PDF C# Conversion

 

So what does ABCpdf .NET do and how does it work? We will take a historical perspective as it has varied somewhat over the years as technologies have waxed and waned.

 

2003 - IE Engine

The original ABCpdf .NET  HTML conversion engine was based around MSHTML which is a COM component at the heart of Internet Explorer. We invented a unique technique to leverage the technology. this was a first – no-one had ever done this before.

We hosted MSHTML in a virtual window and then asked it to display the desired web page. As it was drawing to the window we would intercept and pick up the drawing commands, saving them for later use. Being virtual, the window could be tens or hundreds of feet tall. So no matter how big the web page, we could save it all up.

Of course a web page a hundred feet tall is not what people want to see in a PDF. So we needed to chop it up into pages, selecting carefully where breaks should occur and if perhaps elements should be repeated or avoided. By doing this we were able to get a very realistic view of what IE would show a user and we were also able to implement support for page break CSS tags which would not normally be understood by IE.

There were a lot of experiments and fixes and special cases and technologies required to allow this all to function in highly demanding multi-threaded environments. But with time and experience this became a spectacularly successful, and popular option for HTML to PDF C# conversion.

2011 - Firefox Engine

In 2011 ABCpdf .NET Version 8 was released, containing our new ABCGecko engine based around the Firefox Gecko HTML rendering module. This provided a number of advantages over the older MSHTML method.

Firstly Firefox was rather more standards compliant than MSHTML which meant that developers could use the kinds of HTML mark-up they wanted rather than adapting it to Internet Explorer. It also brought in support for standards like HTML5 and SVG.

Secondly we had complete control over the software. One of the problems with MSHTML had always been that it came as part of Windows. This meant that Microsoft updates might change the way that things worked. In general the updates were positive – but not always.

Our ABCGecko engine was completely self-contained. We selected a good and stable version of Firefox, we compiled it up, linked into the Gecko engine and made it do our bidding.

Running in a process garden it was exceptionally stable and it was completely constant. It brought a whole raft of frequently requested features with it – most notably repeated table headers (THEAD) and footers (TFOOT) and it let you select different media types such as screen and print via CSS.

2017 - Chrome Engine

Most recently we have made available the new ABCChrome engine which is based around the Chrome engine. The Firefox port had been very successful so we leveraged the same techniques again. This engine is again more standards compliant and it is also, in many cases, rather faster than the other two engines.

The ABCChrome engine has been incredibly popular for C# HTML to PDF conversion. Because of the way the Chrome architecture works and because of the experience we had had with Firefox, we were able to really get inside it and extend it in a much more flexible way than before.

As a result we were able to use some ingenious techniques to solve some of the long running conundrums that people had had using the other engines. In particular we were able to deal with AJAX updates in a much more sensible way than previously. Of which more later.

It should be noted that not all Chrome releases are equal. While each one may pass the Chrome automated testing, and to a user they may seem pretty much identical, it is only too obvious to us, which ones are better. Trying a browser out as a user is very different from integrating them into something which needs high volume, high scalability and high reliability.

We upgrade the engine by starting at the most current release and working back through the releases until we find one that passes our speed, quality and stability tests. Typically, we will need to go through about five releases until we find one we are happy with.

2021 - Chrome and WebKit Engines

With the release of ABCpdf 12 in early 2021 we introduced a new updated ABCChrome engine bringing enhanced security such as site isolation and enhanced cross-origin policies, as well as more current conformance and better CSS support.

Spring 2021 brought ABCpdf 12.1 and a completely new WebKit based engine. This has the great advantage of being much less OS dependent than the other engines - albeit less sophisticated. From a practical point of view this means it can be easily used in stripped down hosting environments such as Azure web sites or Azure App Services.

Example Scenario using C# and ABCpdf

 

So enough of history. How does one convert a web page to a PDF and what are the standard issues one comes across? Here we use ABCpdf .NET with C# for our HTML to PDF conversion example.

 

One Page

The base process is pretty simple. This is all the C# you need to convert your web page into a PDF.

Doc doc = new Doc();
doc.AddImageUrl("http://www.google.com/");
doc.Save(@"c:\web_page.pdf");

Many Pages

However that is only one page. If you want more pages you need to add them.

Doc doc = new Doc();
int id = doc.AddImageUrl("http://www.google.com/");
while (doc.Chainable(id)) {
doc.Page = doc.AddPage();
id = doc.AddImageToChain(id);
}
doc.Save(@"c:\web_page.pdf");

Page Breaking

So now you have an HTML web page that has been converted to a multi-page PDF document. But when you look at it, it is not quite right. Perhaps there is a table that has been split across two pages? Perhaps an image has ended up on one page and the by-line on another? In general you find the pages haven’t quite broken at the points you felt were best.

The way to deal with this is with the CSS page break styles: page-break-before, page-break-after and page-break-inside. The first breaks before the item, the second breaks after it and the last allows you to specify elements which should not be broken across pages.

For example to break before this div you might use the following construct,

<div style="page-break-before:always">&nbsp;</div>

Or to avoid splitting a table you might use the following,

<table style="page-break-inside: avoid;">...</table>

These things aren’t always easy to visualize so for debugging purposes, it is best to apply a visible style at the point you apply the page breaks – just so you can see where things ought – and ought not – to break. Just a simple horizontal or vertical border is sufficient to solve most page break difficulties.

Custom Styles for Paged Media

At this point it is all looking good but you do think that some of the CSS selected styles are working a bit better on screen than they are in your PDF? The way one deals with this kind of thing is to specify a different print stylesheet.

Print stylesheets are often somewhat lackluster so ABCpdf .NET allows you to decide whether or not you want to use the print or the screen stylesheet. It is a simple option you specify after creating your Doc object.

Doc doc = new Doc();
doc.HtmlOptions.Media = MediaType.Print;
...

Coping with AJAX - Dynamic Pages, Charts and Maps

This is fabulous - the HTML to PDF C# code is working brilliantly. You try it a bit more and then you notice that as your C# creates one PDF after another, one of your animated charts seems to be varying a bit. Normally it looks all right in the PDF, but sometimes it looks like it is half way through a start-up animation.

This is a common and difficult issue with dynamic pages and AJAX and most often occurs when dealing with charts and maps. The whole concept behind AJAX is that a web page may not have a particular point at which it has finished loading, it loads progressively and may continue to change. As a person it is obvious to us when a page is ready, but of course computers are not sensible and intelligent.

But no longer is it difficult! The new ABCChrome engine will literally watch the page until it stops changing. Once it does that it takes a snapshot. Obviously it cannot wait for ever but it is fairly easy to set a sensible set of options. So to say that the page is ready when it doesn’t change for 500ms you could use the following options.

Doc doc = new Doc();
doc.HtmlOptions.RepaintDelay = 500;
...

Splitting Tables over Pages

So is the final output right? Well not quite. On looking at a table on the second page you note it has been split over two pages. It is a large table so splitting is acceptable, but it is not ideal that the table header and footer don’t get repeated on all the pages.

The solution here is to use the thead and tfoot tags which specify table headers and footers that should be repeated on paged media. Something along these lines.

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bill</td>
      <td>40</td>
    </tr>
    <tr>
      <td>Zak</td>
      <td>20</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td>...</td>
      <td> </td>
    </tr>
  </tfoot>
</table>

For this functionality you’ll need to be using the ABCChrome or ABCGecko HTML engine. While both are good, ABCGecko is still superior for complex nested tables.

Converting Links

So all looks perfect. But you add a second web page to the PDF and suddenly a problem appears: the links between the two pages take you to the pages on the web site rather than the pages embedded in the PDF.

Similarly your fragment based links are linking to pages at the original URL rather than pages in the document. What you really want is to redirect links to web pages which were on the site - to web pages which are now in the PDF.

The solution here is pretty simple. All you need to do is add all the pages to your PDF document and then at the end make a single call to LinkPages.

Doc doc = new Doc();
doc.HtmlOptions.AddLinks = true;
doc.AddImageUrl("http://www.mysite.com/support.htm");
doc.Page = doc.AddPage();
doc.AddImageUrl("http://www.mysite.com/purchase.htm");
doc.HtmlOptions.LinkPages();
...

ABCpdf .NET will look at all the HTML as it appears in the PDF together with the original URLs used for conversion. Where it is possible to link within the document rather than to the original URL, it will redirect the links. So your menus, pages and fragments will target locations inside the PDF.

Advanced C# and JavaScript

 

So we have been through the more common problems and how one deals with them. However there are also a more advanced set of HTML to PDF C# conversion techniques one can use. They can be more difficult to code but they are incredibly powerful.

 

Dynamically Changing the HTML DOM

Sometimes one has control over both code and web pages. But all too often control over the two are divorced. So you have control over the C# you are writing but the web page you are converting may not be something you can change.

Perhaps it is on a different site? Perhaps it might be something another team is dealing with? The effect is the same - you cannot add in the tweaks that you need to get the HTML to convert to the perfect PDF.

Have no fear there are still ways to make this work. In your code you can inject JavaScript into the page to modify it before ABCpdf .NET sees it. Something as simple as this can be used to change the background color.

Doc doc = new Doc();
doc.HtmlOptions.OnLoadScript = @"document.body.style.backgroundColor = 'rgb(243, 249, 172)';";
...

This is a simple example but obviously this technique is enormously powerful as it allows virtually unlimited control. You can insert as much JavaScript as you like, to do anything you like, before it gets converted to PDF.

Getting Information Out of the Web Page

So you now know how to dynamically change the HTML DOM. However sometimes what one needs is the ability to know what is going on inside the page rather than simply the ability to change it.

For example suppose you want to know how big to make your PDF so that you can fit the entire HTML page on it when it is converted? To do this you really need to know what is going on inside the web page.

To do this you can insert JavaScript to collect and return information via the special 'document.documentElement.abcpdf' property.

Doc doc = new Doc();
doc.HtmlOptions.OnLoadScript
= "(function(){"
+ " var aspectRatio = document.body.offsetWidth / document.body.scrollHeight;"
+ " document.documentElement.abcpdf = aspectRatio;"
+ "})();";
int id = doc.AddImageUrl(_url);
var aspectRatio = doc.HtmlOptions.GetScriptReturn(id);
...

Using code like this you can access the aspect ratio of the page and then adjust the PDF page size as appropriate.

Tagging Content

Suppose your HTML to PDF C# conversion is not the end of the story - your HTML page is not the same as the output you need? For example suppose it contains placeholder signature images. In your output PDF you want to identify these and replace them with signature fields so that the final document can be signed.

In this case what you need to do is to tag the elements you are interested in and then work out where they appear in the converted PDF. To do this you can use the ABCpdf tagged area functionality.

The first thing to do is to mark up the images in your HTML so that ABCpdf knows that they need to be tagged. To do this you use a special 'abcpdf-tag-visible' attribute. Something like the following,

<img src="..\img\sig.jpg" id="sig1" abcpdf-tag-visible="" />

Then after the HTML has been converted to PDF you just need to go through the document, identifying the tagged items and inserting Signature Fields.

Doc doc = new Doc();
doc.HtmlOptions.AddTags = true;
int id = doc.AddImageUrl(url);
string[] ids = doc.HtmlOptions.GetTagIDs(id);
XRect[] rects = doc.HtmlOptions.GetTagRects(id);
for (int i = 0; i < ids.Length; ++i) {
// get id from ids[i] and rect from rects[i]
// add signature field - see ABCpdf docs for details
}
...

Adding Form Fields

A common request is for the conversion of an HTML form to a PDF form. ABCpdf .NET does not default to the active conversion of form fields - they will look like form fields but they are not something you can enter data into.

However simply by setting one property you can automatically convert your HTML fields into PDF fields.

Doc doc = new Doc();
doc.HtmlOptions.AddForms = true;
int id = doc.AddImageUrl(url);
...

The structure of your form will be preserved as far as possible but there are limitations here as relates to differences between HTML and PDF. So the HTML to PDF C# conversion will convert your fields the way you want but you should not expect that your JavaScript validation will be preserved.

Cracking down on Security

ABCpdf .NET contains a special custom security technology called FireShield, which allows you to dynamically assign file permissions at runtime. This ensures that the HTML engine can only access the locations you want.

We put this in place because browsers are complicated and can contain bugs with security implications. As such we felt it was important to put ABCChrome inside our FireShield sandbox which tightly controls access to the outside world.

The default rights are sufficient for most people. However if you are running a sensitive operation you may wish to provide rules so that each request can be assigned custom rights and permissions, dependent on the type of conversion.

For example Chrome will often attempt to load sound drivers from the Windows folder. We normally deny this but if you wish to allow it you can do so using the following code.

Doc doc = new Doc();
doc.HtmlOptions.Engine = EngineType.Chrome;
doc.HtmlOptions.FireShield.Rules.Add(
new XHtmlFireShieldRule(@"C:\Windows\", "*.drv", XHtmlFireShieldRule.AccessType.Allow));
int id = doc.AddImageUrl(url);
...

Let's Call it a Wrap!

 

Well that covers all the common gotcha's that people find when converting HTML to PDF in C# – along with all the solutions you need to get past them.

If you are using ABCpdf .NET this should provide you with the basics you to get you going. The C# snippets included here should be sufficient as starter code. For more examples related to HTML to PDF C# conversion, see the documentation which comes with ABCpdf .NET.

If you are not using ABCpdf then consider if any of the common problems listed above are going to apply to your project. If they are then this would be a good time to ensure that your chosen component has an equivalent solution.

Bear in mind that with some fifteen years of experience here we are well placed to offer definitive advice on all aspects of this type of solution.

Should you feel that perhaps you would like to download a copy of ABCpdf .NET - Welcome to the party!





Download Free Trial