Intro
 

ABCpdf fully supports HTML and CSS.

You can render individual pages of HTML using the AddImageUrl method.

You can page HTML over multiple PDF pages using the AddImageUrl method in combination with the AddImageToChain method.

ABCpdf allows you to treat HTML like any other media so you can even page your HTML across multiple columns of multiple pages of your PDF.

 

   

Method
 

ABCpdf can use the Chromium engine (as used in Google Chrome), the MSHTML engine (as used in Microsoft Internet Explorer) or the ABCGecko engine (as used in Mozilla Firefox) to parse and preprocess the HTML for insertion into your PDF. To switch between them, simply change the XHtmlOptions.Engine property. It's only one line of code.

The use of sandboxed web browsers provides an extremely accurate rendition of the HTML. However due to the differences in behavior and capabilities of the underlying rendering engine, you should expect some differences in behavior and output when switching between them.

Using Chrome 59 in print preview, you should see something similar to the output you get using the ABCChrome engine. However because we default to screen rather than print media stylesheets, colors may be more like you see on screen. The view you see of your HTML using the IE version on the machine should be pretty similar to the MSHTML engine on that same machine. Firefox 38 in print preview should be pretty much identical to the ABCGecko engine. None will be exactly the same because PDF is different from HTML and some things need to be represented slightly differently. However any differences should be subtle.

The ABCChrome engine is typically about three times as fast as the MSHTML engine, which itself typically about twice the speed of ABCGecko. However this does depend heavily on the type of input. For some inputs we have seen the ABCGecko engine five times the speed of the MSHTML one. The ABCChrome and Gecko engines have a fairly large startup overhead so if you are testing for speed you should be careful to discount the first conversion.

In terms of commonly requested features, the ABCChrome and Gecko engines support repeated table headers and footers, SVG, and are pretty good on AJAX for charts and maps. The MSHTML engine tends to provide a more literal screen based output. ABCChrome and ABCGecko generally support HTML5 specific features better than MSHTML. ABCChrome supports web fonts while the other two engines do not.

You should be expecting that there are differences in PDF representation between the engines and in particular the decision to reference or embed fonts can be quite different. File sizes are likely to be noticeably but not startlingly different. The MSHTML tends to produce the largest outputs, followed by ABCGecko and with ABCChrome producing the smallest files.

Please refer to Engine, ForChrome, ForMSHtml, and ForGecko for further elaborations on the engines' distinct characteristics.

 

   

Cache
 

ABCpdf holds a cache of recently requested URLs and it's only after five minutes or so that these pages expire from the cache.

This results in a considerable degree of optimization for many common operations. However, if you wish to bypass the cache, you can do so by setting the DisableCache parameter to true when you call AddImageUrl or AddImageHtml.

Occasionally, you may find that your page is being cached elsewhere. There are all kinds of places this can happen. For example, Windows sometimes caches individual page resources. Proxy servers may cache entire pages.

The standard reason that content gets cached is that pages are sending HTTP header information which indicates that it is acceptable to cache this content. If you are using the Internet Explorer HTML engine, sometimes it will insist to cache certain Web pages. In that case, your first step should be to use a tool like IEWatch to view the content expiration headers. Indeed, you may find that simply adjusting the content expiration settings found in the IIS Management Console will resolve the issue.

If you want to be totally sure that your URLs are rendered afresh each time, you need to vary the URL. For example:

http://www.microsoft.com/?dummy=1
http://www.microsoft.com/?dummy=2
http://www.microsoft.com/?dummy=3

These will all render the same page (www.microsoft.com) but because the URL is varying, you can be sure that they will be rendered afresh each time.

 

   

Speed
 
 

Obvious things will impact the speed of HTML conversion. So if you want to optimize the process look at retrieval times for your http requests, the size of your HTML and any related resources, the complexity of HTML, the speed of your computer. Tweaking these can make a big difference.

However there are also some small and simple things you may be able to do without getting into the complexity of system wide optimization.

The MSHTML rendering engine is, by default, set up for accuracy and quality. In order to ensure that the output is always good we have to enable every setting that might ever affect the output quality. This is the case even for situations in which you are not using those features.

So if your HTML does not contain features which require these settings then you can disable them. Doing so can result in significant speed improvements.

The setting which typically makes the biggest difference is HostWebBrowser but DoMarkup and AdjustLayout are also worth looking at. The actual speed increases depend very much on the input HTML, but in our tests, disabling these features for simple HTML, increased the speed of processing by about 30% for HostWebBrowser, another 10% for the DoMarkup property and another 7% for the AdjustLayout property.

Another property which needs examination is the UseScript one. By default this is set to false but many people enable it in their ABCpdf code. As long as your JavaScript is good and sensible then there is no problem. However JavaScript is often coded poorly and as such it may have an unpredictable effect on speed. Consider disabling this feature if you do not actively need it.

Setting the BrowserWidth to a predefined value means that ABCpdf does not have to compute one. This can result in an increase of speed or perhaps 10% or so.

By default the ABCpdf MSHTML engine operates a two stage process for obtaining web pages. It starts with the most efficient method (EngineTransfer) and, in the event that security prohibits this, falls back to a less demanding but slower method (NetWebRequest). If you are operating in a restricted permission environment such as ASP.NET you are unlikely to notice this fallback and thus the potential speed impact. By setting the TransferModule to EngineTransfer you can force ABCpdf to use the faster method and if permissions prohibit this then an error will be raised. At that point you can tackle the permissions issue to ensure that the faster method is used. See the TransferModule property for details.

 

   

Caveats
 
 

You can render any page you can supply a URL for.

When you render a page the page has to be reloaded by ABCpdf. This is because you - as a client - are looking at the page from your current machine. ABCpdf lives on the server and so it exists in a different session.

So, you cannot generally rely on cookies, session state or form submission in your page. The page must be reliant only on the URL you supply.

If you have to rely on session state, you could use cookie-less sessions (which will give you a URL for your session) or you could save the session information under a specific unique ID then pass the ID via the URL and pick up the information via your server-side code.

Problems which appear to be related to SSL or HTTPS connections are often authentication issues simply solved by providing a user name and password. See the LogonName property for details. If that is not convenient then, for situations in which the HTTPS resource is on the same machine as ABCpdf, it is often possible to access the files directly using a 'file://' protocol URL - bypassing SSL issues can be simpler than solving them. In addition, people often tweak security without considering the consequences, so if you have to solve an SSL problem there is always the possibility that someone might unsolve the problem in the future.

 

   

Size
 
 

Screen resolution is typically 96 DPI. So, when you view an HTML page on your monitor, Windows will display it at 96 DPI.

The disparity between the screen resolution and the PDF 72 DPI resolution means that HTML appears larger in print documents than it does on screen.

You will need to apply a scale of 72/96 (0.75) to compensate for this if you want both to appear the same size.

For example, if you are rendering a web page supplying a value of 800 for the Width parameter, you will need to set the width of your Rect to 600 if you want both to appear the same size.

 

   

DPI
 
 

PDF documents are predominantly vector based. As such, they do not really have a DPI because they are resolution independent. The only portions of PDFs which are raster based are images.

Most elements of HTML - text, lines - are vector based. So, they are resolution independent.

The resolution at which images in your web pages are rendered is complicated. Suppose you have a 300 square image referenced by an image tag. If the width of your Doc.Rect is the same as the width you pass to AddImageUrl, this will be rendered at 72 DPI. However, by changing the ratios between these two values, the image will be scaled and hence the resolution will be changed.

And... if your 300 square is in an img tag with a width and height of 150, the default resolution will be doubled.

 

   

Breaks
 
 

ABCpdf uses a sophisticated set of heuristics to determine where to break pages. For greater control over page breaking, you can use the page-break-before, page-break-after and page-break-inside CSS styles.

You must ensure that the element to which you apply your page breaking style is visible. For example:

<div style="page-break-before:always">&nbsp;</div>

... will break but ...

<div style="page-break-before:always"></div>

... will not.

Useful Tip. Debugging page break styles.

Sometimes, your page breaks don't work in they way you think they should. Because these kinds of tags are invisible, it's very difficult for you to know whether you've applied them correctly or not. One simple solution is to debug your HTML using a visible style.

For example, when you apply your "page-break-inside: avoid" style, apply a right border style at the same time. That way, you can see exactly where your elements are. If the borders don't appear in the right places, then you know there's something wrong with your HTML.

The page break styles in the Gecko engine are not always applied as intuitively as they are in MSHTML. The root of this is the CSS specification that which says that break styles must be applicable to block-level elements within the "normal flow of the root element". It allows for these styles to be applied to other elements but does not mandate it.

The upshot of this, within the Gecko engine, is that page break styles cannot be applied within tables, to elements such as table rows. If you are unsure about whether something is likely to work just try Print Preview from within Firefox 38.0 as a simple sanity check.

 

   

Snapshot
 
 

You may wish to take a snapshot of the current URL.

In many circumstances, you should be able to derive a URL for the current page using the value of the SERVER_NAME, URL and QUERY_STRING Server Variables. You should be able to derive a URL for the previous page using the HTTP_REFERER (sic) Server Variable.

Alternatively, you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. You can then present this HTML to ABCpdf using AddImageHtml. If your HTML references resources using relative references, you may wish to insert a <BASE> tag into the HTML before presentation to ABCpdf. If your server is serving one site and you are rendering a URL on it, something as simple as <BASE HREF="http://localhost/"> is often all you need.

When you perform this kind of operation, be careful not to recursively call ABCpdf. If you do this, you will get into a hall-of-mirrors type situation and the software will not be able to return you a sensible image.

 

   

Security
 
 

In previous versions of ABCpdf, JavaScript was disabled by default and had to be manually enabled. However JavaScript has become critical to the display of most sites and From Version 11 onwards, it is enabled by default. See the HtmlOptions.UseScript property for details.

If you are working with trusted web pages or HTML - your web pages and your HTML - then there is little to worry about in terms of security.

However if you allow untrusted clients to provide you with web URLs or raw HTML then you need to consider if a malicious user could leverage specially crafted pages, to create effects you had not considered. The two core ways in which this may happen are related to redirection and base tags.

Each web engine operates inside a sandbox which limits the access that web pages have. However the extent to which the web pages are limited, is largely determined by where they are and what they wish to access. In general web pages are tied into the local site from which they come. So any attempt to move outside this location will normally be denied. You can read more about this on the web by searching for details of the same-origin policy and how it is used to prevent cross-site scripting.

Redirection is much more relevant when JavaScript is enabled as this is the primary mechanism by which it is accomplished. There are other mechanisms such as meta-refresh tags, but they are uncommon and fragile. If you pass a URL to ABCpdf, it cannot tell the difference between intended redirection and malicious redirection. So it is important to ensure that the types of redirection that are allowed is controlled.

Base tags are similar but do not use JavaScript - they allow a web page in one location to reference resources in another. Most typically this is used to allow page on one site to reference images or style sheets on a different web site. Obviously to strictly enforce a same-origin policy here would rather defeat the concept of the base tag. However there are security measures in place to limit the extent to which they can be manipulated.

The main thing to consider is that the origin of files on the local machine is the local machine. So, while the origin of web pages strictly controls the access they can have, the origin of files inherently limits the level of control that is possible. Allowing untrusted users to provide web page URLs provides limited exposure. Allowing them to provide files to be be passed directly to AddImageUrl via a 'file:///' protocol URL, provides much more exposure.

As an alternative, if you have files containing HTML, read the content and pass it in via the AddImageHtml method. Both the ABCChrome and Gecko engines have barriers in place to limit the access of web pages which are provided this way. The origin of the HTML is not the local machine and a higher level of defense is automatically in place. However it may also mean that images on your local machine, referenced in your HTML, will not be able to be loaded. See table below for details.

The ABCChrome FireScreen allows you to specify dynamic permissions associated with file access. If you need to work with files on the local machine, this is an excellent way of ensuring that the HTML render process has access only to those files that it requires. However this is a feature specific to the ABCChrome engine. It does not exist for the ABCGecko or MSHTML engines.

The file permissions associated with a web server process, provide yet another layer of defense. But it is best not to to have more layers rather than fewer - try to avoid providing untrusted users with mechanisms to allow them to insert unvetted HTML or JavaScript.

There are some variations between engines and what they allow. See below for details.

ABChrome
ABCGecko
Mshtml
Redirection from one page to another within the current web site
Allowed
Allowed
Allowed
Redirection from a page on one web site to a page on a different one
Denied
Denied
Denied
Redirection from a page on a web site to a local file
Denied
Denied
Denied
Redirection from one local file to another local file
Allowed
Allowed
Allowed
Redirection from a local file to a page on a web site
Denied
Allowed
Denied
Base tag in a local file referencing resources in a local directory
Allowed
Allowed
Allowed
Base tag in a local file referencing resources on a remote web site
Allowed
Allowed
Allowed
Base tag in a page on a remote web site referencing resources in a local directory
Denied
Denied
Allowed
Image in a local directory, referenced in HTML, provided via AddImageHtml method.
Denied
Denied
Allowed
Image in a local directory, referenced in HTML, provided via AddImageUrl and a 'file:///' scheme URL.
Allowed*
Denied
Allowed

* If you enable the FireShield engine, the default rules will deny access. Access can be re-enabled by adding FireShield rules to explicitly allow it.