Simple HTML to PDF conversion

The arrival of formatted content in XFINIUM.PDF 4.4 brings the possibility of implementing simple HTML to PDF conversion using XFINIUM.PDF library.

Formatted content lets you create complex text layouts on a PDF page combining paragraphs, text blocks with various fonts and colors, links, bullet lists. However creating a complex layout can require a lot of code.
Wouldn’t it be simpler to have the content described using a markup language such as HTML?

This article shows how to parse an HTML fragment (actually XHTML since it uses the XML parser included in .NET), create the corresponding formatted content objects and draw them on the page. The sample implements only a few HTML tags for basic text formatting, but more tags can be added (full HTML to PDF conversion is not possible because not all HTML tags can be translated into formatted content objects).

The mains sample method is
public PdfFixedDocument Convert(Stream html)
which takes the HTML in the given stream and converts it to a PdfFixedDocument object.

This method has 2 parts, the conversion of HTML content to a PdfFormattedContent object and the rendering of the PdfFormattedContent object on document’s pages.

public PdfFixedDocument Convert(Stream html)
{
	PdfFixedDocument document = new PdfFixedDocument();

	PdfFormattedContent fc = ConvertHtmlToFormattedContent(html);
	DrawFormattedContent(document, fc);

	return document;
}
Public Function Convert(html As Stream) As PdfFixedDocument
	Dim document As New PdfFixedDocument()

	Dim fc As PdfFormattedContent = ConvertHtmlToFormattedContent(html)
	DrawFormattedContent(document, fc)

	Return document
End Function

The ConvertHtmlToFormattedContent uses the XmlReader class to parse the HTML content. For each supported tag the corresponding objects are created or properties are set. A stack of fonts and colors is used for keeping track of current font and color. The supported tags in the sample are: p, font, a, b, strong, i, em, u, ul, li but the sample can be extended with other tags (h1, h2, code, span, etc).
The source code of this method is quite long to be posted here but the sample project is available for download.

The DrawFormattedContent method splits the formatted content over multiple pages and draws them.

private void DrawFormattedContent(PdfFixedDocument document, PdfFormattedContent fc)
{
	double leftMargin, topMargin, rightMargin, bottomMargin;
	leftMargin = topMargin = rightMargin = bottomMargin = 36;

	PdfPage page = document.Pages.Add();
	PdfFormattedContent fragment = fc.SplitByBox(page.Width - leftMargin - rightMargin, page.Height - topMargin - bottomMargin);
	while (fragment != null)
	{
		page.Graphics.DrawFormattedContent(fragment, 
			leftMargin, topMargin, page.Width - leftMargin - rightMargin, page.Height - topMargin - bottomMargin);
		page.Graphics.CompressAndClose();

		fragment = fc.SplitByBox(page.Width - leftMargin - rightMargin, page.Height - topMargin - bottomMargin);
		if (fragment != null)
		{
			page = document.Pages.Add();
		}
	}
}
Private Sub DrawFormattedContent(document As PdfFixedDocument, fc As PdfFormattedContent)
	Dim leftMargin As Double = 36
	Dim topMargin As Double = 36
	Dim rightMargin As Double = 36
	Dim bottomMargin As Double = 36

	Dim page As PdfPage = document.Pages.Add()
	Dim fragment As PdfFormattedContent = fc.SplitByBox(page.Width - leftMargin - rightMargin, page.Height - topMargin - bottomMargin)
	While fragment IsNot Nothing
		page.Graphics.DrawFormattedContent(fragment, leftMargin, topMargin, page.Width - leftMargin - rightMargin, page.Height - topMargin - bottomMargin)
		page.Graphics.CompressAndClose()

		fragment = fc.SplitByBox(page.Width - leftMargin - rightMargin, page.Height - topMargin - bottomMargin)
		If fragment IsNot Nothing Then
			page = document.Pages.Add()
		End If
	End While
End Sub

The page margins are set to half an inch. From the initial formatted content the part that fits the given box is extracted and drawn on the page. The procedure is repeated till no more formatted content is available.

The full sample project can be downloaded here. It is a Windows console application but the SimpleHtmlToPdf.cs file which contains all the conversion logic can be compiled on any supported platform.

25 thoughts on “Simple HTML to PDF conversion”

  1. Trying this on Xamarin for Android. There seems to be a problem in the SplitByBox method. The height I give the method seems to split too soon…if I multiply the height by a factor of 1.8 it appears to work.

    1. Please send us a sample project. It will help us investigate the problem because it depends very much on the HTML text you use and the values for the split box.

  2. I want to try render HTML table, but I can’t understand how to add lines to PdfFormattedContent object. Maybe some examples available?

    1. The PdfFormattedContent object cannot draw lines. In theory you would have to handle each cell as a PdfFormattedContent object and draw each one separately. Support for tables will be available during the following months.

    1. The code shown in the article also works in Xamarin.Forms, the XFINIUM.PDF API is the same across all supported platforms.
      The article shows how to implement conversion of simple HTML tags to PDF, it is not intended to convert any HTML page to PDF.

  3. hello,

    I have to draw a long string to my PdfPage. Then, I have also to draw a box outside this text. My problem is: when I use PdfFormattedTextBlock & PdfFormattedParagraph to draw text (by set the right font and color), the method .SplitByBox() does not work. The text is truncated in the screen.
    Here is my code:

    //I have a PdfFixedDocument and a PdfPage added to that document
    //PdfFixedDocument pdfDoc, PdfPage currentPage

    var fc = new PdfFormattedContent();
    var paragraph = new PdfFormattedParagraph ();
    fc.Paragraphs.Add (paragraph);

    //add textblock
    var textFont = new PdfStandardFont();
    textFont.Size = 20;

    string text = “a very long string here …”

    var textBlock = new PdfFormattedTextBlock (text, textFont);
    paragraph.Blocks.Add (textBlock); //add textblock to paragraph

    PdfFormattedContent fragment = fc.SplitByBox(300,20); //here, the fragment is not null but fragment.Paragraphs is empty

    //display the first fragment (just for testing).
    currentPage.Graphics.DrawFormattedContent (fragment, 40, 20); //I see nothing in the pdf file.

      1. I don’t know how to upload file to your site. I have the .zip of my sample project. Its size is < 200KB
        Or do you have any email to receive this file?

  4. Hi,
    I use the SplitByBox method to split the formatted content on to several pages. Is it possible to get some lines to stay on the same page? I have name on one line and title on the next, and I don’t want these lines to split on different pages.

    I now create one paragraph with one textblock inside for both name and title and add the paragraphs to the formatted content.

  5. Thank you for your answer!

    I have another problem. I want to save my pdf-document as PdfAFormat.PdfA1b. When I do, I get the message “{“Page 0: Page content uses CMYK colors but the document Output Profile is not set to CMYK.”}”. I tried to change to Rgb, but I got the same exception.

    How/where can I set the output profile for my document? I have tried the example code I found here:http://www.xfiniumpdf.com/samples/xfinium-pdf-samples-explorer-aspnet-mvc/ pdf/a, but Adobe will not open the generated document, so something must be wrong.

    1. The PDF/A sample shows how to set an output profile on a document. The profile used in the sample is RGB so you have to use also RGB colors in the document. If you could send us (support@xfiniumpdf.com) a sample project that we can run it would help us identify the problem and give you a solution.

    1. The code in the article is simple and it uses only the PdfFromattedContent object which does not support tables. We’re working to update the code to use the FlowDocument API which supports a more flexible layout including tables.

  6. Is thre a chance for a simple Convert method which will take html(including css, tables, images and anything possible in html) and produce a nice PDF document? Right now conversion is limited to simple tags.

    1. The conversion code is provided as source code so that it can be extended as needed. We plan to update it in the future to support tables and other tags but full HTML to PDF conversion is a long road.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: