Extract text and images from PDF pages

Extracting text and images from PDF pages for additional processing is a common requirement for many software projects. XFINIUM.PDF library can extract text, images and vector graphics from PDF files at various levels, from low level PDF operators to high level visual objects.

The main class for extracting text, images and vector graphics from a PDF page is PdfContentExtractor class. The page from which the content is extracted is provided as parameter to the PdfContentExtractor constructor.
The following methods for extracting content are available:

ExtractText

public string ExtractText(PdfContentExtractionContext context)

Extracts the text from a PDF page as a string object.
The context parameter has effect on the performance when extracting text from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

The code below extracts the text from a PDF file:
C#:

PdfFixedDocument doc = new PdfFixedDocument("sample.pdf");
PdfContentExtractionContext ctx = new PdfContentExtractionContext();
for (int i = 0; i < doc.Pages.Count; i++)
{
    PdfContentExtractor ce = new PdfContentExtractor(doc.Pages[i]);
    string pageText = ce.ExtractText(ctx);
}

VB.NET:

Dim doc As New PdfFixedDocument("sample.pdf")
Dim ctx As New PdfContentExtractionContext()
For i As Integer = 0 To doc.Pages.Count - 1
    Dim ce As New PdfContentExtractor(doc.Pages(i))
    Dim pageText As String = ce.ExtractText(ctx)
Next

ExtractTextFragments

public PdfTextFragmentCollection ExtractTextFragments(PdfContentExtractionContext context)

Extracts the text from a PDF page as a collection of text fragment objects. A text fragment is a piece of text painted by a single ‘showtext’ operator. The text can be a letter, a word or an entire phrase, it depends on the application that generated the PDF file.
A text fragment object includes several information such as: the text being shown, the name of the font used to display the text, the font size, the positions of the fragment’s 4 corners (the fragment can be rotated and skewed so it cannot be represented as a rectangle), the pen and brush used to style the text and a collection of glyphs describing each glyph that composes the text.
The context parameter has effect on the performance when extracting text fragments from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

The code below shows how to extract the text fragments from a page and highlight them:

C#

PdfRgbColor penColor = new PdfRgbColor();
PdfPen pen = new PdfPen(penColor, 0.5);
Random rnd = new Random();
byte[] rgb = new byte[3];

PdfFixedDocument document = new PdfFixedDocument("sample.pdf");
PdfContentExtractor ce = new PdfContentExtractor(document.Pages[0]);
PdfTextFragmentCollection tfc = ce.ExtractTextFragments();
for (int i = 0; i < tfc.Count; i++)
{
    rnd.NextBytes(rgb);
    penColor.R = rgb[0];
    penColor.G = rgb[1];
    penColor.B = rgb[2];

    PdfPath boundingPath = new PdfPath();
    boundingPath.StartSubpath(tfc[i].FragmentCorners[0].X, tfc[i].FragmentCorners[0].Y);
    boundingPath.AddLineTo(tfc[i].FragmentCorners[1].X, tfc[i].FragmentCorners[1].Y);
    boundingPath.AddLineTo(tfc[i].FragmentCorners[2].X, tfc[i].FragmentCorners[2].Y);
    boundingPath.AddLineTo(tfc[i].FragmentCorners[3].X, tfc[i].FragmentCorners[3].Y);
    boundingPath.CloseSubpath();

    document.Pages[0].Graphics.DrawPath(pen, boundingPath);
}
document.Save("sample-updated.pdf");

 

VB.NET:

Dim penColor As New PdfRgbColor()
Dim pen As New PdfPen(penColor, 0.5)
Dim rnd As New Random()
Dim rgb As Byte() = New Byte(3) {}

Dim document As New PdfFixedDocument("sample.pdf")
Dim ce As New PdfContentExtractor(document.Pages(0))
Dim tfc As PdfTextFragmentCollection = ce.ExtractTextFragments()
For i As Integer = 0 To tfc.Count - 1
    rnd.NextBytes(rgb)
    penColor.R = rgb(0)
    penColor.G = rgb(1)
    penColor.B = rgb(2)

    Dim boundingPath As New PdfPath()
    boundingPath.StartSubpath(tfc(i).FragmentCorners(0).X, tfc(i).FragmentCorners(0).Y)
    boundingPath.AddLineTo(tfc(i).FragmentCorners(1).X, tfc(i).FragmentCorners(1).Y)
    boundingPath.AddLineTo(tfc(i).FragmentCorners(2).X, tfc(i).FragmentCorners(2).Y)
    boundingPath.AddLineTo(tfc(i).FragmentCorners(3).X, tfc(i).FragmentCorners(3).Y)
    boundingPath.CloseSubpath()

    document.Pages(0).Graphics.DrawPath(pen, boundingPath)
Next
document.Save("sample-updated.pdf");

ExtractWords

public PdfTextWordCollection ExtractWords(PdfContentExtractionContext context)

Extracts the text from a PDF page as a collection of word objects. Each word object consists of the text representing the word and a collection of text fragments that are combined together to create the word.
The context parameter has effect on the performance when extracting words from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

ExtractImages

public PdfVisualImageCollection ExtractImages(bool includeImageData)

Extracts the images from a PDF page. The method parses the page content and returns a collection of visual images where each visual image represents a drawing instance of an image object. For example if a page contains a single image object in its resources but that image is drawn 5 times on the page the method will return a collection of 5 visual image objects. Each visual image object specifies the position of the image’s 4 corners (the image can be rotated and skewed so it cannot be represented as a rectangle), its vertical and horizontal resolution, the image size in pixels, the image colorspace and bits per component.
The includeImageData parameter specifies how the image data should be handled. If true, the images will be decoded and the actual image data will be included in the image object but the method will take longer to complete. If false the images will not be decoded and the method will execute faster.
If you need to save the images to external storage then set this parameter to true. If you need only information about the image, such as position on the page, size, resolution then set this parameter to false.

The code below shows how to extract information about the images displayed on a page:

C#:

PdfPen pen = new PdfPen(new PdfRgbColor(255, 0, 192), 0.5);
PdfBrush brush = new PdfBrush(new PdfRgbColor(0, 0, 0));
PdfStandardFont helvetica = new PdfStandardFont(PdfStandardFontFace.Helvetica, 8);
PdfStringAppearanceOptions sao = new PdfStringAppearanceOptions();
sao.Brush = brush;
sao.Font = helvetica;
PdfStringLayoutOptions slo = new PdfStringLayoutOptions();
slo.Width = 1000;

PdfFixedDocument document = new PdfFixedDocument("sample");
PdfContentExtractor ce = new PdfContentExtractor(document.Pages[0]);
PdfVisualImageCollection eic = ce.ExtractImages(false);
for (int i = 0; i < eic.Count; i++)
{
    string imageProperties = string.Format("Image ID: {0}nPixel width: {1} pixelsnPixel height: {2} pixelsn" +
        "Display width: {3} pointsnDisplay height: {4} pointsnHorizonal Resolution: {5} dpinVertical Resolution: {6} dpi",
        eic[i].ImageID, eic[i].Width, eic[i].Height, eic[i].DisplayWidth, eic[i].DisplayHeight, eic[i].DpiX, eic[i].DpiY);

    PdfPath boundingPath = new PdfPath();
    boundingPath.StartSubpath(eic[i].ImageCorners[0].X, eic[i].ImageCorners[0].Y);
    boundingPath.AddLineTo(eic[i].ImageCorners[1].X, eic[i].ImageCorners[1].Y);
    boundingPath.AddLineTo(eic[i].ImageCorners[2].X, eic[i].ImageCorners[2].Y);
    boundingPath.AddLineTo(eic[i].ImageCorners[3].X, eic[i].ImageCorners[3].Y);
    boundingPath.CloseSubpath();

    document.Pages[0].Graphics.DrawPath(pen, boundingPath);
    slo.X = eic[i].ImageCorners[3].X + 1;
    slo.Y = eic[i].ImageCorners[3].Y + 1;
    document.Pages[0].Graphics.DrawString(imageProperties, sao, slo);
}
document.Save("sample-imagesinfo.pdf");

VB.NET:

Dim pen As New PdfPen(New PdfRgbColor(255, 0, 192), 0.5)
Dim brush As New PdfBrush(New PdfRgbColor(0, 0, 0))
Dim helvetica As New PdfStandardFont(PdfStandardFontFace.Helvetica, 8)
Dim sao As New PdfStringAppearanceOptions()
sao.Brush = brush
sao.Font = helvetica
Dim slo As New PdfStringLayoutOptions()
slo.Width = 1000

Dim document As New PdfFixedDocument("sample.pdf")
Dim ce As New PdfContentExtractor(document.Pages(0))
Dim eic As PdfVisualImageCollection = ce.ExtractImages(False)
For i As Integer = 0 To eic.Count - 1
    Dim imageProperties As String = String.Format("Image ID: {0}" & vbLf & "Pixel width: {1} pixels" & vbLf & "Pixel height: {2} pixels" & vbLf & "Display width: {3} points" & vbLf & "Display height: {4} points" & vbLf & "Horizonal Resolution: {5} dpi" & vbLf & "Vertical Resolution: {6} dpi", _
        eic(i).ImageID, eic(i).Width, eic(i).Height, eic(i).DisplayWidth, eic(i).DisplayHeight, eic(i).DpiX, eic(i).DpiY)

    Dim boundingPath As New PdfPath()
    boundingPath.StartSubpath(eic(i).ImageCorners(0).X, eic(i).ImageCorners(0).Y)
    boundingPath.AddLineTo(eic(i).ImageCorners(1).X, eic(i).ImageCorners(1).Y)
    boundingPath.AddLineTo(eic(i).ImageCorners(2).X, eic(i).ImageCorners(2).Y)
    boundingPath.AddLineTo(eic(i).ImageCorners(3).X, eic(i).ImageCorners(3).Y)
    boundingPath.CloseSubpath()

    document.Pages(0).Graphics.DrawPath(pen, boundingPath)
    slo.X = eic(i).ImageCorners(3).X + 1
    slo.Y = eic(i).ImageCorners(3).Y + 1
    document.Pages(0).Graphics.DrawString(imageProperties, sao, slo)
Next
document.Save("sample-imagesinfo.pdf")

ExtractVisualObjects

public PdfVisualObjectCollection ExtractVisualObjects(
                                     bool includeImageData, 
                                     bool keepGraphicContainers, 
                                     PdfContentExtractionContext context)

Extracts the content of the page as a collection of visual objects. A visual object can be a path, a text fragment, an image, a shading or a form XObjects.
The method parameters let you control the result:
– includeImageData – if true, the images will be decoded and the actual image data will be included in the image object but the method will take longer to complete. If false the images will not be decoded and the method will execute faster. If you need to save the images to external storage then set this parameter to true. If you need only information about the image, such as position on the page, size, resolution then set this parameter to false.
– keepGraphicContainers – the page content can be extracted as a flat list of visual object or as a grouped list where the grouping item is a form XObject. If true the form XObjects will be extracted as standalone objects and their content will appear in a separate collection as a child of the form XObject. If false the form XObjects will not appear in the result collection and their content will be included directly in the page content.
– context – a PdfContentExtractionContext object. This parameter has effect on the performance when extracting content from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

ExtractOptionalContentGroup

public PdfPageOptionalContent ExtractOptionalContentGroup(string ocgName)

Extracts the content of the specified optional content group as a reusable drawing object. The returned PdfPageOptionalContent object can be later drawn on a page using the Graphics’ DrawFormXObject method.

ExtractContentStreamOperators

public PdfContentStreamOperatorCollection ExtractContentStreamOperators()

Extracts the content of a PDF page as a collection of content stream operators. Each operator in the page content stream is represented by an operator object. The collection of operators and their operands can be used for a low level custom analysis of page content.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: