Search text in PDF files

PDF search for text is a common operation performed on PDF files and XFINIUM.PDF library fully supports this feature.

When a PDF document is searched for a string of characters each page needs to be searched separately because the content is stored at page level. Each PDF search operation returns a collection of search results, each result specifying the text being searched and the collection of text fragments that compose the result.

XFINIUM.PDF provides several options when searching text in PDF file. By default the PDF search is case insensitive. Search options allow to specify case sensitive search or whole word search, these 2 options can be combined together. Another search option is regular expression search. If this option is combined with other options, those options are ignored.

The code below shows how to search text in a PDF page using various options. The search results are highlighted on the page by drawing a rectangle around the text.

public void SearchText(Stream pdfInputStream, Stream pdfOutputStream)
{
    // Load the document from the input stream.
	PdfFixedDocument document = new PdfFixedDocument(pdfInputStream);
	// Create a content extractor for the page being searched.
	PdfContentExtractor ce = new PdfContentExtractor(document.Pages[0]);

	// Simple case insensitive search.
	PdfTextSearchResultCollection searchResults = ce.SearchText("at");
	HighlightSearchResults(document.Pages[0], searchResults, PdfRgbColor.Red);

	// Whole words search combined with case sensitive search.
	searchResults = ce.SearchText("and", PdfTextSearchOptions.WholeWordSearch | PdfTextSearchOptions.CaseSensitiveSearch);
	HighlightSearchResults(document.Pages[0], searchResults, PdfRgbColor.Green);

	// Regular expression search, find all words that start with uppercase.
	searchResults = ce.SearchText("[A-Z][a-z]*", PdfTextSearchOptions.RegExSearch);
	HighlightSearchResults(document.Pages[0], searchResults, PdfRgbColor.Blue);

	document.Save(pdfOutputStream);
}

private void HighlightSearchResults(PdfPage page, PdfTextSearchResultCollection searchResults, PdfColor color)
{
	PdfPen pen = new PdfPen(color, 0.5);

	for (int i = 0; i < searchResults.Count; i++)
	{
		PdfTextFragmentCollection tfc = searchResults[i].TextFragments;
		for (int j = 0; j < tfc.Count; j++)
		{
			PdfPath path = new PdfPath();

			path.StartSubpath(tfc[j].FragmentCorners[0].X, tfc[j].FragmentCorners[0].Y);
			path.AddPolygon(tfc[j].FragmentCorners);

			page.Graphics.DrawPath(pen, path);
		}
	}
}
Public Sub Run(pdfInputStream As Stream, pdfOutputStream as Stream)
    ' Load the document from the input stream.
	Dim document As New PdfFixedDocument(pdfInputStream)
	' Create a content extractor for the page being searched.
	Dim ce As New PdfContentExtractor(document.Pages(0))

	' Simple case insensitive search.
	Dim searchResults As PdfTextSearchResultCollection = ce.SearchText("at")
	HighlightSearchResults(document.Pages(0), searchResults, PdfRgbColor.Red)

	' Whole words search combined with case sensitive search.
	searchResults = ce.SearchText("and", PdfTextSearchOptions.WholeWordSearch or PdfTextSearchOptions.CaseSensitiveSearch)
	HighlightSearchResults(document.Pages(0), searchResults, PdfRgbColor.Green)

	' Regular expression search, find all words that start with uppercase.
	searchResults = ce.SearchText("[A-Z][a-z]*", PdfTextSearchOptions.RegExSearch)
	HighlightSearchResults(document.Pages(0), searchResults, PdfRgbColor.Blue)

	document.Save(pdfOutputStream)
End Function

Private Sub HighlightSearchResults(page As PdfPage, searchResults As PdfTextSearchResultCollection, color As PdfColor)
	Dim pen As New PdfPen(color, 0.5)

	For i As Integer = 0 To searchResults.Count - 1
		Dim tfc As PdfTextFragmentCollection = searchResults(i).TextFragments
		For j As Integer = 0 To tfc.Count - 1
			Dim path As New PdfPath()

			path.StartSubpath(tfc(j).FragmentCorners(0).X, tfc(j).FragmentCorners(0).Y)
			path.AddPolygon(tfc(j).FragmentCorners)

			page.Graphics.DrawPath(pen, path)
		Next
	Next
End Sub

Download XFINIUM.PDF library and give it a try.

6 thoughts on “Search text in PDF files”

  1. This works fine, I have my results as PdfTextFragmentCollection. Is there anyway to change/modify the text and save/write it back to same document?

    searchResults[0].TextFragments[0].Text is read only.

    1. At this moment text cannot be replaced directly. It can be implemented using a page transform to inspect each text fragment in the page content but it would work for basic situations.

  2. Hi,

    Is it possible to replace text now? if not, what’s the best way to achieve this?

    Many Thanks.

    1. Text replace is not available at this moment. What you could do is to search the text, perform a redaction at the location of your text and then draw the new text at the same location. The problem is if the new text is longer than the old text then the content that follows the text might be overwritten.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: