Extract content from the current page in a specified format.

 

   

Syntax
 

[C#]
virtual string GetText(string type)

[Visual Basic]
Overridable Function GetText(type As String) As String

 

   

Params
 
Name Description
type

The format in which to return the content.

return The returned content.

 

   

Notes
 

This function allows you to extract the content from a page.

There are four formats supported - "Text", "SVG", "SVG+" and "SVG+2".

Text is in layout order which may not be the same as reading order. For example - what to a user may look like a space - may simply be two items of text positioned apart from each other - or it may not. ABCpdf will make sensible assumptions on how items of text should be combined but many situations are ambiguous.

SVG is an XML based format for representing vector graphics. Because SVG is standard XML it's easy to parse and gives you the precise position of each item of text on the page. The way that ABCpdf constructs the SVG should make it easy to extract any information you require. ABCpdf currently supports SVG text, paths and image placeholders.

For example a simple "Hello World" PDF might produce the following content:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0" y="0">
<text x="0" y="76.8" font-size="96" font-family="Times-Roman" >Hello World</text>
</svg>

SVG+ and SVG+2 are annotated forms of SVG which include details of the PDF operators and how they relate to the items of content in the SVG. They can be very useful if you are trying to deconstruct a page and determine how objects in the PDF relate to objects in the SVG. In SVG+, SVG elements appear before the pdf elements for their generating operators, and the pdf elements for the Do operator on Form XObjects are not generated. In SVG+2, SVG elements appear after the pdf elements of their generating operators, and the pdf elements for the Do operator on Form XObjects are generated.

For example you could use SVG+ to identify the section of a PDF stream that relates to a particular word on a page. You could then replace the text show operator for that word with another one. Effectively you'd be performing a low-level Search/Replace on the PDF document. However you should note that this would not mean that your layout would automatically adjust if - for example - you were to replace a short word with a long one.

There is no official standard for SVG+ but if you are familiar with the PDF specification it should be easy enough to understand.

For example a simple "Hello World" PDF might produce the following content:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0" y="0">
<pdf pdf_Op="q" pdf_StreamID="5" pdf_StreamOffset="0" pdf_StreamLength="1" />
<pdf pdf_Op="BT" pdf_StreamID="5" pdf_StreamOffset="3" pdf_StreamLength="2" />
<pdf pdf_Op="0 Tr" pdf_StreamID="5" pdf_StreamOffset="7" pdf_StreamLength="4" />
<pdf pdf_Op="/Fabc6 96 Tf" pdf_StreamID="5" pdf_StreamOffset="13" pdf_StreamLength="12" />
<pdf pdf_Op="0 0 0 rg" pdf_StreamID="5" pdf_StreamOffset="27" pdf_StreamLength="8" />
<pdf pdf_Op="1 0 0 1 0 715.2 Tm" pdf_StreamID="5" pdf_StreamOffset="37" pdf_StreamLength="18" />
<pdf pdf_Op="0 Ts" pdf_StreamID="5" pdf_StreamOffset="57" pdf_StreamLength="4" />
<text x="0" y="76.8" font-size="96" font-family="Times-Roman" pdf_CTM="1 0 0 1 0 0" pdf_TM="1 0 0 1 0 715.2" pdf_Trm="96 0 0 96 0 715.2" pdf_Tf="Fabc6" pdf_Tz="100" pdf_Ts="0" pdf_w1000="5027" pdf_Op="(Hello World) Tj" pdf_StreamID="5" pdf_StreamOffset="63" pdf_StreamLength="16" >Hello World</text>
<pdf />
<pdf pdf_Op="ET" pdf_StreamID="5" pdf_StreamOffset="81" pdf_StreamLength="2" />
<pdf pdf_Op="Q" pdf_StreamID="5" pdf_StreamOffset="85" pdf_StreamLength="1" />
</svg>

The operators within the PDF stream are detailed in the SVG. For example the first 'q' operator is located in Object ID 5 at offset 0 and has a length of 1 byte. The 'Tj' operator which shows "Hello World" is at offset 63 and has length 16. The Current Transformation Matrix (CTM) the Text Matrix (TM) and other important PDF state values are shown.

 

   

Example
 

None.