Extracts content from the current page in a specified format.

 

   

Syntax
 

Text = Doc.GetText(Type)

 

   

Params
 
Name Type Description
Type String The format in which to return the content.
Text String The returned content.

 

   

Notes
 

This method allows you to extract the content from a page.

There are three formats supported - "Text", "SVG" and "SVG+".

Text is in layout order, which may not be the same as reading order. For example, what to a user may look like a space may simply be two items of text positioned apart from each other, or it may not. ABCpdf will make sensible assumptions on how items of text should be combined but many situations are ambiguous.

SVG is an XML based format for representing vector graphics. Because SVG is standard XML, it's easy to parse and gives you the precise position of each item of text on the page. The way that ABCpdf constructs the SVG should make it easy to extract any information you require. ABCpdf currently supports SVG text and paths.

For example, a simple "Hello World" PDF might produce the following content:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0" y="0">
<text x="0" y="76.8" font-size="96" font-family="Times-Roman" >Hello World</text>
</svg>

SVG+ is an annotated form of SVG which includes details of the PDF operators and how they relate to the items of content in the SVG. It can be very useful if you are trying to deconstruct a page and determine how objects in the PDF relate to objects in the SVG.

For example, you could use SVG+ to identify the section of a PDF stream that relates to a particular word on a page. You could then replace the text show operator for that word with another one. Effectively, you'd be performing a low-level Search/Replace on the PDF document.

There is no official standard for SVG+, but if you are familiar with the PDF specification it should be easy enough to understand.

For example, a simple "Hello World" PDF might produce the following content:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0" y="0">
<pdf pdf_Op="q" pdf_StreamID="5" pdf_StreamOffset="0" pdf_StreamLength="1" />
<pdf pdf_Op="BT" pdf_StreamID="5" pdf_StreamOffset="3" pdf_StreamLength="2" />
<pdf pdf_Op="0 Tr" pdf_StreamID="5" pdf_StreamOffset="7" pdf_StreamLength="4" />
<pdf pdf_Op="/Fabc6 96 Tf" pdf_StreamID="5" pdf_StreamOffset="13" pdf_StreamLength="12" />
<pdf pdf_Op="0 0 0 rg" pdf_StreamID="5" pdf_StreamOffset="27" pdf_StreamLength="8" />
<pdf pdf_Op="1 0 0 1 0 715.2 Tm" pdf_StreamID="5" pdf_StreamOffset="37" pdf_StreamLength="18" />
<pdf pdf_Op="0 Ts" pdf_StreamID="5" pdf_StreamOffset="57" pdf_StreamLength="4" />
<text x="0" y="76.8" font-size="96" font-family="Times-Roman" pdf_CTM="1 0 0 1 0 0" pdf_TM="1 0 0 1 0 715.2" pdf_Trm="96 0 0 96 0 715.2" pdf_Tf="Fabc6" pdf_Tz="100" pdf_Ts="0" pdf_w1000="5027" pdf_Op="(Hello World) Tj" pdf_StreamID="5" pdf_StreamOffset="63" pdf_StreamLength="16" >Hello World</text>
<pdf />
<pdf pdf_Op="ET" pdf_StreamID="5" pdf_StreamOffset="81" pdf_StreamLength="2" />
<pdf pdf_Op="Q" pdf_StreamID="5" pdf_StreamOffset="85" pdf_StreamLength="1" />
</svg>

The operators within the PDF stream are detailed in the SVG. For example, the first 'q' operator is located in Object ID 5 at offset 0 and has a length of 1 byte. The 'Tj' operator which shows "Hello World" is at offset 63 and has length 16. The Current Transformation Matrix (CTM), the Text Matrix (TM), and other important PDF state values are shown.

 

   

Example
 

None.