|
This function allows you to extract the content from a page.
There are three formats supported - "Text", "SVG"
and "SVG+".
Text is in layout order which may not be the same as reading order.
For example - what to a user may look like a space - may simply
be two items of text positioned apart from each other - or it may
not. ABCpdf will make sensible assumptions on how items of text
should be combined but many situations are ambiguous.
SVG is an XML based format for representing vector graphics. Because
SVG is standard XML it's easy to parse and gives you the precise
position of each item of text on the page. The way that ABCpdf constructs
the SVG should make it easy to extract any information you require.
ABCpdf currently supports SVG text, paths and image placeholders.
For example a simple "Hello World" PDF might produce
the following content:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0"
y="0">
<text x="0" y="76.8" font-size="96"
font-family="Times-Roman" >Hello World</text>
</svg>
SVG+ is an annotated form of SVG which includes details of the
PDF operators and how they relate to the items of content in the
SVG. It can be very useful if you are trying to deconstruct a page
and determine how objects in the PDF relate to objects in the SVG.
For example you could use SVG+ to identify the section of a PDF
stream that relates to a particular word on a page. You could then
replace the text show operator for that word with another one. Effectively
you'd be performing a low-level Search/Replace on the PDF document.
There is no official standard for SVG+ but if you are familiar
with the PDF specification it should be easy enough to understand.
For example a simple "Hello World" PDF might produce
the following content:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0"
y="0">
<pdf pdf_Op="q" pdf_StreamID="5" pdf_StreamOffset="0"
pdf_StreamLength="1" />
<pdf pdf_Op="BT" pdf_StreamID="5" pdf_StreamOffset="3"
pdf_StreamLength="2" />
<pdf pdf_Op="0 Tr" pdf_StreamID="5" pdf_StreamOffset="7"
pdf_StreamLength="4" />
<pdf pdf_Op="/Fabc6 96 Tf" pdf_StreamID="5"
pdf_StreamOffset="13" pdf_StreamLength="12"
/>
<pdf pdf_Op="0 0 0 rg" pdf_StreamID="5" pdf_StreamOffset="27"
pdf_StreamLength="8" />
<pdf pdf_Op="1 0 0 1 0 715.2 Tm" pdf_StreamID="5"
pdf_StreamOffset="37" pdf_StreamLength="18"
/>
<pdf pdf_Op="0 Ts" pdf_StreamID="5" pdf_StreamOffset="57"
pdf_StreamLength="4" />
<text x="0" y="76.8" font-size="96"
font-family="Times-Roman" pdf_CTM="1 0 0 1 0 0"
pdf_TM="1 0 0 1 0 715.2" pdf_Trm="96 0 0 96 0 715.2"
pdf_Tf="Fabc6" pdf_Tz="100" pdf_Ts="0"
pdf_w1000="5027" pdf_Op="(Hello World) Tj" pdf_StreamID="5"
pdf_StreamOffset="63" pdf_StreamLength="16"
>Hello World</text>
<pdf />
<pdf pdf_Op="ET" pdf_StreamID="5" pdf_StreamOffset="81"
pdf_StreamLength="2" />
<pdf pdf_Op="Q" pdf_StreamID="5" pdf_StreamOffset="85"
pdf_StreamLength="1" />
</svg>
The operators within the PDF stream are detailed in the SVG. For
example the first 'q' operator is located in Object ID 5 at offset
0 and has a length of 1 byte. The 'Tj' operator which shows "Hello
World" is at offset 63 and has length 16. The Current Transformation
Matrix (CTM) the Text Matrix (TM) and other important PDF state
values are shown.
|