http://www.getit.de/2008/indexing/pdf

parsePdf (xlink : xlink, [fromDataSource : boolean]) : nodeset

The method “parsPdf” returns the textual content of a.pdf file for each page as well as some meta information if they are entered in the file.

The parameter »xlink« can either be an XLink to the binary file itself or an XLink to the literal method performing the PDF rendering i.e. it as good as produces the result »XY.pdf« already.

The second parameter »fromDataScource« is optional. If it is left out or is »true()«, then the extension expects an XLink to the binary data itself. It is »false()«, an XLink to the rendering method/data is expected.

<xsl:stylesheet version="1.0">
<xsl:copy-of select="pdf:parsePdf('onion://data/objects/1234.4321')" />
<xsl:copy-of select="pdf:parsePdf('onion://data/objects/1234.4321', true())" />
<xsl:copy-of select="pdf:parsePdf(c.xlink('onion://data/objects/5678'), false())" />
</xsl:stylesheet>
Three examples for calling the method “parsePdf()”

The example shows three different calls of the method »parsePdf()«.

The first two have the same result. They call the method on the binary datum »onion://data/objects/1234.4321« (which has evidently been loaded into an object).

The third example however indicates, as the first parameter, an Xlink to the »default()« method of an object, which evidently contains a binary .pdf document. This method has the PDF itself as result. The second parameter is therefore »false()«.

The result XML of the method »parsePdf()« looks as follows:

<pdf numberOfPages="3">
<information author="Mustemann" creationDate="2008-01-10T08:23:09.0000000+01:00" creator="" modificationDate="" producer="Acrobat Distiller 6.0.0 for Macintosh" title="Titel des Dokuments" />
<textExtraction>
<text page="1">Some text on page one...</text>
<text page="2">Some further text on page two...</text>
<text page="3">Some further text on page three...</text>
</textExtraction>
</pdf>
Result XML of the method “parsePdf()”