如何使用PHP从PDF文件中提取突出显示的文本


How can I extract highlighted text from PDF file using PHP?

我想制作一个web应用程序,从PDF文件中提取突出显示的文本。我使用fpdf和PDFlib有很多用途,但我觉得它们在这方面没有帮助。请告诉我如何做到这一点。或者至少告诉我哪些PHP库或框架可以支持它。我想知道是否有任何API可以用于此目的。我将非常感谢你的帮助。

您可以使用SetaPDF Extractor组件(我们的商业产品!)来完成此操作。它允许您访问高亮注释,使用这些注释可以为提取过程创建特定的过滤器。一个简单的示例脚本可能看起来像:

<?php
// load and register the autoload function
require_once('library/SetaPDF/Autoload.php');
// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('path/to/the/highligted.pdf');
// initate an extractor instance
$extractor = new SetaPDF_Extractor($document);
// get page documents pages object
$pages = $document->getCatalog()->getPages();
// we are going to save the results in this variable
$results = array();
// iterate over all pages
for ($pageNo = 1, $pageCount = $pages->count(); $pageNo <= $pageCount; $pageNo++) {
    // get the page object
    $page = $pages->getPage($pageNo);
    // get the highlight annotations
    $annotations = $page->getAnnotations()->getAll(SetaPDF_Core_Document_Page_Annotation::TYPE_HIGHLIGHT);
    // create a strategy instance
    $strategy = new SetaPDF_Extractor_Strategy_Word();
    // create a multi filter instance
    $filter = new SetaPDF_Extractor_Filter_Multi();
    // and pass it to the strategy
    $strategy->setFilter($filter);
    // iterate over all highlight annotations
    foreach ($annotations AS $annotation) {
        /**
         * @var SetaPDF_Core_Document_Page_Annotation_Highlight $annotation
         */
        $name = $annotation->getName();
        // iterate over the quad points to setup our filter instances
        $quadpoints = $annotation->getQuadPoints();
        for ($pos = 0, $c = count($quadpoints); $pos < $c; $pos += 8) {
            $llx = min($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]);
            $urx = max($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]);
            $lly = min($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]);
            $ury = max($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]);
            // Add a new rectangle filter to the multi filter instance
            $filter->addFilter(
                new SetaPDF_Extractor_Filter_Rectangle(
                    new SetaPDF_Core_Geometry_Rectangle($llx, $lly, $urx, $ury),
                    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
                    $name
                )
            );
        }
    }
    // if no filters for this page defined, ignore it
    if (0 === count($filter->getFilters())) {
        continue;
    }
    // pass the strategy to the extractor instance
    $extractor->setStrategy($strategy);
    // and get the results by the current page number
    $pageResult = $extractor->getResultByPageNumber($pageNo);
    // group the resulting words in an result array
    foreach ($pageResult AS $word) {
        $results[$pageNo][$word->getFilterId()][] = $word->getString();
    }
}
// debug output
echo '<pre>';
foreach ($results AS $pageNo => $annotationResults) {
    echo 'Page No #' . $pageNo . "'n";
    foreach ($annotationResults AS $name => $words) {
        echo '  Annotation name: ' . $name . "'n";
        echo '    Result: ' . join(' ', $words). "'n";
        echo '<br />';
    }
}
echo '</pre>';

输出是为每个高亮注释找到的所有单词的简单转储。