如何使用Symfony DomCrawler Component和Goutte for Laravel 4从爬虫对象中跳


How can I skip or remove a list of html tags from my crawler object using Symfony DomCrawler Component and Goutte for Laravel 4?

这是我的第一次尝试,但没有成功。

$this->crawler = $client->request('GET', $this->url);
$document = new 'DOMDocument('1.0', 'UTF-8');
$root = $document->appendChild($document->createElement('_root'));
$this->crawler->rewind();
$root->appendChild($document->importNode($this->crawler->current(), true));
$selectorsToRemove = ['script','p'];
foreach ($selectorsToRemove as $selector) {
   $crawlerInverse = $this->crawler->filter($selector);
   foreach ($crawlerInverse as $elementToRemove) {
      $parent = $elementToRemove->parentNode;
      $parent->removeChild($elementToRemove);
    }
}
$this->crawler->clear();
$this->crawler->add($document);

我想从这个页面获取"p"标签 http://www.amazon.com/dp/B00IOY8XWQ/ref=fs_kv 并且它接缝它在段落中有一些js,所以当我尝试做$node->text((时; 它让我在"p"内的"脚本"中获取文本和js。结构是这样的;

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<script>
 "JS CODE"
</script>
</p>

所以我只想要Lorem ipsum文本。

我看

了一下DomCrawler,并没有看到它有很多目的。它似乎只是围绕着已经很多易于使用的 DOM 扩展,所以我将采取捷径并直接使用它。

该示例简短而简单,您应该能够或多或少地按原样进行调整。你已经准备好了一个 DOMDocument。


例:

$html = <<<'HTML'
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<script>
 "JS CODE"
</script>
</p>
HTML;
$dom = new DOMDocument();
$dom->loadXML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//p/script') as $node) {
    $node->parentNode->removeChild($node);
}
echo $dom->saveXML();

输出:

<?xml version="1.0"?>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</p>