如何使用简单的htmldom和PHP抓取页面


How to scrape page using simple htmldom and PHP?

我正在尝试获取<div id listing-page-cart-inner><div id="description text"><div id="tags">中的数据,但我发现很难挖掘数据。

有人能给我指路吗?我无法获取数据,虽然我提到的第一个div我可以抓取,但其他div我不能。当我循环通过第二个foreach时,需要更长的时间。

<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val =  $html->find('div[id=listing-page-cart-inner]');

function scraping_etsy() {
    // create HTML DOM
    $html = file_get_html('https://etsy.com/listing/107492702/');
        foreach($html->find('div[id=listing-page-cart-inner]') as $article) 
    {
        // get title
        //$item['title'] = trim($article->find('h3', 0)->plaintext);
        // get details
        $item['details'] = trim($article->find('span', 0)->plaintext);
        // get intro
        //$lists = $articles->find('div[id=item-overview]');
        $item['list1'] = trim($article->find('li',0)->plaintext);
        $item['list2'] = trim($article->find('li',1)->plaintext);
        $item['list3'] = trim($article->find('li',2)->plaintext);
        $item['list4'] = trim($article->find('li',3)->plaintext);
        $item['list5'] = trim($article->find('li',4)->plaintext);
        /*foreach($article->find('li') as $al){
            $item['lists'] =trim($al->find('li')->plaintext);
        }*/
        $ret[] = $item;
    }

    foreach($html->find('div[id=description]') as $content){
        var_dump($content->find('text'));
        // $item['content'] = trim($content->find('div[id=description]')->plaintext);
        // $ret[] = $item;
    }
    // clean up memory
  $html->clear();
   unset($html);
    return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
    echo $v['title'].'<br>';
    echo '<ul>';
    echo '<li>'.$v['details'].'</li>';
    echo '<li>Diggs: '.$v['diggs'].'</li>';
    echo '</ul>';
}*/
?>

至于获取这些div的子元素,请记住,如果找到父元素,请始终使用->find('<the selector here>', 0)始终使用索引实际指向该元素。

$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
    echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
    $tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;

最简单的启动方法始终是使用3d方库,即Symfony DomCrawler

它的使用和一样简单

use Symfony'Component'DomCrawler'Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">Hello World!</p>
        <p>Hello Crawler!</p>
    </body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
    print $domElement->nodeName;
}

你可以使用像这样的过滤器

$crawler = $crawler->filter('body > p');