在PHP中抓取数据的麻烦


Trouble with scraping data in PHP

我从一个网站的数据,源代码是

view-source:http://www.pakdukaan.com/75-computer-cases

我用来抓取数据的代码如下

<?php
$html = file_get_contents('http://www.pakdukaan.com/75-computer-cases'); 
$pk_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); 
if(!empty($html)){ 
$pk_doc->loadHTML($html);
libxml_clear_errors(); 
$pk_xpath = new DOMXPath($pk_doc);
$pk_list = array();
$pk_and_price = $pk_xpath->query('//div[@class="product_list list row "]');
if($pk_and_price->length > 0){  
foreach($pk_and_price as $pat){   
  $name = $pk_xpath->query('//h5[@class="name"]', $pat)->item(0)->nodeValue;
    $pkmn_types = array(); 
    $price = $pk_xpath->query('//span[@class="price product-price"]', $pat)
    foreach($types as $type){
        $pkmn_types[] = $type->nodeValue; 
    }
    $pk_list[] = array('name' => $name, 'price' => $pkmn_price);
}
}
}
//output what we have
echo "<pre>";
echo print_r($pk_list);
echo "</pre>";
?>

但是我得到的不是所有箱子的名字,而是一个箱子的名字,而且我得到了两次箱子的所有价格。

输出

Array
(
[0] => Array
    (
        [name] => 
                Thermaltake V2 Plus + 350W Power Supply

        [price] => Array
            (
                [0] => 
                        Rs.  4,099                      
                [1] => 
                        Rs.  4,099                      
                [2] => 
                        Rs.  5,899                      
                [3] => 
                        Rs.  5,899                      
                [4] => 
                        Rs.  8,499                      
                [5] => 
                        Rs.  8,499                      
                [6] => 
                        Rs.  9,499                      
                [7] => 
                        Rs.  9,499                      
                [8] => 
                        Rs.  10,350                     
                [9] => 
                        Rs.  10,350                     
                [10] => 
                        Rs.  12,999                     
                [11] => 
                        Rs.  12,999                     
                [12] => 
                        Rs.  17,799                     
                [13] => 
                        Rs.  17,799                     
                [14] => 
                        Rs.  16,199                     
                [15] => 
                        Rs.  16,199                     
                [16] => 
                        Rs.  17,299                     
                [17] => 
                        Rs.  17,299                     
                [18] => 
                        Rs.  16,500                     
                [19] => 
                        Rs.  16,500                     
                [20] => 
                        Rs.  5,899                      
                [21] => 
                        Rs.  5,899                      
                [22] => 
                        Rs.  8,399                      
                [23] => 
                        Rs.  8,399                      
                [24] => 
                        Rs.  4,999                      
                [25] => 
                        Rs.  4,999                      
                [26] => 
                        Rs.  7,599                      
                [27] => 
                        Rs.  7,599                      
                [28] => 
                        Rs.  9,999                      
                [29] => 
                        Rs.  9,999                      
           )
    )
)
1
有谁能帮忙解决这个问题吗?我已经尝试了很多改变div的类在网站的源代码,但无法得到适当的结果。

那么,让我们检查一下你的错误:

首先:查询$pk_xpath->query('//h5[@class="name"]', $pat),然后只取item(0)

这意味着您跳过xpath-query中的所有其他DOMNodes。但是如果你这样做:

$names = $pk_xpath->query('//h5[@class="name"]', $pat);
foreach ($names as $n) {
    echo $n->nodeValue . PHP_EOL;
}

您将看到页面中所有名称

第二:价格。如果您检查抓取页面的html,您将看到span[@class="price product-price"] 为每个项目double 。一个span是可见的,第二个是弹出块,目前隐藏。

因此,您需要另一个xpath查询,例如,您可以找到所有.product-meta项,然后在其中搜索price product-price