使用 php 抓取 html 数据


Scraping html data using php

如何将HTML data parsePHP array PHP

网页数据

<div class="test">
    <strong>ID</strong>
    <a href="a.html" title="a html">123456</a><br>
    <label class='label'>Occupation </label>    
    House wife      <br>
    <label>Language?</label>    
    English     <br>
    <label style="width:50%">Basic Language Knowledge of?</label>   
    Hindi       <br>
    <label>Start date</label>
    Nov 2013        <br>
    <label>Other Info</label>
    yes     <br>
    <label>age</label>
    19      <br>
    <label>Gender</label>   
    Female      <br>
    <strong>Address</strong>
    India       <br><br>
    <p>Hi, <br>
Lorem ipsum doner inut</p>
</div>

我试过了,

<?php
    $html='Let above html to parse';
    preg_match_all('/<label's(.*)>(.*)<'/label>/U',$html,$m);
    print_r($m);
    // gives all label contents only but I need pair of label text 
    // and value showing after it
?>

输出如,

Array('ID'=>123456,'link'=>'

a.html','Occupation'=>'House妻子','语言?=>'英语', '基本语言知识'的?=>'印地语','开始日期'=>'Nov 2013','其他信息'=>'是','年龄'=>'19','性别'=>'女性','地址'=>'印度','描述'=>'嗨,Lorem ipsum doner inut');

是的,forgot to mention我正在使用 ganon 进行scraping

使用 DOMDocument 解析 HTML。

$doc = new DOMDocument();
$doc->loadHTML($html);

并使用DOMXPath获取所有标签:

$xpath = new DOMXPath($doc);
$allLabels = $xpath->query('//label');
foreach($allLabels as $label) {
    var_dump($label, $label->nodeValue);
    /* or */
    $labelElmnts = $xpath->query('/*', $label);
    $innerHTML = '';
    foreach($labelElmnts as $elmnt)
        $innerHTML .= $domDoc->saveHTML($elmnt);
    var_dump($innerHTML);
}

更简单的解决方案。

使用查询路径:

foreach(qp($html, 'label') as $label){
  echo $label->text();
}

就像jquery一样。

我用了ganon所以我不想使用Dom Document我尝试过一些东西,worked喜欢,

// for description
echo $desc=$html('div.right_div p',0)->getInnerText();
$s=$html('div.right_div',0)->getInnerText();
// for occupation
$r='/<label>'s*Occupation's*<'/label>'s*(.*)'s*<br's*['/]>/i';
preg_match_all($r,$s,$ma);
echo $occupation=$ma[1];
// for address
$r='/<strong>'s*Address's*<'/strong>'s*(.*)'s*<br's*['/]>/i';
preg_match_all($r,$s,$ma);
echo $address=$ma[1];
// for id
echo $id=$html('div.right_div a',0)->getInnerText();

等等...