如何将HTML data
parse
到PHP array PHP
网页数据
<div class="test">
<strong>ID</strong>
<a href="a.html" title="a html">123456</a><br>
<label class='label'>Occupation </label>
House wife <br>
<label>Language?</label>
English <br>
<label style="width:50%">Basic Language Knowledge of?</label>
Hindi <br>
<label>Start date</label>
Nov 2013 <br>
<label>Other Info</label>
yes <br>
<label>age</label>
19 <br>
<label>Gender</label>
Female <br>
<strong>Address</strong>
India <br><br>
<p>Hi, <br>
Lorem ipsum doner inut</p>
</div>
我试过了,
<?php
$html='Let above html to parse';
preg_match_all('/<label's(.*)>(.*)<'/label>/U',$html,$m);
print_r($m);
// gives all label contents only but I need pair of label text
// and value showing after it
?>
输出如,
Array('ID'=>123456,'link'=>'a.html','Occupation'=>'House妻子','语言?=>'英语', '基本语言知识'的?=>'印地语','开始日期'=>'Nov 2013','其他信息'=>'是','年龄'=>'19','性别'=>'女性','地址'=>'印度','描述'=>'嗨,Lorem ipsum doner inut');
是的,forgot to mention
我正在使用 ganon 进行scraping
使用 DOMDocument
解析 HTML。
$doc = new DOMDocument();
$doc->loadHTML($html);
并使用DOMXPath
获取所有标签:
$xpath = new DOMXPath($doc);
$allLabels = $xpath->query('//label');
foreach($allLabels as $label) {
var_dump($label, $label->nodeValue);
/* or */
$labelElmnts = $xpath->query('/*', $label);
$innerHTML = '';
foreach($labelElmnts as $elmnt)
$innerHTML .= $domDoc->saveHTML($elmnt);
var_dump($innerHTML);
}
更简单的解决方案。
使用查询路径:
foreach(qp($html, 'label') as $label){
echo $label->text();
}
就像jquery一样。
我用了ganon
所以我不想使用Dom Document
我尝试过一些东西,worked
喜欢,
// for description
echo $desc=$html('div.right_div p',0)->getInnerText();
$s=$html('div.right_div',0)->getInnerText();
// for occupation
$r='/<label>'s*Occupation's*<'/label>'s*(.*)'s*<br's*['/]>/i';
preg_match_all($r,$s,$ma);
echo $occupation=$ma[1];
// for address
$r='/<strong>'s*Address's*<'/strong>'s*(.*)'s*<br's*['/]>/i';
preg_match_all($r,$s,$ma);
echo $address=$ma[1];
// for id
echo $id=$html('div.right_div a',0)->getInnerText();
等等...