如何使用XPath从HTML文档中提取属性对


How to extract pairs of attributes from an HTML document using XPath?

给定的HTML文档包含这样的表单:

<form>
    <div controlType="yyy1" xmlTag="zzz1">...</div>
    <div controlType="yyy2" xmlTag="zzz2">...</div>
</form>

我需要收集这些数据:

$div[0]      = array('yyy1', 'zzz1');
$div[1]      = array('yyy2', 'zzz2');

对于每个div元素,所需的属性对是controlTypexmlTag

评估这两个XPath表达式

/form/div[$k]/@controlType

和:

/form/div[$k]/@xmlTag

填充$div[$k -1]

其中$k必须用数字1、2、…、。。。,count(/form/div)

人们可能会尝试将上面的两个表达式组合成一个XPath表达式:

/form/div[$k]/@*

然而,XPath的实现允许以任何顺序返回属性(XPath不定义属性之间的排序),并且不清楚这两个属性中哪一个在所选节点中排在第一位,哪一个排在第二位。

我的两美分,如果它有助于

            var doc = '<form xmltag="xxx"><div controltype="yyy1" xmltag="zzz1">...</div><div controltype="yyy2" xmltag="zzz2">...</div></form>';
        var result = [];
        $(doc).children().each(function () {
            var ctrl = $(this);
            if (ctrl.is('div')) {
                result.push([ctrl.attr('controlType'), ctrl.attr('xmlTag')]);
            }
        });

我的最终解决方案基于@dimitre novachev的优秀创意提案:

$res             = $xpath->query("//form//div/@xmltag"); // OBS: xmltag not xmlTag
$total_fields    = $res->length;
for ($i = 1; $i <= $total_fields; $i ++ )
{
    $r       = $xpath->query("//form//div[$i]/@xmltag");
    $xmltag  = $r->item(0)->value;
    $r           = $xpath->query("//form//div[$i]/@controltype");
    $controltype = $r->item(0)->value;
    $div[$i - 1] = array(
        'xmltag'         => $xmltag,
        'controltype'    => $controltype
    );
}

输出样本:

array (
  0 => 
  array (
    'xmltag' => 'Case_Number',
    'controltype' => '',
  ),
  1 => 
  array (
    'xmltag' => 'Plaintiff',
    'controltype' => 'RadioButtons',
  ),
  2 => 
  array (
    'xmltag' => 'Plaintiff_Name',
    'controltype' => '',
  ),

漂亮!

@$url = "http://XXX.xom"
$path     = "//div[@class='sb_tlst']//a";
$contents = get_contents($url, $path);
foreach ($contents as $value) 
{ 
    /* do something */
}