如何分解和解析特定的维基百科文本


How to break down and parse specific Wikipedia text

我有以下工作示例来检索返回SimpleXMLElement对象的特定维基百科页面:

ini_set('user_agent', 'michael@example.com');
$doc = New DOMDocument();
$doc->load('http://en.wikipedia.org/w/api.php?action=parse&page=Main%20Page&format=xml');
$xml = simplexml_import_dom($doc);
print '<pre>';
print_r($xml);
print '</pre>';

哪个返回:

SimpleXMLElement Object
(
    [parse] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [title] => Main Page
                    [revid] => 472210092
                    [displaytitle] => Main Page
                )
            [text] => <body><table id="mp-topbanner" style="width: 100%;"...

愚蠢的问题/头脑空白。我要做的是捕获$xml->parse->text元素,然后再解析它。所以最终我想要返回的是以下对象;我该如何做到这一点?

SimpleXMLElement Object
(
    [body] => SimpleXMLElement Object
        (
            [table] => SimpleXMLElement Object
                (
                    [@attributes] => Array
                        (
                            [id] => mp-topbanner
                            [style] => width:100% ...

喝了一杯新鲜的茶,吃了一根香蕉后,我想出了一个解决方案:

ini_set('user_agent','michael@example.com');
$doc = new DOMDocument();
$doc->load('http://en.wikipedia.org/w/api.php?action=parse&page=Main%20Page&format=xml');
$nodes = $doc->getElementsByTagName('text');
$str = $nodes->item(0)->nodeValue;
$html = new DOMDocument();
$html->loadHTML($str);

这样我就可以得到一个元素值,这就是我想要的。例如:

echo "Some value: ";
echo $html->getElementById('someid')->nodeValue;