PHP:DOM获取url和锚点(但不获取IMG)


PHP: DOM get url and anchors (but not IMG)

我想从HTML页面中选择所有URL到一个数组中,比如:

This is a webpage <a href="http://somesite.com/link1.php">with</a> 
different kinds of <a href="http://somesite.com/link1.php"><img src="someimg.png"></a>

我想要的输出是:

with => http://somesite.se/link1.php

现在我得到:

<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php

我不想要在开始和结束之间包含图像的url/链接。只有带文本的。

我当前的代码是:

<?php
function innerHTML($node) {
    $ret = '';
    foreach ($node->childNodes as $node) {
        $ret .= $node->ownerDocument->saveHTML($node);
    }
    return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
@$dom->loadHTML($html); // @ = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
    //$node = $link->nodeValue;
    $node = innerHTML($link);
    $href = $link->getAttribute('href');
    if (preg_match('/'.pdf$/i', $href))
            $result[$node] = $href;
}
print_r($result);
?>

将第二个preg_match添加到您的条件:

if(preg_match('/'.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;