如何在PHP中从html页面中抓取H2和H3标签


How to scrape H2 and H3 tags from html page in PHP?

我需要从下面的html代码中获得h2和h3标记作为php中的$var:

<div class="main-info">
   <img class="iphone-img" alt="" src="https://www.myweb.com/securedImage.jsp">
        <div class="sub-info">
                <h2 class="model">iPhone 4S</h2>
                <h3 class="capacity color">16GB Black</h3>
          </div>
</div>

我想要这个结果:

echo $model; // Should echo:  'iPhone 4S'
echo $capacitycolour; // Should echo: '16GB Black'

我试过preg_matchpreg_match_allgetElementsByTagName,但到目前为止没有运气。

这是我尝试过的代码:

$pattern = '/[^'n]h2*[^'n]*/';
preg_match_all($pattern,$data, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

和:

$doc = new DOMDocument();
$doc->loadHTML($data);
$tags = $doc->getElementsByTagName('sub-info');
$root = $doc->documentElement;
foreach($root->childNodes as $node){
    $attributes[$node->nodeName] = $node->nodeValue;
}
var_dump($attributes);

sub-info是类,而不是标记名,因此您对DOMDocument的使用存在缺陷,您最好使用XPath查询。

$strhtml='<div class="main-info">
            <img class="iphone-img" alt="" src="https://www.myweb.com/securedImage.jsp?configcode=DTF9&size=120x120">
            <div class="sub-info">
                <h2 class="model">
                        iPhone 4S
                </h2>
                <h3 class="capacity color">
                    16GB Black 
                </h3>
            </div>
        </div>';

$doc = new DOMDocument();
$doc->loadHTML( $strhtml );
$xpath=new DOMXPath( $doc );
$col=$xpath->query('//div[@class="sub-info"]/h2|//div[@class="sub-info"]/h3');
if( $col ){
    /* You could store results from query in an array */
    $tags=array();
    foreach( $col as $node ) {
        /* Simplest form to display results on separate lines, use br tag */
        echo $node->nodeValue . '<br />';
        /* Add tags to array - a rethink would be required if there are multiple h2 and h3 tags! */
        $tags[ $node->tagName ]=$node->nodeValue;
    }
    /* echo back results from array */
    echo $tags['h2'];
    echo '<br />';
    echo $tags['h3'];
}

对于未来,只需尝试在线regex测试仪来验证您的表达式。

对于H2标签,以下内容将起作用:.*<h2.*>['n's]*(.*)(尽管不是最理想的)

我以前在很多情况下都使用过simple_html_dom.php,效果非常好。它允许在加载文档后使用类似CSS的选择器。此外,您可以从字符串、本地文件或URL进行解析!以下将为您提供一个Element的数组:

$div = $html->find('div.sub-info');
$ret = $div[0]->find('h2, h3');

API参考:此处为

警告:不要使用RegEx来解析HTML,如果你在这里看到会发生什么:)

那是你Cyberboki吗?

检查一下。

$strhtml='<div class="main-info">
        <img class="iphone-img" alt="" src="https://www.myweb.com/securedImage.jsp?configcode=DTF9&size=120x120">
        <div class="sub-info">
            <h2 class="model">
                    iPhone 4S
            </h2>
            <h3 class="capacity color">
                16GB Black 
            </h3>
        </div>
    </div>';
$new = preg_replace("/'s+/",' ',$strhtml);  
preg_match('/<h2 class="model">(.*?)<'/h2>/i', $new , $h2); 
preg_match('/<h3 class="capacity color">(.*?)<'/h3>/i', $new , $h3); 
echo "option 1";
echo "<br/>";
echo $h2[1];
echo "<br/>";
echo $h3[1];
echo "<br/>";
echo "<br/>";
    $ex = explode("'n",strip_tags($strhtml));   
    foreach($ex as $key){
        //echo $key;
        $line_out = preg_replace('/'s+/', ' ', trim($key));
        if(strlen($line_out) > 0){
            $rr[] = trim($key);
        }
    }
echo "option 2";
echo "<br/>";       
echo $rr[0];
echo "<br/>";
echo $rr[1];        
result:
option 1
iPhone 4S
16GB Black
option 2
iPhone 4S
16GB Black 

谨致问候,iPhone Yeta