正则表达式 :如何提取 HTML 标题标签


Regex : How to Extract HTML Heading Tags

提取所有标题标签(h1,h2,h3,...)及其内容。例如:

<h1 id="title">This is the title</h1>
<h2 id="subtitle">This is the subtitle</h2>
<p>And this is the paragraph</p>

将提取为:

<h1 id="title">This is the title</h1><h2 id="subtitle">This is the subtitle</h2>

我正在使用PHP并使用正则表达式,如标题所示。

建议使用正确的工具来完成任务。

$doc = DOMDocument::loadHTML('
    <h1 id="title">This is the title</h1>
    <h2 id="subtitle">This is the subtitle</h2>
    <p>And this is the paragraph</p>
    <p>another tag</p>
');
$xpath = new DOMXPath($doc);  
$heads = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');
foreach ($heads as $tag) {
   echo $doc->saveHTML($tag), "'n";
}

输出

<h1 id="title">This is the title</h1>
<h2 id="subtitle">This is the subtitle</h2>