正则表达式 :如何提取 HTML 标题标签 - Regex : How to Extract HTML Heading Tags

Regex : How to Extract HTML Heading Tags

提取所有标题标签（h1，h2，h3，...）及其内容。例如：

<h1 id="title">This is the title</h1>
<h2 id="subtitle">This is the subtitle</h2>
<p>And this is the paragraph</p>

将提取为：

<h1 id="title">This is the title</h1>和<h2 id="subtitle">This is the subtitle</h2>

我正在使用PHP并使用正则表达式，如标题所示。

建议使用正确的工具来完成任务。

$doc = DOMDocument::loadHTML('
    <h1 id="title">This is the title</h1>
    <h2 id="subtitle">This is the subtitle</h2>
    <p>And this is the paragraph</p>
    <p>another tag</p>
');
$xpath = new DOMXPath($doc);  
$heads = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');
foreach ($heads as $tag) {
   echo $doc->saveHTML($tag), "'n";
}

输出

<h1 id="title">This is the title</h1>
<h2 id="subtitle">This is the subtitle</h2>