Regex剥离任何不是'；t html注释 - Regex to strip anything that isn't an html comment

Regex to strip anything that isn't an html comment

本文关键字：html 注释剥离任何不 Regex | 更新日期: 2023-09-27

我知道使用正则表达式解析html通常是不可行的，但我不想要任何聪明的东西。。。

以为例

<div><!--<b>Test</b>-->Test</div>
<div><!--<b>Test2</b>-->Test2</div>

我想去掉任何不在之间的东西来获得：

<b>Test</b><b>Test2</b>

标签保证正确匹配（没有未关闭/嵌套的注释）。

我需要使用什么正则表达式？

替换模式：

(?s)((?!-->).)*<!--|-->((?!<!--).)*

带有一个空字符串。

简短解释：

(?s)              # enable DOT-ALL
((?!-->).)*<!--   # match anything except '-->' ending with '<!--'
|                 # OR
-->((?!<!--).)*   # match '-->' followed by anything except '<!--'

使用regex处理（X）HTML时要小心。每当部分注释出现在标记属性或CDATA块中时，就会出现问题。

编辑

看到你最活跃的标签是JavaScript，这里有一个JS演示：

print(
  "<div><!--<b>Test</b>-->Test</div>'n<div><!--<b>Test2</b>-->Test2</div>"
  .replace(
    /((?!-->)['s'S])*<!--|-->((?!<!--)['s'S])*/g,
    ""
  )
);

打印：

<b>Test</b><b>Test2</b>

注意，由于JS不支持(?s)标志，我使用了等效的['s'S]，它匹配任何字符（包括换行字符）。

在Ideone上测试如下：http://ideone.com/6yQaK

编辑II

PHP演示看起来像：

<?php
$s = "<div><!--<b>Test</b>-->Test</div>'n<div><!--<b>Test2</b>-->Test2</div>";
echo preg_replace('/(?s)((?!-->).)*<!--|-->((?!<!--).)*/', '', $s);
?>

它还打印：

<b>Test</b><b>Test2</b>

如在Ideone上所见：http://ideone.com/Bm2uJ

另一种可能性是这个

.*?<!--(.*?)-->.*?(?=<!--|$)

并替换为

$1

在Regexr 上查看

如果您逐行读取字符串，这将匹配到第一个注释之前的任何内容，将第一个内容的内容放入组1，然后匹配到该行结束或下一个注释为止的任何内容。

s/-->.*?<--//g strips off anything between "-->" and the next "<--"
s/^.*?<--// strips off from the beginning to the first occurence of "<--"
s/-->.*?$// strips off from the last occurence of "-->" to the end

.*匹配任意数量的字符，.*?匹配尽可能少的字符，因此孔型匹配

^代表字符串的开头，$代表的末尾