从图像标记的title和alt属性中剥离HTML标记


Strip HTML tags from within the title and alt attributes of an image tag

在我们的一些文章中,我们有一些图像错误地将链接硬编码到图像标签的title/alt属性中,这会破坏图像的显示。例如:

<img src="/imgs/my-image.jpg" title="This is a picture of a <a href="/blob.html">blob</a>." />

我尝试过使用preg_replace_callback函数,但由于链接中的引号重复,很难匹配完整的标题。

我希望能够对任何字符串进行动态编程,以确保正确的输出。想法?

您可以尝试这种模式:

$pattern = <<<'EOD'
~
(?:
    'G(?!'A)                 # second entry point
    (?:                        # content up to the next alt/title attribute (optional)
        [^><"]* "                 # end of the previous attribute
        (?> [^><"]* " [^"]* " )*? # other attributes (optional)
        [^><"]*                   # spaces or attributes without values (optional)
        'b(?:alt|title)'s*='s*"   # the next alt/title attribute
    )?+                        # make all the group optional
  |
    <img's[^>]*?             # first entry point
    'b(?:alt|title)'s*='s*"
)
[^<"]*+'K
(?:              # two possibilities:
    </?a[^>]*>     # an "a" tag (opening or closing)
  |                # OR
    (?=")          # followed by the closing quote
)
~x
EOD;
$result = preg_replace($pattern, '', $html);

在线演示

这种模式使用了与'G锚的重复匹配的邻接性。