Regex使用preg_match替换网页元描述撇号 - Regex to replace webpage meta description apostrophe using preg_match

Regex to replace webpage meta description apostrophe using preg_match

我有这个数据：

<meta name="description" content="Access Kenya is Kenya's leading corporate Internet service provider and is a technology solutions provider in Kenya with IT and network solutions for your business.Welcome to the Yellow Network.Kenya's leading Corporate and Residential ISP" />;

我使用的是这个正则表达式：

<meta +name *=['"']?description['"']? *content=['"']?([^<>''"]+)['"']?

获取网页描述一切正常，但一切都停顿了，到处都有撇号。

我该如何逃脱？

正则表达式为<meta>节点考虑以下三个选项：

<meta name="description" content="Some Content" />
<meta name='description' content='Some Content' />
<meta name=description content=Some Content />

第三个选项不是有效的HTML，但一切都可能发生，所以……你是对的。

简单的方法是修改原始的正则表达式结束标记，并使用?非贪婪运算符：

<meta +name *=['"']?description['"']? *content=['"']?(.*?)['"']? */?>
                                                      └─┘       └───┘
          search zero-or-more characters except following       closing tag characters

regex101演示

但是，在这种情况下，如果你有这个元，会发生什么？

<meta content="Some Content" name="description" />

您的正则表达式将失败。

要real匹配HTML节点，必须使用解析器：

$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $yourHtmlString );
$xpath = new DOMXPath( $dom );
$description = $xpath->query( '//meta[@name="description"]/@content' );
echo $description->item(0)->nodeValue);

将输出：

Some Content

是的，它是5行对1行，但使用此方法，您将匹配任何<meta name="description">（如果它包含第三个无效属性）。

阅读有关DOMDocument的更多信息
阅读有关DOMXPath的更多信息
阅读为什么不能用正则表达式解析[X]HTML