使用正则表达式删除电子邮件地址后的结束段落标签


Remove closing paragraph tag after email address with regex

我正在使用

public function __construct()
{
    $this->EE =& get_instance();
    $regex = '/('S+@'S+'.'S+)/';
    $replace = '<a href="mailto:$1">$1</a>';

    $this->return_data = preg_replace($regex, $replace, ee()->TMPL->tagdata);
}

但是,要查找纯文本电子邮件地址并将其更改为Mailto链接,所见即所得编辑器将结束段落标签放在链接之后,以便捕获结束标签并将其放入mailto链接中。我需要我的正则表达式来排除 .com 或 .net 或其他任何东西之后的任何内容。 我该怎么做?

现在,它正在返回 mailto:email@domain.com

,我需要排除.com之后的任何和所有标签 这是

转储的一部分,这是输出的内容:

<br />
Preston Newbill<br />
Manager<br />
pnewbill@domain.com</p>

一个非常基本的正则表达式,用于获取电子邮件地址而不匹配任何HTML标签:

['w'.]+@['w'.'-]+

解释如下:

  • 'w :代表"单词字符",通常为 [A-Za-z0-9_]。通知包含下划线和数字
  • '.:转义点
  • ['w'.]+:匹配任何单词字符和任何点

不幸的是,这并不匹配所有可能的电子邮件地址。有关更多详细信息,请参阅此问题。

完全符合 RFC-822 的正则表达式(源)将是:

(?:(?:'r'n)?[ 't])*(?:(?:(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't]
)+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:
'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(
?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 
't]))*"(?:(?:'r'n)?[ 't])*))*@(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'0
31]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'
](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+
(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:
(?:'r'n)?[ 't])*))*|(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z
|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)
?[ 't])*)*'<(?:(?:'r'n)?[ 't])*(?:@(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'
r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[
 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)
?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't]
)*))*(?:,@(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[
 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*
)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't]
)+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*)
*:(?:(?:'r'n)?[ 't])*)?(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+
|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r
'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:
'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't
]))*"(?:(?:'r'n)?[ 't])*))*@(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031
]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](
?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?
:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?
:'r'n)?[ 't])*))*'>(?:(?:'r'n)?[ 't])*)|(?:[^()<>@,;:''".'['] '000-'031]+(?:(?
:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?
[ 't]))*"(?:(?:'r'n)?[ 't])*)*:(?:(?:'r'n)?[ 't])*(?:(?:(?:[^()<>@,;:''".'['] 
'000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|
''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>
@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"
(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*))*@(?:(?:'r'n)?[ 't]
)*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''
".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?
:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[
']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*|(?:[^()<>@,;:''".'['] '000-
'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(
?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*)*'<(?:(?:'r'n)?[ 't])*(?:@(?:[^()<>@,;
:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([
^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''"
.'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'['
]'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*(?:,@(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'
['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'
r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] 
'000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']
|''.)*'](?:(?:'r'n)?[ 't])*))*)*:(?:(?:'r'n)?[ 't])*)?(?:[^()<>@,;:''".'['] '0
00-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''
.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,
;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']]))|"(?
:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*))*@(?:(?:'r'n)?[ 't])*
(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".
'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't])*(?:[
^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'[']
]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*'>(?:(?:'r'n)?[ 't])*)(?:,'s*(
?:(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''
".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*)(?:'.(?:(
?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=[
'["()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't
])*))*@(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't
])+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?
:'.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|
'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*|(?:
[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".'['
]]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*)*'<(?:(?:'r'n)
?[ 't])*(?:@(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["
()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)
?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>
@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*(?:,@(?:(?:'r'n)?[
 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,
;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'.(?:(?:'r'n)?[ 't]
)*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''
".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*)*:(?:(?:'r'n)?[ 't])*)?
(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['["()<>@,;:''".
'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])*)(?:'.(?:(?:
'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z|(?=['[
"()<>@,;:''".'[']]))|"(?:[^'"'r'']|''.|(?:(?:'r'n)?[ 't]))*"(?:(?:'r'n)?[ 't])
*))*@(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])
+|'Z|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*)(?:'
.(?:(?:'r'n)?[ 't])*(?:[^()<>@,;:''".'['] '000-'031]+(?:(?:(?:'r'n)?[ 't])+|'Z
|(?=['["()<>@,;:''".'[']]))|'[([^'[']'r'']|''.)*'](?:(?:'r'n)?[ 't])*))*'>(?:(
?:'r'n)?[ 't])*))*)?;'s*)

您可以尝试将正则表达式更改为以下内容:

/('S+@'S+'.[^'<]+)/

当它在顶级域中遇到第一个<时,这将停止捕获。

@ukliviu提出了一种更严格的方法,其误报率甚至比 HTML 标记更少。

从广义上讲,尝试将HTML标记与正则表达式混合是一个坏主意。您的结果会有所不同 - 对于可靠的脚本来说变化太大。如果您需要解析 HTML,请使用 PHP 中可用的 HTML 解析器,DomDocument。

摆脱HTML甚至更简单。您可以使用strip_tags从字符串中删除任何和所有 HTML,甚至是损坏的标记。您的代码可以简单地是:

$this->return_data = strip_tags(ee()->TMPL->tagdata);

概念验证:

$sample1 = 'mailto:email@domain.com</p>';
echo 'dirty: '.htmlentities($sample1).', clean: '.htmlentities(strip_tags($sample1));
// output: dirty: mailto:email@domain.com</p>, clean: mailto:email@domain.com 

在这里看到它的实际效果:http://codepad.viper-7.com/KHsIr0

一个函数调用,无需维护疯狂的正则表达式。


下面是如何使用 DomDocument 执行此操作的示例:

// create a new DomDocument object
$doc = new DOMDocument();
// load the HTML into the DomDocument object (this would be your source HTML)
libxml_use_internal_errors(true);
$doc->loadHTML('
    <p>
        <br>
        Preston Newbill<br>
        Manager<br>
        pnewbill@domain.com<br>
        <a href="mailto:noob@aol.com">also email me @ noob@aol.com</a><br>
        Party 9/15/2013@10:00pm!
');
libxml_clear_errors();
// grab the body, recursively check for child nodes. Turn any email addresses into links
$body = $doc->getElementsByTagName('body')->item(0);
checkDomNodeForEmailAddress($body);
// strip off the html,head, and body
$doc->removeChild($doc->firstChild);            
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
die('<hr>final product:'.htmlentities($doc->saveHtml()));
function checkDomNodeForEmailAddress(DOMNode $domNode) {
    foreach ($domNode->childNodes as $node) {
        if($node->hasChildNodes()) {
            if (strtolower($node->nodeName) != 'a')
                checkDomNodeForEmailAddress($node);
        } else {
            $node->nodeValue = preg_replace('/('S+@'S+'.[^'<]+)/', '<a href="mailto:$1">$1</a>', $node->nodeValue);
        }
    }    
}

在这里尝试一下: http://codepad.viper-7.com/EpdBKx

文档

  • strip_tags - http://php.net/manual/en/function.strip-tags.php
  • 文档 - http://php.net/manual/en/class.domdocument.php