使url正则表达式全局化


Making a url regex global

我一直在搜索一个正则表达式来替换字符串中的纯文本url(字符串可以包含多个url),通过:

 <a href="url">url</a>

我发现:http://mathiasbynens.be/demo/url-regex

我想使用diegoperini的正则表达式(根据测试,它是最好的):

_^(?:(?:https?|ftp)://)(?:'S+(?::'S*)?@)?(?:(?!10(?:'.'d{1,3}){3})(?!127(?:'.'d{1,3}){3})(?!169'.254(?:'.'d{1,3}){2})(?!192'.168(?:'.'d{1,3}){2})(?!172'.(?:1[6-9]|2'd|3[0-1])(?:'.'d{1,3}){2})(?:[1-9]'d?|1'd'd|2[01]'d|22[0-3])(?:'.(?:1?'d{1,2}|2[0-4]'d|25[0-5])){2}(?:'.(?:[1-9]'d?|1'd'd|2[0-4]'d|25[0-4]))|(?:(?:[a-z'x{00a1}-'x{ffff}0-9]+-?)*[a-z'x{00a1}-'x{ffff}0-9]+)(?:'.(?:[a-z'x{00a1}-'x{ffff}0-9]+-?)*[a-z'x{00a1}-'x{ffff}0-9]+)*(?:'.(?:[a-z'x{00a1}-'x{ffff}]{2,})))(?::'d{2,5})?(?:/[^'s]*)?$_iuS

但我想让它全局替换字符串中的所有url。当我使用这个:

/_(?:(?:https?|ftp)://)(?:'S+(?::'S*)?@)?(?:(?!10(?:'.'d{1,3}){3})(?!127(?:'.'d{1,3}){3})(?!169'.254(?:'.'d{1,3}){2})(?!192'.168(?:'.'d{1,3}){2})(?!172'.(?:1[6-9]|2'd|3[0-1])(?:'.'d{1,3}){2})(?:[1-9]'d?|1'd'd|2[01]'d|22[0-3])(?:'.(?:1?'d{1,2}|2[0-4]'d|25[0-5])){2}(?:'.(?:[1-9]'d?|1'd'd|2[0-4]'d|25[0-4]))|(?:(?:[a-z'x{00a1}-'x{ffff}0-9]+-?)*[a-z'x{00a1}-'x{ffff}0-9]+)(?:'.(?:[a-z'x{00a1}-'x{ffff}0-9]+-?)*[a-z'x{00a1}-'x{ffff}0-9]+)*(?:'.(?:[a-z'x{00a1}-'x{ffff}]{2,})))(?::'d{2,5})?(?:/[^'s]*)?_iuS/g

它不起作用,我如何使这个regex全局化,开头的下划线和结尾的"_iuS"是什么意思?

我想把它和php一起使用,所以我使用的是:

preg_replace($regex, '<a href="$0">$0</a>', $examplestring);

下划线是正则表达式的分隔符,i、u和S是模式修饰符:

i(PCRE_CASELESS)

If this modifier is set, letters in the pattern match both upper and lower 
case letters.

U(PCRE_UNGREEDY)

This modifier inverts the "greediness" of the quantifiers so that they are 
not greedy by default, but become greedy if followed by ?. It is not compatible
with Perl. It can also be set by a (?U) modifier setting within the pattern 
or by a question mark behind a quantifier (e.g. .*?).

S

When a pattern is going to be used several times, it is worth spending more 
time analyzing it in order to speed up the time taken for matching. If this 
modifier is set, then this extra analysis is performed. At present, studying 
a pattern is useful only for non-anchored patterns that do not have a single 
fixed starting character.

有关更多信息,请参阅http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

当您添加//g,您添加了另一个regex分隔符加上PCRE中不存在的修饰符g,这就是它不起作用的原因。

我同意@verdesmarad的观点,并在以下函数中使用了此模式:

$string = preg_replace_callback(
        "_(?:(?:https?|ftp)://)(?:'S+(?::'S*)?@)?(?:(?!10(?:'.'d{1,3}){3})(?!127(?:'.'d{1,3}){3})(?!169'.254(?:'.'d{1,3}){2})(?!192'.168(?:'.'d{1,3}){2})(?!172'.(?:1[6-9]|2'd|3[0-1])(?:'.'d{1,3}){2})(?:[1-9]'d?|1'd'd|2[01]'d|22[0-3])(?:'.(?:1?'d{1,2}|2[0-4]'d|25[0-5])){2}(?:'.(?:[1-9]'d?|1'd'd|2[0-4]'d|25[0-4]))|(?:(?:[a-z'x{00a1}-'x{ffff}0-9]+-?)*[a-z'x{00a1}-'x{ffff}0-9]+)(?:'.(?:[a-z'x{00a1}-'x{ffff}0-9]+-?)*[a-z'x{00a1}-'x{ffff}0-9]+)*(?:'.(?:[a-z'x{00a1}-'x{ffff}]{2,})))(?::'d{2,5})?(?:/[^'s]*)?_iuS",
        create_function('$match','
            $m = trim(strtolower($match[0]));
            $m = str_replace("http://", "", $m);
            $m = str_replace("https://", "", $m);
            $m = str_replace("ftp://", "", $m);
            $m = str_replace("www.", "", $m);
            if (strlen($m) > 25)
            {
                $m = substr($m, 0, 25) . "...";
            }
            return "<a href='"$match[0]'">$m</a>";
                '), $string);
    return $string;

它似乎起到了作用,解决了我遇到的一个问题。正如@verdesmarad所说,即使在我的pre_replace_callback()中,删除^和$字符也可以使用该模式。

唯一让我担心的是,这种模式的效率有多高。如果在繁忙/高流量的网络应用程序中使用,会造成瓶颈吗?

更新

如果在url的路径部分末尾有一个跟踪点,那么上面的regex模式就会中断,就像http://www.mydomain.com/page.一样。为了解决这个问题,我修改了regex模式的最后部分,添加了^.,使最后部分看起来像[^'s^.]。当我读到它时,不要匹配尾随空格或点。

到目前为止,在我的测试中,它似乎运行良好。