PHP正则表达式匹配特定的URL并删除其他URL


PHP regex match specific URL and strip others

我编写这个函数是为了将所有特定的URL(mywebsite.com)转换为链接,并将其他URL剥离为@@@spam@@@。

function get_global_convert_all_urls($content) {
  $content = strtolower($content);
  $replace = "/(?:http|https)?(?:':'/'/)?(?:www.)?(([A-Za-z0-9-]+'.)*[A-Za-z0-9-]+'.[A-Za-z]+)(?:'/.*)?/im";
  preg_match_all($replace, $content, $search);
  $total = count($search[0]);
  for($i=0; $i < $total; $i++) {
  $url = $search[0][$i];
    if(preg_match('/mywebsite.com/i', $url)) {
      $content = str_replace($url, '<a href="'.$url.'">'.$url.'</a>', $content);            
    } else {
      $content = str_replace($url, '@@@spam@@@', $content); 
    }
  } 
  return $content;
}

我唯一不能解决的问题是,如果一行中有两个URL,正则表达式就不会以空格结尾。

$content = "http://www.mywebsite.com/index.html http://www.others.com/index.html";

结果:

<a href="http://www.mywebsite.com/index.html http://www.others.com/index.html">http://www.mywebsite.com/index.html http://www.others.com/index.html</a>

如何获得以下结果:

<a href="http://www.mywebsite.com/index.html">http://www.mywebsite.com/index.html</a> @@@spam@@@   

我试着在正则表达式的末尾添加这个(''s|$),但没有成功:

/(?:http|https)?(?:':'/'/)?(?:www.)?(([A-Za-z0-9-]+'.)*[A-Za-z0-9-]+'.[A-Za-z]+)(?:'/.*)?('s|$)/im

根据问题的变化进行编辑。

问题是正则表达式末尾的.*,所以我的建议是用更精确的表达式替换它。我很快就做好了,你会想做一些测试来验证你的情况

$matches = null;
$returnValue = preg_match_all('!(?:http|https)?(?:'':''/''/)?(?:www.)?(([A-Za-z0-9-]+''.)*[A-Za-z0-9-]+''.[A-Za-z]+)(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9''-''._''?'',''''/''''''+&%''$#''=~])*[^''.'','')''(]!', 'mywebsite.com/index.html others.com/index.html', $matches);

结果:

array (
  0 => 
  array (
    0 => 'mywebsite.com/index.html ',
    1 => 'others.com/index.html',
  ),
  1 => 
  array (
    0 => 'mywebsite.com',
    1 => 'others.com',
  ),
  2 => 
  array (
    0 => '',
    1 => '',
  ),
  3 => 
  array (
    0 => '',
    1 => '',
  ),
  4 => 
  array (
    0 => 'l',
    1 => 'm',
  ),
)

将正则表达式(?:'/.*)?的最后一个元素更改为'S*

正则表达式匹配字符串末尾的所有字符,包括空格,'S*匹配所有非空格字符。

您还可以将整个正则表达式简化为:

$replace = "~(?:https?://)?(?:www'.)?(([A-Z0-9-]+'.)*[A-Z0-9-]+'.[A-Z]+)'S*~im";

更改regexp模式以捕获最后一个url部分(/index.html/index.php

/(?:http|https)?(?:':'/'/)?(?:www.)?(([A-Za-z0-9-]+?'.)?[A-Za-z0-9-]+?'.?[A-Za-z]*?('/'w+?'.'w+?)?)'b/im

更改您的功能内容,如下所示:

$content = "http://www.mywebsite.com/index.html http://www.others.com/index.html";
function get_global_convert_all_urls($content) {
  $content = strtolower($content);
  $replace = "/(?:http|https)?(?:':'/'/)?(?:www.)?(([A-Za-z0-9-]+?'.)?[A-Za-z0-9-]+?'.?[A-Za-z]*?('/'w+?'.'w+?)?)'b/im";
  preg_match_all($replace, $content, $search);
  foreach ($search[0] as $url) {
    if(preg_match('/mywebsite.com/i', $url)) {
      $content = str_replace($url, '<a href="'.$url.'">'.$url.'</a>', $content);         
    } else {
      $content = str_replace($url, '@@@spam@@@', $content); 
    }
  } 
  return $content;
}
var_dump(get_global_convert_all_urls($content)); 

输出:

string '<a href="http://www.mywebsite.com/index.html">http://www.mywebsite.com/index.html</a> @@@spam@@@'