如何获取所有未使用正则表达式链接的网址 - How to fetch all the urls which are not linked using regex

How to fetch all the urls which are not linked using regex

我需要从给定字符串中获取所有未链接的网址（没有锚点标签的网址）。

我知道正则表达式(http|ftp|https)://(['w_-]+(?:(?:'.['w_-]+)+))(['w.,@?^=%&:/~+#-]*['w@?^=%&/~+#-])?从给定字符串中获取所有网址。

输入：

<div class='test'>
<p>Heading</p>
<a href='http://www.google.com'>google</a>
www.yahoo.com
http://www.rediff.com
<a href='http://www.overflow.com'>www.overflow.com</a> 
</div>

输出：

www.yahoo.com
http://www.rediff.com

恳请指教。

使用 library 获取 dom tree html，并获取所有链接。例如，您可以使用 SimpleHTML http://simplehtmldom.sourceforge.net/

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all links
foreach($html->find('a') as $element) {
       echo $element->href . '<br>'; 
}

简单使用这将获得 href 源：

href='(.+?)'