我用这个循环得到一个页面的所有ahref:
foreach($html->find('a[href!="#"]') as $ahref) {
$ahrefs++;
}
我想做这样的事情:
foreach($html->find('a[href!="#"]') as $ahref) {
if(isexternal($ahref)) {
$external++;
}
$ahrefs++;
}
其中isexternal是函数
function isexternal($url) {
// FOO...
// Test if link is internal/external
if(/*condition is true*/) {
return true;
}
else {
return false;
}
}
救命!
使用parse_url并将主机与本地主机进行比较(通常但并非总是与$_SERVER['HTTP_HOST']
相同)
function isexternal($url) {
$components = parse_url($url);
return !empty($components['host']) && strcasecmp($components['host'], 'example.com'); // empty host will indicate url like '/relative.php'
}
悬停此选项将把www.example.com和example.com视为不同的主机。如果你希望所有子域都被视为本地链接,那么函数会更大一些:
function isexternal($url) {
$components = parse_url($url);
if ( empty($components['host']) ) return false; // we will treat url like '/relative.php' as relative
if ( strcasecmp($components['host'], 'example.com') === 0 ) return false; // url host looks exactly like the local host
return strrpos(strtolower($components['host']), '.example.com') !== strlen($components['host']) - strlen('.example.com'); // check if the url host is a subdomain
}
这就是简单地检测外部URL的方法:
$url = 'https://my-domain.com/demo/';
$domain = 'my-domain.com';
$internal = (
false !== stripos( $url, '//' . $domain ) || // include "//my-domain.com" and "http://my-domain.com"
stripos( $url, '.' . $domain ) || // include subdomains, like "www.my-domain.com". DANGEROUS (see below)!
(
0 !== strpos( $url, '//' ) && // exclude protocol relative URLs, like "//example.com"
0 === strpos( $url, '/' ) // include root-relative URLs, like "/demo"
)
);
上述检查将把www.my-domain.com
和my-domain.com
视为"内部"。
为什么此规则很危险:
子域逻辑引入了一个可能被利用的弱点:例如,当外部URL在路径中包含您的域时,https://external.com/www.my-domain.com
将被视为内部!
更安全的代码:
这个问题可以通过删除子域支持来消除(我建议这样做):
$url = 'https://my-domain.com/demo/';
$domain = 'my-domain.com';
$internal = (
false !== stripos( $url, '//' . $domain ) || // include "//my-domain.com" and "http://my-domain.com"
(
0 !== strpos( $url, '//' ) && // exclude protocol relative URLs, like "//example.com"
0 === strpos( $url, '/' ) // include root-relative URLs, like "/demo"
)
);
我知道这篇文章很旧,但我现在编码了我的函数。也许其他人也需要它。
function IsResourceLocal($url){
if( empty( $url ) ){ return false; }
$urlParsed = parse_url( $url );
$host = $urlParsed['host'];
if( empty( $host ) ){
/* maybe we have a relative link like: /wp-content/uploads/image.jpg */
/* add absolute path to begin and check if file exists */
$doc_root = $_SERVER['DOCUMENT_ROOT'];
$maybefile = $doc_root.$url;
/* Check if file exists */
$fileexists = file_exists ( $maybefile );
if( $fileexists ){
/* maybe you want to convert to full url? */
return true;
}
}
/* strip www. if exists */
$host = str_replace('www.','',$host);
$thishost = $_SERVER['HTTP_HOST'];
/* strip www. if exists */
$thishost = str_replace('www.','',$thishost);
if( $host == $thishost ){
return true;
}
return false;
}
function isexternal($url) {
// FOO...
// Test if link is internal/external
if(strpos($url,'domainname.com') !== false || strpos($url,"/") === '0')
{
return true;
}
else
{
return false;
}
}
您可能需要检查链接是否在同一域中。只有当所有href属性都是绝对的并且包含域时,这才有效。像/test/file.html这样的相对文件夹很棘手,因为可能有与域同名的文件夹。。所以,如果你在每个链接中都有完整的url:
function isexternal($url) {
// Test if link is internal/external
if(stristr($url, "myDomain.com") || strpos($url,"/") == '0')
return true;
else
return false;
}