如何使用python/PHP来删除URL链接中的冗余


How to use python/PHP to remove redundancy in URL link?

许多网站在url链接中添加标签以进行跟踪,例如

http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost

若我们删除附录"?wprss=linkset&tid=sm_twitter_washingtonpost",仍然会转到同一页。是否有任何通用方法可以去除这些冗余元素?任何评论都会有所帮助。

谢谢!

要删除查询,请从URL中分割部分

在Python中使用urlparse:

import urlparse
 
url = urlparse.urlsplit(URL)               # parse url
print urlparse.urlunsplit(url[:3]+('','')) # remove query, fragment parts

或者是一种更轻量级但可能不那么通用的方法:

print URL.partition('?')[0]

根据rfc 3986 URI可以使用正则表达式进行解析:

/^(([^:'/?#]+):)?('/'/([^'/?#]*))?([^?#]*)('?([^#]*))?(#(.*))?/

因此,如果没有片段标识符(上面regex中的最后一部分)或存在查询组件(倒数第二部分),则URL.partition('?')[0]应该有效,否则将url拆分为"?"会失败,例如

http://example.com/path#here-?-ereh

但CCD_ 3答案仍然有效。

检查是否可以通过URL访问页面

在Python中:

import urllib2
try:
    resp = urllib2.urlopen(URL)
except IOError, e:
    print "error: can't open %s, reason: %s" % (URL, e)
else:
    print "success, status code: %s, info:'n%s" % (resp.code, resp.info()),

CCD_ 4可以用于读取页面的内容。

删除URL中的查询字符串:

<?php
$url = 'http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost';
$url = explode('?',$url);
$url = $url[0];
//check output
echo $url;
?>

要检查URL是否有效:

您可以使用PHP函数get_headers($url)。示例:

<?php
//$url_o = 'http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost';
$url_o = 'http://mobile.nytimes.com/article?a=893626&f=21';
$url = explode('?',$url_o);
$url = $url[0];
$header = get_headers($url);
if(strpos($header[0],'Not Found'))
{
    $url = $url_o;
}
//check output
echo $url; 
?>

您可以使用正则表达式:

$yourUrl = preg_replace("/[?].*/","",$yourUrl);

意思是:"用一个空字符串替换问号和后面的所有内容"。

您可以制作一个URL解析器,将从"?"和上剪切所有内容

<?php
$pos = strpos($yourUrl, '?'); //First, find the index of "?"
//Then, cut all the chars after the "?" and a append to a new URL string://
$newUrl = substr($yourUrl, 0, -1*(strlen($yourUrl)-((int)$pos)));
echo ($newUrl);
?>