我创建了这个函数,它基本上是为博客文章和这些文章的URL抓取Technorati。顺便说一句,我苦心经营地想找到一个API,但找不到。我确实为这个刮刀感到羞愧,但应该有一个API!无论如何
function get_technorati_bposts($kwd) {
//
global $user, $settings;
$user_id = $user->id;
if (!$user OR $user->verified != 1 OR $user->suspend != 0) {echo "no permission"; return;}
//
$items_max = $settings['scraper_num_technorati'];
$i = 0;
$p = 1;
$posts = array();
//
while ($i < $items_max) {
$url = "http://technorati.com/search?q=". urlencode($kwd) ."&return=posts&sort=relevance&topic=overall&source=blogs&authority=high&page=". $p ."";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_HTTPHEADER , "Content-Type: text/xml; charset=utf-8");
$output = curl_exec($ch);
curl_close($ch);
//
$html = "";
$html = str_get_html($output);
foreach ($html->find(".search-results li") as $key => $elm) {
foreach ($elm->find(".offsite") as $url) {
//
$href = $url->href;
$parse = parse_url($href);
$domain = $parse['host'];
$match = 0;
foreach ($posts as $item) {
$href_b = $item['Url'];
$parse_b = parse_url($href_b);
$domain_b = $parse_b['host'];
if ($domain == $domain_b) {$match++;}
}
if ($match > 0) {continue;}
//
$posts[$i]['Url'] = $href;
$posts[$i]['Thumb'] = "http://api.snapito.com/web/".$settings['scraper_snapito_key']."/sc/" . $href . "?fast";
$posts[$i]['Title'] = $url->title;
//
$i++;
}
if ($items_max == $i) {break;}
}
//
$p++;
}
print_r(json_encode($posts));
//
}
问题是不时地,我得到内部服务器错误500。
日志文件显示:
PHP致命错误:在 432行对/Library/WebServer/Documents/words/lib/raper-functions.PHP中的非对象调用成员函数find()
这是因为cURL超时吗?我能做些什么来避免这种情况吗?所以,如果cURL没有返回任何内容,请再次调用该函数,以便在某个时刻获取内容?
(一般建议)始终检查返回值是否有错误:
$output = curl_exec($ch);
if($output === FALSE) {
// when output is false it can't be used in str_get_html()
// output a proper error message in such cases
die(curl_error($ch));
}
如果功能失败,请务必阅读手册。:)。。每个函数都有一个名为"return value"的部分。
顺便说一句,如果在下一行再次初始化$html = "";
,为什么要将其初始化为空字符串?