当网页抓取时cURL超时:“;PHP致命错误:在非对象上调用成员函数find();


cURL times out when web scraping: "PHP Fatal error: Call to a member function find() on a non-object"

我创建了这个函数,它基本上是为博客文章和这些文章的URL抓取Technorati。顺便说一句,我苦心经营地想找到一个API,但找不到。我确实为这个刮刀感到羞愧,但应该有一个API!无论如何

function get_technorati_bposts($kwd) {
        //
        global $user, $settings;
        $user_id = $user->id;
        if (!$user OR $user->verified != 1 OR $user->suspend != 0) {echo "no permission"; return;}
        //
        $items_max = $settings['scraper_num_technorati'];
        $i = 0;
        $p = 1;
        $posts = array();
        //
        while ($i < $items_max) {
            $url = "http://technorati.com/search?q=". urlencode($kwd) ."&return=posts&sort=relevance&topic=overall&source=blogs&authority=high&page=". $p ."";
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
            curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
            curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
            curl_setopt($ch, CURLOPT_HEADER, FALSE);
            curl_setopt($ch, CURLOPT_TIMEOUT, 20);
            curl_setopt($ch, CURLOPT_HTTPHEADER , "Content-Type: text/xml; charset=utf-8");
            $output = curl_exec($ch);
            curl_close($ch);
            //
            $html = "";
            $html = str_get_html($output);
            foreach ($html->find(".search-results li") as $key => $elm) {
                foreach ($elm->find(".offsite") as $url) {
                    //
                    $href = $url->href;
                    $parse = parse_url($href);
                    $domain = $parse['host'];
                    $match = 0;
                    foreach ($posts as $item) {
                        $href_b = $item['Url'];
                        $parse_b = parse_url($href_b);
                        $domain_b = $parse_b['host'];
                        if ($domain == $domain_b) {$match++;}
                    }
                    if ($match > 0) {continue;}
                    //
                    $posts[$i]['Url'] = $href;
                    $posts[$i]['Thumb'] = "http://api.snapito.com/web/".$settings['scraper_snapito_key']."/sc/" . $href . "?fast";
                    $posts[$i]['Title'] = $url->title;
                    //
                    $i++;
                }
                if ($items_max == $i) {break;}
            }
            //
            $p++;
        }
        print_r(json_encode($posts));
        //
    }

问题是不时地,我得到内部服务器错误500。

日志文件显示:

PHP致命错误:在 432行对/Library/WebServer/Documents/words/lib/raper-functions.PHP中的非对象调用成员函数find()

这是因为cURL超时吗?我能做些什么来避免这种情况吗?所以,如果cURL没有返回任何内容,请再次调用该函数,以便在某个时刻获取内容?

(一般建议)始终检查返回值是否有错误:

$output = curl_exec($ch);
if($output === FALSE) {
    // when output is false it can't be used in str_get_html()
    // output a proper error message in such cases
    die(curl_error($ch));
}

如果功能失败,请务必阅读手册。:)。。每个函数都有一个名为"return value"的部分。


顺便说一句,如果在下一行再次初始化$html = "";,为什么要将其初始化为空字符串?