PHP cURL web scraper间歇性地返回错误“”;Recv故障:连接被重置”;


PHP cURL web-scraper intermittently returns error "Recv failure: Connection was reset"

我已经使用cURL和DOM用PHP编程了一个非常基本的web抓取工具。我使用XAMPP(Apache&MySQL)在Windows10上本地运行它。它在一个特定网站上的400个页面上抓取大约5个值(总共约2000个值)。作业通常在<120秒,但间歇性地(大约每5次运行一次),它会在60秒左右停止,并出现以下错误:

Recv故障:连接已重置

可能无关紧要,但我所有的抓取数据都被扔进了一个MySQL表中,一个单独的.php文件正在对数据进行样式化和呈现。这部分工作得很好。cURL正在引发错误。这是我的(非常精简的)代码:

$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');
//Some code that creates my SQL table.
//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){
   //3 array variables defined here, which I didn't include in this example.
   $path = $uniqueItems->href;
   $url = 'http://IPAddressOfSiteIAmScraping' . $path;
//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl);
    exit; }
curl_close($curl);
//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.
//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]'); 
$league = $xpath->query('//td[@class="left-column"]/p[1]');
//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
    $temp = new DOMDocument();
    $temp->appendChild($temp->importNode($e,TRUE));
    $val = $temp->saveHTML();
    $val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
    $val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
    $val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
    $final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
    $item['title'] = $final; //Here's the item name, ready for the SQL table.
}
//Here's a bunch of code where I write to my SQL table. Again, this part works great!
}

如果我需要放弃DOM,我并不反对切换到regex,但在选择DOM而不是regex之前,我已经潜伏了三天。我花了很多时间研究这个问题,但我看到的所有信息都显示"Recv故障:对等方重置了连接",这不是我得到的。我真的很沮丧,我不得不寻求帮助——到目前为止我做得很好——只是边学习边学习。这是我用PHP写的第一篇文章。

TL;DR:我写了一个cURL网络抓取器,它只有80%的时间都能出色地工作。20%的时间,由于未知原因,它错误地显示"Recv故障:连接已重置"。

希望有人能帮我!!:)谢谢你的阅读,即使你不能!

附言:如果你想查看我的完整代码,请访问:http://pastebin.com/vf4s0d5L.

经过长时间的研究(在发布问题之前,我已经研究了好几天了),我屈服了,并承认这个错误可能与我试图抓取的网站有关,因此超出了我的控制范围。

不过我确实设法解决了这个问题,所以我将放弃我的变通方法。。。

$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl) . '</br>';
    echo 'Dropping table...</br>';
    $sql = "DROP TABLE table_item_info";
        if (!mysqli_query($conn, $sql)) {
            echo "Could not drop table: " . mysqli_error($conn);
        }
    mysqli_close($conn);
    echo "TABLE has been dropped. Restarting.</br>";
    goto start;
    exit; }
curl_close($curl);

基本上,我所做的是实现错误检查。如果curl_errno($curl)下出现错误,我认为这是连接重置错误。在这种情况下,我放下SQL表,然后使用"goto-start"跳回到脚本的开头。然后,在我的文件顶部,我有"开始:"

这解决了我的问题!现在我不需要担心连接是否重置。我的代码足够聪明,可以自行确定,如果是这样的话,可以重置脚本。

希望这能有所帮助!