如果我将$number<=放入for循环25或<=50或<=75,这个脚本的工作方式正是我想要的,如果我输入100或更高,它会抛出一个错误:
致命错误:允许的内存大小67108864字节已用完(尝试分配24字节)/public_html/phpscraper/simplehtmldom_1_5/simple_html_dom.php在线1074
有更好的编码方法吗?从网站上获取数据似乎一点也不奇怪。我需要分配或初始化更多内存吗(没有经验)。它看起来并不是一个复杂的php任务。我也不知道这个网页是什么,我只是为一家公司做这件事,他们选择了这个目录。
感谢阅读。这是代码
<?php
include_once 'simple_html_dom.php';
echo 'Category, AvnetPartNumber, Manufacturer, Price, Availability,';
//number is the value of which item the page starts with
for($number = 0; $number <= 100; $number = $number + 25){
// Create DOM from URL or file
$url = "http://avnetexpress.avnet.com/store/em/EMController/_/N-?Ns=PartNumber|0&action=excess_inventory&catalogId=&cutTape=&inStock=&langId=-1&myCatalog=&npi=&proto=®ionalStock=&rohs=&storeId=500201&term=&topSellers=&No=".$number;
$html = file_get_html($url);
$i = 0;
// Find all images
foreach($html->find('td[class=small dataTd]') as $element) {
if($i == 1 || $i == 3 || $i == 4 || $i == 7 || $i == 8){
echo $element->plaintext . ',' ;
}
if($i == 8){
$i =0;
}
else{
$i++;
}
}
}
?>
我会做一些不同的事情,首先我会使用curl(2个原因是它更快,通过设置useragent,你可以看起来像一个普通的浏览器),最后不用担心simple_html_dom
,你可以用PHP内置的domDocument做什么。
此外,您不希望重置$i
&8,因为每行有10列,这会使您的结果发生偏差,所以在9上重置将按预期创建新行,在我的示例中,我将所有数据放在一个数组中,但您应该将其放在数据库等中,正如您在4页中看到的那样,其峰值内存使用量为1.40MB,希望这会有所帮助。
<?php
$url = 'http://avnetexpress.avnet.com/store/em/EMController/_/N-?Nn=50&Ns=PartNumber|0&action=excess_inventory&catalogId=&cutTape=&inStock=&langId=-1&myCatalog=&npi=&proto=®ionalStock=&rohs=&storeId=500201&term=&topSellers=&No=';
//4 pages
$result = run_scrap($url,100,25);
//Memory usage
$memory = array();
$memory['used'] = getReadableFileSize(memory_get_peak_usage());
$memory['total'] = ini_get("memory_limit").'B';
print_r($result);
print_r($memory); //Array ( [used] => 1.40 MB [total] => 128MB )
/** Result
* Array
(
[0] => Array
(
[title] => Logic and Timing - Crystals
[partnum] => ##BP11DCRK430
[manufactuere] => TOKO America
[price] => $0.3149
[availability] => 4500 Stock
)
[1] => Array
(
[title] => Inductor - Inductor Leaded
[partnum] => #187LY-471J
[manufactuere] => TOKO America
[price] => $0.3149
[availability] => 100 Stock
)
...
*/
function run_scrap($url,$total_items=100,$step=25){
$range = range(0,$total_items,$step);
$result = array();
foreach($range as $page){
$src = curl_get($url.$page);
$result = array_merge($result,process($src));
}
return $result;
}
function process($src){
$return = array();
$dom = new DOMDocument("1.0","UTF-8");
@$dom->loadHTML($src);
$dom->preserveWhiteSpace = false;
$return = array();
$i=0;
$r=0;
foreach($dom->getElementsByTagName('td') as $ret) {
if($ret->getAttribute('class') == 'small dataTd'){
switch($i){
case 1:
$return[$r]['title'] = trim($ret->nodeValue);
break;
case 3:
$return[$r]['partnum'] = trim($ret->nodeValue);
break;
case 4:
$return[$r]['manufactuere'] = trim($ret->nodeValue);
break;
case 7:
$return[$r]['price'] = trim($ret->nodeValue);
break;
case 8:
$return[$r]['availability'] = trim($ret->nodeValue);
break;
default:
break;
}
//Reset after col 9
if($i == 9){
$i = 0;
$r++;
}else{
$i++;
}
}
}
return $return;
}
function curl_get($url){
$return = '';
(function_exists('curl_init')) ? '' : die('cURL Must be installed!');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/json,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
//Debug Function - not related to the scrapper
function getReadableFileSize($size, $retstring = null) {
$sizes = array('bytes', 'kB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB');
if ($retstring === null) { $retstring = '%01.2f %s'; }
$lastsizestring = end($sizes);
foreach ($sizes as $sizestring) {
if ($size < 1024) { break; }
if ($sizestring != $lastsizestring) { $size /= 1024; }
}
if ($sizestring == $sizes[0]) { $retstring = '%01d %s'; }
return sprintf($retstring, $size, $sizestring);
}
?>
<?
require_once("SimpleHtmlDom/simple_html_dom.php");
$_htmlDom = new simple_html_dom();
echo 'Category, AvnetPartNumber, Manufacturer, Price, Availability,';
//number is the value of which item the page starts with
for($number = 0; $number <= 200; $number += 25){
// Create DOM from URL or file
$url = "http://avnetexpress.avnet.com/store/em/EMController/_/N-?Ns=PartNumber|0&action=excess_inventory&catalogId=&cutTape=&inStock=&langId=-1&myCatalog=&npi=&proto=®ionalStock=&rohs=&storeId=500201&term=&topSellers=&No=".$number;
$html = file_get_contents($url);
$_htmlDom->load($html);
$i = 0;
$elementList = $_htmlDom->find('td[class=small dataTd]');
// Find all images
foreach($elementList as $element) {
if($i == 1 || $i == 3 || $i == 4 || $i == 7 || $i == 8){
echo $element->plaintext . ',' ;
}
if($i == 8){
$i = 0;
}else{
$i++;
}
flush();
}
}
?>
这个版本在128MB RAM NAS中测试(实际上它小于80MB RAM),它是有效的
我只是修改了一些东西:
- 更改为OO
- 使用`$elementList`变量存储`'td[class=small dataId]'`
- 测试数量最多200
- 已添加`flush()`
问题似乎与您的代码无关。当您增加$number
时,您正在增加(看起来是)搜索返回的结果数。
返回的结果越多,生成的网页就越大,因此最终会得到更多的DOM节点和链接。问题是,当您尝试调用$html->find()
时,内存(在PHP中)将耗尽。我不确定这个解析器是如何工作的,但当您加载脚本时,它可能会将所有节点解析到内存中。
解决方案是提高PHP的内存限制,或者每次请求的结果少于100个,因为这似乎是内存耗尽的时候。
您可以通过在脚本开头调用ini_set('memory_limit', '128M');
来增加内存限制。注:我刚刚从哪里选了128M
。将其设置为您认为需要的任何值。