我试图去掉这个页面的标签,这样我就可以得到一个页码列表。所以我可以计算出curl程序继续抓取页面的最高页码是多少。现在,我可以把标签剥离到一个点上,得到一个数字,但我不知道如何将每个数字分开,这样我就可以看到最高的页码是什么。
我收到的当前返回值是
12
这是我的代码:
<?php
// Defining the basic pruning function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
ob_start();
?>
<span class="current">1</span><a href="javascript:__doPostBack('ctl00$phCenterColumn$motoSearchResults$gvCatalog$ctl01$ctl03','')">2</a><a href="javascript:__doPostBack('ctl00$phCenterColumn$motoSearchResults$gvCatalog$ctl01$ctl04','')">
<?php
$variable = ob_get_clean();
$startend5 = Array('">' => '</a>');
foreach($startend5 as $o => $p){
$data = scrape_between($variable, $o, $p);
}
$data = strip_tags($data);
echo $data;
?>
仅供参考ob_start();以及ob_get_clean();are只是为了这个例子不想让代码库的长度超过必要的长度,包括所有的curl命令。
我可以推荐PHP DOM库吗。你可以这样访问这些值:
<?php
include 'simple_html_dom.php';
$html = str_get_html("<span class='"current'">1</span><a href='"javascript:__doPostBack('ctl00'$phCenterColumn'$motoSearchResults'$gvCatalog'$ctl01'$ctl03','')'">2</a><a href='"javascript:__doPostBack('ctl00'$phCenterColumn'$motoSearchResults'$gvCatalog'$ctl01'$ctl04','')'">");
$currentPage = $html->find('span.current');
foreach($currentPage as $page)
{
echo 'current page: ' . $page->plaintext . '<br />';
}
$otherPages = $html->find('a');
echo 'other pages: ';
foreach($otherPages as $otherPage)
{
echo $otherPage->plaintext . ' ';
}
?>
这给了我:
current page: 1
other pages: 2