PHP curl代码跳过被抓取的行


php curl code to skip lines from being scraped

我正在使用curl抓取HTML页面。它完美地在预标记之间擦除数据。然而,我想跳过前五行。有什么我可以添加到代码做到这一点吗?下面是我的代码:

<?php
function curl_download($Url){
if (!function_exists('curl_init')){
    die('cURL is not installed. Install and try again.');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
$start = strpos($output, '<pre>');
$end = strpos($output, '</pre>', $start);
$length = $end-$start;
$output = substr($output, $start, $length);    

curl_close($ch);
return $output;
}
print curl_download('http://athleticsnews.co.za/results/20140207BOLALeague3/140207F006.htm');
?>

输入的HTML是这样的:

<pre>
AllTrax Timing - Contractor License                     4/22/2014 - 8:31 AM
                Boland Athletics League 3 - 2/7/2014                    
                        Hosted by Maties AC                             
                     Coetzenburg, Stellenbosch                          
Event 6  Girls 14-15 200 Meter Sprint

因此,我试图排除前四行加上空白行并开始从以Event 6开头的行中删除。

您可以使用正则表达式将行分开并选择您想要的行:

$str = curl_download('http://.../140207F006.htm');
$re = "/([^'n'r]+)/m"; 
preg_match_all($re, $str, $matches);
print_r($matches[1]);

结果:

Array
(
    [0] =>  AllTrax Timing - Contractor License                     4/22/2014 - 8:31 AM
    [1] =>                     Boland Athletics League 3 - 2/7/2014                    
    [2] =>                             Hosted by Maties AC                             
    [3] =>                          Coetzenburg, Stellenbosch                          
    [4] =>  
    [5] => Event 6  Girls 14-15 200 Meter Sprint
    [6] => ============================================================================
    [7] =>     Name                     Age Team                    Finals  Wind Points
    [8] => ============================================================================
    [9] => Finals                                                                      
    [10] =>   1 Shan Fourie                  Bola                     29.03   NWI  10  
)

只打印最后5行,可以执行

$matches = $matches[1];
$str = "";
for($i = 5; $i <= 10; $i++) {
    $str .= $matches[$i] . PHP_EOL; // Preserve the new line
}
echo $str;

结果:

Event 6  Girls 14-15 200 Meter Sprint
============================================================================
    Name                     Age Team                    Finals  Wind Points
============================================================================
Finals                                                                      
  1 Shan Fourie                  Bola                     29.03   NWI  10  

演示:http://ideone.com/ijPiP6