使用 cURL 的屏幕抓取仅获取页眉和页脚


screen scrape using cURL only gets header and footer

我被要求对这个网站进行屏幕抓取。它适用于许多其他网站,但由于某种原因只获取这个网站的页眉和页脚(http://www.coast-stores.com/SOPHIE-DRESS/Dresses/coast/fcp-product/2224724715)

function get_url($url) {
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
$cookie = '/cookies.txt';
$timeout = 30;


curl_setopt($curl, CURLOPT_URL,             $url);
curl_setopt($curl, CURLOPT_USERAGENT,       'Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
curl_setopt($curl, CURLOPT_HTTPHEADER,      $header);
curl_setopt($curl, CURLOPT_ENCODING,        'gzip,deflate'); 
curl_setopt($curl, CURLOPT_AUTOREFERER,     true); 
curl_setopt($curl, CURLOPT_REFERER,         'http://google.co.uk/');
curl_setopt($curl, CURLOPT_TIMEOUT,         20); 
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,  $timeout );
curl_setopt($curl, CURLOPT_COOKIEJAR,       $cookie);
curl_setopt($curl, CURLOPT_COOKIEFILE,      $cookie);
curl_setopt($curl, CURLOPT_RETURNTRANSFER,  true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION,  false );
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER,  false );    # required for https urls
curl_setopt($curl, CURLOPT_MAXREDIRS,       30 );
curl_setopt($curl, CURLOPT_BINARYTRANSFER, true);
$responseHTML   = curl_exec($curl);
$response       = curl_getinfo( $curl );
curl_close($curl); // close the connection
//return $html; // and finally, return $html

if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
    ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
    if ( $headers = get_headers($response['url']) )
    {
        foreach( $headers as $value )
        {
            if ( substr( strtolower($value), 0, 9 ) == "location:" )
                return get_url( trim( substr( $value, 9, strlen($value) ) ) );
        }
    }
}
if (
    (preg_match("/>[[:space:]]+window'.location'.replace'('(.*)'')/i", $content, $value) 
    || preg_match("/>[[:space:]]+window'.location'='"(.*)'"/i", $content, $value))
    && $javascript_loop < 5
)
{
    return get_url( $value[1], $javascript_loop+1 );
}
else
{
    return $responseHTML; //array( $content, $response );
}
}
// uses the function and displays the text off the website
$text = get_url($_GET['url']);
echo $text;

知道为什么它不获取主要内容吗?可能是在显示 HTML 后内容被传递吗?

此处运行的脚本示例:http://www.mattfacer.com/scraping/scraping2.php?url=http://www.coast-stores.com/SOPHIE-DRESS/Dresses/coast/fcp-product/2224724715

在任何其他网站上尝试一下,它似乎有效!

感谢您的任何帮助!

已签出网址,但它被重定向回主页。您能给我们一些步骤,告诉我们如何到达您要查找的页面吗?页面消失或网站使用 cookie 来允许访问网站的某些部分。