使用 php 和 fopen 进行屏幕抓取


Screen scrape using php and fopen

可能的重复项:
使用 file_get_contents 在 php 中进行屏幕填充

谁能帮我..我正在尝试从酒店评论中抓取 LateRooms.com 不要告诉我这是一个坏主意,因为我已经获得了会员许可

我的代码:

<?php
header('content-type: text/plain');
$contents = file_get_contents('http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx');
$contents = preg_replace('/'s(1,)/', ' ', $contents);
print $contents . "'n";
$records = preg_split('/<div id="review/', $contents);
for ($ix = 1; $ix < count($records); $ix++) {
$tmp = $records[$ix];
preg_match('/id="review"/', $tmp, $match_reviews);
print_r($match_reviews);
exit();
}
?>

这非常有效,唯一的问题是它拉入整个代码页面并且与div id "review"不匹配

提前致谢

function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function DOMinnerHTML($element){ 
$innerHTML = ""; 
$children = $element->childNodes; 
foreach ($children as $child) 
{ 
    $tmp_dom = new DOMDocument(); 
    $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
    $innerHTML.=trim($tmp_dom->saveHTML()); 
} 
return $innerHTML; 
}
$url  = 'http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx';
$html = file_get_contents_curl($url);
//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0){
foreach ($div_elements as $div_element) {
    if ($div_element->getAttribute('class') == 'review newReview'){
        $reviews[] = DOMinnerHTML($div_element);
    }
}
}
print_r($reviews);

试试这个,它将返回所有评论。您可以根据需要优化内容。