我试图在两个不同的中文购物网站的"产品页面"中提取主图像。
site 1产品页面链接:http://www.aliexpress.com/item/100ft - 7 -核心链- 550降落伞绳-尼龙系索-沙漠paracord生存-工具-爬camping/541809415.html
站点2产品页面链接:http://detail.china.alibaba.com/offer/1235158006.html
我的代码对站点#1工作得很好,但对于站点#2,我得到一个奇怪的短html字符串。
下面是我的代码:<?php
require_once('./includes/simple_html_dom/simple_html_dom.php');
$url="http://www.aliexpress.com/item/100FT-7-Core-Strand-550-Parachute-Cord-Nylon-Lanyard-Desert-Paracord-Survival-Kits-For-Climbing-Camping/541809415.html";
$html=file_get_html($url);
echo "html lenght : ".strlen($html)."<br>";
foreach($html->find('meta[property=og:image]') as $element) {
echo("result : ".$element->content);
}
echo "<br>-------------------------------------------------------------------<br>";
$url="http://detail.china.alibaba.com/offer/1235158006.html";
$html=file_get_html($url);
echo "html lenght : ".strlen($html)."<br>";
foreach($html->find('div[id=J_DetailInside]') as $element) {
echo("result : ".$element->innertext);
}
?>
我一直在努力让它正常工作,但没有成功,
原因是如果第二个站点没有获得任何useragent,它将重定向到127.0.0.1您必须像这样使用curl设置useragent:
<?php
require_once('simple_html_dom.php');
$url="http://www.aliexpress.com/item/100FT-7-Core-Strand-550-Parachute-Cord-Nylon-Lanyard-Desert-Paracord-Survival-Kits-For-Climbing-Camping/541809415.html";
$html=file_get_html($url);
echo "html lenght : ".strlen($html)."<br>";
foreach($html->find('meta[property=og:image]') as $element) {
echo("result : ".$element->content);
}
echo "<br>-------------------------------------------------------------------<br>";
$url="http://detail.china.alibaba.com/offer/1235158006.html";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13');
$pagie = curl_exec($curl);
curl_close($curl);
$html=str_get_html($pagie);
echo "html lenght : ".strlen($html)."<br>";
foreach($html->find('div[id=J_DetailInside]') as $element) {
echo("result : ".$element->innertext);
}
?>
btw,div[id=J_DetailInside]似乎获取了太多