我即将抓取一个带有多个选项卡的网站。在每个选项卡上单击一个 AJAX 请求发送到他们的服务器,返回将显示的选项卡的数据。
由于我需要获取这些数据,我检查了HTTP请求并使用"hurl.it"(网站(操纵标头以检查响应。我收到了正确的结果,但是当我使用相同的标头设置我的 Curl 会话时,响应不同/可读。
使用实时HTTP标头附加组件,我能够提取AJAX - URL
请求
发布 http://xxxx.xxx.xx/Organisation/AjaxScopeQualification/0e69a479-63e3-4d64-9340-f2e9cc8d84df?tabIndex=3
头
内容类型:应用程序/XML
X-Request-with: XMLHttpRequest
推荐人: http://xxxx.xxx.xx/Organisation/Details/41283
通过 hurl.it
响应 200 正常 646 字节 547 毫秒
头
缓存控制:私有
内容编码:gzip
内容长度:382
内容类型:应用程序/json;字符集=UTF-8
日期:2016 年 1 月 29 日星期五 01:36:42 GMT
服务器:Microsoft-IIS/7.5
设置饼干: .ASPXANONYMOUS=fsbx3gX1CykkKL2OIvPFH9GcPj97KEPkK-6WVTA24eI87k0F3gjpt0fyVA2P90S8heeaoqjUps9-UFtzgm8mRAiPqnbS50kytk_NY5K4yHPwa-5l0kCqNzPAo0yjBsPmbisbg3N7P7h6Oz5EdRaN8Fkr0y3G6wdIILI8yMQBj1S1X4GULf9rpQ8IvvSo13KB0;到期=星期五, 29-一月-2016 03:36:42 GMT;路径=/;仅 httponly
X-Aspnet版本:4.0.30319
X-Aspnetmvc版本:3.0
X-供电者:ASP.NET
身体
{"数据":[{"Id":"9fe29051-31e6-4bfa-a2f1-194d70c0aab9","NrtId":"930ec525-2199-44a9-bc27-c1b28524c9bf","RtoId":"0e69a479-63e3-4d64-9340-f2e9cc8d84df","TrainingComponentType":2,"代码":"TLI41210","标题":"运输和物流四级证书(道路运输 - 汽车驾驶指导(","隐式":假,"范围 ID":"01","范围":"交付和评估","开始日期":新日期(2011,11,7,0,0,0(,"结束日期":新日期(2016,11,6,0,0,0(,"交付新南威尔士州":真,"交付维克":真,"DeliveryQld
":true,"DeliverySa":true,"DeliveryWa":true,"DeliveryTas":true,"DeliveryNt":true,"DeliveryAct":true,"ScopeDecisionType":0,"ScopeDecision":"Deliverand assessment"}],"total":1}**来自 CURL 的响应 - var_dump(( **
string(382( " m j 0 _E蔀 |+ = B Kz(= q8 ICȻWζiq t { y r; r D @ P t Ǚ. Z ZaX ; N z ~( [Jor 7F H1h E~ ! aJ# '䭮> Mg Vr Ǚ ŊK S A &�L evu Sl3; ᱴd] 4 pR . ] 1 @ ' X ? ty p 8 1 R= t(S 6 [ +- Vr9 # f 4 2# Ew їѯ r FGZ O '' .䲰 7 f^ W [ ;Z"
这是一个字符集问题还是我设置了错误的卷曲选项?
卷曲
$url = http://xxxx.xxx.xx/Organisation/AjaxDetailsLoadScope/e11d03e7-37e7-49e8-be54-0bed8eb1c247?_=1454029562507&tabIndex=3
$header = array(
'Accept: */*',
'Accept-Encoding: gzip, deflate',
'Content-Length: 0',
'Content-Type: application/xml',
'X-Requested-With: XMLHttpRequest',
"Referer: http://xxxx.xxx.xx/Organisation/Details/$this->code"
);
//..
//$header and $url are saved in arrays and then passed to curlMulti()
function curlMulti($urls, $headers = false) {
$mh = curl_multi_init();
// For each of the URLs in array
foreach ($urls as $id => $d) {
$ch[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
if (is_array($headers) && $headers[$id] != false) {
curl_setopt($ch[$id], CURLOPT_POST, 1);
curl_setopt($ch[$id], CURLOPT_HTTPHEADER, $headers[$id]);
}
curl_setopt($ch[$id], CURLOPT_URL, $url);
curl_setopt($ch[$id], CURLOPT_RETURNTRANSFER, TRUE);
curl_multi_add_handle($mh, $ch[$id]);
}
$running = NULL; // Set $running to NULL
do {
curl_multi_exec($mh, $running);
} while ($running > 0); // While $running is greater than zero
foreach ($ch as $id => $content) {
$results[$id] = curl_multi_getcontent($content);
curl_multi_remove_handle($mh, $content);
}
curl_multi_close($mh);
return $results;
}
我正在玩一些标题,现在让它工作了。
必须删除"接受:*/*", 标题中的"接受编码:gzip,放气">
$header = array(
'Content-Length: 0',
'Content-Type: application/xml',
'X-Requested-With: XMLHttpRequest',
"Referer: http://xxxx.xxx.xx/Organisation/Details/$this->code"
);
像魅力一样工作:
stdClass Object
(
[data] => Array
(
[0] => stdClass Object
(
[Id] => 9fe29051-31e6-4bfa-a2f1-194d70c0aab9
[NrtId] => 930ec525-2199-44a9-bc27-c1b28524c9bf
[RtoId] => 0e69a479-63e3-4d64-9340-f2e9cc8d84df
[TrainingComponentType] => 2
[Code] => TLI41210
[Title] => Certificate IV in Transport and Logistics (Road Transport - Car Driving Instruction)
[IsImplicit] =>
[ExtentId] => 01
[Extent] => Deliver and assess
[DeliveryNsw] => 1
[DeliveryVic] => 1
[DeliveryQld] => 1
[DeliverySa] => 1
[DeliveryWa] => 1
[DeliveryTas] => 1
[DeliveryNt] => 1
[DeliveryAct] => 1
[ScopeDecisionType] => 0
[ScopeDecision] => Deliver and assess
)
)
[total] => 1
)