我有以下简单的网络表单,称为login.php
,其中包含:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta http-equiv="x-dns-prefetch-control" content="off">
</head>
<form action="action.php" method="post">
<!-- Input: Input box -->
Name: <input name="userName" type="text"/>
<br>
Password: <input name="userPassword" type="password"/>
<br>
<!-- Submit form -->
<input type="submit"/> <input type="reset"/>
</form>
</body>
</html>
然后我有非常简单的文件action.php
它处理通过POST
传递给它的数据,这是代码:
<?php
print_r ($_POST);
?>
这非常有效,如果我尝试以用户"foo"
和密码登录"bar"
我会得到:
Array ( [userName] => foo [userPassword] => bar )
我想要的是能够通过 curl 将POST
内容直接发送到 action.php
.所以我有第三个文件名scraper.php
它的代码在这里:
<?php
// SLIGHTLY MODIFIED CODE FROM: http://www.phpcodester.com/2011/01/scraping-a-password-protected-website-with-curl/
$ch=login('http://localhost/scraper_post/action.php','userName=foo&userPassword=bar');
$html=downloadUrl('http://localhost/scraper_post/action.php', $ch);
echo $html;
function downloadUrl($Url, $ch){
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch, CURLOPT_REFERER, "http://localhost/scraper_post/login.php");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
return $output;
}
// ALSO TRIED WITH $postData ON SEPARATE LINES AS IT IS IN ORIGINAL TUTORIAL
function login($url,$postData){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
// ALSO TRIED WITH FOLLOWING, AS SUGGESTED IN ORIGINAL TUTORIAL COMMENTS: curl_setopt ($ch, CURLOPT_POSTFIELDS, urlencode($postData));
curl_setopt ($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$store = curl_exec ($ch);
return $ch;
}
?>
问题是当我调用scraper.php
时,我在文件中得到空$_POST
变量action.php
。换句话说,scraper.php
不会向action.php
发送任何POST
数据,我不知道为什么。对于需要登录的页面编写更大的网络爬虫来说,这整个才刚刚开始,但正如你所看到的,我一开始就被困住了。谢谢。
您不需要downloadUrl()
函数,您的login()
函数已经登录并获取内容。在login()
中return $store;
,它将是来自网站的html
我对代码的建议:
<?php
$html=login('http://localhost/scrapper_post/action.php','userName=foo&userPassword=bar');
echo $html;
function login($url,$postData){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, urlencode($postData));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_REFERER, "http://localhost/scrapper_post/login.php");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
?>