将会话设置为刮取页面 - Set session to scrape page

Set session to scrape page

本文关键字：会话设置 | 更新日期: 2023-09-27

URL1:https://duapp3.drexel.edu/webtms_du/

URL2:https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX

URL3:https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX

作为一个个人编程项目，我想抓取我大学的课程目录，并将其作为RESTful API提供。

然而，我遇到了以下问题。

我需要抓取的页面是URL3。但是URL3只在我访问URL2之后返回有意义的信息（它在那里设置了术语Colleges.asp?Term=201125），但是URL2只能在访问URL1之后访问。

我试着用Fiddler来监控来回的HTTP数据，但我不认为他们在使用cookie。关闭浏览器会立即重置所有内容，所以我怀疑他们正在使用Session。

如何抓取URL 3？我尝试用程序的方式，先访问URL1和URL2，然后访问file_get_contents(url3)，但这不起作用（可能是因为它注册为三个不同的会话。

会话也需要一种机制来识别您。常用的方法包括：cookie、URL中的会话id。

URL 1上的curl -v表明确实正在设置会话cookie。

Set-Cookie: ASPSESSIONIDASBRRCCS=LKLLPGGDFBGGNFJBKKHMPCDA; path=/

您需要在任何后续请求中将此cookie发送回服务器，以保持会话有效。

如果要使用file_get_contents，则需要使用stream_context_create为其手动创建上下文，以便在请求中包含cookie。

另一种选择（我个人更喜欢）是使用PHP提供的curl函数。（它甚至可以为你处理cookie流量！）但这只是我的偏好。

编辑：

下面是一个有效的例子，可以在你的问题中找到答案。

$scrape = array(
    "https://duapp3.drexel.edu/webtms_du/",
    "https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX",
    "https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX"
);
$data = '';
$ch = curl_init();
// Set cookie jar to temporary file, because, even if we don't need them, 
// it seems curl does not store the cookies anywhere otherwise or include
// them in subsequent requests
curl_setopt($ch, CURLOPT_COOKIEJAR, tempnam(sys_get_temp_dir(), 'curl'));
// We don't want direct output by curl
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Then run along the scrape path
foreach ($scrape as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
}
curl_close($ch);
echo $data;