PHP 中的网页抓取 - 使用某些 URL，但使用其他 URL 失败 - Web scraping in PHP - working with some URLs but fails with others

Web scraping in PHP - working with some URLs but fails with others

本文关键字：URL 其他失败抓取 PHP 网页 | 更新日期: 2023-09-27

我正在为LinkedIn个人资料页面使用 curl 进行网络抓取。如果我们尝试从这个（http://in.linkedin.com/in/ratneshdwivedi）公共URL中提取数据，它就可以工作。当我登录到LinkedIn并尝试从此 URL（http://www.linkedin.com/profile/view?id=77597832&locale=en_US&trk=tyah2&trkInfo=tas%3Aravi%20kant%20mishra%2Cidx%3A1-1-1）收集数据时，它不起作用，而是返回空白数据。

以下是我的源代码：

$html= $this->_getScrapingData ('http://in.linkedin.com/in/ratneshdwivedi',10);
preg_match("/<span class='"full-name'">(.*)<'/span>/i", $html, $match);
 private function _getScrapingData($url,$timeout) {
        $ch = curl_init($url); // initialize curl with given url
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
        curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
        return @curl_exec($ch);
    }

提前致谢

您的脚本未使用与浏览器相同的 Cookie。您需要先使用脚本浏览登录表单。

用

CURLOPT_COOKIEJAR
CURLOPT_COOKIEFILE

通过您的请求保留 Cookie。

您的脚本是否进行身份验证？

您链接的页面只能在登录后查看，这可以解释为什么您的脚本返回空数据，因为重定向的登录页面上不存在全名 span 类。

您可能还想检查 http://developer.linkedin.com/documents/profile-api 因为有比抓取页面更好的方法来实现这一点。

我认为问题是您在浏览器中登录（我猜您的浏览器具有带有某些会话ID的cookie），但是当您调用curl时，它对您的cookie一无所知。

解决方案是首先使用您的凭据调用登录请求，并保存从LinkedIn收到的cookie。然后使用适当的 cookie 调用您想要的所有请求。只是谷歌如何通过 PHP 发送饼干 curl ，我相信之前有人问过这个问题。

顺便说一句，我认为LinkedIn有一些您可以使用的 API。