我是PHP的新手,正在为一个项目创建一个web scraper。从这个网站,https://www.bloglovin.com/en/blogs/1/2/all,我正在抓取博客标题、博客url、图像url,并连接一个后续链接以供以后使用。正如你在页面上看到的,有几个字段为每个博主提供信息。
以下是到目前为止我的PHP代码;
<?php
// Function to make GET request using cURL
function curlGet($url) {
$ch = curl_init(); // Initialising cURL session
// Setting cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch); // Executing cURL session
curl_close($ch); // Closing cURL session
return $results; // Return the results
}
$blogStats = array();
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
@$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
$blPageXpath = returnXPathObject($blPage);
$title = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[1]');
if ($title->length > 0) {
$blogStats['title'] = $title->item(0)->nodeValue;
}
$url = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[2]');
if ($url->length > 0) {
$blogStats['url'] = $url->item(0)->nodeValue;
}
$img = $blPageXpath->query('//*[@id="content"]//div/a/div/@href');
if ($img->length > 0) {
$blogStats['img'] = $img->item(0)->nodeValue;
}
$followLink = $blPageXpath->query('//*[@id="content"]/div[1]/div/a/@href');
if ($followLink->length > 0) {
$blogStats['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
}
print_r($blogStats);
/*$data = $blogStats;
header('Content-Type: application/json');
echo json_encode($data);*/
?>
目前,这只返回:
Array ( [title] => Fashion Toast [url] => fashiontoast.com [followLink] => http://www.bloglovin.com/blog/4735/fashion-toast )
我的问题是,循环查看每个结果的最佳方式是什么?我一直在看Stack Overflow,很难找到问题的答案,我的脑子有点乱!如果有人能给我建议或让我朝着正确的方向前进,那将是非常棒的。
谢谢。
更新:我确信这是错误的,我收到了错误!
<?php
// Function to make GET request using cURL
function curlGet($url) {
$ch = curl_init(); // Initialising cURL session
// Setting cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch); // Executing cURL session
curl_close($ch); // Closing cURL session
return $results; // Return the results
}
$blogStats = array();
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
@$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$blogPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
$blogPageXpath = returnXPathObject($blogPage);
$blogger = $blogPageXpath->query('//*[@id="content"]/div/@data-blog-id');
if ($blogger->length > 0) {
$blogStats[] = $blogger->item(0)->nodeValue;
}
foreach($blogger as $id) {
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
$blPageXpath = returnXPathObject($blPage);
$title = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[1]');
if ($title->length > 0) {
$blogStats[$id]['title'] = $title->item(0)->nodeValue;
}
$url = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[2]');
if ($url->length > 0) {
$blogStats[$id]['url'] = $url->item(0)->nodeValue;
}
$img = $blPageXpath->query('//*[@id="content"]//div/a/div/@href');
if ($img->length > 0) {
$blogStats[$id]['img'] = $img->item(0)->nodeValue;
}
$followLink = $blPageXpath->query('//*[@id="content"]/div[1]/div/a/@href');
if ($followLink->length > 0) {
$blogStats[$id]['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
}
}
print_r($blogStats);
/*$data = $blogStats;
header('Content-Type: application/json');
echo json_encode($data);*/ ?>
也许您想实际向数组添加一个维度。我猜博客作者有一个唯一的id,或者类似的标识符。
此外,您的代码似乎只执行一次?它可能需要在一个类似前臂的地方
我不能为你做这部分,但你需要一个包含每个博主的数组,或者一种方法来做一段时间,或者for!你必须了解如何自己迭代不同的博客:)
这里有一个博客的例子
[14] ['bloggerOne'][15] ['bloggerTwo'][16] ['bloggerThree']
foreach ($blogger as $id => $name)
{
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/' . $name);
// here you have something to do so that $blPage is actually different with each iteration, like changing the url
$blPageXpath = returnXPathObject($blPage);
$title = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[1]');
if ($title->length > 0) {
$blogStats[$id]['title'] = $title->item(0)->nodeValue;
}
$url = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[2]');
if ($url->length > 0) {
$blogStats[$id]['url'] = $url->item(0)->nodeValue;
}
$img = $blPageXpath->query('//*[@id="content"]//div/a/div/@href');
if ($img->length > 0) {
$blogStats[$id]['img'] = $img->item(0)->nodeValue;
}
$followLink = $blPageXpath->query('//*[@id="content"]/div[1]/div/a/@href');
if ($followLink->length > 0) {
$blogStats[$id]['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
}
}
所以在foreach之后,您的数组可能看起来像:
['12345']['title']=随便['url']=url['img']=foo['llowLink']=bar['4141']['title']=其他['url']=无url['img']=foo['llowLink']=bar['7415']['title']=静止['url']=url4['img']=foo['llowLink']=bar