PHP Scrape-从结果创建多维数组-当前代码只返回一个结果


PHP Scrape - Create multidimensional arrays from results - current code only returning one result

我是PHP的新手,正在为一个项目创建一个web scraper。从这个网站,https://www.bloglovin.com/en/blogs/1/2/all,我正在抓取博客标题、博客url、图像url,并连接一个后续链接以供以后使用。正如你在页面上看到的,有几个字段为每个博主提供信息。

以下是到目前为止我的PHP代码;

<?php
        // Function to make GET request using cURL
        function curlGet($url) {
            $ch = curl_init(); // Initialising cURL session
            // Setting cURL options
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
            curl_setopt($ch, CURLOPT_URL, $url);
            $results = curl_exec($ch); // Executing cURL session
            curl_close($ch); // Closing cURL session
            return $results; // Return the results
        }
        $blogStats = array();
        function returnXPathObject($item) {
            $xmlPageDom = new DomDocument(); 
            @$xmlPageDom->loadHTML($item); 
            $xmlPageXPath = new DOMXPath($xmlPageDom); 
            return $xmlPageXPath; 
        }
        $blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
        $blPageXpath = returnXPathObject($blPage); 
        $title = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[1]');
                if ($title->length > 0) {
            $blogStats['title'] = $title->item(0)->nodeValue;
        }
        $url = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[2]');
            if ($url->length > 0) {
            $blogStats['url'] = $url->item(0)->nodeValue;
        }
        $img = $blPageXpath->query('//*[@id="content"]//div/a/div/@href');
            if ($img->length > 0) {
            $blogStats['img'] = $img->item(0)->nodeValue;
        }
        $followLink = $blPageXpath->query('//*[@id="content"]/div[1]/div/a/@href');
            if ($followLink->length > 0) {
                $blogStats['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
        }

        print_r($blogStats);

        /*$data = $blogStats;
        header('Content-Type: application/json');
        echo json_encode($data);*/
    ?>

目前,这只返回:

Array ( [title] => Fashion Toast [url] => fashiontoast.com [followLink] => http://www.bloglovin.com/blog/4735/fashion-toast )

我的问题是,循环查看每个结果的最佳方式是什么?我一直在看Stack Overflow,很难找到问题的答案,我的脑子有点乱!如果有人能给我建议或让我朝着正确的方向前进,那将是非常棒的。

谢谢。

更新:我确信这是错误的,我收到了错误!

<?php
    // Function to make GET request using cURL
    function curlGet($url) {
        $ch = curl_init(); // Initialising cURL session
        // Setting cURL options
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $results = curl_exec($ch); // Executing cURL session
        curl_close($ch); // Closing cURL session
        return $results; // Return the results
    }
    $blogStats = array();
    function returnXPathObject($item) {
        $xmlPageDom = new DomDocument(); 
        @$xmlPageDom->loadHTML($item); 
        $xmlPageXPath = new DOMXPath($xmlPageDom); 
        return $xmlPageXPath; 
    }
$blogPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
    $blogPageXpath = returnXPathObject($blogPage);
    $blogger = $blogPageXpath->query('//*[@id="content"]/div/@data-blog-id');
    if ($blogger->length > 0) {
    $blogStats[] = $blogger->item(0)->nodeValue;
    }

    foreach($blogger as $id) {
            $blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/all');
            $blPageXpath = returnXPathObject($blPage);
            $title = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[1]');
                if ($title->length > 0) {
                $blogStats[$id]['title'] = $title->item(0)->nodeValue;
            }
            $url = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[2]');
                if ($url->length > 0) {
                $blogStats[$id]['url'] = $url->item(0)->nodeValue;
            }
            $img = $blPageXpath->query('//*[@id="content"]//div/a/div/@href');
                if ($img->length > 0) {
                $blogStats[$id]['img'] = $img->item(0)->nodeValue;
            }
            $followLink = $blPageXpath->query('//*[@id="content"]/div[1]/div/a/@href');
                if ($followLink->length > 0) {
                $blogStats[$id]['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
            }
            }

    print_r($blogStats);

    /*$data = $blogStats;
    header('Content-Type: application/json');
    echo json_encode($data);*/ ?>

也许您想实际向数组添加一个维度。我猜博客作者有一个唯一的id,或者类似的标识符。

此外,您的代码似乎只执行一次?它可能需要在一个类似前臂的地方

我不能为你做这部分,但你需要一个包含每个博主的数组,或者一种方法来做一段时间,或者for!你必须了解如何自己迭代不同的博客:)

这里有一个博客的例子

[14] ['bloggerOne'][15] ['bloggerTwo'][16] ['bloggerThree']
foreach ($blogger as $id => $name)  
 {
$blPage = curlGet('https://www.bloglovin.com/en/blogs/1/2/' . $name); 
// here you have something to do so that $blPage is actually different with each iteration, like changing the url
$blPageXpath = returnXPathObject($blPage); 
$title = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[1]');
            if ($title->length > 0) {
        $blogStats[$id]['title'] = $title->item(0)->nodeValue;
    }
    $url = $blPageXpath->query('//*[@id="content"]//div/a/h2/span[2]');
        if ($url->length > 0) {
        $blogStats[$id]['url'] = $url->item(0)->nodeValue;
    }
    $img = $blPageXpath->query('//*[@id="content"]//div/a/div/@href');
        if ($img->length > 0) {
        $blogStats[$id]['img'] = $img->item(0)->nodeValue;
    }
    $followLink = $blPageXpath->query('//*[@id="content"]/div[1]/div/a/@href');
        if ($followLink->length > 0) {
            $blogStats[$id]['followLink'] = 'http://www.bloglovin.com' . $followLink->item($i)->nodeValue;
    }

  }

所以在foreach之后,您的数组可能看起来像:

['12345']['title']=随便['url']=url['img']=foo['llowLink']=bar['4141']['title']=其他['url']=无url['img']=foo['llowLink']=bar['7415']['title']=静止['url']=url4['img']=foo['llowLink']=bar