php爬虫为wiki获取错误 - php crawler for wiki getting error

php crawler for wiki getting error

本文关键字：获取取错误 wiki 爬虫 php | 更新日期: 2023-09-27

在下面的代码中，我试图使用php代码从网站中提取内容，当我使用getElementByIdAsString时，该代码运行良好（'ww.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp'，'profile'）；

但当我使用相同的代码从维基百科提取内容时，它就不起作用了，getElementByIdAsString（'https://en.wikipedia.org/wiki/A_Brief_History_of_Time'，'摘要'）；

下面是我的代码以及使用后一个代码时出现的异常。有人能纠正我的代码以提取基于id 的维基百科内容吗

提前谢谢。

<?php

function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $result = curl_exec($ch);

//    var_dump($doc->loadHTMLFile($url)); die;
error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);
    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }
    if($pretty) {
        $doc->formatOutput = true;
    }
    // Return the string representation of the element
    return $doc->saveXML($element);
}
//Here I am dispalying the output in bold text
echo getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');
?>

异常

Fatal error: Uncaught exception 'Exception' with message 'Failed to load http://en.wikipedia.org/wiki/A_Brief_History_of_Time' in C:'xampp'htdocs'example2.php:18 Stack trace: #0 C:'xampp'htdocs'example2.php(40): getElementByIdAsString() #1 {main} thrown in C:'xampp'htdocs'example2.php on line 18

您的帮助将是非常棒的：-）

尝试添加：

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

评论中讨论后更新：

<?php
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $result = curl_exec($ch);
    error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);
    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }
    if($pretty) {
        $doc->formatOutput = true;
    }
    $output = '';
    $node = $element->parentNode;
    while(true) {
        $node = $node->nextSibling;
        if(!$node) {
            break;
        }
        if($node->nodeName == 'p') {
            $output .= $node->nodeValue;
        }
        if($node->nodeName == 'h2') {
            break;
        }
    }
    return $output;
}
//Here I am dispalying the output in bold text
var_dump(getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary'));

您可能还可以使用xPaths，或者只使用整个响应，并使用regex 切割您想要的任何内容