PHP爬网程序异常


PHP crawler exception

下面是我的代码,它输出wiki页面上Plot选项卡下的内容,我使用的是getElementById,它抛出了一些我粘贴在下面的异常,有人能修改它吗。提前感谢。

<?php
/**
 * Downloads a web page from $url, selects the the element by $id
 * and returns it's xml string representation.
 */
//Taking input
 if(isset($_POST['submit'])) /* i.e. the PHP code is executed only when someone presses Submit button in the below given HTML Form */
{
$var = $_POST['var'];   // Here $var is the input taken from user.
} 
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    @$doc->loadHTMLFile($url);
    if(!$doc) {
        throw new Exception("Failed to load $url");
    }
    // Obtain the element
    $element = $doc->getElementById($id);
    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }
    if($pretty) {
        $doc->formatOutput = true;
    }
    // Return the string representation of the element
    return $doc->saveXML($element);
}
// call it:
echo getElementByIdAsString('https://en.wikipedia.org/wiki/I_Too_Had_a_Love_Story', 'Plot');
?>

例外情况是:

Fatal error: Uncaught exception 'Exception' with message 'An element with id Plot was not found' in C:'xampp'htdocs'example2.php:23 Stack trace: #0 C:'xampp'htdocs'example2.php(35): getElementByIdAsString() #1 {main} thrown in C:'xampp'htdocs'example2.php on line 23

我尝试您的代码,它工作正常并返回<span class="mw-headline" id="Plot">Plot</span>。我认为你在使用DOMDocument::loadHTMLFile@:时遇到的问题

@$doc->loadHTMLFile($url);

因为此方法返回

成功时bool true或失败时bool false

有时它会返回false(例如,对于许多请求,来自wikipedia的403),并且您的dom元素为空。在这种情况下,您的$element = $doc->getElementById($id);找不到此元素。

尝试将您的代码更改为:

<?php
/**
 * Downloads a web page from $url, selects the the element by $id
 * and returns it's xml string representation.
 */
//Taking input
if(isset($_POST['submit'])) /* i.e. the PHP code is executed only when someone presses Submit button in the below given HTML Form */
{
    $var = $_POST['var'];   // Here $var is the input taken from user.
}
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    $loadResult = @$doc->loadHTMLFile($url);
    if(!$doc || !$loadResult) {
        throw new Exception("Failed to load $url");
    }
    // Obtain the element
    $element = $doc->getElementById($id);
    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }
    if($pretty) {
        $doc->formatOutput = true;
    }
    // Return the string representation of the element
    return $doc->saveXML($element);
}
// call it:
echo getElementByIdAsString('https://en.wikipedia.org/wiki/I_Too_Had_a_Love_Story', 'Plot');
?>

Wkipedia可能不适用于您的脚本(一些站点会阻止解析器脚本)。尝试使用curl获取您的响应的状态代码

$url = 'en.wikipedia.org/wiki/I_Too_Had_a_Love_Story';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL,$url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$html = curl_exec($ch); 
$status_code = curl_getinfo($ch,CURLINFO_HTTP_CODE);