使用php DOM解析html的问题


Problem with use php DOM to parse html

我使用PHP的DOMDocument来解析HTML源(通过cURL获得)。cURL工作得很好,但是当我使用DOM进行解析时,出现了一个问题。查看代码

    <?php
    $url = "http://www.google.com.vn/advanced_search?hl=en";
    $ch = curl_init($url);
    $header = array();
    $header[0]  = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[]   = "Cache-Control: max-age=0";
    $header[]   = "Connection: keep-alive";
    $header[]   = "Keep-Alive: 300";
    $header[]   = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[]   = "Accept-Language: en-us,en;q=0.5";
    $header[]   = "Pragma: "; // browsers keep this blank.
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; vi; rv:1.9.2.3)  Gecko/20100401 Firefox/3.6.3 FirePHP/0.5');
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    //curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    //curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLINFO_HEADER_OUT, 1);
    $html = curl_exec($ch); 
    /*
     * if I do:
     * echo $html;
     * exit;    // <-- it work finally
     *  numbers of <td> tag equal to </td>
     */
     $dom = new DOMDocument();
     @$dom->loadHTML($html);
     $html = $dom->saveHTML();
     echo $html; // <-- output html not right syntax . number of <td> tag greater than </td> tag.
    ?>

这里是一个编程错误或DOMDocument错误?

当你删除错误抑制时,你会看到DOMDocument将给出这些:

Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: form and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: div and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: td and tr

为了将标记解析为DOM树,loadHTML将尝试尽可能多地修复,因此这可能是您认为它是错误的原因。事实并非如此。Google的标记是无效的。

旁注:你为什么要刮掉那一页?Google有一个用于搜索的API。