我使用PHP的DOMDocument来解析HTML源(通过cURL获得)。cURL工作得很好,但是当我使用DOM进行解析时,出现了一个问题。查看代码
<?php
$url = "http://www.google.com.vn/advanced_search?hl=en";
$ch = curl_init($url);
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; vi; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 FirePHP/0.5');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLINFO_HEADER_OUT, 1);
$html = curl_exec($ch);
/*
* if I do:
* echo $html;
* exit; // <-- it work finally
* numbers of <td> tag equal to </td>
*/
$dom = new DOMDocument();
@$dom->loadHTML($html);
$html = $dom->saveHTML();
echo $html; // <-- output html not right syntax . number of <td> tag greater than </td> tag.
?>
这里是一个编程错误或DOMDocument错误?
当你删除错误抑制时,你会看到DOMDocument
将给出这些:
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: form and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: div and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: td and tr
为了将标记解析为DOM树,loadHTML
将尝试尽可能多地修复,因此这可能是您认为它是错误的原因。事实并非如此。Google的标记是无效的。
旁注:你为什么要刮掉那一页?Google有一个用于搜索的API。