我有下面的PHP代码,它获取一个HTML文件并从中提取表,然后解析表,并像Current Output
中那样返回单元格数据。我正在尝试获取href属性输出,也像Desired Output
代码段中一样。如果存在href,我看不出如何从单元格中仅针对href,我似乎只能获取节点值,非常感谢任何帮助。
电流输出
Array
(
[0] => Array
(
[id] => 213
[url] => Website
)
)
所需输出
Array
(
[0] => Array
(
[id] => 213
[url] => Website
[link] => example.com/page/1/
)
)
HTML
<table>
<tr>
<td>213</td>
<td><a href="example.com/page/1/">Website</a></td>
</tr>
</table>
PHP
$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(0)->getElementsByTagName('th');
$row_headers = null;
foreach($cols AS $node) {
$row_headers[] = $node->nodeValue;
}
$table = array();
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach($rows AS $row) {
$cols = $row->getElementsByTagName('td');
$row = array();
$i = 0;
foreach($cols AS $node) {
if ($row_headers != null) {
$row[$row_headers[$i]] = $node->nodeValue;
}
$i++;
}
if (!empty($row)) {
$table[] = $row;
}
}
我曾在嵌套的foreach foreach($cols AS $node)
中尝试过$row['link'] = $node->getAttribute('href');
,但似乎也不起作用。
请参阅下面的代码和内联注释
$html = '<table>
<tr>
<td>213</td>
<td><a href="example.com/page/1/">Website</a></td>
</tr>
<tr>
<td>444</td>
<td><a href="example.org/page/1/">not a website</a></td>
</tr>
</table>';
$dom = new DOMDocument();
$html = $dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$rows = $dom->getElementsByTagName("tr");
foreach($rows as $row){
$cols = $row->getElementsByTagName('td');
$id = $cols->item(0)->nodeValue; // get the id, the first td element, index=0
$anchor = $cols->item(1)->nodeValue; // get the anchor text, the second td element, index=1
$url = $cols->item(1)->getElementsByTagName('a')->item(0)->getAttribute('href'); // get the url from the href attribute, the second td element, index=1
$result[] = array(
'id' => $id,
'anchor'=> $anchor,
'url'=>$url
);
}
print_r($result);
应该输出这个
Array
(
[0] => Array
(
[id] => 213
[anchor] => Website
[url] => example.com/page/1/
)
[1] => Array
(
[id] => 444
[anchor] => not a website
[url] => example.org/page/1/
)
)