如何使用 DOMDocument 访问 HTML 节点,同时保留内部 HTML 格式


How do you access an HTML node using DOMDocument while retaining the inner HTML formatting?

我正在尝试使用PHP中的DOMDocument从Google Docs访问电子表格单元格的内容。

我能够访问该节点,但内容是纯文本格式并且缺少 HTML 格式。

这是我使用的示例链接,其中包含粗体、斜体和下划线的文本。

https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml

以下是我正在使用的PHP代码:

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);
    $dom        = new 'DOMDocument();
    $html       = $dom->loadHTML($htmlData); 
    $dom->preserveWhiteSpace = false;
    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');  
    $rowHeaders = array();
    foreach ($cols as $i => $node) {
        if($i >= 0 ) $rowHeaders[] = $node->textContent;
    }
    foreach ($rows as $i => $row){
        if($i == 0 ) continue;
        $cols = $row->getElementsByTagName('td');
        $row = array();
        foreach ($cols as $j => $node) {
            $row[$rowHeaders[$j]] = $node->textContent;
        }
        $table[] = $row;
    }
    die(print_r($table)); 

我的输出缺少内部 HTML 格式:

[1] => Array
    (
        [Variable] => html_body
        [Data] => Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    )

不要使用 textContent,试试看:

foreach ($cols as $j => $node) {
    //$row[$rowHeaders[$j]] = $node->textContent;
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }
    $row[$rowHeaders[$j]]= $innerHTML;
}