考虑这个例子,test.php
:
<?php
$mystr = "<p>Hello, με काचं ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument();
$domdoc->loadHTML($mystr); // already here corrupt UTF-8?
var_dump($domdoc);
?>
如果我用PHP 5.5.9(cli(运行这个,我会进入终端:
$ php test.php
string(50) "<p>Hello, με काचं ça øy jeść</p>"
object(DOMDocument)#1 (34) {
["doctype"]=>
string(22) "(object value omitted)"
...
["actualEncoding"]=>
NULL
["encoding"]=>
NULL
["xmlEncoding"]=>
NULL
...
["textContent"]=>
string(70) "Hello, με à¤à¤¾à¤à¤ ça øy jeÅÄ"
}
很明显,原始字符串是正确的UTF-8,但是DOMDocument的textContent
编码不正确。
那么,如何在DOMDocument中获得正确的UTF-8内容呢?
DOM扩展是在libxml2上构建的,libxml2的HTML解析器是为HTML 4创建的,其默认编码是ISO-8859-1。除非遇到适当的元标记或XML声明,否则loadHTML()
将假定内容为ISO-8859-1。
在创建DOMDocument时指定编码不会影响解析器的操作——加载HTML(或XML(将替换XML版本和为其构造函数提供的编码。
解决方法:
首先使用mb_convert_encoding()
将ASCII范围以上的任何内容转换为等效的html实体。
$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));
或者破解指定UTF-8的元标记或xml声明。
$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);
我只是想发布OP代码和适用于我的修复程序:
<?php
$mystr = "<p>Hello, με काचं ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'UTF-8'); //DOMDocument();
$domdoc->substituteEntities = true; // no effect if hack is done
//~ $domdoc->actualEncoding = 'UTF-8'; // Cannot write property
$domdoc->encoding = 'UTF-8'; // no effect
//~ $domdoc->xmlEncoding = 'UTF-8'; // Cannot write property
//~ $domdoc->loadHTML($mystr); // already here corrupt UTF-8?
//~ $domdoc->loadHTML(utf8_decode($mystr)); // this gets to <p>Hello, ?? ????? ça øy je??</p>, so not all
//~ $domdoc->loadHTML( mb_convert_encoding($mystr, 'utf-8', mb_detect_encoding($mystr)) ); // no dice
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr); // hack, http://php.net/manual/en/domdocument.loadhtml.php#95251
// dirty fix
foreach ($domdoc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$domdoc->removeChild($item); // remove hack
$domdoc->encoding = 'UTF-8'; // insert proper (sets all three)
var_dump($domdoc);
print $domdoc->saveXML(); // without ->encoding = 'UTF-8': Hello, με काचं else OK
//~ print mb_convert_encoding($domdoc->saveXML(), 'UTF-8', 'HTML-ENTITIES'); // if without ->encoding = 'UTF-8', this is then OK: <p>Hello, με काचं ça øy jeść</p>
?>
该输出:
$ php test.php
string(50) "<p>Hello, με काचं ça øy jeść</p>"
object(DOMDocument)#1 (34) {
["doctype"]=>
string(22) "(object value omitted)"
...
["actualEncoding"]=>
string(5) "UTF-8"
["encoding"]=>
string(5) "UTF-8"
["xmlEncoding"]=>
string(5) "UTF-8"
...
["textContent"]=>
string(43) "Hello, με काचं ça øy jeść"
}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello, με काचं ça øy jeść</p></body></html>
现在一切都很好:)