DOMDocument 输出哪些字符实体


Which character entities does DOMDocument output?

PHP的DOMDocument类会弄乱UTF-8输入,除非你先准备输入。

例如,此代码

<?php
echo mb_internal_encoding()."'n'n";
$str = '’';
$dom = new DOMDocument;
$dom->loadHTML($str);
echo $dom->saveHTML();

产生此输出

UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&acirc;&#128;&#153;</p></body></html>

&acirc;&#128;&#153;应该是&rsquo;.

我想知道如果您不使用修复程序,DOMDocument 可能会产生的所有角色实体,例如 &acirc;。某处有列表吗?它在PHP源代码中吗?LibXML源代码?

我想到了一种无需阅读任何参考资料或源代码即可找到的方法:

<?php
$str = '';
for ($i = 1; $i < 256; $i++) {
   $str .= chr($i)."'n";
}
$str .= chr(0)."'n";
$dom = new DOMDocument;
$dom->loadHTML($str);
echo $dom->saveHTML();

如果您需要正确的列表,那么我建议您在自己的系统上运行它以获取自己的列表,以防在不同版本的 PHP 等中有所不同。

预计会有很多警告消息,但没有错误。

这是我得到的输出,除了我使用文本编辑器删除了非字符实体:

&amp;
&#128;
&#129;
&#130;
&#131;
&#132;
&#133;
&#134;
&#135;
&#136;
&#137;
&#138;
&#139;
&#140;
&#141;
&#142;
&#143;
&#144;
&#145;
&#146;
&#147;
&#148;
&#149;
&#150;
&#151;
&#152;
&#153;
&#154;
&#155;
&#156;
&#157;
&#158;
&#159;
&nbsp;
&iexcl;
&cent;
&pound;
&curren;
&yen;
&brvbar;
&sect;
&uml;
&copy;
&ordf;
&laquo;
&not;
&shy;
&reg;
&macr;
&deg;
&plusmn;
&sup2;
&sup3;
&acute;
&micro;
&para;
&middot;
&cedil;
&sup1;
&ordm;
&raquo;
&frac14;
&frac12;
&frac34;
&iquest;
&Agrave;
&Aacute;
&Acirc;
&Atilde;
&Auml;
&Aring;
&AElig;
&Ccedil;
&Egrave;
&Eacute;
&Ecirc;
&Euml;
&Igrave;
&Iacute;
&Icirc;
&Iuml;
&ETH;
&Ntilde;
&Ograve;
&Oacute;
&Ocirc;
&Otilde;
&Ouml;
&times;
&Oslash;
&Ugrave;
&Uacute;
&Ucirc;
&Uuml;
&Yacute;
&THORN;
&szlig;
&agrave;
&aacute;
&acirc;
&atilde;
&auml;
&aring;
&aelig;
&ccedil;
&egrave;
&eacute;
&ecirc;
&euml;
&igrave;
&iacute;
&icirc;
&iuml;
&eth;
&ntilde;
&ograve;
&oacute;
&ocirc;
&otilde;
&ouml;
&divide;
&oslash;
&ugrave;
&uacute;
&ucirc;
&uuml;
&yacute;
&thorn;
&yuml;