如何在PHP中验证XML的CDATA部分


How to validate CDATA section for an XML in PHP

根据用户输入创建XML。其中一个xml节点有一个CDATA节。如果在CDATA节中插入的字符之一是"特殊的"(我认为是一个控制字符),那么整个xml将无效。

的例子:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'))
    ->appendChild($dom->createCDATASection(
        "This is some text with a SOH char 'x01."
    ));
$test = new DOMDocument;
$test->loadXml($dom->saveXML());
echo $test->saveXml();

会给

Warning: DOMDocument::loadXML(): CData section not finished
This is some text with a SOH cha in Entity, line: 2 in /newfile.php on line 17
Warning: DOMDocument::loadXML(): PCDATA invalid Char value 1 in Entity, line: 2 in /newfile.php on line 17
Warning: DOMDocument::loadXML(): Sequence ']]>' not allowed in content in Entity, line: 2 in /newfile.php on line 17
Warning: DOMDocument::loadXML(): Sequence ']]>' not allowed in content in Entity, line: 2 in /newfile.php on line 17
Warning: DOMDocument::loadXML(): internal errorExtra content at the end of the document in Entity, line: 2 in /newfile.php on line 17
<?xml version="1.0"?>

在php中是否有一个好的方法来确保CDATA部分是有效的?

CDATA节允许的字符范围为

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

所以你必须净化你的字符串,只包括那些字符

因为"'x01"不是一个可打印字符。你可以这样解决这个问题:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'))
->appendChild($dom->createCDATASection(
    urlencode("This is some text with a SOH char 'x01.")
));
$test = new DOMDocument;
$test->loadXml($dom->saveXML());
echo urldecode($test->saveXml());

根据Gordon的回答,我做了:

 /**
 * Removes invalid characters from an HTML string
 *
 * @param string $content
 *
 * @return string
 */
function sanitize_html($content) {
  if (!$content) return '';
  $invalid_characters = '/[^'x9'xa'x20-'xD7FF'xE000-'xFFFD]/';
  return preg_replace($invalid_characters, '', $content);
}
使用

:

看看simplexml_load_file (http://php.net/manual/en/function.simplexml-load-file.php) LIBXML_NOCDATA选项(http://www.php.net/manual/en/libxml.constants.php)。这很可能会回答你的问题。