语法分析器错误:只允许在文档开头使用XML声明 - parser error : XML declaration allowed only at the start of the document

我有一个xml文件，其中包含多个声明，如以下

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <node>
  <element1>Stefan</element1>
  <element2>42</element2>
  <element3>Shirt</element3>
  <element4>3000</element4>  
</node>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
 <node>
  <element1>Damon</element1>
  <element2>32</element2>
  <element3>Jeans</element3>
  <element4>4000</element4>  
</node>
</root>

当我尝试用加载xml时

$data = simplexml_load_file("testdoc.xml") or die("Error: Cannot create object");

然后它给我以下错误

Warning: simplexml_load_file(): testdoc.xml:11: parser error : XML declaration allowed only at the start of the document in C:'xampp'htdocs'crea'services'testxml.php on line 3
Warning: simplexml_load_file(): <?xml version="1.0" encoding="UTF-8"?> in C:'xampp'htdocs'crea'services'testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:'xampp'htdocs'crea'services'testxml.php on line 3
Warning: simplexml_load_file(): testdoc.xml:12: parser error : Extra content at the end of the document in C:'xampp'htdocs'crea'services'testxml.php on line 3
Warning: simplexml_load_file(): <root> in C:'xampp'htdocs'crea'services'testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:'xampp'htdocs'crea'services'testxml.php on line 3
Error: Cannot create object

请让我知道如何解析这个xml，或者如何将它拆分为任何xml文件，以便我可以阅读。文件大小约为1gb。

第二行

<?xml version="1.0" encoding="UTF-8"?>

需要移除。任何文件中只允许有1个xml声明，并且它必须是第一行。

严格地说，您还需要有一个根元素（尽管我见过宽松的解析器）。只需用一个伪标签包装内容，这样你的文件就会看起来像：

<?xml version="1.0" encoding="UTF-8"?>
<metaroot><!-- synthetic unique root, no semantics attached -->
    <root>
        <!-- ... -->
    </root>
    <root>
        <!-- ... -->
    </root>
    <!-- ... -->
</metaroot>

（非常）大文件的解决方案：

使用sed可以消除有问题的xml声明，使用printf可以添加单个xml声明和唯一的根元素。bash命令序列如下：

  printf "<?xml version='"1.0'" encoding='"UTF-8'"?>'n<metaroot>'n" >out.xml
  sed '/<'?xml /d' in.xml >>out.xml
  printf "'n</metaroot>'n" >>out.xml

in.xml表示原始文件，out.xml表示清除的结果。

printf打印单个xml声明和打开/关闭标记。sed是一个工具，用于逐行编辑文件，根据正则表达式模式匹配执行操作。要匹配的模式是xml声明（<'? xml）的开始，要执行的操作是删除该行。

注：

命令中的反斜杠在其出现的位置对具有特殊语义的符号进行转义
sed也适用于windows/macos

备用解决方案

另一种选择是将文件拆分为各个格式良好的文件（取自SO答案：

csplit -z -f 'temp' -b 'out%03d.xml' in.xml '/<'?xml /' {*}

它生成名为out000.xml、out001.xml。。。您应该至少知道已经处理到输入文件中的单个文件的数量的大小，以便使用自动编号是安全的（当然，您可以使用上面命令中的-b 'out%09d.xml'，将输入文件的字节数作为大小）。

这不是有效的XML。您将需要使用字符串函数来拆分它，或者更确切地说，逐个读取它。

$xmlDeclaration = '<?xml version="1.0" encoding="UTF-8"?>';
$file = new SplFileObject($filename, 'r');
$file->setFlags(SplFileObject::SKIP_EMPTY);
$buffer = '';
foreach ($file as $line) {
  if (FALSE === strpos($line, $xmlDeclaration)) {
    $buffer .= $line; 
  } else {
    outputBuffer($buffer);
    $buffer = $line;
  }
}
outputBuffer($buffer);
function outputBuffer($buffer) {
  if (!empty($buffer)) {
    $dom = new DOMDocument();
    $dom->loadXml($buffer);
    $xpath = new DOMXPath($dom);
    echo $xpath->evaluate('string(//element1)'), "'n";
  }
}

输出：

Stefan
Damon