用PHP高效解析大型XML文件生成SQL


Parse Large XML File in PHP Efficiently to Generate SQL

我正在尝试解析一个大型XML文件并将其加载到MySQL中。我使用了simplexml来解析它,它工作得很好,但对于这个大的XML文件来说,它的速度很慢。现在我正在尝试使用XMLReader

以下是XML:的示例

<?xml version="1.0" encoding="UTF-8"?>
<drug type="biotech" created="2005-06-13" updated="2015-02-23">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical </description>
<cas-number>120993-53-5</cas-number>
<groups>
  <group>approved</group>
</groups>
<pathways>
<pathway>
  <smpdb-id>SMP00278</smpdb-id>
  <name>Lepirudin Action Pathway</name>
  <drugs>
    <drug>
      <drugbank-id>DB00001</drugbank-id>
      <name>Lepirudin</name>
    </drug>
    <drug>
      <drugbank-id>DB01373</drugbank-id>
      <name>Calcium</name>
    </drug>
  </drugs>
...
</drug>
<drug type="biotech" created="2005-06-15" updated="2015-02-25">
...
</drug>

以下是我使用simplexml:的方法

<?php
$xml = simplexml_load_file('drugbank.xml');
$servername = "localhost"; // Example : localhost
$username   = "root";
$password   = "pass";
$dbname     = "dbname";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
    die("Connection failed: " . $conn->connect_error);
} 
$xmlObject_count  = $xml->drug->count();
for ($i=0; $i < $xmlObject_count; $i++) {
    $name = $xml->drug[$i]->name;
    $description  = $xml->drug[$i]->description;
    $casnumber = $xml->drug[$i]->{'cas-number'};
    // ...
    $created = $xml->drug[$i]['created'];
    $updated = $xml->drug[$i]['updated'];
    $type = $xml->drug[$i]['type'];

    $sql = "INSERT INTO `drug` (name, description,cas_number,created,updated,type) 
VALUES ('$name', '$description','$casnumber','$created','$updated','$type')";
    if ($conn->query($sql) === TRUE) {
        $last_id = $conn->insert_id;
    } else {
        echo "outer else Error: " . $sql . "<br>" . $conn->error. "<br>" ;
    }
}
$conn->close();

它工作正常,给了我7789排。但是,我想使用XMLReader来解析它。但XMLReader的问题是,我发现它提供了35000多行。

如果查看XML,您可以看到在<drug />节点内部还有一些其他<drugs><drug>子节点。我该如何克服这一点?

以下是我使用XMLReader:的步骤

<?php
$servername = "localhost"; // Example : localhost
$username   = "root";
$password   = "pass";
$dbname     = "dbname";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
    die("Connection failed: " . $conn->connect_error);
} 
$reader = new XMLReader();
$reader->open('drugbank.xml');
while ($reader->read())
{
    if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'drug')
    {
        $doc = new DOMDocument('1.0', 'UTF-8');
        $xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
        $name = $xml->name;
        $description  = $xml->description;
        $casnumber = $xml->{'cas-number'};
        // ...
        $sql = "INSERT INTO `drug` (name, description,cas_number,created,updated,type) 
VALUES ('$name', '$description','$casnumber','$created','$updated','$type')";
        if ($conn->query($sql) === TRUE) {
            $last_id = $conn->insert_id;
        } else {
            echo "outer else Error: " . $sql . "<br>" . $conn->error. "<br>" ;
        }
    }
}
$conn->close();

在这个例子中,我发现它给出了35000多行。

好吧,我有一个在执行速度、内存使用和数据库负载方面有很大改进的工作示例:

<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');
$servername = "localhost"; // Example : localhost
$username   = "root";
$password   = "pass";
$dbname     = "dbname";
function parseXml($mysql)
{
    $drugs = array();
    $xmlReader = new XMLReader();
    $xmlReader->open(DRUG_XML_FILE);
    // Move our pointer to the first <drug /> element.
    while ($xmlReader->read() && $xmlReader->name !== 'drug') ;
    $drugCount = 0;
    $totalDrugs = 0;
    // Iterate over the outer <drug /> elements.
    while ($xmlReader->name == 'drug')
    {
        // Convert the node into a SimpleXMLElement for ease of use.
        $item = new SimpleXMLElement($xmlReader->readOuterXML());
        $name = $item->name;
        $description = $item->description;
        $casNumber = $item->{'cas-number'};
        $created = $item['created'];
        $updated = $item['updated'];
        $type = $item['type'];
        $drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
        $drugCount++;
        $totalDrugs++;
        // Once we've reached the desired batch size, insert the batch and reset the counter.
        if ($drugCount >= INSERT_BATCH_SIZE)
        {
            batchInsertDrugs($mysql, $drugs);
            $drugCount = 0;
        }
        // Go to next <drug />.
        $xmlReader->next('drug');
    }
    $xmlReader->close();
    // Insert the leftovers from the last batch.
    batchInsertDrugs($mysql, $drugs);
    echo "Inserted $totalDrugs total drugs.";
}
function batchInsertDrugs($mysql, &$drugs)
{
    // Generate a batched INSERT statement.
    $statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
    $statement = $statement . ' ' . implode(",'n", $drugs);
    echo $statement, "'n";
    // Run the batch INSERT.
    if ($mysql->query($statement))
    {
        echo "Inserted " . count($drugs) . " drugs.";
    }
    else
    {
        echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
    }
    // Clear the buffer.
    $drugs = array();
}
// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
    die("Connection failed: " . $mysql->connect_error);
}
parseXml($mysql);

我使用相同的数据集测试了这个例子。以您现在的方式使用SimpleXML会导致在内存中解析整个文档,这是缓慢且占用内存的。这种方法使用XMLReader,它是一个快速拉取解析器。您可能仍然可以使用PHPSAXXMLParser更快地实现这一点,但它的模式有点复杂,上面的示例将明显比您开始使用的要好。

在我的例子中,另一个重要的变化是我们使用MySQL批量插入,所以我们实际上只访问我们处理的每个500(可配置)项的数据库。您可以调整此数字以获得更好的性能。在某一点之后,查询将变得太大,MySQL无法处理,但您可能一次可以做比500多得多的事情。

如果你想让我进一步解释其中的任何部分,或者如果你对此有任何问题,请在评论中告诉我!:)