合并3个数据相似但ID不同的MySQL数据库';s


Merging 3 MySQL Databases with similar data but different ID's?

我们正在合并一些在不同阶段建立的独立公司数据库,它们都有大致相同的数据,但顺序/ID不同。

以下是3个数据库的伪覆盖图,每个数据库有2个示例表: +------------+-----------+---------------+ | DATABASE 1 | | | +------------+-----------+---------------+ | fruit_id | name | | | 1 | orange | | | 2 | apple | | | 3 | banana | | | | | | | sales_id | fruit_ids | | | 1924 | 2,3 | apple,banana | | 1925 | 1,3 | orange,apple | | | | | | DATABASE 2 | | | | fruit_id | name | | | 1 | apple | | | 2 | orange | | | 3 | banana | | | | | | | sales_id | fruit_ids | | | 1924 | 2,3 | orange,banana | | 1925 | 1,3 | apple,banana | | | | | | DATABASE 3 | | | | fruit_id | name | | | 1 | banana | | | 2 | apple | | | 3 | orange | | | | | | | sales_id | fruit_ids | | | 1950 | 2,3 | apple,orange | | 1951 | 1,3 | banana,orange | +------------+-----------+---------------+

您将看到一些数据库sales_id实际上是重复的,甚至fruit_id与每个表中的不同项相关。

我们正在使用的真实数据库在很多地方都会有fruit_id分布在数据库周围,有些数据库包含1m以上的行,所以手动MySQL查询有点不可能。

我们需要组合的实际数据库比上面的伪表复杂得多,但我想看看是否有任何逻辑/工具/软件可以帮助将MySQL数据库与这种类似的数据组合在一起?

到目前为止没有太多答案。以下是一些初步的努力,展示在一些具有相同格式的数据库之间生成公共密钥:

// Source databases or companies
// -----------------------------
$corp = array();
$corp[0] = 'XYZ Corp';
$corp[1] = 'ABC Corp';
$corp[2] = '123 Corp';
// Source fruit lists
// ------------------
$fruit = array();
$fruit[0]['orange'] = 1;
$fruit[0]['apple']  = 2;
$fruit[0]['banana'] = 3;
$fruit[0]['kiwi']   = 4;
$fruit[1]['apple']  = 1;
$fruit[1]['orange'] = 2;
$fruit[1]['banana'] = 3;
$fruit[1]['pear']   = 4;
$fruit[2]['banana'] = 1;
$fruit[2]['apple']  = 2;
$fruit[2]['orange'] = 3;
$fruit[2]['grape']  = 4;
// Master fruit list
// -----------------
echo "Generating common fruit key list...<br>'n";
// Generate common keys in sorted item order
// -----------------------------------------
$common = array();
foreach ($corp as $corpid => $name)
{
  echo "<br>'n";
  echo "... examining fruit keys for " . $corp[$corpid] . "<br>'n";
  foreach ($fruit[$corpid] as $name => $id)
  {
    $common[$name] = 0;
    echo "... ... fruit $name has key $id<br>'n";
  }
}
ksort($common);
$i = 0;
foreach ($common as $name => $dummy )
   $common[$name] = ++$i;
echo "<br>'n";   
print_r($common);
echo "<br><br>'n";
// Demonstrate index conversions
// -----------------------------
echo "Demonstrating conversions to common indexes per company...<br>'n";
foreach ($corp as $corpid => $name)
{
  echo "<br>'n";
  echo "... examining key conversions for " . $corp[$corpid] . "<br>'n";
  foreach ($fruit[$corpid] as $name => $id)
  {
    $new = $common[$name];
    $old = $id;
    echo "... ... fruit $name key changes from $old to $new<br>'n";
  }
}

你会得到类似于以下的结果:

Generating common fruit key list...
... examining fruit keys for XYZ Corp
... ... fruit orange has key 1
... ... fruit apple has key 2
... ... fruit banana has key 3
... ... fruit kiwi has key 4
... examining fruit keys for ABC Corp
... ... fruit apple has key 1
... ... fruit orange has key 2
... ... fruit banana has key 3
... ... fruit pear has key 4
... examining fruit keys for 123 Corp
... ... fruit banana has key 1
... ... fruit apple has key 2
... ... fruit orange has key 3
... ... fruit grape has key 4
Array ( [apple] => 1 [banana] => 2 [grape] => 3 [kiwi] => 4 [orange] => 5 [pear] => 6 ) 
Demonstrating conversions to common indexes per company...
... examining key conversions for XYZ Corp
... ... fruit orange key changes from 1 to 5
... ... fruit apple key changes from 2 to 1
... ... fruit banana key changes from 3 to 2
... ... fruit kiwi key changes from 4 to 4
... examining key conversions for ABC Corp
... ... fruit apple key changes from 1 to 1
... ... fruit orange key changes from 2 to 5
... ... fruit banana key changes from 3 to 2
... ... fruit pear key changes from 4 to 6
... examining key conversions for 123 Corp
... ... fruit banana key changes from 1 to 2
... ... fruit apple key changes from 2 to 1
... ... fruit orange key changes from 3 to 5
... ... fruit grape key changes from 4 to 3

需要考虑的一些问题:

  • 我认为您无法安全地更新表,因为如上所述,键会发生冲突。如果您愿意,可以通过调整新的键值使其不与任何现有键值重叠来避免这种情况。例如,在这种情况下,您可以向所有新的公共密钥添加一个整数,例如100000。这可能允许对先前的键值进行"就地"更新——这可能对早期测试或替代转换路径有用
  • 你需要一个明确的表列表来应用转换。如果ID总是使用相同的名称,这可以通过扫描数据库表格式(可能通过描述表名)来实现
  • 如果您在数据库中确实有一个逗号分隔的ID值列表,如图所示,您将不得不做额外的工作来识别它们,解析它们,并适当地更新它们
  • 你会注意到,我为每家公司添加了一个独特的库存项目(即猕猴桃、梨、葡萄),因为它们可能不都有完全相同的产品集

这应该为思考这个问题提供一个起点。将所有源数据库处理到一个新数据库中,然后验证合并结果可能需要多长时间,这很难说。