MySQL - 如何规范化包含分隔符分隔 ID 的列


MySQL - How to normalize column containing delimiter-separated IDs

我正在尝试规范化一个表,以前的开发人员设计了一个包含管道分隔ID的列,这些ID链接到同一表中的其他行。

客户表

id    |    aliases (VARCHAR)
----------------------------
1     |    |4|58|76
2     |    
3     |
4     |    |1|58|76
...   |    
58    |    |1|4|76
...   |
76    |    |1|4|58

所以客户 1、4、58 和 76 都是彼此的"别名"。 客户 2 和 3 没有别名,因此该字段包含一个空字符串。

我想取消整个"别名"系统,并对数据进行规范化,以便我可以将这些其他客户全部映射到一条记录。因此,我希望客户 1、4、58 和 76 的相关表数据都仅映射到客户 1。

我想我会填充一个新表,稍后我可以加入并在其他表上执行更新。

联接表

id  |  customer_id  |  alias_id
-------------------------------
1   |  1            |  4
2   |  1            |  58
3   |  1            |  76

如何将第一个表中的数据转换为上述格式?如果这在纯SQL中绝对是一场噩梦,我将编写一个PHP脚本来尝试完成这项工作并插入数据。

当我开始回答这个问题时,我认为它会快速而简单,因为我曾经在SQL Server中做过非常相似的事情,但是证明翻译的概念迅速发展到这个完整的解决方案中。

从您的问题中不清楚的一个警告是,您是否有条件声明主 id 与别名 id。例如,此解决方案将允许 1 的别名为 4,4 的别名为 1,这与简化示例问题中提供的数据一致。

为了设置此示例的数据,我使用了以下结构:

CREATE TABLE notnormal_customers (
  id INT NOT NULL PRIMARY KEY,
  aliases VARCHAR(10)
);
INSERT INTO notnormal_customers (id,aliases)
VALUES
(1,'|4|58|76'),
(2,''),
(3,''),
(4,'|1|58|76'),
(58,'|1|4|76'),
(76,'|1|4|58');

首先,为了表示一个客户到多个别名的一对多关系,我创建了这个表:

CREATE TABLE customer_aliases (
    primary_id INT NOT NULL,
    alias_id INT NOT NULL,
    FOREIGN KEY (primary_id) REFERENCES notnormal_customers(id),
    FOREIGN KEY (alias_id)   REFERENCES notnormal_customers(id),
    /* clustered primary key prevents duplicates */
    PRIMARY KEY (primary_id,alias_id)
)

最重要的是,我们将使用自定义SPLIT_STR函数:

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
       LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, '');

然后,我们将创建一个存储过程来完成所有工作。代码使用对源引用的注释进行批注。

DELIMITER $$
CREATE PROCEDURE normalize_customers()
BEGIN
  DECLARE cust_id INT DEFAULT 0;
  DECLARE al_id INT UNSIGNED DEFAULT 0;
  DECLARE alias_str VARCHAR(10) DEFAULT '';
  /* set the value of the string delimiter */
  DECLARE string_delim CHAR(1) DEFAULT '|';
  DECLARE count_aliases INT DEFAULT 0;
  DECLARE i INT DEFAULT 1;
  /*
    use cursor to iterate through all customer records
    http://burnignorance.com/mysql-tips/how-to-loop-through-a-result-set-in-mysql-strored-procedure/
  */
  DECLARE done INT DEFAULT 0;
  DECLARE cur CURSOR FOR
      SELECT `id`, `aliases`
      FROM `notnormal_customers`;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
  OPEN cur;
  read_loop: LOOP
    /*
      Fetch one record from CURSOR and set to customer id and alias string.
      If not found then `done` will be set to 1 by continue handler.
    */
    FETCH cur INTO cust_id, alias_str;
    IF done THEN
        /* If done set to 1 then exit the loop, else continue. */
        LEAVE read_loop;
    END IF;
    /* skip to next record if no aliases */
    IF alias_str = '' THEN
      ITERATE read_loop;
    END IF;
    /*
      get number of aliases
      https://pisceansheart.wordpress.com/2008/04/15/count-occurrence-of-character-in-a-string-using-mysql/
    */
    SET count_aliases = LENGTH(alias_str) - LENGTH(REPLACE(alias_str, string_delim, ''));
    /* strip off the first pipe to make it compatible with our SPLIT_STR function */
    SET alias_str = SUBSTR(alias_str, 2);
    /*
      iterate and get each alias from custom split string function
      https://stackoverflow.com/questions/18304857/split-delimited-string-value-into-rows
    */
    WHILE i <= count_aliases DO
      /* get the next alias id */
      SET al_id = CAST(SPLIT_STR(alias_str, string_delim, i) AS UNSIGNED);
      /* REPLACE existing values instead of insert to prevent errors on primary key */
      REPLACE INTO customer_aliases (primary_id,alias_id) VALUES (cust_id,al_id);
      SET i = i+1;
    END WHILE;
    SET i = 1;
  END LOOP;
  CLOSE cur;
END$$
DELIMITER ;

最后,您可以通过调用以下命令简单地运行它:

CALL normalize_customers();

然后,您可以在控制台中检查数据:

mysql> select * from customer_aliases;
+------------+----------+
| primary_id | alias_id |
+------------+----------+
|          4 |        1 |
|         58 |        1 |
|         76 |        1 |
|          1 |        4 |
|         58 |        4 |
|         76 |        4 |
|          1 |       58 |
|          4 |       58 |
|         76 |       58 |
|          1 |       76 |
|          4 |       76 |
|         58 |       76 |
+------------+----------+
12 rows in set (0.00 sec)

更新 2(单查询解决方案(

假设别名列表始终排序,则只需一个查询即可获得结果:

CREATE TABLE aliases (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  customer_id INT UNSIGNED NOT NULL,
  alias_id INT UNSIGNED NOT NULL
) AS
  SELECT NULL AS id, c1.id AS customer_id, c2.id AS alias_id
  FROM customers c1
  JOIN customers c2 
    ON c2.aliases LIKE CONCAT('|', c1.id , '|%') -- c1.id is the first alias of c2.id
  WHERE c1.id < (SUBSTRING(c1.aliases,2)+0) -- c1.id is smaller than the first alias of c2.id

如果对aliases列编制索引,它也会快得多,因此范围搜索将支持 JOIN。

SQLfiddle

原答案

如果将管道替换为逗号,则可以使用 FIND_IN_SET 函数。

我将首先创建一个临时表(不需要是技术上的临时表(来存储逗号分隔的别名列表:

CREATE TABLE tmp (`id` int, `aliases` varchar(50));
INSERT INTO tmp(`id`, `aliases`)
  SELECT id, REPLACE(aliases, '|', ',')  AS aliases
  FROM customers;

然后使用 JOINs ON 子句中的FIND_IN_SET填充规范化表:

CREATE TABLE aliases (`id` int, `customer_id` int, `alias_id` int) AS
  SELECT t.id as customer_id, c.id AS alias_id
  FROM tmp t
  JOIN customers c ON find_in_set(c.id, t.aliases);

如果需要 - 删除具有较高customer_id的重复项(仅保留最低(:

DELETE FROM aliases 
WHERE customer_id IN (SELECT * FROM(
  SELECT DISTINCT a1.customer_id
  FROM aliases a1
  JOIN aliases a2
    ON  a2.customer_id = a1.alias_id
    AND a1.customer_id = a2.alias_id
    AND a1.customer_id > a1.alias_id
)derived);

如果需要 - 创建AUTO_INCREMENT id:

ALTER TABLE aliases ADD column id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

aliases表现在将如下所示:

| id | customer_id | alias_id |
|----|-------------|----------|
|  1 |           1 |        4 |
|  2 |           1 |       58 |
|  3 |           1 |       76 |

SQLfiddle

不要忘记定义正确的索引。

更新 1

您可以跳过创建临时表并使用 LIKE 而不是FIND_IN_SET填充aliases表:

CREATE TABLE aliases (`customer_id` int, `alias_id` int) AS
  SELECT c2.id as customer_id, c1.id AS alias_id
  FROM customers c1
  JOIN customers c2 
    ON CONCAT(c1.aliases, '|') LIKE CONCAT('%|', c2.id , '|%');

SQLfiddle

使用整数表 (0-9( - 尽管您可以使用(SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3...etc.)实现相同的目标...

 SELECT DISTINCT id old_id /* the technique below inevitably creates duplicates. */
                          /* DISTINCT discards them. */
              , SUBSTRING_INDEX(
                  SUBSTRING_INDEX(SUBSTR(aliases,2),'|',i+1) /* isolate text between */
                        ,'|',-1) x                           /* each pipe and the next */
           FROM customers
              , ints      /* do this for the first 10 pipes in each string */
          ORDER
             BY id,x+0    /* implicit CASTING */
+--------+------+
| old_id | x    |
+--------+------+
|      1 | 4    |
|      1 | 58   |
|      1 | 76   |
|      2 | NULL |
|      3 | NULL |
|      4 | 1    |
|      4 | 58   |
|      4 | 76   |
|     58 | 1    |
|     58 | 4    |
|     58 | 76   |
|     76 | 1    |
|     76 | 4    |
|     76 | 58   |
+--------+------+

(编辑:添加了行内注释(