删除所有重复项


8

我正在尝试删除所有重复项,但仅保留单个记录(较短的ID)。以下查询会删除重复项,但要进行大量迭代才能删除所有副本并保留原始副本。

DELETE FROM emailTable WHERE id IN (
 SELECT * FROM (
    SELECT id FROM emailTable GROUP BY email HAVING ( COUNT(email) > 1 )
 ) AS q
)

它的MySQL。

编辑#1 DDL

CREATE TABLE `emailTable` (
 `id` mediumint(9) NOT NULL auto_increment,
 `email` varchar(200) NOT NULL default '',
 PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=298872 DEFAULT CHARSET=latin1

编辑#2 它就像@Dtest的魅力领导

DELETE FROM emailTable WHERE NOT EXISTS (
 SELECT * FROM (
    SELECT MIN(id) minID FROM emailTable    
    GROUP BY email HAVING COUNT(*) > 0
  ) AS q
  WHERE minID=id
)

Answers:


8

尝试这个:

DELETE FROM emailTable WHERE NOT EXISTS (
 SELECT * FROM (
    SELECT MIN(id) minID FROM emailTable    
    GROUP BY email HAVING COUNT(*) > 0
  ) AS q
  WHERE minID=id
)

以上适用于我测试50封电子邮件(5个不同的电子邮件重复10次)。

您可能需要在“电子邮件”列上添加索引:

ALTER TABLE emailTable ADD INDEX ind_email (email);

大约250,000行可能有点慢。对于具有150万行(正确索引)的表,我的速度很慢,这就是我提出此策略的方式:

/* CREATE MEMORY TABLE TO HOUSE IDs of the MIN */
CREATE TABLE email_min (minID INT, PRIMARY KEY(minID)) ENGINE=Memory;

/* INSERT THE MINIMUM IDs */
INSERT INTO email_min SELECT id FROM email
    GROUP BY email HAVING MIN(id);

/* MAKE SURE YOU HAVE RIGHT INFO */
SELECT * FROM email 
 WHERE NOT EXISTS (SELECT * FROM email_min WHERE minID=id)

/* DELETE FROM EMAIL */
DELETE FROM email 
 WHERE NOT EXISTS (SELECT * FROM email_min WHERE minID=id)

/* IF ALL IS WELL, DROP MEMORY TABLE */
DROP TABLE email_min;

内存表的好处在于,使用了一个索引(minID上的主键),该索引可加快正常的临时表的速度。


4

这是一个更简化的删除过程:

CREATE TABLE emailUnique LIKE emailTable;
ALTER TABLE emailUnique ADD UNIQUE INDEX (email);
INSERT IGNORE INTO emailUnique SELECT * FROM emailTable;
SELECT * FROM emailUnique;
ALTER TABLE emailTable  RENAME emailTable_old;
ALTER TABLE emailUnique RENAME emailTable;
DROP TABLE emailTable_old;

以下是一些示例数据:

use test
DROP TABLE IF EXISTS emailTable;
CREATE TABLE `emailTable` (
 `id` mediumint(9) NOT NULL auto_increment,
 `email` varchar(200) NOT NULL default '',
 PRIMARY KEY  (`id`)
) ENGINE=MyISAM;
INSERT INTO emailTable (email) VALUES
('redwards@gmail.com'),
('redwards@gmail.com'),
('redwards@gmail.com'),
('redwards@gmail.com'),
('rolandoedwards@gmail.com'),
('rolandoedwards@gmail.com'),
('rolandoedwards@gmail.com'),
('red@gmail.com'),
('red@gmail.com'),
('red@gmail.com'),
('rolandoedwards@gmail.com'),
('rolandoedwards@gmail.com'),
('rolandoedwards@comcast.net'),
('rolandoedwards@comcast.net'),
('rolandoedwards@comcast.net');
SELECT * FROM emailTable;

我跑了 结果如下:

mysql> use test
Database changed
mysql> DROP TABLE IF EXISTS emailTable;
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE `emailTable` (
    ->  `id` mediumint(9) NOT NULL auto_increment,
    ->  `email` varchar(200) NOT NULL default '',
    ->  PRIMARY KEY  (`id`)
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO emailTable (email) VALUES
    -> ('redwards@gmail.com'),
    -> ('redwards@gmail.com'),
    -> ('redwards@gmail.com'),
    -> ('redwards@gmail.com'),
    -> ('rolandoedwards@gmail.com'),
('rolandoedwards@comcast.net');
SELECT * FROM emailTable;
    -> ('rolandoedwards@gmail.com'),
    -> ('rolandoedwards@gmail.com'),
    -> ('red@gmail.com'),
    -> ('red@gmail.com'),
    -> ('red@gmail.com'),
    -> ('rolandoedwards@gmail.com'),
    -> ('rolandoedwards@gmail.com'),
    -> ('rolandoedwards@comcast.net'),
    -> ('rolandoedwards@comcast.net'),
    -> ('rolandoedwards@comcast.net');
Query OK, 15 rows affected (0.00 sec)
Records: 15  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM emailTable;
+----+----------------------------+
| id | email                      |
+----+----------------------------+
|  1 | redwards@gmail.com         |
|  2 | redwards@gmail.com         |
|  3 | redwards@gmail.com         |
|  4 | redwards@gmail.com         |
|  5 | rolandoedwards@gmail.com   |
|  6 | rolandoedwards@gmail.com   |
|  7 | rolandoedwards@gmail.com   |
|  8 | red@gmail.com              |
|  9 | red@gmail.com              |
| 10 | red@gmail.com              |
| 11 | rolandoedwards@gmail.com   |
| 12 | rolandoedwards@gmail.com   |
| 13 | rolandoedwards@comcast.net |
| 14 | rolandoedwards@comcast.net |
| 15 | rolandoedwards@comcast.net |
+----+----------------------------+
15 rows in set (0.00 sec)

mysql> CREATE TABLE emailUnique LIKE emailTable;
Query OK, 0 rows affected (0.04 sec)

mysql> ALTER TABLE emailUnique ADD UNIQUE INDEX (email);
Query OK, 0 rows affected (0.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> INSERT IGNORE INTO emailUnique SELECT * FROM emailTable;
Query OK, 4 rows affected (0.01 sec)
Records: 15  Duplicates: 11  Warnings: 0

mysql> SELECT * FROM emailUnique;
+----+----------------------------+
| id | email                      |
+----+----------------------------+
|  1 | redwards@gmail.com         |
|  5 | rolandoedwards@gmail.com   |
|  8 | red@gmail.com              |
| 13 | rolandoedwards@comcast.net |
+----+----------------------------+
4 rows in set (0.00 sec)

mysql> ALTER TABLE emailTable  RENAME emailTable_old;
Query OK, 0 rows affected (0.03 sec)

mysql> ALTER TABLE emailUnique RENAME emailTable;
Query OK, 0 rows affected (0.00 sec)

mysql> DROP TABLE emailTable_old;
Query OK, 0 rows affected (0.00 sec)

mysql>

如图所示,emailTable将包含每个电子邮件地址的第一个匹配项和相应的原始ID。对于此示例:

  • ID 1-4具有redwards@gmail.com,但仅保留了1个。
  • ID 5-7,11,12拥有rolandoedwards@gmail.com,但仅保留了5个。
  • ID 8-10具有red@gmail.com,但仅保留了8个。
  • ID 13-15具有rolandoedwards@comcast.net,但仅保留了13个。

卡瓦特:我回答了一个关于临时表删除表的问题

试试看 !!!


我编辑了有关发现有效的查询的问题。虽然该查询很简单。但是我认为从技术上讲,如果在大桌子上完成,您的解决方案会更好?
加里·林达尔

2
@DTest的答案类似(使用外部表),但使用的是MEMORY临时表,该表的键存储在HASH索引中而不是BTREE中。它可能会更快地工作。至于数据大小,只要有足够的RAM来容纳密钥,这是一个很好的解决方案。不错,DTest。
RolandoMySQLDBA 2011年

2

这是一个真正的快速Itzik解决方案。这将在SQL 2005及更高版本中运行。

WITH Dups AS
(
  SELECT *,
    ROW_NUMBER()
      OVER(PARTITION BY email ORDER BY id) AS rn
  FROM dbo.emailTable
)
DELETE FROM Dups
WHERE rn > 1;

OP要求使用MySQL
Derek Downey

2
是的,刚刚意识到;h!好吧,这是MS SQL的绝佳解决方案:)
Delux

也了解MS SQL也不错:p,但此刻正在寻找MySQL解决方案。
Gary Lindahl
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.