在SQL Server中查找重复的行


231

我有一个组织的SQL Server数据库,并且有很多重复的行。我想运行一条select语句以获取所有这些信息和重复的数量,而且还返回与每个组织相关联的ID。

如下语句:

SELECT     orgName, COUNT(*) AS dupes  
FROM         organizations  
GROUP BY orgName  
HAVING      (COUNT(*) > 1)

将返回类似

orgName        | dupes  
ABC Corp       | 7  
Foo Federation | 5  
Widget Company | 2 

但我也想获取它们的ID。有什么办法吗?也许像

orgName        | dupeCount | id  
ABC Corp       | 1         | 34  
ABC Corp       | 2         | 5  
...  
Widget Company | 1         | 10  
Widget Company | 2         | 2  

原因是还有一个单独的链接到这些组织的用户表,我想统一它们(因此删除重复对象,以便用户链接到同一组织而不是重复组织)。但是我想手动分开,所以我不会搞砸任何东西,但是我仍然需要一条语句返回所有重复组织的ID,以便我可以浏览用户列表。

Answers:


313
select o.orgName, oc.dupeCount, o.id
from organizations o
inner join (
    SELECT orgName, COUNT(*) AS dupeCount
    FROM organizations
    GROUP BY orgName
    HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName

4
此查询是否有任何限制,例如,记录数是否超过1000万?
Steam

3
@Steam您是正确的:在具有数百万条记录的大型数据库中,此答案无效。建议使用Aykut提交的GroupBy / Having答案,该答案可以通过数据库更好地进行优化。一个例外:我建议使用Count(0)而不是Count(*)来简化操作。
Mike Christian

1
@Mike-为什么Count(0)vs Count(*)?
KornMuffin 2015年

2
@KornMuffin回想起来,我对Count()的评论无效。仅当您要对外部联接返回的非空结果进行计数时,在Count()中使用非空评估才有用。否则,请使用Count(*)。在这里可以找到很好的解释。
迈克·克里斯蒂安

使用isnull()为空的列上的on部分
阿里夫Ulusoy

92

您可以运行以下查询,并找到重复项max(id)并删除这些行。

SELECT orgName, COUNT(*), Max(ID) AS dupes 
FROM organizations 
GROUP BY orgName 
HAVING (COUNT(*) > 1)

但是您必须多次运行此查询。


您必须精确地运行它MAX( COUNT(*) ) - 1,这可能仍然可行。
DerMike '16

1
嗨,他们是以任何方式获取所有ID而不是2的max id的吗,我可以使用max和min,但是超过2怎么办?@DerMike
Arijit Mukherjee

31

您可以这样做:

SELECT
    o.id, o.orgName, d.intCount
FROM (
     SELECT orgName, COUNT(*) as intCount
     FROM organizations
     GROUP BY orgName
     HAVING COUNT(*) > 1
) AS d
    INNER JOIN organizations o ON o.orgName = d.orgName

如果您只想返回可以删除的记录(每个记录都保留一个),则可以使用:

SELECT
    id, orgName
FROM (
     SELECT 
         orgName, id,
         ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY id) AS intRow
     FROM organizations
) AS d
WHERE intRow != 1

编辑:SQL Server 2000没有ROW_NUMBER()函数。相反,您可以使用:

SELECT
    o.id, o.orgName, d.intCount
FROM (
     SELECT orgName, COUNT(*) as intCount, MIN(id) AS minId
     FROM organizations
     GROUP BY orgName
     HAVING COUNT(*) > 1
) AS d
    INNER JOIN organizations o ON o.orgName = d.orgName
WHERE d.minId != o.id

第一个语句有效,但是第二个语句似乎无效。
xtine 2010年

SQL Server似乎无法识别row_number()吗?
xtine 2010年

啊...您有旧版本的SQL Server吗?我相信这是在SQL Server 2005中引入
保罗

3
再次感谢,每当我需要这样做时,我都会来到这里并
爱上

9

标记为正确的解决方案对我不起作用,但是我发现这个答案非常有用在MySql中获取重复行的列表

SELECT n1.* 
FROM myTable n1
INNER JOIN myTable n2 
ON n2.repeatedCol = n1.repeatedCol
WHERE n1.id <> n2.id

结果集中会出现很多重复,因此您也必须处理这些重复。
雷南

1
如果ID是数字,则检查n1.id > n2.id将防止每对显示两次。
2016年

9

您可以尝试一下,最适合您

 WITH CTE AS
    (
    SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY orgName DESC) FROM organizations 
    )
    select * from CTE where RN>1
    go

任何以逗号分隔或不同列获取所有ID的方法
Arijit Mukherjee 2016年

6

如果要删除重复项:

WITH CTE AS(
   SELECT orgName,id,
       RN = ROW_NUMBER()OVER(PARTITION BY orgName ORDER BY Id)
   FROM organizations
)
DELETE FROM CTE WHERE RN > 1

6
select * from [Employees]

查找重复记录1)使用CTE

with mycte
as
(
select Name,EmailId,ROW_NUMBER() over(partition by Name,EmailId order by id) as Duplicate from [Employees]
)
select * from mycte

2)通过使用GroupBy

select Name,EmailId,COUNT(name) as Duplicate from  [Employees] group by Name,EmailId 

当选择超过10m行的数据时,这是最快的解决方案。谢谢
Fandango68

4
Select * from (Select orgName,id,
ROW_NUMBER() OVER(Partition By OrgName ORDER by id DESC) Rownum
From organizations )tbl Where Rownum>1

因此,rowum> 1的记录将是表中的重复记录。“按...划分”首先对记录进行分组,然后通过赋予其序列号对它们进行序列化。因此,rownum> 1将是可以被删除的重复记录。


我喜欢这一行,因为它使您可以轻松地在内部select子句中添加更多列。因此,如果您想从“组织”表中返回其他列,则不必在这些列上进行“分组依据”。
Gwasshoppa


2
select a.orgName,b.duplicate, a.id
from organizations a
inner join (
    SELECT orgName, COUNT(*) AS duplicate
    FROM organizations
    GROUP BY orgName
    HAVING COUNT(*) > 1
) b on o.orgName = oc.orgName
group by a.orgName,a.id

1
select orgname, count(*) as dupes, id 
from organizations
where orgname in (
    select orgname
    from organizations
    group by orgname
    having (count(*) > 1)
)
group by orgname, id

1

您有几种选择方式duplicate rows

对于我的解决方案,请首先考虑此表

CREATE TABLE #Employee
(
ID          INT,
FIRST_NAME  NVARCHAR(100),
LAST_NAME   NVARCHAR(300)
)

INSERT INTO #Employee VALUES ( 1, 'Ardalan', 'Shahgholi' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 4, 'name3', 'lname3' );

第一个解决方案:

SELECT DISTINCT *
FROM   #Employee;

WITH #DeleteEmployee AS (
                     SELECT ROW_NUMBER()
                            OVER(PARTITION BY ID, First_Name, Last_Name ORDER BY ID) AS
                            RNUM
                     FROM   #Employee
                 )

SELECT *
FROM   #DeleteEmployee
WHERE  RNUM > 1

SELECT DISTINCT *
FROM   #Employee

次要解决方案:使用identity字段

SELECT DISTINCT *
FROM   #Employee;

ALTER TABLE #Employee ADD UNIQ_ID INT IDENTITY(1, 1)

SELECT *
FROM   #Employee
WHERE  UNIQ_ID < (
    SELECT MAX(UNIQ_ID)
    FROM   #Employee a2
    WHERE  #Employee.ID = a2.ID
           AND #Employee.FIRST_NAME = a2.FIRST_NAME
           AND #Employee.LAST_NAME = a2.LAST_NAME
)

ALTER TABLE #Employee DROP COLUMN UNIQ_ID

SELECT DISTINCT *
FROM   #Employee

所有解决方案的结尾都使用此命令

DROP TABLE #Employee

0

我想我知道您需要什么,我需要在答案之间进行混合,我想我得到了他想要的解决方案:

select o.id,o.orgName, oc.dupeCount, oc.id,oc.orgName
from organizations o
inner join (
    SELECT MAX(id) as id, orgName, COUNT(*) AS dupeCount
    FROM organizations
    GROUP BY orgName
    HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName

拥有最大ID会给您双重副本的ID和他要求的原始ID之一:

id org name , dublicate count (missing out in this case) 
id doublicate org name , doub count (missing out again because does not help in this case)

您只会以这种形式把它弄丢

id , name , dubid , name

希望它仍然有帮助


0

假设我们有2列的表'Student':

  • student_id int
  • student_name varchar

    Records:
    +------------+---------------------+
    | student_id | student_name        |
    +------------+---------------------+
    |        101 | usman               |
    |        101 | usman               |
    |        101 | usman               |
    |        102 | usmanyaqoob         |
    |        103 | muhammadusmanyaqoob |
    |        103 | muhammadusmanyaqoob |
    +------------+---------------------+

现在我们要查看重复的记录,请使用以下查询:

select student_name,student_id ,count(*) c from student group by student_id,student_name having c>1;

+---------------------+------------+---+
| student_name        | student_id | c |
+---------------------+------------+---+
| usman               |        101 | 3 |
| muhammadusmanyaqoob |        103 | 2 |
+---------------------+------------+---+

0

我有一个更好的选择来获取表中的重复记录

SELECT x.studid, y.stdname, y.dupecount
FROM student AS x INNER JOIN
(SELECT a.stdname, COUNT(*) AS dupecount
FROM student AS a INNER JOIN
studmisc AS b ON a.studid = b.studid
WHERE (a.studid LIKE '2018%') AND (b.studstatus = 4)
GROUP BY a.stdname
HAVING (COUNT(*) > 1)) AS y ON x.stdname = y.stdname INNER JOIN
studmisc AS z ON x.studid = z.studid
WHERE (x.studid LIKE '2018%') AND (z.studstatus = 4)
ORDER BY x.stdname

以上查询的结果显示了所有重复的名称,这些名称具有唯一的学生ID和重复出现的次数

单击此处查看sql的结果


0
 /*To get duplicate data in table */

 SELECT COUNT(EmpCode),EmpCode FROM tbl_Employees WHERE Status=1 
  GROUP BY EmpCode HAVING COUNT(EmpCode) > 1

0

我使用两种方法来查找重复的行。第一种方法是最著名的一种使用并列的方法。第二种方法是使用CTE- 公用表表达式

正如@RedFilter所提到的,这种方式也是正确的。很多时候我发现CTE方法对我也很有用。

WITH TempOrg (orgName,RepeatCount)
AS
(
SELECT orgName,ROW_NUMBER() OVER(PARTITION by orgName ORDER BY orgName) 
AS RepeatCount
FROM dbo.organizations
)
select t.*,e.id from organizations   e
inner join TempOrg t on t.orgName= e.orgName
where t.RepeatCount>1

在上面的示例中,我们通过使用ROW_NUMBER和PARTITION BY查找重复出现来收集结果。然后,我们应用where子句仅选择重复计数大于1的行。所有结果都收集到CTE表中,并与Organizations表结合在一起。

资料来源:CodoBee


-2

尝试

SELECT orgName, id, count(*) as dupes
FROM organizations
GROUP BY orgName, id
HAVING count(*) > 1;
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.