查找具有相同的子行集的父行

假设我有一个这样的结构：

食谱表

RecipeID
Name
Description

RecipeIngredients表

RecipeID
IngredientID
Quantity
UOM

关键RecipeIngredients是(RecipeID, IngredientID)。

查找重复食谱的一些好方法是什么？重复配方定义为具有完全相同的一组配料以及每种配料的数量。

我曾经考虑过使用FOR XML PATH将成分合并到一个单独的列中。我尚未对此进行全面探讨，但是如果我确保成分/ UOM /数量按相同顺序排序并具有适当的分隔符，那么它应该可以工作。有更好的方法吗？

有48K食谱和200K成分行。

— 戳
source

Answers:

对于以下假定的模式和示例数据

CREATE TABLE dbo.RecipeIngredients
    (
      RecipeId INT NOT NULL ,
      IngredientID INT NOT NULL ,
      Quantity INT NOT NULL ,
      UOM INT NOT NULL ,
      CONSTRAINT RecipeIngredients_PK 
          PRIMARY KEY ( RecipeId, IngredientID ) WITH (IGNORE_DUP_KEY = ON)
    ) ;

INSERT INTO dbo.RecipeIngredients
SELECT TOP (210000) ABS(CRYPT_GEN_RANDOM(8)/50000),
                     ABS(CRYPT_GEN_RANDOM(8) % 100),
                     ABS(CRYPT_GEN_RANDOM(8) % 10),
                     ABS(CRYPT_GEN_RANDOM(8) % 5)
FROM master..spt_values v1,                     
     master..spt_values v2


SELECT DISTINCT RecipeId, 'X' AS Name
INTO Recipes 
FROM  dbo.RecipeIngredients

这将填充205,009个配料行和42,613个配方。由于随机元素，每次都将略有不同。

假定相对较少的重复（示例运行后的输出为217个重复配方组，每组两个或三个配方）。根据《任择议定书》中的数字，最病理的情况将是48,000个精确重复。

设置脚本是

DROP TABLE dbo.RecipeIngredients,Recipes
GO

CREATE TABLE Recipes(
RecipeId INT IDENTITY,
Name VARCHAR(1))

INSERT INTO Recipes 
SELECT TOP 48000 'X'
FROM master..spt_values v1,                     
     master..spt_values v2

CREATE TABLE dbo.RecipeIngredients
    (
      RecipeId INT NOT NULL ,
      IngredientID INT NOT NULL ,
      Quantity INT NOT NULL ,
      UOM INT NOT NULL ,
      CONSTRAINT RecipeIngredients_PK 
          PRIMARY KEY ( RecipeId, IngredientID )) ;

INSERT INTO dbo.RecipeIngredients
SELECT RecipeId,IngredientID,Quantity,UOM
FROM Recipes
CROSS JOIN (SELECT 1,1,1 UNION ALL SELECT 2,2,2 UNION ALL  SELECT 3,3,3 UNION ALL SELECT 4,4,4) I(IngredientID,Quantity,UOM)

对于这两种情况，以下操作均在我的计算机上不到一秒钟完成。

CREATE TABLE #Concat
  (
     RecipeId     INT,
     concatenated VARCHAR(8000),
     PRIMARY KEY (concatenated, RecipeId)
  )

INSERT INTO #Concat
SELECT R.RecipeId,
       ISNULL(concatenated, '')
FROM   Recipes R
       CROSS APPLY (SELECT CAST(IngredientID AS VARCHAR(10)) + ',' + CAST(Quantity AS VARCHAR(10)) + ',' + CAST(UOM AS VARCHAR(10)) + ','
                    FROM   dbo.RecipeIngredients RI
                    WHERE  R.RecipeId = RecipeId
                    ORDER  BY IngredientID
                    FOR XML PATH('')) X (concatenated);

WITH C1
     AS (SELECT DISTINCT concatenated
         FROM   #Concat)
SELECT STUFF(Recipes, 1, 1, '')
FROM   C1
       CROSS APPLY (SELECT ',' + CAST(RecipeId AS VARCHAR(10))
                    FROM   #Concat C2
                    WHERE  C1.concatenated = C2.concatenated
                    ORDER  BY RecipeId
                    FOR XML PATH('')) R(Recipes)
WHERE  Recipes LIKE '%,%,%'

DROP TABLE #Concat

一个警告

我假设串联字符串的长度不会超过896个字节。如果这样做，则会在运行时引发错误，而不是无提示地失败。您将需要从#temp表中删除主键（以及隐式创建的索引）。在我的测试设置中，串联字符串的最大长度为125个字符。

如果串联的字符串太长而无法建立索引，那么XML PATH合并相同配方的最终查询的性能可能会很差。安装和使用自定义CLR字符串聚合将是一种解决方案，因为它可以通过一次数据传递进行连接，而不是使用非索引自连接。

SELECT YourClrAggregate(RecipeId)
FROM #Concat
GROUP BY concatenated

我也试过

WITH Agg
     AS (SELECT RecipeId,
                MAX(IngredientID)          AS MaxIngredientID,
                MIN(IngredientID)          AS MinIngredientID,
                SUM(IngredientID)          AS SumIngredientID,
                COUNT(IngredientID)        AS CountIngredientID,
                CHECKSUM_AGG(IngredientID) AS ChkIngredientID,
                MAX(Quantity)              AS MaxQuantity,
                MIN(Quantity)              AS MinQuantity,
                SUM(Quantity)              AS SumQuantity,
                COUNT(Quantity)            AS CountQuantity,
                CHECKSUM_AGG(Quantity)     AS ChkQuantity,
                MAX(UOM)                   AS MaxUOM,
                MIN(UOM)                   AS MinUOM,
                SUM(UOM)                   AS SumUOM,
                COUNT(UOM)                 AS CountUOM,
                CHECKSUM_AGG(UOM)          AS ChkUOM
         FROM   dbo.RecipeIngredients
         GROUP  BY RecipeId)
SELECT  A1.RecipeId AS RecipeId1,
        A2.RecipeId AS RecipeId2
FROM   Agg A1
       JOIN Agg A2
         ON A1.MaxIngredientID = A2.MaxIngredientID
            AND A1.MinIngredientID = A2.MinIngredientID
            AND A1.SumIngredientID = A2.SumIngredientID
            AND A1.CountIngredientID = A2.CountIngredientID
            AND A1.ChkIngredientID = A2.ChkIngredientID
            AND A1.MaxQuantity = A2.MaxQuantity
            AND A1.MinQuantity = A2.MinQuantity
            AND A1.SumQuantity = A2.SumQuantity
            AND A1.CountQuantity = A2.CountQuantity
            AND A1.ChkQuantity = A2.ChkQuantity
            AND A1.MaxUOM = A2.MaxUOM
            AND A1.MinUOM = A2.MinUOM
            AND A1.SumUOM = A2.SumUOM
            AND A1.CountUOM = A2.CountUOM
            AND A1.ChkUOM = A2.ChkUOM
            AND A1.RecipeId <> A2.RecipeId
WHERE  NOT EXISTS (SELECT *
                   FROM   (SELECT *
                           FROM   RecipeIngredients
                           WHERE  RecipeId = A1.RecipeId) R1
                          FULL OUTER JOIN (SELECT *
                                           FROM   RecipeIngredients
                                           WHERE  RecipeId = A2.RecipeId) R2
                            ON R1.IngredientID = R2.IngredientID
                               AND R1.Quantity = R2.Quantity
                               AND R1.UOM = R2.UOM
                   WHERE  R1.RecipeId IS NULL
                           OR R2.RecipeId IS NULL)

当重复项相对较少时（第一个示例数据少于一秒），此方法可以接受，但在病理情况下效果不佳，因为初始聚合对每个结果均返回完全相同的结果RecipeID，因此无法减少重复数的数量。比较。

— 马丁·史密斯
source

我不确定比较“空”食谱是否有意义，但是在最终发布之前，我也确实将查询更改为这种效果，因为这就是@ypercube解决方案所做的。

— Andriy M

@AndriyM-乔·塞尔科（Joel Celko）在他的关系部门文章

— 马丁·史密斯

这是关系划分问题的概括。不知道这将有多有效：

; WITH cte AS
( SELECT RecipeID_1 = r1.RecipeID, Name_1 = r1.Name,
         RecipeID_2 = r2.RecipeID, Name_2 = r2.Name  
  FROM Recipes AS r1
    JOIN Recipes AS r2
      ON r1.RecipeID <> r2.RecipeID
  WHERE NOT EXISTS
        ( SELECT 1
          FROM RecipeIngredients AS ri1
          WHERE ri1.RecipeID = r1.RecipeID 
            AND NOT EXISTS
                ( SELECT 1
                  FROM RecipeIngredients AS ri2
                  WHERE ri2.RecipeID = r2.RecipeID 
                    AND ri1.IngredientID = ri2.IngredientID
                    AND ri1.Quantity = ri2.Quantity
                    AND ri1.UOM = ri2.UOM
                )
         )
)
SELECT c1.*
FROM cte AS c1
  JOIN cte AS c2
    ON  c1.RecipeID_1 = c2.RecipeID_2
    AND c1.RecipeID_2 = c2.RecipeID_1
    AND c1.RecipeID_1 < c1.RecipeID_2;

另一种（类似）方法：

SELECT RecipeID_1 = r1.RecipeID, Name_1 = r1.Name,
       RecipeID_2 = r2.RecipeID, Name_2 = r2.Name 
FROM Recipes AS r1
  JOIN Recipes AS r2
    ON  r1.RecipeID < r2.RecipeID 
    AND NOT EXISTS
        ( SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri1
          WHERE ri1.RecipeID = r1.RecipeID
        EXCEPT 
          SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri2
          WHERE ri2.RecipeID = r2.RecipeID
        )
    AND NOT EXISTS
        ( SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri2
          WHERE ri2.RecipeID = r2.RecipeID
        EXCEPT 
          SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri1
          WHERE ri1.RecipeID = r1.RecipeID
        ) ;

还有另一个不同的：

; WITH cte AS
( SELECT RecipeID_1 = r.RecipeID, RecipeID_2 = ri.RecipeID, 
          ri.IngredientID, ri.Quantity, ri.UOM
  FROM Recipes AS r
    CROSS JOIN RecipeIngredients AS ri
)
, cte2 AS
( SELECT RecipeID_1, RecipeID_2,
         IngredientID, Quantity, UOM
  FROM cte
EXCEPT
  SELECT RecipeID_2, RecipeID_1,
         IngredientID, Quantity, UOM
  FROM cte
)

  SELECT RecipeID_1 = r1.RecipeID, RecipeID_2 = r2.RecipeID
  FROM Recipes AS r1
    JOIN Recipes AS r2
      ON r1.RecipeID < r2.RecipeID
EXCEPT 
  SELECT RecipeID_1, RecipeID_2
  FROM cte2
EXCEPT 
  SELECT RecipeID_2, RecipeID_1
  FROM cte2 ;

在SQL-Fiddle上测试

使用CHECKSUM()和CHECKSUM_AGG()功能测试在SQL-小提琴-2 ：
（ 忽略这个，因为它可能产生假阳性）

ALTER TABLE RecipeIngredients ADD ck AS CHECKSUM( IngredientID, Quantity, UOM ) PERSISTED ; CREATE INDEX ckecksum_IX ON RecipeIngredients ( RecipeID, ck ) ; ; WITH cte AS ( SELECT RecipeID, cka = CHECKSUM_AGG(ck) FROM RecipeIngredients AS ri GROUP BY RecipeID ) SELECT RecipeID_1 = c1.RecipeID, RecipeID_2 = c2.RecipeID FROM cte AS c1 JOIN cte AS c2 ON c1.cka = c2.cka AND c1.RecipeID < c2.RecipeID ;

— 超级立方体
source

执行计划有点吓人。

— ypercubeᵀᴹ

这是我的问题的核心，即如何做到这一点。但是，对于我的特定情况，执行计划可能会破坏交易。

— 戳

CHECKSUM而且CHECKSUM_AGG仍然需要检查误报。

— 马丁·史密斯

对于我的答案中包含470个配方和2057个成分行的示例数据的精简版本，查询1具有Table 'RecipeIngredients'. Scan count 220514, logical reads 443643和查询2 Table 'RecipeIngredients'. Scan count 110218, logical reads 441214。第三个读数似乎比这两个读数低，但仍然针对完整的示例数据，我在8分钟后取消了查询。

— 马丁·史密斯

您应该能够通过首先比较计数来加快速度。如果配料的数量不相同，则基本上一对食谱不能具有完全相同的配料。

— TomTom 2013年