查找具有相同的子行集的父行


9

假设我有一个这样的结构:

食谱表

RecipeID
Name
Description

RecipeIngredients表

RecipeID
IngredientID
Quantity
UOM

关键RecipeIngredients(RecipeID, IngredientID)

查找重复食谱的一些好方法是什么?重复配方定义为具有完全相同的一组配料以及每种配料的数量。

我曾经考虑过使用FOR XML PATH将成分合并到一个单独的列中。我尚未对此进行全面探讨,但是如果我确保成分/ UOM /数量按相同顺序排序并具有适当的分隔符,那么它应该可以工作。有更好的方法吗?

有48K食谱和200K成分行。

Answers:


7

对于以下假定的模式和示例数据

CREATE TABLE dbo.RecipeIngredients
    (
      RecipeId INT NOT NULL ,
      IngredientID INT NOT NULL ,
      Quantity INT NOT NULL ,
      UOM INT NOT NULL ,
      CONSTRAINT RecipeIngredients_PK 
          PRIMARY KEY ( RecipeId, IngredientID ) WITH (IGNORE_DUP_KEY = ON)
    ) ;

INSERT INTO dbo.RecipeIngredients
SELECT TOP (210000) ABS(CRYPT_GEN_RANDOM(8)/50000),
                     ABS(CRYPT_GEN_RANDOM(8) % 100),
                     ABS(CRYPT_GEN_RANDOM(8) % 10),
                     ABS(CRYPT_GEN_RANDOM(8) % 5)
FROM master..spt_values v1,                     
     master..spt_values v2


SELECT DISTINCT RecipeId, 'X' AS Name
INTO Recipes 
FROM  dbo.RecipeIngredients 

这将填充205,009个配料行和42,613个配方。由于随机元素,每次都将略有不同。

假定相对较少的重复(示例运行后的输出为217个重复配方组,每组两个或三个配方)。根据《任择议定书》中的数字,最病理的情况将是48,000个精确重复。

设置脚本是

DROP TABLE dbo.RecipeIngredients,Recipes
GO

CREATE TABLE Recipes(
RecipeId INT IDENTITY,
Name VARCHAR(1))

INSERT INTO Recipes 
SELECT TOP 48000 'X'
FROM master..spt_values v1,                     
     master..spt_values v2

CREATE TABLE dbo.RecipeIngredients
    (
      RecipeId INT NOT NULL ,
      IngredientID INT NOT NULL ,
      Quantity INT NOT NULL ,
      UOM INT NOT NULL ,
      CONSTRAINT RecipeIngredients_PK 
          PRIMARY KEY ( RecipeId, IngredientID )) ;

INSERT INTO dbo.RecipeIngredients
SELECT RecipeId,IngredientID,Quantity,UOM
FROM Recipes
CROSS JOIN (SELECT 1,1,1 UNION ALL SELECT 2,2,2 UNION ALL  SELECT 3,3,3 UNION ALL SELECT 4,4,4) I(IngredientID,Quantity,UOM)

对于这两种情况,以下操作均在我的计算机上不到一秒钟完成。

CREATE TABLE #Concat
  (
     RecipeId     INT,
     concatenated VARCHAR(8000),
     PRIMARY KEY (concatenated, RecipeId)
  )

INSERT INTO #Concat
SELECT R.RecipeId,
       ISNULL(concatenated, '')
FROM   Recipes R
       CROSS APPLY (SELECT CAST(IngredientID AS VARCHAR(10)) + ',' + CAST(Quantity AS VARCHAR(10)) + ',' + CAST(UOM AS VARCHAR(10)) + ','
                    FROM   dbo.RecipeIngredients RI
                    WHERE  R.RecipeId = RecipeId
                    ORDER  BY IngredientID
                    FOR XML PATH('')) X (concatenated);

WITH C1
     AS (SELECT DISTINCT concatenated
         FROM   #Concat)
SELECT STUFF(Recipes, 1, 1, '')
FROM   C1
       CROSS APPLY (SELECT ',' + CAST(RecipeId AS VARCHAR(10))
                    FROM   #Concat C2
                    WHERE  C1.concatenated = C2.concatenated
                    ORDER  BY RecipeId
                    FOR XML PATH('')) R(Recipes)
WHERE  Recipes LIKE '%,%,%'

DROP TABLE #Concat 

一个警告

我假设串联字符串的长度不会超过896个字节。如果这样做,则会在运行时引发错误,而不是无提示地失败。您将需要从#temp表中删除主键(以及隐式创建的索引)。在我的测试设置中,串联字符串的最大长度为125个字符。

如果串联的字符串太长而无法建立索引,那么XML PATH合并相同配方的最终查询的性能可能会很差。安装和使用自定义CLR字符串聚合将是一种解决方案,因为它可以通过一次数据传递进行连接,而不是使用非索引自连接。

SELECT YourClrAggregate(RecipeId)
FROM #Concat
GROUP BY concatenated

我也试过

WITH Agg
     AS (SELECT RecipeId,
                MAX(IngredientID)          AS MaxIngredientID,
                MIN(IngredientID)          AS MinIngredientID,
                SUM(IngredientID)          AS SumIngredientID,
                COUNT(IngredientID)        AS CountIngredientID,
                CHECKSUM_AGG(IngredientID) AS ChkIngredientID,
                MAX(Quantity)              AS MaxQuantity,
                MIN(Quantity)              AS MinQuantity,
                SUM(Quantity)              AS SumQuantity,
                COUNT(Quantity)            AS CountQuantity,
                CHECKSUM_AGG(Quantity)     AS ChkQuantity,
                MAX(UOM)                   AS MaxUOM,
                MIN(UOM)                   AS MinUOM,
                SUM(UOM)                   AS SumUOM,
                COUNT(UOM)                 AS CountUOM,
                CHECKSUM_AGG(UOM)          AS ChkUOM
         FROM   dbo.RecipeIngredients
         GROUP  BY RecipeId)
SELECT  A1.RecipeId AS RecipeId1,
        A2.RecipeId AS RecipeId2
FROM   Agg A1
       JOIN Agg A2
         ON A1.MaxIngredientID = A2.MaxIngredientID
            AND A1.MinIngredientID = A2.MinIngredientID
            AND A1.SumIngredientID = A2.SumIngredientID
            AND A1.CountIngredientID = A2.CountIngredientID
            AND A1.ChkIngredientID = A2.ChkIngredientID
            AND A1.MaxQuantity = A2.MaxQuantity
            AND A1.MinQuantity = A2.MinQuantity
            AND A1.SumQuantity = A2.SumQuantity
            AND A1.CountQuantity = A2.CountQuantity
            AND A1.ChkQuantity = A2.ChkQuantity
            AND A1.MaxUOM = A2.MaxUOM
            AND A1.MinUOM = A2.MinUOM
            AND A1.SumUOM = A2.SumUOM
            AND A1.CountUOM = A2.CountUOM
            AND A1.ChkUOM = A2.ChkUOM
            AND A1.RecipeId <> A2.RecipeId
WHERE  NOT EXISTS (SELECT *
                   FROM   (SELECT *
                           FROM   RecipeIngredients
                           WHERE  RecipeId = A1.RecipeId) R1
                          FULL OUTER JOIN (SELECT *
                                           FROM   RecipeIngredients
                                           WHERE  RecipeId = A2.RecipeId) R2
                            ON R1.IngredientID = R2.IngredientID
                               AND R1.Quantity = R2.Quantity
                               AND R1.UOM = R2.UOM
                   WHERE  R1.RecipeId IS NULL
                           OR R2.RecipeId IS NULL) 

当重复项相对较少时(第一个示例数据少于一秒),此方法可以接受,但在病理情况下效果不佳,因为初始聚合对每个结果均返回完全相同的结果RecipeID,因此无法减少重复数的数量。比较。


我不确定比较“空”食谱是否有意义,但是在最终发布之前,我也确实将查询更改为这种效果,因为这就是@ypercube解决方案所做的。
Andriy M

@AndriyM-乔·塞尔科(Joel Celko)在他的关系部门文章
马丁·史密斯

10

这是关系划分问题的概括。不知道这将有多有效:

; WITH cte AS
( SELECT RecipeID_1 = r1.RecipeID, Name_1 = r1.Name,
         RecipeID_2 = r2.RecipeID, Name_2 = r2.Name  
  FROM Recipes AS r1
    JOIN Recipes AS r2
      ON r1.RecipeID <> r2.RecipeID
  WHERE NOT EXISTS
        ( SELECT 1
          FROM RecipeIngredients AS ri1
          WHERE ri1.RecipeID = r1.RecipeID 
            AND NOT EXISTS
                ( SELECT 1
                  FROM RecipeIngredients AS ri2
                  WHERE ri2.RecipeID = r2.RecipeID 
                    AND ri1.IngredientID = ri2.IngredientID
                    AND ri1.Quantity = ri2.Quantity
                    AND ri1.UOM = ri2.UOM
                )
         )
)
SELECT c1.*
FROM cte AS c1
  JOIN cte AS c2
    ON  c1.RecipeID_1 = c2.RecipeID_2
    AND c1.RecipeID_2 = c2.RecipeID_1
    AND c1.RecipeID_1 < c1.RecipeID_2;

另一种(类似)方法:

SELECT RecipeID_1 = r1.RecipeID, Name_1 = r1.Name,
       RecipeID_2 = r2.RecipeID, Name_2 = r2.Name 
FROM Recipes AS r1
  JOIN Recipes AS r2
    ON  r1.RecipeID < r2.RecipeID 
    AND NOT EXISTS
        ( SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri1
          WHERE ri1.RecipeID = r1.RecipeID
        EXCEPT 
          SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri2
          WHERE ri2.RecipeID = r2.RecipeID
        )
    AND NOT EXISTS
        ( SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri2
          WHERE ri2.RecipeID = r2.RecipeID
        EXCEPT 
          SELECT IngredientID, Quantity, UOM
          FROM RecipeIngredients AS ri1
          WHERE ri1.RecipeID = r1.RecipeID
        ) ;

还有另一个不同的:

; WITH cte AS
( SELECT RecipeID_1 = r.RecipeID, RecipeID_2 = ri.RecipeID, 
          ri.IngredientID, ri.Quantity, ri.UOM
  FROM Recipes AS r
    CROSS JOIN RecipeIngredients AS ri
)
, cte2 AS
( SELECT RecipeID_1, RecipeID_2,
         IngredientID, Quantity, UOM
  FROM cte
EXCEPT
  SELECT RecipeID_2, RecipeID_1,
         IngredientID, Quantity, UOM
  FROM cte
)

  SELECT RecipeID_1 = r1.RecipeID, RecipeID_2 = r2.RecipeID
  FROM Recipes AS r1
    JOIN Recipes AS r2
      ON r1.RecipeID < r2.RecipeID
EXCEPT 
  SELECT RecipeID_1, RecipeID_2
  FROM cte2
EXCEPT 
  SELECT RecipeID_2, RecipeID_1
  FROM cte2 ;

SQL-Fiddle上测试


使用CHECKSUM()CHECKSUM_AGG()功能测试在SQL-小提琴-2 :
忽略这个,因为它可能产生假阳性

ALTER TABLE RecipeIngredients
  ADD ck AS CHECKSUM( IngredientID, Quantity, UOM )
    PERSISTED ;

CREATE INDEX ckecksum_IX
  ON RecipeIngredients
    ( RecipeID, ck ) ;

; WITH cte AS
( SELECT RecipeID,
         cka = CHECKSUM_AGG(ck)
  FROM RecipeIngredients AS ri
  GROUP BY RecipeID
)
SELECT RecipeID_1 = c1.RecipeID, RecipeID_2 = c2.RecipeID
FROM cte AS c1
  JOIN cte AS c2
    ON  c1.cka = c2.cka
    AND c1.RecipeID < c2.RecipeID  ;


执行计划有点吓人。
ypercubeᵀᴹ

这是我的问题的核心,即如何做到这一点。但是,对于我的特定情况,执行计划可能会破坏交易。

1
CHECKSUM而且CHECKSUM_AGG仍然需要检查误报。
马丁·史密斯

对于我的答案中包含470个配方和2057个成分行的示例数据的精简版本,查询1具有Table 'RecipeIngredients'. Scan count 220514, logical reads 443643和查询2 Table 'RecipeIngredients'. Scan count 110218, logical reads 441214。第三个读数似乎比这两个读数低,但仍然针对完整的示例数据,我在8分钟后取消了查询。
马丁·史密斯

您应该能够通过首先比较计数来加快速度。如果配料的数量不相同,则基本上一对食谱不能具有完全相同的配料。
TomTom 2013年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.