为协作距离有限的行分配唯一值的解决方案

9

我有一个可以使用以下代码创建和填充的表：

CREATE TABLE dbo.Example(GroupKey int NOT NULL, RecordKey varchar(12) NOT NULL);
ALTER TABLE dbo.Example
    ADD CONSTRAINT iExample PRIMARY KEY CLUSTERED(GroupKey ASC, RecordKey ASC);
INSERT INTO dbo.Example(GroupKey, RecordKey)
VALUES (1, 'Archimedes'), (1, 'Newton'), (1, 'Euler'), (2, 'Euler'), (2, 'Gauss'),
       (3, 'Gauss'), (3, 'Poincaré'), (4, 'Ramanujan'), (5, 'Neumann'),
       (5, 'Grothendieck'), (6, 'Grothendieck'), (6, 'Tao');

对于与另一行具有有限协作距离的所有行RecordKey，我想分配一个唯一值-我不在乎唯一值的类型或数据类型。

可以通过以下查询生成满足我要求的正确结果集：

SELECT 1 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(1, 2, 3)
UNION ALL
SELECT 2 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey = 4
UNION ALL
SELECT 3 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(5, 6)
ORDER BY SupergroupKey ASC, GroupKey ASC, RecordKey ASC;

为了更好地帮助我的要求，我将解释为什么GroupKeys 1–3具有相同的含义SupergroupKey：

GroupKey1包含RecordKeyEuler，而Euler包含在GroupKey2中；因此，GroupKeys 1和2必须具有相同的值SupergroupKey。
由于高斯同时包含在GroupKey2和3中，因此它们也必须具有相同的SupergroupKey。这导致GroupKeys 1-3具有相同的SupergroupKey。
由于GroupKeys 1-3 RecordKey与其余GroupKeys 不共享任何s ，因此它们是唯一被分配SupergroupKey值为1的。

我应该补充一点，该解决方案需要通用。上表和结果集仅是示例。

附录

我删除了非迭代解决方案的要求。尽管我更喜欢这样的解决方案，但我认为这是一个不合理的约束。不幸的是，我无法使用任何基于CLR的解决方案。但是如果您想包括这样的解决方案，请随时。我可能不会接受它作为答案。

我的真实表中的行数多达500万，但是在几天里，行数将“仅”为一万左右。平均有8 RecordKeys /秒GroupKey和4 GroupKeys /秒RecordKey。我想一个解决方案将具有指数级的时间复杂度，但是我仍然对解决方案感兴趣。

sql-server sql-server-2016

— 篮球迷22
source

7

这是用于性能比较的迭代T-SQL解决方案。

假定可以向表中添加一个额外的列来存储超级组键，并且可以更改索引：

设定

DROP TABLE IF EXISTS 
    dbo.Example;

CREATE TABLE dbo.Example
(
    SupergroupKey integer NOT NULL
        DEFAULT 0, 
    GroupKey integer NOT NULL, 
    RecordKey varchar(12) NOT NULL,

    CONSTRAINT iExample 
    PRIMARY KEY CLUSTERED 
        (GroupKey ASC, RecordKey ASC),

    CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
    UNIQUE NONCLUSTERED (RecordKey, GroupKey),

    INDEX [IX dbo.Example SupergroupKey, GroupKey]
        (SupergroupKey ASC, GroupKey ASC)
);

INSERT dbo.Example
    (GroupKey, RecordKey)
VALUES 
    (1, 'Archimedes'), 
    (1, 'Newton'),
    (1, 'Euler'),
    (2, 'Euler'),
    (2, 'Gauss'),
    (3, 'Gauss'),
    (3, 'Poincaré'),
    (4, 'Ramanujan'),
    (5, 'Neumann'),
    (5, 'Grothendieck'),
    (6, 'Grothendieck'),
    (6, 'Tao');

如果您能够颠倒当前主键的键顺序，则不需要额外的唯一索引。

大纲

该解决方案的方法是：

将超级群组ID设置为1
查找编号最小的未处理组密钥
如果找不到，请退出
使用当前组键为所有行设置超级组
为与当前组中的行相关的所有行设置超级组
重复步骤5，直到没有行被更新
增加当前的超级组ID
前往步骤2

实作

内联评论：

-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;

-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
    WITH (TABLOCKX)
WHERE 
    SupergroupKey != 0;

DECLARE 
    @CurrentSupergroup integer = 0,
    @CurrentGroup integer = 0;

WHILE 1 = 1
BEGIN
    -- Next super group
    SET @CurrentSupergroup += 1;

    -- Find the lowest unprocessed group key
    SELECT 
        @CurrentGroup = MIN(E.GroupKey)
    FROM dbo.Example AS E
    WHERE 
        E.SupergroupKey = 0;

    -- Exit when no more unprocessed groups
    IF @CurrentGroup IS NULL BREAK;

    -- Set super group for all records in the current group
    UPDATE E
    SET E.SupergroupKey = @CurrentSupergroup
    FROM dbo.Example AS E 
    WHERE 
        E.GroupKey = @CurrentGroup;

    -- Iteratively find all groups for the super group
    WHILE 1 = 1
    BEGIN
        WITH 
            RecordKeys AS
            (
                SELECT DISTINCT
                    E.RecordKey
                FROM dbo.Example AS E
                WHERE
                    E.SupergroupKey = @CurrentSupergroup
            ),
            GroupKeys AS
            (
                SELECT DISTINCT
                    E.GroupKey
                FROM RecordKeys AS RK
                JOIN dbo.Example AS E
                    WITH (FORCESEEK)
                    ON E.RecordKey = RK.RecordKey
            )
        UPDATE E WITH (TABLOCKX)
        SET SupergroupKey = @CurrentSupergroup
        FROM GroupKeys AS GK
        JOIN dbo.Example AS E
            ON E.GroupKey = GK.GroupKey
        WHERE
            E.SupergroupKey = 0
        OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better

        -- Break when no more related groups found
        IF @@ROWCOUNT = 0 BREAK;
    END;
END;

SELECT
    E.SupergroupKey,
    E.GroupKey,
    E.RecordKey
FROM dbo.Example AS E;

执行计划

对于密钥更新：

结果

该表的最终状态为：

╔═══════════════╦══════════╦══════════════╗
║ SupergroupKey ║ GroupKey ║  RecordKey   ║
╠═══════════════╬══════════╬══════════════╣
║             1 ║        1 ║ Archimedes   ║
║             1 ║        1 ║ Euler        ║
║             1 ║        1 ║ Newton       ║
║             1 ║        2 ║ Euler        ║
║             1 ║        2 ║ Gauss        ║
║             1 ║        3 ║ Gauss        ║
║             1 ║        3 ║ Poincaré     ║
║             2 ║        4 ║ Ramanujan    ║
║             3 ║        5 ║ Grothendieck ║
║             3 ║        5 ║ Neumann      ║
║             3 ║        6 ║ Grothendieck ║
║             3 ║        6 ║ Tao          ║
╚═══════════════╩══════════╩══════════════╝

演示：db <> fiddle

性能测试

使用Michael Green的答案中提供的扩展测试数据集，笔记本电脑上的计时^*为：

╔═════════════╦════════╗
║ Record Keys ║  Time  ║
╠═════════════╬════════╣
║ 10k         ║ 2s     ║
║ 100k        ║ 12s    ║
║ 1M          ║ 2m 30s ║
╚═════════════╩════════╝

_{* Microsoft SQL Server 2017（RTM-CU13），开发版（64位），Windows 10 Pro，16GB RAM，SSD，4核超线程i7、2.4GHz标称。}

— 保罗·怀特9
source

这是一个了不起的答案。正如我的问题所预示的那样，“大日子”太慢了。但对我较小的日子来说很棒。在大约250万行的表上运行了大约5个小时。

— basketballfan22

10

此问题与以下项目之间的链接有关。这使其进入图形和图形处理领域。具体来说，整个数据集形成一个图，我们正在寻找该图的组成部分。这可以通过绘制问题样本数据来说明。

这个问题说我们可以跟随GroupKey或RecordKey来查找共享该值的其他行。因此我们可以将它们都视为图形中的顶点。问题继续说明GroupKeys 1-3如何具有相同的SupergroupKey。这可以看作是左侧的群集由细线连接而成。该图还显示了由原始数据形成的其他两个组件（SupergroupKey）。

SQL Server具有内置在T-SQL中的某些图形处理功能。目前，这非常微不足道，并且对解决此问题没有帮助。SQL Server还具有调用R和Python的功能，以及丰富而强大的软件包套件。igraph就是这样一种。它是为“快速处理具有数百万个顶点和边（链接）的大型图”而编写的。

使用R和IGRAPH我能够在2分22秒的本地测试，以处理一个百万行¹。这是它与当前最佳解决方案的比较：

Record Keys     Paul White  R               
------------    ----------  --------
Per question    15ms        ~220ms
100             80ms        ~270ms
1,000           250ms       430ms
10,000          1.4s        1.7s
100,000         14s         14s
1M              2m29        2m22s
1M              n/a         1m40    process only, no display

The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.

处理1M行时，使用1m40s来加载和处理图形以及更新表。需要42s才能用输出填充SSMS结果表。

在处理1M行时观察Task Manager，表明大约需要3GB的工作内存。该功能在该系统上无需分页即可使用。

我可以确认Ypercube对递归CTE方法的评估。使用几百个记录键，它消耗了100％的CPU和所有可用的RAM。最终，tempdb增长到80GB以上，并且SPID崩溃了。

我将Paul的表与SupergroupKey列一起使用，因此解决方案之间有一个合理的比较。

由于某种原因，R反对庞加莱的重音。将其更改为普通的“ e”可使其运行。我没有进行调查，因为它与当前的问题无关。我确定有解决方案。

这是代码

-- This captures the output from R so the base table can be updated.
drop table if exists #Results;

create table #Results
(
    Component   int         not NULL,
    Vertex      varchar(12) not NULL primary key
);


truncate table #Results;    -- facilitates re-execution

declare @Start time = sysdatetimeoffset();  -- for a 'total elapsed' calculation.

insert #Results(Component, Vertex)
exec sp_execute_external_script   
    @language = N'R',
    @input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
    @script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';

-- Write SuperGroupKey to the base table, as other solutions do
update e
set
    SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
    on r.Vertex = e.RecordKey;

-- Return all rows, as other solutions do
select
    e.SupergroupKey,
    e.GroupKey,
    e.RecordKey
from dbo.Example as e;

-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);

这就是R代码的作用

@input_data_1 是SQL Server如何将数据从表传输到R代码并将其转换为称为InputDataSet的R数据帧。
library(igraph) 将库导入R执行环境。
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)将数据加载到igraph对象中。这是一个无向图，因为我们可以跟踪从组到记录或从记录到组的链接。InputDataSet是SQL Server发送到R的数据集的默认名称。
cpts <- components(df.g, mode = c("weak")) 处理图以找到离散的子图（组件）和其他度量。
OutputDataSet <- data.frame(cpts$membership)SQL Server期望从R返回一个数据帧。其默认名称为OutputDataSet。组件存储在称为“成员资格”的向量中。该语句将向量转换为数据帧。
OutputDataSet$VertexName <- V(df.g)$nameV（）是图形中顶点的向量-GroupKeys和RecordKeys的列表。这会将它们复制到输出数据框中，从而创建一个称为VertexName的新列。这是用于与源表匹配以更新SupergroupKey的键。

我不是R专家。可能可以对此进行优化。

测试数据

OP的数据用于验证。对于规模测试，我使用以下脚本。

drop table if exists Records;
drop table if exists Groups;

create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go

set nocount on;

-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount    int             = 1000000;

-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier     numeric(4, 2)   = 2.7;

-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount     int             = FLOOR(@RecordCount * @Multiplier);


-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
    ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c  -- include if needed


declare @c int = 0
while @c < @RecordCount
begin
    -- Can't use a set-based method since RAND() gives the same value for all rows.
    -- There are better ways to do this, but it works well enough.
    -- RecordKeys will be 10 letters, a-z.
    insert Records(RecordKey)
    select
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND()));

    set @c += 1;
end


-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);

truncate table dbo.Example;

select top(1) @RecordKey = RecordKey 
from Records 
where RecordKey > @RecordKey 
order by RecordKey;

while @@ROWCOUNT > 0
begin
    print @Recordkey;

    delete @Groups;

    insert @Groups(GroupKey)
    select distinct C
    from
    (
        -- Hard-code * from OP's statistics
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
    ) as T(C);

    insert dbo.Example(GroupKey, RecordKey)
    select
        GroupKey, @RecordKey
    from @Groups;

    select top(1) @RecordKey = RecordKey 
    from Records 
    where RecordKey > @RecordKey 
    order by RecordKey;
end

-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all 
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, 
      ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);


-- Check what we ended up with:
select COUNT(*) from dbo.Example;  -- Should be @RecordCount * 8
                                   -- Often a little less due to random clashes
select 
    ByGroup = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by GroupKey)) 
    from dbo.Example
) as T(C);

select
    ByRecord = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by RecordKey)) 
    from dbo.Example
) as T(C);

我刚刚意识到我从OP的定义中错误地获得了比率。我认为这不会影响时间安排。记录和组与此过程对称。对于算法，它们都只是图中的节点。

在测试中，数据总是形成一个单一的组成部分。我相信这是由于数据的均匀分布。如果不是硬编码到生成例程中的静态1：8比率，而是我允许比率发生变化，则很有可能会有更多的组件。

¹机器规格：Microsoft SQL Server 2017（RTM-CU12），开发人员版（64位），Windows 10 Home。16GB RAM，SSD，4核超线程i7，标称2.8GHz。除了正常的系统活动（大约4％的CPU）外，测试是当时唯一运行的项目。

— 迈克尔·格林
source

6

递归CTE方法-在大表中可能效率极低：

WITH rCTE AS 
(
    -- Anchor
    SELECT 
        GroupKey, RecordKey, 
        CAST('|' + CAST(GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS GroupKeys,
        CAST('|' + CAST(RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS RecordKeys,
        1 AS lvl
    FROM Example

    UNION ALL

    -- Recursive
    SELECT
        e.GroupKey, e.RecordKey, 
        CASE WHEN r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.GroupKeys + CAST(e.GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.GroupKeys
        END,
        CASE WHEN r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.RecordKeys + CAST(e.RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.RecordKeys
        END,
        r.lvl + 1
    FROM rCTE AS r
         JOIN Example AS e
         ON  e.RecordKey = r.RecordKey
         AND r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
         -- 
         OR e.GroupKey = r.GroupKey
         AND r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
)
SELECT 
    ROW_NUMBER() OVER (ORDER BY GroupKeys) AS SuperGroupKey,
    GroupKeys, RecordKeys
FROM rCTE AS c
WHERE NOT EXISTS
      ( SELECT 1
        FROM rCTE AS m
        WHERE m.lvl > c.lvl
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
        OR    m.lvl = c.lvl
          AND ( m.GroupKey > c.GroupKey
             OR m.GroupKey = c.GroupKey
             AND m.RecordKeys > c.RecordKeys
              )
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
          AND c.GroupKeys LIKE '%|' + CAST(m.GroupKey AS VARCHAR(10)) + '|%'
      ) 
OPTION (MAXRECURSION 0) ;

在dbfiddle.uk中测试

— 超级立方体
source