此问题与以下项目之间的链接有关。这使其进入图形和图形处理领域。具体来说,整个数据集形成一个图,我们正在寻找该图的组成部分。这可以通过绘制问题样本数据来说明。
这个问题说我们可以跟随GroupKey或RecordKey来查找共享该值的其他行。因此我们可以将它们都视为图形中的顶点。问题继续说明GroupKeys 1-3如何具有相同的SupergroupKey。这可以看作是左侧的群集由细线连接而成。该图还显示了由原始数据形成的其他两个组件(SupergroupKey)。
SQL Server具有内置在T-SQL中的某些图形处理功能。目前,这非常微不足道,并且对解决此问题没有帮助。SQL Server还具有调用R和Python的功能,以及丰富而强大的软件包套件。igraph就是这样一种。它是为“快速处理具有数百万个顶点和边(链接)的大型图”而编写的。
使用R和IGRAPH我能够在2分22秒的本地测试,以处理一个百万行1。这是它与当前最佳解决方案的比较:
Record Keys Paul White R
------------ ---------- --------
Per question 15ms ~220ms
100 80ms ~270ms
1,000 250ms 430ms
10,000 1.4s 1.7s
100,000 14s 14s
1M 2m29 2m22s
1M n/a 1m40 process only, no display
The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.
处理1M行时,使用1m40s来加载和处理图形以及更新表。需要42s才能用输出填充SSMS结果表。
在处理1M行时观察Task Manager,表明大约需要3GB的工作内存。该功能在该系统上无需分页即可使用。
我可以确认Ypercube对递归CTE方法的评估。使用几百个记录键,它消耗了100%的CPU和所有可用的RAM。最终,tempdb增长到80GB以上,并且SPID崩溃了。
我将Paul的表与SupergroupKey列一起使用,因此解决方案之间有一个合理的比较。
由于某种原因,R反对庞加莱的重音。将其更改为普通的“ e”可使其运行。我没有进行调查,因为它与当前的问题无关。我确定有解决方案。
这是代码
-- This captures the output from R so the base table can be updated.
drop table if exists #Results;
create table #Results
(
Component int not NULL,
Vertex varchar(12) not NULL primary key
);
truncate table #Results; -- facilitates re-execution
declare @Start time = sysdatetimeoffset(); -- for a 'total elapsed' calculation.
insert #Results(Component, Vertex)
exec sp_execute_external_script
@language = N'R',
@input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
@script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';
-- Write SuperGroupKey to the base table, as other solutions do
update e
set
SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
on r.Vertex = e.RecordKey;
-- Return all rows, as other solutions do
select
e.SupergroupKey,
e.GroupKey,
e.RecordKey
from dbo.Example as e;
-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);
这就是R代码的作用
@input_data_1
是SQL Server如何将数据从表传输到R代码并将其转换为称为InputDataSet的R数据帧。
library(igraph)
将库导入R执行环境。
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
将数据加载到igraph对象中。这是一个无向图,因为我们可以跟踪从组到记录或从记录到组的链接。InputDataSet是SQL Server发送到R的数据集的默认名称。
cpts <- components(df.g, mode = c("weak"))
处理图以找到离散的子图(组件)和其他度量。
OutputDataSet <- data.frame(cpts$membership)
SQL Server期望从R返回一个数据帧。其默认名称为OutputDataSet。组件存储在称为“成员资格”的向量中。该语句将向量转换为数据帧。
OutputDataSet$VertexName <- V(df.g)$name
V()是图形中顶点的向量-GroupKeys和RecordKeys的列表。这会将它们复制到输出数据框中,从而创建一个称为VertexName的新列。这是用于与源表匹配以更新SupergroupKey的键。
我不是R专家。可能可以对此进行优化。
测试数据
OP的数据用于验证。对于规模测试,我使用以下脚本。
drop table if exists Records;
drop table if exists Groups;
create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go
set nocount on;
-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount int = 1000000;
-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier numeric(4, 2) = 2.7;
-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount int = FLOOR(@RecordCount * @Multiplier);
-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c -- include if needed
declare @c int = 0
while @c < @RecordCount
begin
-- Can't use a set-based method since RAND() gives the same value for all rows.
-- There are better ways to do this, but it works well enough.
-- RecordKeys will be 10 letters, a-z.
insert Records(RecordKey)
select
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND()));
set @c += 1;
end
-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);
truncate table dbo.Example;
select top(1) @RecordKey = RecordKey
from Records
where RecordKey > @RecordKey
order by RecordKey;
while @@ROWCOUNT > 0
begin
print @Recordkey;
delete @Groups;
insert @Groups(GroupKey)
select distinct C
from
(
-- Hard-code * from OP's statistics
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
) as T(C);
insert dbo.Example(GroupKey, RecordKey)
select
GroupKey, @RecordKey
from @Groups;
select top(1) @RecordKey = RecordKey
from Records
where RecordKey > @RecordKey
order by RecordKey;
end
-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);
-- Check what we ended up with:
select COUNT(*) from dbo.Example; -- Should be @RecordCount * 8
-- Often a little less due to random clashes
select
ByGroup = AVG(C)
from
(
select CONVERT(float, COUNT(1) over(partition by GroupKey))
from dbo.Example
) as T(C);
select
ByRecord = AVG(C)
from
(
select CONVERT(float, COUNT(1) over(partition by RecordKey))
from dbo.Example
) as T(C);
我刚刚意识到我从OP的定义中错误地获得了比率。我认为这不会影响时间安排。记录和组与此过程对称。对于算法,它们都只是图中的节点。
在测试中,数据总是形成一个单一的组成部分。我相信这是由于数据的均匀分布。如果不是硬编码到生成例程中的静态1:8比率,而是我允许比率发生变化,则很有可能会有更多的组件。
1机器规格:Microsoft SQL Server 2017(RTM-CU12),开发人员版(64位),Windows 10 Home。16GB RAM,SSD,4核超线程i7,标称2.8GHz。除了正常的系统活动(大约4%的CPU)外,测试是当时唯一运行的项目。