我的部分工作量使用CLR函数,该函数实现了怪异的哈希算法来比较行以查看是否有任何列值已更改。CLR函数将二进制字符串作为输入,因此我需要一种快速的方法将行转换为二进制字符串。我希望在整个工作负载中散列大约100亿行,因此我希望这段代码尽可能快。
我有大约300个具有不同架构的表。出于这个问题的目的,请假设一个简单的表结构包含32个可为空的INT
列。在此问题的底部,我提供了示例数据以及一种基准测试结果的方法。
如果所有列值都相同,则行必须转换为相同的二进制字符串。如果任何列值不同,则行必须转换为不同的二进制字符串。例如,以下简单的代码将不起作用:
CAST(COL1 AS BINARY(4)) + CAST(COL2 AS BINARY(4)) + ..
它不能正确处理NULL。如果COL1
第1行为COL2
NULL,第2行为NULL,则两行都将转换为NULL字符串。我相信正确处理NULL是正确转换整个行的最难的部分。INT列的所有允许值都是可能的。
要优先考虑一些问题:
- 如果这很重要,那么在大多数情况下(90%+),列将不会为NULL。
- 我必须使用CLR。
- 我必须对这么多行进行哈希处理。我无法保留这些哈希值。
- 我相信由于CLR功能的存在,我无法使用批处理模式进行转换。
将32个可INT
为空的列转换为BINARY(X)
或VARBINARY(X)
字符串的最快方法是什么?
样本数据和代码如下:
-- create sample data
DROP TABLE IF EXISTS dbo.TABLE_OF_32_INTS;
CREATE TABLE dbo.TABLE_OF_32_INTS (
COL1 INT NULL,
COL2 INT NULL,
COL3 INT NULL,
COL4 INT NULL,
COL5 INT NULL,
COL6 INT NULL,
COL7 INT NULL,
COL8 INT NULL,
COL9 INT NULL,
COL10 INT NULL,
COL11 INT NULL,
COL12 INT NULL,
COL13 INT NULL,
COL14 INT NULL,
COL15 INT NULL,
COL16 INT NULL,
COL17 INT NULL,
COL18 INT NULL,
COL19 INT NULL,
COL20 INT NULL,
COL21 INT NULL,
COL22 INT NULL,
COL23 INT NULL,
COL24 INT NULL,
COL25 INT NULL,
COL26 INT NULL,
COL27 INT NULL,
COL28 INT NULL,
COL29 INT NULL,
COL30 INT NULL,
COL31 INT NULL,
COL32 INT NULL
);
INSERT INTO dbo.TABLE_OF_32_INTS WITH (TABLOCK)
SELECT 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, NULL, -876545321
FROM
(
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);
GO
-- procedure to test performance
CREATE OR ALTER PROCEDURE #p AS
BEGIN
SET NOCOUNT ON;
DECLARE
@counter INT = 0,
@dummy VARBINARY(8000);
WHILE @counter < 10
BEGIN
SELECT @dummy = -- this code is clearly incomplete as it does not handle NULLs
CAST(COL1 AS BINARY(4)) +
CAST(COL2 AS BINARY(4)) +
CAST(COL3 AS BINARY(4)) +
CAST(COL4 AS BINARY(4)) +
CAST(COL5 AS BINARY(4)) +
CAST(COL6 AS BINARY(4)) +
CAST(COL7 AS BINARY(4)) +
CAST(COL8 AS BINARY(4)) +
CAST(COL9 AS BINARY(4)) +
CAST(COL10 AS BINARY(4)) +
CAST(COL11 AS BINARY(4)) +
CAST(COL12 AS BINARY(4)) +
CAST(COL13 AS BINARY(4)) +
CAST(COL14 AS BINARY(4)) +
CAST(COL15 AS BINARY(4)) +
CAST(COL16 AS BINARY(4)) +
CAST(COL17 AS BINARY(4)) +
CAST(COL18 AS BINARY(4)) +
CAST(COL19 AS BINARY(4)) +
CAST(COL20 AS BINARY(4)) +
CAST(COL21 AS BINARY(4)) +
CAST(COL22 AS BINARY(4)) +
CAST(COL23 AS BINARY(4)) +
CAST(COL24 AS BINARY(4)) +
CAST(COL25 AS BINARY(4)) +
CAST(COL26 AS BINARY(4)) +
CAST(COL27 AS BINARY(4)) +
CAST(COL28 AS BINARY(4)) +
CAST(COL29 AS BINARY(4)) +
CAST(COL30 AS BINARY(4)) +
CAST(COL31 AS BINARY(4)) +
CAST(COL32 AS BINARY(4))
FROM dbo.TABLE_OF_32_INTS
OPTION (MAXDOP 1);
SET @counter = @counter + 1;
END;
SELECT cpu_time
FROM sys.dm_exec_requests
WHERE session_id = @@SPID;
END;
GO
-- run procedure
EXEC #p;
(我仍将在此二进制结果上使用怪异的哈希。工作量使用哈希联接,并且哈希值用于哈希构建之一。我不希望哈希构建中使用长二进制值,因为它需要太多记忆。)