将许多可空整数1:1转换为二进制字符串的最快方法是什么?


14

我的部分工作量使用CLR函数,该函数实现了怪异的哈希算法来比较行以查看是否有任何列值已更改。CLR函数将二进制字符串作为输入,因此我需要一种快速的方法将行转换为二进制字符串。我希望在整个工作负载中散列大约100亿行,因此我希望这段代码尽可能快。

我有大约300个具有不同架构的表。出于这个问题的目的,请假设一个简单的表结构包含32个可为空的INT列。在此问题的底部,我提供了示例数据以及一种基准测试结果的方法。

如果所有列值都相同,则行必须转换为相同的二进制字符串。如果任何列值不同,则行必须转换为不同的二进制字符串。例如,以下简单的代码将不起作用:

CAST(COL1 AS BINARY(4)) + CAST(COL2 AS BINARY(4)) + ..

它不能正确处理NULL。如果COL1第1行为COL2NULL,第2行为NULL,则两行都将转换为NULL字符串。我相信正确处理NULL是正确转换整个行的最难的部分。INT列的所有允许值都是可能的。

要优先考虑一些问题:

  • 如果这很重要,那么在大多数情况下(90%+),列将不会为NULL。
  • 我必须使用CLR。
  • 我必须对这么多行进行哈希处理。我无法保留这些哈希值。
  • 我相信由于CLR功能的存在,我无法使用批处理模式进行转换。

将32个可INT为空的列转换为BINARY(X)VARBINARY(X)字符串的最快方法是什么?

样本数据和代码如下:

-- create sample data
DROP TABLE IF EXISTS dbo.TABLE_OF_32_INTS;

CREATE TABLE dbo.TABLE_OF_32_INTS (
    COL1 INT NULL,
    COL2 INT NULL,
    COL3 INT NULL,
    COL4 INT NULL,
    COL5 INT NULL,
    COL6 INT NULL,
    COL7 INT NULL,
    COL8 INT NULL,
    COL9 INT NULL,
    COL10 INT NULL,
    COL11 INT NULL,
    COL12 INT NULL,
    COL13 INT NULL,
    COL14 INT NULL,
    COL15 INT NULL,
    COL16 INT NULL,
    COL17 INT NULL,
    COL18 INT NULL,
    COL19 INT NULL,
    COL20 INT NULL,
    COL21 INT NULL,
    COL22 INT NULL,
    COL23 INT NULL,
    COL24 INT NULL,
    COL25 INT NULL,
    COL26 INT NULL,
    COL27 INT NULL,
    COL28 INT NULL,
    COL29 INT NULL,
    COL30 INT NULL,
    COL31 INT NULL,
    COL32 INT NULL
);

INSERT INTO dbo.TABLE_OF_32_INTS WITH (TABLOCK)
SELECT 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, NULL, -876545321
FROM
(
    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);


GO


-- procedure to test performance
CREATE OR ALTER PROCEDURE #p AS 
BEGIN

SET NOCOUNT ON;

DECLARE
@counter INT = 0,
@dummy VARBINARY(8000);

WHILE @counter < 10
BEGIN
    SELECT @dummy = -- this code is clearly incomplete as it does not handle NULLs
        CAST(COL1 AS BINARY(4)) + 
        CAST(COL2 AS BINARY(4)) + 
        CAST(COL3 AS BINARY(4)) + 
        CAST(COL4 AS BINARY(4)) + 
        CAST(COL5 AS BINARY(4)) + 
        CAST(COL6 AS BINARY(4)) + 
        CAST(COL7 AS BINARY(4)) + 
        CAST(COL8 AS BINARY(4)) + 
        CAST(COL9 AS BINARY(4)) + 
        CAST(COL10 AS BINARY(4)) + 
        CAST(COL11 AS BINARY(4)) + 
        CAST(COL12 AS BINARY(4)) + 
        CAST(COL13 AS BINARY(4)) + 
        CAST(COL14 AS BINARY(4)) + 
        CAST(COL15 AS BINARY(4)) + 
        CAST(COL16 AS BINARY(4)) + 
        CAST(COL17 AS BINARY(4)) + 
        CAST(COL18 AS BINARY(4)) + 
        CAST(COL19 AS BINARY(4)) + 
        CAST(COL20 AS BINARY(4)) + 
        CAST(COL21 AS BINARY(4)) + 
        CAST(COL22 AS BINARY(4)) + 
        CAST(COL23 AS BINARY(4)) + 
        CAST(COL24 AS BINARY(4)) + 
        CAST(COL25 AS BINARY(4)) + 
        CAST(COL26 AS BINARY(4)) + 
        CAST(COL27 AS BINARY(4)) + 
        CAST(COL28 AS BINARY(4)) + 
        CAST(COL29 AS BINARY(4)) + 
        CAST(COL30 AS BINARY(4)) + 
        CAST(COL31 AS BINARY(4)) + 
        CAST(COL32 AS BINARY(4))
    FROM dbo.TABLE_OF_32_INTS
    OPTION (MAXDOP 1);

    SET @counter = @counter + 1;
END;

SELECT cpu_time
FROM sys.dm_exec_requests
WHERE session_id = @@SPID;

END;

GO

-- run procedure
EXEC #p;

(我仍将在此二进制结果上使用怪异的哈希。工作量使用哈希联接,并且哈希值用于哈希构建之一。我不希望哈希构建中使用长二进制值,因为它需要太多记忆。)

Answers:


6

使用BINARY(5)和将NULL转换为INT超出范围的内容该怎么办:

SELECT @dummy =
    ISNULL(CAST(COL1  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL2  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL3  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL4  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL5  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL6  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL7  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL8  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL9  AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL10 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL11 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL12 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL13 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL14 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL15 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL16 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL17 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL18 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL19 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL20 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL21 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL22 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL23 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL24 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL25 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL26 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL27 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL28 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL29 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL30 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL31 AS BINARY(5)), 0x0100000000) + 
    ISNULL(CAST(COL32 AS BINARY(5)), 0x0100000000)
FROM dbo.TABLE_OF_32_INTS
OPTION (MAXDOP 1);

11

在我的机器(SQL Server 2017)上,以下C#SQLCLR函数的运行速度比binary(5)想法快30%,比想法快35%CONCAT_WS,并且是自我解答时间的一半。

它需要UNSAFE权限并使用指针。该实现与测试数据紧密相关。

出于测试目的,使此不安全程序集正常工作的最简单方法是将数据库设置为,TRUSTWORTHY并在必要时禁用clr strict安全配置选项。

编译代码

为了方便起见,已CREATE ASSEMBLY编译的位位于https://gist.github.com/SQLKiwi/72d01b661c74485900e7ebcfdc63ab8e

T-SQL函数存根

CREATE FUNCTION dbo.NullableIntsToBinary
(
    @Col01 int, @Col02 int, @Col03 int, @Col04 int, @Col05 int, @Col06 int, @Col07 int, @Col08 int, 
    @Col09 int, @Col10 int, @Col11 int, @Col12 int, @Col13 int, @Col14 int, @Col15 int, @Col16 int, 
    @Col17 int, @Col18 int, @Col19 int, @Col20 int, @Col21 int, @Col22 int, @Col23 int, @Col24 int, 
    @Col25 int, @Col26 int, @Col27 int, @Col28 int, @Col29 int, @Col30 int, @Col31 int, @Col32 int
)
RETURNS binary(132) 
WITH EXECUTE AS CALLER
AS EXTERNAL NAME Obbish.UserDefinedFunctions.NullableIntsToBinary;

源代码

C#源位于https://gist.github.com/SQLKiwi/64f320fe7fd802a68a3a644aa8b8af9f

如果自己编译,则必须使用类库(.dll)作为目标项目类型,并选中“ 允许不安全代码”构建选项。

组合解决方案

由于您最终想要计算上面返回的二进制数据的SpookyHash,因此可以在CLR函数中调用SpookyHash并返回16字节的哈希。

基于混合了列数据类型的表的示例实现位于https://gist.github.com/SQLKiwi/6f82582a4ad1920c372fac118ec82460。这包括从Jon Hanna的SpookilySharp派生的Spooky Hash算法的不安全内联版本以及Bob Jenkins 的原始公共领域C源代码


7

INT列有四个字节的允许值,这些值与a的大小完全匹配BINARY(4)。换句话说,BINARY(4)的每个可能值都与INT列的可能值匹配。因此,除非该INT列中不允许使用某个值,否则无法安全替换NULL。列是否为NULL必须分别编码。它根本无法容纳BINARY(4)

一种方法是使用NULL位图。考虑以下代码:

CAST(       
    CASE WHEN COL1 IS NOT NULL THEN 0 ELSE 1 END | 
    CASE WHEN COL2 IS NOT NULL THEN 0 ELSE 2 END | 
    CASE WHEN COL3 IS NOT NULL THEN 0 ELSE 4 END | 
    CASE WHEN COL4 IS NOT NULL THEN 0 ELSE 8 END | 
    CASE WHEN COL5 IS NOT NULL THEN 0 ELSE 16 END | 
    CASE WHEN COL6 IS NOT NULL THEN 0 ELSE 32 END | 
    CASE WHEN COL7 IS NOT NULL THEN 0 ELSE 64 END | 
    CASE WHEN COL8 IS NOT NULL THEN 0 ELSE 128 END
AS BINARY(1))

八列是否为NULL适合单个字节。可以在行之间比较这些表达式,以检查所有相同的列是否为NULL。有了这些附加信息,就可以安全地将NULL列值替换为非NULL的任何值。我发现CAST(ISNULL(COL1, 0) AS BINARY(4))它是最快的,尽管ISNULL(CAST(COL1 AS VARBINARY(4)), 0x)也可能有其他变化。

绝对很难证明任何事情,但是我发现以下细节是最快的:

  • 在位图中使用0表示NOT NULL,因为我知道大多数列值都不会为NULL
  • 对位图使用按位或代替加法
  • 检查列值是否为NULL而不是转换后的二进制值

在我的计算机上,基准测试大约需要27.5 CPU秒。不幸的是,NULL位图步骤花费了大约三分之一的时间。如果有更快的方法可以做到这一点。

这是完整的解决方案:

SELECT
    CAST(ISNULL(COL1, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL2, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL3, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL4, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL5, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL6, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL7, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL8, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL9, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL10, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL11, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL12, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL13, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL14, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL15, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL16, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL17, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL18, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL19, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL20, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL21, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL22, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL23, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL24, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL25, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL26, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL27, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL28, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL29, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL30, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL31, 0) AS BINARY(4)) + 
    CAST(ISNULL(COL32, 0) AS BINARY(4)) + 
    CAST(       
        CASE WHEN COL1 IS NOT NULL THEN 0 ELSE 1 END | 
        CASE WHEN COL2 IS NOT NULL THEN 0 ELSE 2 END | 
        CASE WHEN COL3 IS NOT NULL THEN 0 ELSE 4 END | 
        CASE WHEN COL4 IS NOT NULL THEN 0 ELSE 8 END | 
        CASE WHEN COL5 IS NOT NULL THEN 0 ELSE 16 END | 
        CASE WHEN COL6 IS NOT NULL THEN 0 ELSE 32 END | 
        CASE WHEN COL7 IS NOT NULL THEN 0 ELSE 64 END | 
        CASE WHEN COL8 IS NOT NULL THEN 0 ELSE 128 END
    AS BINARY(1)) + 
    CAST(   
        CASE WHEN COL9  IS NOT NULL THEN 0 ELSE 1 END | 
        CASE WHEN COL10 IS NOT NULL THEN 0 ELSE 2 END | 
        CASE WHEN COL11 IS NOT NULL THEN 0 ELSE 4 END | 
        CASE WHEN COL12 IS NOT NULL THEN 0 ELSE 8 END | 
        CASE WHEN COL13 IS NOT NULL THEN 0 ELSE 16 END | 
        CASE WHEN COL14 IS NOT NULL THEN 0 ELSE 32 END | 
        CASE WHEN COL15 IS NOT NULL THEN 0 ELSE 64 END | 
        CASE WHEN COL16 IS NOT NULL THEN 0 ELSE 128 END
    AS BINARY(1)) + 
    CAST(   
        CASE WHEN COL17 IS NOT NULL THEN 0 ELSE 1 END | 
        CASE WHEN COL18 IS NOT NULL THEN 0 ELSE 2 END | 
        CASE WHEN COL19 IS NOT NULL THEN 0 ELSE 4 END | 
        CASE WHEN COL20 IS NOT NULL THEN 0 ELSE 8 END | 
        CASE WHEN COL21 IS NOT NULL THEN 0 ELSE 16 END | 
        CASE WHEN COL22 IS NOT NULL THEN 0 ELSE 32 END | 
        CASE WHEN COL23 IS NOT NULL THEN 0 ELSE 64 END | 
        CASE WHEN COL24 IS NOT NULL THEN 0 ELSE 128 END
    AS BINARY(1)) + 
    CAST(   
        CASE WHEN COL25 IS NOT NULL THEN 0 ELSE 1 END | 
        CASE WHEN COL26 IS NOT NULL THEN 0 ELSE 2 END | 
        CASE WHEN COL27 IS NOT NULL THEN 0 ELSE 4 END | 
        CASE WHEN COL28 IS NOT NULL THEN 0 ELSE 8 END | 
        CASE WHEN COL29 IS NOT NULL THEN 0 ELSE 16 END | 
        CASE WHEN COL30 IS NOT NULL THEN 0 ELSE 32 END | 
        CASE WHEN COL31 IS NOT NULL THEN 0 ELSE 64 END | 
        CASE WHEN COL32 IS NOT NULL THEN 0 ELSE 128 END
    AS BINARY(1))
FROM dbo.TABLE_OF_32_INTS
OPTION (MAXDOP 1);

5

在我的测试中,concat_ws比空位图解决方案(26秒)要快一点(18秒)。将会有更多的数据需要洗净,因此您可能会在其他地方看到性能下降,并且如果要将其与字符列混合使用,则必须明智地选择定界符。

select @dummy = cast(concat_ws('|',
         isnull(cast(T.COL1  as varchar(11)), ''),
         isnull(cast(T.COL2  as varchar(11)), ''),
         isnull(cast(T.COL3  as varchar(11)), ''),
         isnull(cast(T.COL4  as varchar(11)), ''),
         isnull(cast(T.COL5  as varchar(11)), ''),
         isnull(cast(T.COL6  as varchar(11)), ''),
         isnull(cast(T.COL7  as varchar(11)), ''),
         isnull(cast(T.COL8  as varchar(11)), ''),
         isnull(cast(T.COL9  as varchar(11)), ''),
         isnull(cast(T.COL10 as varchar(11)), ''),
         isnull(cast(T.COL11 as varchar(11)), ''),
         isnull(cast(T.COL12 as varchar(11)), ''),
         isnull(cast(T.COL13 as varchar(11)), ''),
         isnull(cast(T.COL14 as varchar(11)), ''),
         isnull(cast(T.COL15 as varchar(11)), ''),
         isnull(cast(T.COL16 as varchar(11)), ''),
         isnull(cast(T.COL17 as varchar(11)), ''),
         isnull(cast(T.COL18 as varchar(11)), ''),
         isnull(cast(T.COL19 as varchar(11)), ''),
         isnull(cast(T.COL20 as varchar(11)), ''),
         isnull(cast(T.COL21 as varchar(11)), ''),
         isnull(cast(T.COL22 as varchar(11)), ''),
         isnull(cast(T.COL23 as varchar(11)), ''),
         isnull(cast(T.COL24 as varchar(11)), ''),
         isnull(cast(T.COL25 as varchar(11)), ''),
         isnull(cast(T.COL26 as varchar(11)), ''),
         isnull(cast(T.COL27 as varchar(11)), ''),
         isnull(cast(T.COL28 as varchar(11)), ''),
         isnull(cast(T.COL29 as varchar(11)), ''),
         isnull(cast(T.COL30 as varchar(11)), ''),
         isnull(cast(T.COL31 as varchar(11)), ''),
         isnull(cast(T.COL32 as varchar(11)), ''))
       as varbinary(8000))
from dbo.TABLE_OF_32_INTS as T
option (maxdop 1)

2
谢谢你 它对我来说也更快,但是它确实有一个缺点,就是它会生成更长的二进制字符串(191个字符,而样本数据为132个字符)。我们一定会考虑的。
Joe Obbish

-1

如果您可以提前确保不存储某些特定的int,-2,147,483,648则可以执行以下操作:

 SELECT @dummy = 
        coalesce(CAST(COL1 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL2 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL3 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL4 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL5 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL6 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL7 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL8 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL9 AS BINARY(4)),  0x80000000) + 
        coalesce(CAST(COL10 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL11 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL12 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL13 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL14 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL15 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL16 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL17 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL18 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL19 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL20 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL21 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL22 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL23 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL24 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL25 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL26 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL27 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL28 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL29 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL30 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL31 AS BINARY(4)), 0x80000000) + 
        coalesce(CAST(COL32 AS BINARY(4)), 0x80000000) 
    FROM dbo.TABLE_OF_32_INTS

谢谢你的回答。INT列的所有允许值都是可能的。
Joe Obbish
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.