使用窗口功能优化子查询


8

由于我的性能调优技能似乎永远不足,我一直想知道是否可以针对某些查询执行更多优化。这个问题涉及的情况是嵌套在子查询中的Windowed MAX函数。

我正在挖掘的数据是在较大组的各个不同组上的一系列事务。我有4个重要字段,一个事务的唯一ID,一批事务的Group ID以及与相应的唯一事务或一组事务相关的日期。多数情况下,组日期与批次的最大唯一交易日期相匹配,但是有时会通过我们的系统进行手动调整,并且在捕获组交易日期之后会进行唯一日期操作。此手动编辑不会通过设计来调整组日期。

我在此查询中识别的是那些唯一日期落在组日期之后的记录。以下示例查询建立了与我的场景大致相当的内容,并且SELECT语句返回了我要查找的记录,但是,我是否以最有效的方式来实现此解决方案?在我的事实表加载期间,这需要花一些时间,因为我的记录计数前9位数字,但是主要是我对子查询的鄙视使我怀疑这里是否有更好的方法。我不太担心任何索引,因为我相信它们已经存在。我正在寻找的是一种替代查询方法,该方法可以实现相同的功能,但效率更高。欢迎任何反馈。

CREATE TABLE #Example
(
    UniqueID INT IDENTITY(1,1)
  , GroupID INT
  , GroupDate DATETIME
  , UniqueDate DATETIME
)

CREATE CLUSTERED INDEX [CX_1] ON [#Example]
(
    [UniqueID] ASC
)


SET NOCOUNT ON

--Populate some test data
DECLARE @i INT = 0, @j INT = 5, @UniqueDate DATETIME, @GroupDate DATETIME

WHILE @i < 10000
BEGIN

    IF((@i + @j)%173 = 0)
    BEGIN
        SET @UniqueDate = GETDATE()+@i+5
    END
    ELSE
    BEGIN
        SET @UniqueDate = GETDATE()+@i
    END

    SET @GroupDate = GETDATE()+(@j-1)

    INSERT INTO #Example (GroupID, GroupDate, UniqueDate)
    VALUES (@j, @GroupDate, @UniqueDate)

    SET @i = @i + 1

    IF (@i % 5 = 0)
    BEGIN
        SET @j = @j+5
    END
END
SET NOCOUNT OFF

CREATE NONCLUSTERED INDEX [IX_2_4_3] ON [#Example]
(
    [GroupID] ASC,
    [UniqueDate] ASC,
    [GroupDate] ASC
)
INCLUDE ([UniqueID])

-- Identify any UniqueDates that are greater than the GroupDate within their GroupID
SELECT UniqueID
     , GroupID
     , GroupDate
     , UniqueDate
FROM (
    SELECT UniqueID
         , GroupID
         , GroupDate
         , UniqueDate
         , MAX(UniqueDate) OVER (PARTITION BY GroupID) AS maxUniqueDate
    FROM #Example
    ) calc_maxUD
WHERE maxUniqueDate > GroupDate
    AND maxUniqueDate = UniqueDate

DROP TABLE #Example

dbfiddle 在这里


2
如果您想对查询进行性能优化,则表上的索引非常重要。
Daniel Hutmacher

@DanielHutmacher我完全同意,尽管我不会为DWH和暂存区域转储架构,所以这是我在合理范围内可以做的最好的事情。
John Eisbrener

Answers:


9

我假设没有索引,因为您没有提供任何索引。

立即使用以下索引将消除您计划中的Sort运算符,否则该运算符可能会消耗大量内存:

CREATE INDEX IX ON #Example (GroupID, UniqueDate) INCLUDE (UniqueID, GroupDate);

在这种情况下,子查询不是性能问题。如果有的话,我将研究消除窗口函数(MAX ... OVER)的方法,以避免嵌套循环和表假脱机构造。

在具有相同索引的情况下,以下查询乍一看可能看起来效率较低,并且确实在基表上进行了从两次扫描到三次扫描,但是由于内部缺少Spool运算符,因此它在内部消除了大量读取。我猜它仍然会更好,特别是如果您的服务器上有足够的CPU内核和IO性能:

SELECT e.UniqueID
     , e.GroupID
     , e.GroupDate
     , e.UniqueDate
FROM (
    SELECT GroupID, MAX(UniqueDate) AS maxUniqueDate
    FROM #Example
    GROUP BY GroupID) AS agg
INNER JOIN #Example AS e ON agg.GroupID=e.GroupID
WHERE agg.maxUniqueDate > e.GroupDate
    AND agg.maxUniqueDate = e.UniqueDate
OPTION (MERGE JOIN);

(注意:我添加了一个MERGE JOIN查询提示,但是如果您的统计信息正确,则该提示可能会自动发生。最佳做法是,如果可能的话,将此类提示排除在外。)


6
丑陋的,但执行计划是漂亮。这就是像T-SQL这样的声明性语言的魔力。
Daniel Hutmacher

11

当并且能够从SQL Server 2012升级到SQL Server 2016时,您也许可以利用新的批处理模式Window Aggregate运算符提供的性能大大提高(尤其是无框架窗口聚合)​​。

几乎所有大型数据处理方案与列存储相比,与列存储一起使用的效果都更好。即使不更改基表的列存储,您仍然可以通过在其中一个基表上创建一个空的非聚集列存储过滤索引来获得新的2016年运营商和批处理模式的好处,或者通过冗余外部连接到按列存储组织表。

使用第二个选项,查询变为:

-- Just to get batch mode processing and the window aggregate operator
CREATE TABLE #Dummy (a integer NOT NULL, INDEX DummyCC CLUSTERED COLUMNSTORE);

-- Identify any UniqueDates that are greater than the GroupDate within their GroupID
SELECT
    calc_maxUD.UniqueID,
    calc_maxUD.GroupID,
    calc_maxUD.GroupDate,
    calc_maxUD.UniqueDate
FROM 
(
    SELECT
        E.UniqueID,
        E.GroupID,
        E.GroupDate,
        E.UniqueDate,
        maxUniqueDate = MAX(UniqueDate) OVER (
            PARTITION BY GroupID)
    FROM #Example AS E
    LEFT JOIN #Dummy AS D -- The only change to the original query
        ON 1 = 0
) AS calc_maxUD
WHERE 
    calc_maxUD.maxUniqueDate > calc_maxUD.GroupDate
    AND calc_maxUD.maxUniqueDate = calc_maxUD.UniqueDate;

db <>小提琴

请注意,对原始查询的唯一更改是创建一个空的临时表并添加左联接。执行计划是:

批处理模式窗口汇总计划

(58 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
Table '#Example'. Scan count 1, logical reads 40, physical reads 0, read-ahead reads 0

有关更多信息和选项,请参阅Itzik Ben-Gan的精彩系列文章,关于SQL Server 2016中的批处理模式窗口聚合运算符,您需要了解的知识(分为三部分)。


7

我只是将ol'Cross Apply扔在那里:

SELECT e.*
    FROM #Example AS e
    CROSS APPLY ( SELECT TOP 1 e2.UniqueDate AS maxUniqueDate
                    FROM #Example AS e2
                    WHERE e2.GroupID = e.GroupID 
                    ORDER BY e2.UniqueDate DESC
                    ) AS ca
    WHERE ca.maxUniqueDate > e.GroupDate
        AND ca.maxUniqueDate = e.UniqueDate;

无论使用哪种索引,它都可以很好地完成工作。

CREATE CLUSTERED INDEX cx_whatever ON #Example (GroupID)

CREATE UNIQUE NONCLUSTERED INDEX ix_whatever ON #Example (GroupID, UniqueDate DESC, GroupDate)

统计时间和io如下所示(您的查询是第一个结果)

Table 'Worktable'. Scan count 3, logical reads 28004, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Example'. Scan count 1, logical reads 51, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 20 ms.

Table '#Example'. Scan count 10001, logical reads 21336, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 11 ms.

查询计划在这里(再次,您的是第一个):

https://www.brentozar.com/pastetheplan/?id=BJYJvqAal

为什么我更喜欢这个版本?我避开线轴。如果这些内容开始溢出到磁盘上,它将变得难看。

但您也可能想尝试一下。

SELECT e.*
    FROM #Example AS e
    CROSS APPLY ( SELECT e2.UniqueDate AS maxUniqueDate
                    FROM #Example AS e2
                    WHERE e2.GroupID = e.GroupID 
                    ) AS ca
    WHERE ca.maxUniqueDate > e.GroupDate
        AND ca.maxUniqueDate = e.UniqueDate;

如果这是一个较大的DW,则您可能更喜欢哈希联接,并在联接中进行行过滤,而不是在TOP 1查询末尾作为过滤运算符。

计划在这里:https : //www.brentozar.com/pastetheplan/?id=BkUF55ATx

在这里统计时间和io:

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Example'. Scan count 2, logical reads 84, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 5 ms.

希望这可以帮助!

根据@ypercube的想法进行一次编辑,并创建一个新索引。

CREATE NONCLUSTERED INDEX ix_meh ON #Example (UniqueDate,GroupDate) INCLUDE (UniqueID,GroupID);

WITH t1 AS 
(
    SELECT DISTINCT
    e.GroupID ,
    MAX(UniqueDate) AS MaxUniqueDate
    FROM #Example AS e
    GROUP BY e.GroupID
)
SELECT *
FROM #Example AS e
CROSS APPLY (
SELECT *
FROM t1
    WHERE t1.MaxUniqueDate > e.GroupDate
        AND t1.MaxUniqueDate = e.UniqueDate
        AND t1.GroupID = e.GroupID
) ca

这是统计时间和io:

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Example'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms.

这是计划:

https://www.brentozar.com/pastetheplan/?id=SJv8foR6g


似乎我的示例有点太干净了,因为在某些情况下,我的实际环境中可以有多个比“组日期”大的“唯一日期”。这种情况将使您的第二个“交叉应用”查询无效,但是其他方法都可以正常工作。感谢您提供更多选择!
John Eisbrener

4

我来看看 top with ties

如果GroupDate相同,GroupId则:

select top 1 with ties 
   UniqueID
 , GroupID
 , GroupDate
 , UniqueDate
from #Example
where UniqueDate > GroupDate
order by row_number() over (partition by GroupId order by UniqueDate desc)

其他:top with ties通用表表达式中使用

with cte as (
  select top 1 with ties 
      UniqueID
    , GroupID
    , GroupDate
    , UniqueDate
  from #Example
  order by row_number() over (partition by GroupId order by UniqueDate desc)
)
select *
from cte
where UniqueDate > GroupDate

dbfiddle:http ://dbfiddle.uk/?rdbms=sqlserver_2016&fiddle=c058994c2f5f3d99b212f06e1dae9fd3

原始查询

Table 'Worktable'. Scan count 3, logical reads 28001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Example____________________________________________________________________________________________________________0000000000CB'. Scan count 1, logical reads 43, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 31 ms.

vs top with ties通用表表达式中

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Example____________________________________________________________________________________________________________0000000000CB'. Scan count 1, logical reads 43, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 15 ms.

4

因此,我对到目前为止发布的各种方法进行了一些分析,在我的环境中,看起来Daniel的方法在执行时间上始终胜出。令我惊讶的是,sp_BlitzErik的第三个CROSS APPLY方法并没有落后。如果有人感兴趣,这里是输出,但感谢TON提供的所有其他方法。通过挖掘这个问题的答案,我学到的东西比我多了!

Windowed Function - baseline metric

(10406 row(s) affected)
Table 'DateDim'. Scan count 9, logical reads 791, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TableFact'. Scan count 9, logical reads 140181, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 89815, logical reads 42553550, physical reads 0, read-ahead reads 84586, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table01Dim'. Scan count 9, logical reads 7688, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table02Dim'. Scan count 9, logical reads 7819, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 87753 ms,  elapsed time = 13031 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.


Basic Aggregated Subquery - Daniel Hutmacher

(10406 row(s) affected)
Table 'DateDim'. Scan count 18, logical reads 1194, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TableFact'. Scan count 18, logical reads 280362, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 48, logical reads 82408, physical reads 9629, read-ahead reads 72779, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 89791, logical reads 6861425, physical reads 0, read-ahead reads 14565, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table01Dim'. Scan count 9, logical reads 7688, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table02Dim'. Scan count 18, logical reads 15726, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 40527 ms,  elapsed time = 6182 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.


CROSS APPLY Operation A - sp_BlitzErik

(10406 row(s) affected)
Table 'DateDim'. Scan count 9, logical reads 6199331, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TableFact'. Scan count 3099273, logical reads 12844012, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table01Dim'. Scan count 3109676, logical reads 9350502, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table02Dim'. Scan count 3109676, logical reads 9482456, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 132632 ms,  elapsed time = 20955 ms.


CROSS APPLY Operation C - sp_BlitzErik

(10406 row(s) affected)
Table 'DateDim'. Scan count 18, logical reads 1194, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TableFact'. Scan count 18, logical reads 280362, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 56, logical reads 92800, physical reads 10872, read-ahead reads 81928, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 89791, logical reads 6861425, physical reads 0, read-ahead reads 14563, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table01Dim'. Scan count 18, logical reads 15376, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table02Dim'. Scan count 18, logical reads 15726, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 46082 ms,  elapsed time = 6804 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.


TOP 1 WITH TIES - B - SqlZim

(10406 row(s) affected)
Table 'DateDim'. Scan count 9, logical reads 791, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TableFact'. Scan count 9, logical reads 140181, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 89791, logical reads 6866304, physical reads 0, read-ahead reads 93468, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table01Dim'. Scan count 9, logical reads 7688, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table02Dim'. Scan count 9, logical reads 7835, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 79406 ms,  elapsed time = 15852 ms.

我只是在查看如果将您的示例扩展到10万行并添加了每个人的索引建议,发布的选项将如何堆叠。似乎也可以很好地代表您的实际结果。看来我的top with ties带扣版本很多。dbfiddle.uk/…–
SqlZim
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.