您可以将COUNT DISTINCT与OVER子句一起使用吗?


25

我正在尝试改善以下查询的性能:

        UPDATE  [#TempTable]
        SET     Received = r.Number
        FROM    [#TempTable] 
        INNER JOIN (SELECT  AgentID,
                            RuleID,
                            COUNT(DISTINCT (GroupId)) Number
                    FROM    [#TempTable]
                    WHERE   Passed = 1
                    GROUP BY AgentID,
                            RuleID
                   ) r ON r.RuleID = [#TempTable].RuleID AND
                          r.AgentID = [#TempTable].AgentID                            

目前,根据我的测试数据,大约需要一分钟。对于此查询所驻留的整个存储过程的更改,我的输入量有限,但我可能可以让他们修改此查询。或添加索引。我尝试添加以下索引:

CREATE CLUSTERED INDEX ix_test ON #TempTable(AgentID, RuleId, GroupId, Passed)

它实际上使查询所花费的时间增加了一倍。我得到一个非聚集索引相同的效果。

我尝试将其重写如下,但没有任何效果。

        WITH r AS (SELECT  AgentID,
                            RuleID,
                            COUNT(DISTINCT (GroupId)) Number
                    FROM    [#TempTable]
                    WHERE   Passed = 1
                    GROUP BY AgentID,
                            RuleID
            ) 
        UPDATE  [#TempTable]
        SET     Received = r.Number
        FROM    [#TempTable] 
        INNER JOIN r 
            ON r.RuleID = [#TempTable].RuleID AND
               r.AgentID = [#TempTable].AgentID                            

接下来,我尝试使用这样的窗口功能。

        UPDATE  [#TempTable]
        SET     Received = COUNT(DISTINCT (CASE WHEN Passed=1 THEN GroupId ELSE NULL END)) 
                    OVER (PARTITION BY AgentId, RuleId)
        FROM    [#TempTable] 

在这一点上,我开始得到错误

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'distinct'.

所以我有两个问题。首先,您不能使用OVER子句执行COUNT DISTINCT还是我只是写错了?其次,有人可以建议我没有尝试过的改进吗?仅供参考,这是一个SQL Server 2008 R2 Enterprise实例。

编辑:这是原始执行计划的链接。我还应该注意,我的大问题是此查询正在运行30-50次。

https://onedrive.live.com/redir?resid=4C359AF42063BD98%21772

EDIT2:这是语句在注释中所要求的完整循环。我会定期与进行此操作的人员联系,以了解循环的目的。

DECLARE @Counting INT              
SELECT  @Counting = 1              

--  BEGIN:  Cascading Rule check --           
WHILE @Counting <= 30              
    BEGIN      

        UPDATE  w1
        SET     Passed = 1
        FROM    [#TempTable] w1,
                [#TempTable] w3
        WHERE   w3.AgentID = w1.AgentID AND
                w3.RuleID = w1.CascadeRuleID AND
                w3.RulePassed = 1 AND
                w1.Passed = 0 AND
                w1.NotFlag = 0      

        UPDATE  w1
        SET     Passed = 1
        FROM    [#TempTable] w1,
                [#TempTable] w3
        WHERE   w3.AgentID = w1.AgentID AND
                w3.RuleID = w1.CascadeRuleID AND
                w3.RulePassed = 0 AND
                w1.Passed = 0 AND
                w1.NotFlag = 1        

        UPDATE  [#TempTable]
        SET     Received = r.Number
        FROM    [#TempTable] 
        INNER JOIN (SELECT  AgentID,
                            RuleID,
                            COUNT(DISTINCT (GroupID)) Number
                    FROM    [#TempTable]
                    WHERE   Passed = 1
                    GROUP BY AgentID,
                            RuleID
                   ) r ON r.RuleID = [#TempTable].RuleID AND
                          r.AgentID = [#TempTable].AgentID                            

        UPDATE  [#TempTable]
        SET     RulePassed = 1
        WHERE   TotalNeeded = Received              

        SELECT  @Counting = @Counting + 1              
    END

Answers:


28

SQL Server当前不支持此构造。它可以(并且应该在我看来)在将来的版本中实现。

应用反馈项目中列出的报告此缺陷的变通办法之一,您的查询可以重写为:

WITH UpdateSet AS
(
    SELECT 
        AgentID, 
        RuleID, 
        Received, 
        Calc = SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (
            PARTITION BY AgentID, RuleID) 
    FROM 
    (
        SELECT  
            AgentID,
            RuleID,
            Received,
            rn = ROW_NUMBER() OVER (
                PARTITION BY AgentID, RuleID, GroupID 
                ORDER BY GroupID)
        FROM    #TempTable
        WHERE   Passed = 1
    ) AS X
)
UPDATE UpdateSet
SET Received = Calc;

产生的执行计划是:

计划

这样做的好处是可以避免使用Eager Table Spool进行万圣节保护(由于自连接),但是它会引入排序(针对窗口)和效率通常较低的Lazy Table Spool构造,以计算SUM OVER (PARTITION BY)结果并将其应用于所有行在窗口中。它在实践中的表现是只有您可以执行的一项练习。

总体方法很难使性能良好。递归地将更新(尤其是基于自联接的更新)应用于大型结构可能对调试有利,但它会降低性能。反复进行大扫描,内存溢出和万圣节问题只是其中的一部分。索引和(更多)临时表可以提供帮助,但是需要非常仔细的分析,尤其是当索引由流程中的其他语句更新时(维护索引会影响查询计划的选择并添加I / O)。

最终,解决潜在的问题将使有趣的咨询工作变得有意义,但是对于此站点而言,这太过分了。我希望这个答案可以解决表面问题。


原始查询的替代解释(导致更新更多行):

WITH UpdateSet AS
(
    SELECT 
        AgentID, 
        RuleID, 
        Received, 
        Calc = SUM(CASE WHEN Passed = 1 AND rn = 1 THEN 1 ELSE 0 END) OVER (
            PARTITION BY AgentID, RuleID) 
    FROM 
    (
        SELECT  
            AgentID,
            RuleID,
            Received,
            Passed,
            rn = ROW_NUMBER() OVER (
                PARTITION BY AgentID, RuleID, Passed, GroupID
                ORDER BY GroupID)
        FROM    #TempTable
    ) AS X
)
UPDATE UpdateSet
SET Received = Calc
WHERE Calc > 0;

方案2

注意:取消排序(例如,通过提供索引)可能会重新引入对急速阀芯或其他提供必要万圣节防护的需求。Sort是一个阻塞运算符,因此它提供了完整的相分离。


6

死灵法师:

使用DENSE_RANK模拟一个与分区不同的计数是非常简单的:

;WITH baseTable AS
(
              SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
    SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr 
    FROM baseTable
)
SELECT
     RM
    ,ADR

    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1 
    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2 
    -- Geht nicht / Doesn't work 
    --,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
    ,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu 
FROM CTE

3
它的语义与该列为count可为空的语义不同。如果它包含任何空值,则需要减去
马丁·史密斯

@马丁史密斯:很好。显然,如果存在空值,则需要添加WHERE ADR IS NOT NULL。
困惑
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.