根据另一列重置运行总计


10

我正在尝试计算运行总计。但是,当累积总和大于另一个列的值时,应该重置

create table #reset_runn_total
(
id int identity(1,1),
val int, 
reset_val int,
grp int
)

insert into #reset_runn_total
values 
(1,10,1),
(8,12,1),(6,14,1),(5,10,1),(6,13,1),(3,11,1),(9,8,1),(10,12,1)


SELECT Row_number()OVER(partition BY grp ORDER BY id)AS rn,*
INTO   #test
FROM   #reset_runn_total

索引详细信息:

CREATE UNIQUE CLUSTERED INDEX ix_load_reset_runn_total
  ON #test(rn, grp) 

样本数据

+----+-----+-----------+-----+
| id | val | reset_val | Grp |
+----+-----+-----------+-----+
|  1 |   1 |        10 | 1   |
|  2 |   8 |        12 | 1   |
|  3 |   6 |        14 | 1   |
|  4 |   5 |        10 | 1   |
|  5 |   6 |        13 | 1   |
|  6 |   3 |        11 | 1   |
|  7 |   9 |         8 | 1   |
|  8 |  10 |        12 | 1   |
+----+-----+-----------+-----+ 

预期结果

+----+-----+-----------------+-------------+
| id | val |    reset_val    | Running_tot |
+----+-----+-----------------+-------------+
|  1 |   1 | 10              |       1     |  
|  2 |   8 | 12              |       9     |  --1+8
|  3 |   6 | 14              |       15    |  --1+8+6 -- greater than reset val
|  4 |   5 | 10              |       5     |  --reset 
|  5 |   6 | 13              |       11    |  --5+6
|  6 |   3 | 11              |       14    |  --5+6+3 -- greater than reset val
|  7 |   9 | 8               |       9     |  --reset -- greater than reset val 
|  8 |  10 | 12              |      10     |  --reset
+----+-----+-----------------+-------------+

查询:

我得到了使用的结果Recursive CTE。原始问题在这里/programming/42085404/reset-running-total-based-on-another-column

;WITH cte
     AS (SELECT rn,id,
                val,
                reset_val,
                grp,
                val                   AS running_total,
                Iif (val > reset_val, 1, 0) AS flag
         FROM   #test
         WHERE  rn = 1
         UNION ALL
         SELECT r.*,
                Iif(c.flag = 1, r.val, c.running_total + r.val),
                Iif(Iif(c.flag = 1, r.val, c.running_total + r.val) > r.reset_val, 1, 0)
         FROM   cte c
                JOIN #test r
                  ON r.grp = c.grp
                     AND r.rn = c.rn + 1)
SELECT *
FROM   cte 

有没有更好的选择T-SQL而不使用CLR。?


更好吗?这个查询表现不佳吗?使用什么指标?
亚伦·伯特兰

@AaronBertrand-为了更好地理解,我仅发布了一组样本数据。对于Id的50000小组,我必须做同样的事情。这样记录的总数目会在左右。肯定不会很好地扩展。我回到办公室后将更新指标。我们能否像您在本文中使用的那样实现此目的sqlperformance.com/2012/07/t-sql-queries/running-totals60 3000000Recursive CTE3000000sum()Over(Order by)
Pரதீப்17年

游标可能比递归CTE更好
狗仔队

Answers:


6

我研究过类似的问题,但从未找到过一次对数据进行一次传递的窗口函数解决方案。我认为这是不可能的。窗口函数必须能够应用于列中的所有值。这样会使复位计算非常困难,因为一次复位会更改以下所有值的值。

考虑问题的一种方法是,只要您可以从正确的上一行中减去运行总计,就可以计算出基本运行总计来获得所需的最终结果。例如,在样本数据中,id4的值为running total of row 4 - the running total of row 3id6 的值是,running total of row 6 - the running total of row 3因为尚未发生重置。id7 的值是running total of row 7 - the running total of row 6,依此类推。

我会在一个循环中使用T-SQL来解决这个问题。我有些不高兴,认为我有完整的解决方案。对于300万行和500组,代码在我的桌面上在24秒内完成。我正在测试具有6个vCPU的SQL Server 2016 Developer版本。我通常利用并行插入和并行执行的优势,因此如果您使用的是旧版本或具有DOP限制,则可能需要更改代码。

在我用来生成数据的代码下面。在范围VALRESET_VAL应类似于您的样本数据。

drop table if exists reset_runn_total;

create table reset_runn_total
(
id int identity(1,1),
val int, 
reset_val int,
grp int
);

DECLARE 
@group_num INT,
@row_num INT;
BEGIN
    SET NOCOUNT ON;
    BEGIN TRANSACTION;

    SET @group_num = 1;
    WHILE @group_num <= 50000 
    BEGIN
        SET @row_num = 1;
        WHILE @row_num <= 60
        BEGIN
            INSERT INTO reset_runn_total WITH (TABLOCK)
            SELECT 1 + ABS(CHECKSUM(NewId())) % 10, 8 + ABS(CHECKSUM(NewId())) % 8, @group_num;

            SET @row_num = @row_num + 1;
        END;
        SET @group_num = @group_num + 1;
    END;
    COMMIT TRANSACTION;
END;

算法如下:

1)首先将所有具有标准运行总计的行插入到临时表中。

2)在循环中:

2a)对于每个组,在表中剩余的reset_value之上计算运行总计的第一行,并在临时表中存储id,运行总计太大和​​前一个运行总计太大的行。

2b)将第一个临时表中的行删除到ID小于或等于ID第二个临时表中的结果临时表中。使用其他列根据需要调整运行总计。

3)删除不再处理后,DELETE OUTPUT在结果表中再运行一行。这是针对组末尾永不超过重置值的行。

我将逐步在T-SQL中完成上述算法的一种实现。

首先创建一些临时表。#initial_results包含具有标准运行总计的原始数据,在#group_bookkeeping每个循环中进行更新以找出可以移动的行,并#final_results包含针对重置进行了调整的运行总计的结果。

CREATE TABLE #initial_results (
id int,
val int, 
reset_val int,
grp int,
initial_running_total int
);

CREATE TABLE #group_bookkeeping (
grp int,
max_id_to_move int,
running_total_to_subtract_this_loop int,
running_total_to_subtract_next_loop int,
grp_done bit, 
PRIMARY KEY (grp)
);

CREATE TABLE #final_results (
id int,
val int, 
reset_val int,
grp int,
running_total int
);

INSERT INTO #initial_results WITH (TABLOCK)
SELECT ID, VAL, RESET_VAL, GRP, SUM(VAL) OVER (PARTITION BY GRP ORDER BY ID) RUNNING_TOTAL
FROM reset_runn_total;

CREATE CLUSTERED INDEX i1 ON #initial_results (grp, id);

INSERT INTO #group_bookkeeping WITH (TABLOCK)
SELECT DISTINCT GRP, 0, 0, 0, 0
FROM reset_runn_total;

之后,我在临时表上创建了聚集索引,因此插入和索引构建可以并行完成。对我的机器影响很大,但对您的机器可能没有影响。在源表上创建索引似乎没有帮助,但这可能对您的计算机有所帮助。

下面的代码在循环中运行,并更新簿记表。对于每个组,我们需要找到最大值ID,该最大值应移到结果表中。我们需要该行的运行总计,以便可以从初始运行总计中减去它。grp_done当不再需要做任何工作时,该列设置为1 grp

WITH UPD_CTE AS (
        SELECT 
        #grp_bookkeeping.GRP
        , MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) max_id_to_update
        , MIN(#group_bookkeeping.running_total_to_subtract_next_loop) running_total_to_subtract_this_loop
        , MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN initial_running_total ELSE NULL END) additional_value_next_loop
        , CASE WHEN MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) IS NULL THEN 1 ELSE 0 END grp_done
        FROM #group_bookkeeping 
        INNER JOIN #initial_results IR ON #group_bookkeeping.grp = ir.grp
        WHERE #group_bookkeeping.grp_done = 0
        GROUP BY #group_bookkeeping.GRP
    )
    UPDATE #group_bookkeeping
    SET #group_bookkeeping.max_id_to_move = uv.max_id_to_update
    , #group_bookkeeping.running_total_to_subtract_this_loop = uv.running_total_to_subtract_this_loop
    , #group_bookkeeping.running_total_to_subtract_next_loop = uv.additional_value_next_loop
    , #group_bookkeeping.grp_done = uv.grp_done
    FROM UPD_CTE uv
    WHERE uv.GRP = #group_bookkeeping.grp
OPTION (LOOP JOIN);

LOOP JOIN通常,确实不是该提示的支持者,但这是一个简单的查询,并且是获得我想要的最快的方法。为了真正优化响应时间,我希望使用并行嵌套循环联接,而不是DOP 1合并联接。

下面的代码在循环中运行,并将数据从初始表移至最终结果表。注意对初始运行总计的调整。

DELETE ir
OUTPUT DELETED.id,  
    DELETED.VAL,  
    DELETED.RESET_VAL,  
    DELETED.GRP ,
    DELETED.initial_running_total - tb.running_total_to_subtract_this_loop
INTO #final_results
FROM #initial_results ir
INNER JOIN #group_bookkeeping tb ON ir.GRP = tb.GRP AND ir.ID <= tb.max_id_to_move
WHERE tb.grp_done = 0;

为了您的方便,下面是完整的代码:

DECLARE @RC INT;
BEGIN
SET NOCOUNT ON;

CREATE TABLE #initial_results (
id int,
val int, 
reset_val int,
grp int,
initial_running_total int
);

CREATE TABLE #group_bookkeeping (
grp int,
max_id_to_move int,
running_total_to_subtract_this_loop int,
running_total_to_subtract_next_loop int,
grp_done bit, 
PRIMARY KEY (grp)
);

CREATE TABLE #final_results (
id int,
val int, 
reset_val int,
grp int,
running_total int
);

INSERT INTO #initial_results WITH (TABLOCK)
SELECT ID, VAL, RESET_VAL, GRP, SUM(VAL) OVER (PARTITION BY GRP ORDER BY ID) RUNNING_TOTAL
FROM reset_runn_total;

CREATE CLUSTERED INDEX i1 ON #initial_results (grp, id);

INSERT INTO #group_bookkeeping WITH (TABLOCK)
SELECT DISTINCT GRP, 0, 0, 0, 0
FROM reset_runn_total;

SET @RC = 1;
WHILE @RC > 0 
BEGIN
    WITH UPD_CTE AS (
        SELECT 
        #group_bookkeeping.GRP
        , MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) max_id_to_move
        , MIN(#group_bookkeeping.running_total_to_subtract_next_loop) running_total_to_subtract_this_loop
        , MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN initial_running_total ELSE NULL END) additional_value_next_loop
        , CASE WHEN MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) IS NULL THEN 1 ELSE 0 END grp_done
        FROM #group_bookkeeping 
        CROSS APPLY (SELECT ID, RESET_VAL, initial_running_total FROM #initial_results ir WHERE #group_bookkeeping.grp = ir.grp ) ir
        WHERE #group_bookkeeping.grp_done = 0
        GROUP BY #group_bookkeeping.GRP
    )
    UPDATE #group_bookkeeping
    SET #group_bookkeeping.max_id_to_move = uv.max_id_to_move
    , #group_bookkeeping.running_total_to_subtract_this_loop = uv.running_total_to_subtract_this_loop
    , #group_bookkeeping.running_total_to_subtract_next_loop = uv.additional_value_next_loop
    , #group_bookkeeping.grp_done = uv.grp_done
    FROM UPD_CTE uv
    WHERE uv.GRP = #group_bookkeeping.grp
    OPTION (LOOP JOIN);

    DELETE ir
    OUTPUT DELETED.id,  
        DELETED.VAL,  
        DELETED.RESET_VAL,  
        DELETED.GRP ,
        DELETED.initial_running_total - tb.running_total_to_subtract_this_loop
    INTO #final_results
    FROM #initial_results ir
    INNER JOIN #group_bookkeeping tb ON ir.GRP = tb.GRP AND ir.ID <= tb.max_id_to_move
    WHERE tb.grp_done = 0;

    SET @RC = @@ROWCOUNT;
END;

DELETE ir 
OUTPUT DELETED.id,  
    DELETED.VAL,  
    DELETED.RESET_VAL,  
    DELETED.GRP ,
    DELETED.initial_running_total - tb.running_total_to_subtract_this_loop
    INTO #final_results
FROM #initial_results ir
INNER JOIN #group_bookkeeping tb ON ir.GRP = tb.GRP;

CREATE CLUSTERED INDEX f1 ON #final_results (grp, id);

/* -- do something with the data
SELECT *
FROM #final_results
ORDER BY grp, id;
*/

DROP TABLE #final_results;
DROP TABLE #initial_results;
DROP TABLE #group_bookkeeping;

END;

简直太棒了,我将奖励您
P

在我们的服务器中,您花了1分10秒花费了50000 grp和60 id。Recursive CTE花了2分15秒
Pரதீப்17年

我用相同的数据测试了这两个代码。你的真棒。可以进一步改进吗?
Pரதீப்,14:48

我的意思是,我在我们的真实数据上运行了您的代码并对其进行了测试。在我的实际过程中,计算是在临时表中处理的,很可能应该将其紧密包装。这将是很好的,如果它可以降低到任何东西,大约30秒
Pரதீப்

@Prdp尝试了一种使用更新的快速方法,但似乎更糟。暂时将无法对此进行更多研究。尝试记录每个操作花费的时间,以便找出哪个部分在服务器上运行最慢。总的来说,肯定有一种方法可以加快此代码或更好的算法的速度。
Joe Obbish

4

使用游标:

ALTER TABLE #reset_runn_total ADD RunningTotal int;

DECLARE @id int, @val int, @reset int, @acm int, @grp int, @last_grp int;
SET @acm = 0;

DECLARE curRes CURSOR FAST_FORWARD FOR 
SELECT id, val, reset_val, grp
FROM #reset_runn_total
ORDER BY grp, id;

OPEN curRes;
FETCH NEXT FROM curRes INTO @id, @val, @reset, @grp;
SET @last_grp = @grp;

WHILE @@FETCH_STATUS = 0  
BEGIN
    IF @grp <> @last_grp SET @acm = 0;
    SET @last_grp = @grp;
    SET @acm = @acm + @val;
    UPDATE #reset_runn_total
    SET RunningTotal = @acm
    WHERE id = @id;
    IF @acm > @reset SET @acm = 0;
    FETCH NEXT FROM curRes INTO @id, @val, @reset, @grp;
END

CLOSE curRes;
DEALLOCATE curRes;

+----+-----+-----------+-------------+
| id | val | reset_val | RunningTotal|
+----+-----+-----------+-------------+
| 1  | 1   | 10        |     1       |
+----+-----+-----------+-------------+
| 2  | 8   | 12        |     9       |
+----+-----+-----------+-------------+
| 3  | 6   | 14        |     15      |
+----+-----+-----------+-------------+
| 4  | 5   | 10        |     5       |
+----+-----+-----------+-------------+
| 5  | 6   | 13        |     11      |
+----+-----+-----------+-------------+
| 6  | 3   | 11        |     14      |
+----+-----+-----------+-------------+
| 7  | 9   | 8         |     9       |
+----+-----+-----------+-------------+
| 8  | 10  | 12        |     10      |
+----+-----+-----------+-------------+

在这里检查:http : //rextester.com/WSPLO95303


3

没有窗口,而是纯SQL版本:

WITH x AS (
    SELECT TOP 1 id,
           val,
           reset_val,
           val AS running_total,
           1 AS level 
      FROM reset_runn_total
    UNION ALL
    SELECT r.id,
           r.val,
           r.reset_val,
           CASE WHEN x.running_total < x.reset_val THEN x.running_total + r.val ELSE r.val END,
           level = level + 1
      FROM x JOIN reset_runn_total AS r ON (r.id > x.id)
) SELECT
  *
FROM x
WHERE NOT EXISTS (
        SELECT 1
        FROM x AS x2
        WHERE x2.id = x.id
        AND x2.level > x.level
    )
ORDER BY id, level DESC
;

我不是SQL Server方言的专家。这是PostrgreSQL的初始版本(如果我理解正确,则无法在SQL Server的递归部分中使用LIMIT 1 / TOP 1):

WITH RECURSIVE x AS (
    (SELECT id, val, reset_val, val AS running_total
       FROM reset_runn_total
      ORDER BY id
      LIMIT 1)
    UNION
    (SELECT r.id, r.val, r.reset_val,
            CASE WHEN x.running_total < x.reset_val THEN x.running_total + r.val ELSE r.val END
       FROM x JOIN reset_runn_total AS r ON (r.id > x.id)
      ORDER BY id
      LIMIT 1)
) SELECT * FROM x;

说实话,@ JoeObbish,这个问题尚不完全清楚。例如,预期结果显示无grp列。
ypercubeᵀᴹ

@JoeObbish这也是我的理解。然而,该问题可能会受益于对此的明确声明。问题中的代码(带有CTE)也不使用它(甚至具有不同名称的列)。对于阅读该问题的任何人来说,显而易见的是-他们不会-也不应该-必须阅读其他答案或评论。
ypercubeᵀᴹ

@ypercubeᵀᴹ添加了有关问题的必需信息。
Pரதீப்17年

1

看来您有几种查询/方法可以解决问题,但您没有提供给我们-甚至没有考虑过?-表格上的索引。

表中有哪些索引?它是堆还是具有聚集索引?

添加此索引后,我将尝试建议的各种解决方案:

(grp, id) INCLUDE (val, reset_val)

或者只是更改(或使)聚集索引为(grp, id)

具有针对特定查询的索引应该提高效率-对于大多数(如果不是全部)方法而言。


添加了有关问题的必需信息。
Pரதீப்17年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.