选择按值均匀分布的分组数据


8

我想从一个表中选择4组数据,这些数据具有各组中值的总和,并且分布均匀。我确信我对它的解释不够清楚,所以我将尝试举一个例子。

在这里,我使用NTILE(4)创建4个组:

SELECT Time, NTILE(4) OVER (ORDER BY Time DESC) AS N FROM TableX

Time -  N
-------------
10  -   1
 9  -   2
 8  -   3
 7  -   4
 6  -   1
 5  -   2
 4  -   3
 3  -   4
 2  -   1
 1  -   2

为了简便起见,在上面的查询和结果中,其他列已被省略。

因此,您还可以看到以下组:

  1    2    3    4
---  ---  ---  ---
 10    9    8    7
  6    5    4    3
  2    1    
---  ---  ---  ---
 18   15   12   10  Sum Totals of Time

请注意,使用NTile的时间总和在组之间并没有真正达到平衡。例如,时间值的更好分布是:

  1    2    3    4
---  ---  ---  ---
 10    9    8    7
  3    5    4    6
  1         2
---  ---  ---  ---
 14   14   14   13  Sum Totals of Time

在这里,“时间总计”在4个组中分布更均匀。

如何通过TSQL语句执行此操作?

此外,我不得不说我正在使用SQL Server2012。如果您有什么可以帮助我的,请告诉我。

祝你今天愉快。

斯坦


您的值总是整数吗?如果是这样,它们是连续的,还是存在间隙?独特的价值?
Daniel Hutmacher '16

嗨,是的,它们是整数,不,它们不是连续的,也许有些是双倍的并且可以确定它们之间存在间隙。想象他们,这是为该特定项目执行操作所需的时间(该特定项目为省略的列)。
iStan

Answers:


14

这是对算法的一种尝试。它不是完美的,并且取决于您要花费多少时间对其进行优化,可能还会有一些更小的收获。

假设您有一个要由四个队列执行的任务表。您知道与执行每个任务相关的工作量,并且希望所有四个队列获得几乎相等的工作量,因此所有队列将在大约同一时间完成。

首先,我将按照任务的大小从小到大按顺序对任务进行分区。

SELECT [time], ROW_NUMBER() OVER (ORDER BY [time])%4 AS grp, 0

ROW_NUMBER()订单的每一行由大小,然后分配的行编号,从1开始该行数被分配一个“基团”(在grp循环的基础上栏)。第一行是组1,第二行是组2,然后是3,第四行是组0,依此类推。

time ROW_NUMBER() grp
---- ------------ ---
   1            1   1
  10            2   2
  12            3   3
  15            4   0
  19            5   1
  22            6   2
...

为了便于使用,我将timegrp列存储在名为的表变量中@work

现在,我们可以对该数据执行一些计算:

WITH cte AS (
    SELECT *, SUM([time]) OVER (PARTITION BY grp)
             -SUM([time]) OVER (PARTITION BY (SELECT NULL))/4 AS _grpoffset
    FROM @work)
...

该列_grpoffsettime每个总数grp与“理想”平均值相差多少。如果time所有任务的总数为1000,并且有四个组,则理想情况下,每个组中应总共有250个。如果一个组总共包含268个,则该组的为_grpoffset=18

这个想法是确定两个最佳行,一个在“积极”组中(工作量太大),另一个在“消极”组中(工作量太少)。如果我们可以在这两行上交换组,则可以减少_grpoffset两个组的绝对值。

例:

time grp total _grpoffset
---- --- ----- ----------
   3   1   222         40
  46   1   222         40
  73   1   222         40
 100   1   222         40
   6   2   134        -48
  52   2   134        -48
  76   2   134        -48
  11   3   163        -21
  66   3   163        -21
  86   3   163        -21
  45   0   208         24
  71   0   208         24
  92   0   208         24
----
=727

满分为727,每组的平均得分应为182,这是完美的分配。该组的分数与182之间的区别是我们在_grpoffset栏中所输入的。

如您现在所见,在最好的情况下,我们应该将大约40点的行从组1移动到组2,将大约24点的行从组3移动到组0。

这是识别那些候选行的代码:

    SELECT TOP 1 pos._row AS _pos_row, pos.grp AS _pos_grp,
                 neg._row AS _neg_row, neg.grp AS _neg_grp
    FROM cte AS pos
    INNER JOIN cte AS neg ON
        pos._grpoffset>0 AND
        neg._grpoffset<0 AND
        --- To prevent infinite recursion:
        pos.moved<4 AND
        neg.moved<4
    WHERE --- must improve positive side's offset:
          ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
          --- must improve negative side's offset:
          ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
    --- Largest changes first:
    ORDER BY ABS(pos.[time]-neg.[time]) DESC
    ) AS x ON w._row IN (x._pos_row, x._neg_row);

我正在自我加入我们之前创建的公用表表达式cte:一方面,使用正数的组,另一方面,_grpoffset使用负数的组。为了进一步过滤出应该匹配的行,必须改进正负极行的交换_grpoffset,即使其接近于0。

TOP 1ORDER BY选择“最好”的比赛进行到第掉。

现在,我们UPDATE要做的就是添加一个,然后对其进行循环,直到找不到更多的优化为止。

TL; DR-这是查询

这是完整的代码:

DECLARE @work TABLE (
    _row    int IDENTITY(1, 1) NOT NULL,
    [time]  int NOT NULL,
    grp     int NOT NULL,
    moved   tinyint NOT NULL,
    PRIMARY KEY CLUSTERED ([time], _row)
);

WITH cte AS (
    SELECT 0 AS n, CAST(1+100*RAND(CHECKSUM(NEWID())) AS int) AS [time]
    UNION ALL
    SELECT n+1,    CAST(1+100*RAND(CHECKSUM(NEWID())) AS int) AS [time]
    FROM cte WHERE n<100)

INSERT INTO @work ([time], grp, moved)
SELECT [time], ROW_NUMBER() OVER (ORDER BY [time])%4 AS grp, 0
FROM cte;



WHILE (@@ROWCOUNT!=0)
    WITH cte AS (
        SELECT *, SUM([time]) OVER (PARTITION BY grp)
                 -SUM([time]) OVER (PARTITION BY (SELECT NULL))/4 AS _grpoffset
        FROM @work)

    UPDATE w
    SET w.grp=(CASE w._row
               WHEN x._pos_row THEN x._neg_grp
               ELSE x._pos_grp END),
        w.moved=w.moved+1
    FROM @work AS w
    INNER JOIN (
        SELECT TOP 1 pos._row AS _pos_row, pos.grp AS _pos_grp,
                     neg._row AS _neg_row, neg.grp AS _neg_grp
        FROM cte AS pos
        INNER JOIN cte AS neg ON
            pos._grpoffset>0 AND
            neg._grpoffset<0 AND
            --- To prevent infinite recursion:
            pos.moved<4 AND
            neg.moved<4
        WHERE --- must improve positive side's offset:
              ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
              --- must improve negative side's offset:
              ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
        --- Largest changes first:
        ORDER BY ABS(pos.[time]-neg.[time]) DESC
        ) AS x ON w._row IN (x._pos_row, x._neg_row);
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.