分组或窗口

13

我有一种情况，我认为可以使用窗口函数解决，但我不确定。

想象一下下表

CREATE TABLE tmp
  ( date timestamp,        
    id_type integer
  ) ;

INSERT INTO tmp 
    ( date, id_type )
VALUES
    ( '2017-01-10 07:19:21.0', 3 ),
    ( '2017-01-10 07:19:22.0', 3 ),
    ( '2017-01-10 07:19:23.1', 3 ),
    ( '2017-01-10 07:19:24.1', 3 ),
    ( '2017-01-10 07:19:25.0', 3 ),
    ( '2017-01-10 07:19:26.0', 5 ),
    ( '2017-01-10 07:19:27.1', 3 ),
    ( '2017-01-10 07:19:28.0', 5 ),
    ( '2017-01-10 07:19:29.0', 5 ),
    ( '2017-01-10 07:19:30.1', 3 ),
    ( '2017-01-10 07:19:31.0', 5 ),
    ( '2017-01-10 07:19:32.0', 3 ),
    ( '2017-01-10 07:19:33.1', 5 ),
    ( '2017-01-10 07:19:35.0', 5 ),
    ( '2017-01-10 07:19:36.1', 5 ),
    ( '2017-01-10 07:19:37.1', 5 )
  ;

我想在id_type列的每次更改中都有一个新的组。EG第一组从7:19:21到7:19:25，第二组在7:19:26开始和结束，依此类推。
在工作之后，我想包括更多的条件来定义组。

目前，使用下面的查询...

SELECT distinct 
    min(min(date)) over w as begin, 
    max(max(date)) over w as end,   
    id_type
from tmp
GROUP BY id_type
WINDOW w as (PARTITION BY id_type)
order by  begin;

我得到以下结果：

begin                   end                     id_type
2017-01-10 07:19:21.0   2017-01-10 07:19:32.0   3
2017-01-10 07:19:26.0   2017-01-10 07:19:37.1   5

虽然我想要：

begin                   end                     id_type
2017-01-10 07:19:21.0   2017-01-10 07:19:25.0   3
2017-01-10 07:19:26.0   2017-01-10 07:19:26.0   5
2017-01-10 07:19:27.1   2017-01-10 07:19:27.1   3
2017-01-10 07:19:28.0   2017-01-10 07:19:29.0   5
2017-01-10 07:19:30.1   2017-01-10 07:19:30.1   3
2017-01-10 07:19:31.0   2017-01-10 07:19:31.0   5
2017-01-10 07:19:32.0   2017-01-10 07:19:32.0   3
2017-01-10 07:19:33.1   2017-01-10 07:19:37.1   5

解决了第一步之后，我将添加更多列以用作打破组的规则，而其他列将可以为空。

Postgres版本：8.4（我们有Postgres和Postgis，因此升级并不容易。Postgis函数会更改名称并存在其他问题，但是希望我们已经在重写所有内容，并且新版本将使用较新的版本9.X与postgis 2.x）

— 莱洛
source

2

通用的解决方案：dba.stackexchange.com/questions/35380/...

— 欧文Brandstetter修改

4

几点，

不要调用tmp只会造成混乱的非临时表。
请勿在时间戳记中使用文本（在示例中您可以这样做，因为时间戳记没有被截断并且具有.0）
不要叫有时间的领域date。如果有日期和时间，则为时间戳（并将其存储为一个）

最好使用窗口功能。

SELECT id_type, grp, min(date), max(date)
FROM (
  SELECT date, id_type, count(is_reset) OVER (ORDER BY date) AS grp
  FROM (
    SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
    FROM tmp
  ) AS t
) AS g
GROUP BY id_type, grp
ORDER BY min(date);

产出

 id_type | grp |          min          |          max          
---------+-----+-----------------------+-----------------------
       3 |   0 | 2017-01-10 07:19:21.0 | 2017-01-10 07:19:25.0
       5 |   1 | 2017-01-10 07:19:26.0 | 2017-01-10 07:19:26.0
       3 |   2 | 2017-01-10 07:19:27.1 | 2017-01-10 07:19:27.1
       5 |   3 | 2017-01-10 07:19:28.0 | 2017-01-10 07:19:29.0
       3 |   4 | 2017-01-10 07:19:30.1 | 2017-01-10 07:19:30.1
       5 |   5 | 2017-01-10 07:19:31.0 | 2017-01-10 07:19:31.0
       3 |   6 | 2017-01-10 07:19:32.0 | 2017-01-10 07:19:32.0
       5 |   7 | 2017-01-10 07:19:33.1 | 2017-01-10 07:19:37.1
(8 rows)

讲解

首先，我们需要重置。 lag()

SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
FROM tmp
ORDER BY date;

         date          | id_type | is_reset 
-----------------------+---------+----------
 2017-01-10 07:19:21.0 |       3 |         
 2017-01-10 07:19:22.0 |       3 |         
 2017-01-10 07:19:23.1 |       3 |         
 2017-01-10 07:19:24.1 |       3 |         
 2017-01-10 07:19:25.0 |       3 |         
 2017-01-10 07:19:26.0 |       5 |        1
 2017-01-10 07:19:27.1 |       3 |        1
 2017-01-10 07:19:28.0 |       5 |        1
 2017-01-10 07:19:29.0 |       5 |         
 2017-01-10 07:19:30.1 |       3 |        1
 2017-01-10 07:19:31.0 |       5 |        1
 2017-01-10 07:19:32.0 |       3 |        1
 2017-01-10 07:19:33.1 |       5 |        1
 2017-01-10 07:19:35.0 |       5 |         
 2017-01-10 07:19:36.1 |       5 |         
 2017-01-10 07:19:37.1 |       5 |         
(16 rows)

然后我们计数以获得组。

SELECT date, id_type, count(is_reset) OVER (ORDER BY date) AS grp
FROM (
  SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
  FROM tmp
  ORDER BY date
) AS t
ORDER BY date

         date          | id_type | grp 
-----------------------+---------+-----
 2017-01-10 07:19:21.0 |       3 |   0
 2017-01-10 07:19:22.0 |       3 |   0
 2017-01-10 07:19:23.1 |       3 |   0
 2017-01-10 07:19:24.1 |       3 |   0
 2017-01-10 07:19:25.0 |       3 |   0
 2017-01-10 07:19:26.0 |       5 |   1
 2017-01-10 07:19:27.1 |       3 |   2
 2017-01-10 07:19:28.0 |       5 |   3
 2017-01-10 07:19:29.0 |       5 |   3
 2017-01-10 07:19:30.1 |       3 |   4
 2017-01-10 07:19:31.0 |       5 |   5
 2017-01-10 07:19:32.0 |       3 |   6
 2017-01-10 07:19:33.1 |       5 |   7
 2017-01-10 07:19:35.0 |       5 |   7
 2017-01-10 07:19:36.1 |       5 |   7
 2017-01-10 07:19:37.1 |       5 |   7
(16 rows)

然后，我们包装一个子选择GROUP BY，ORDER然后选择最小最大（范围）

SELECT id_type, grp, min(date), max(date)
FROM (
  .. stuff
) AS g
GROUP BY id_type, grp
ORDER BY min(date);

— 埃文·卡洛尔
source

16

1.窗口函数和子查询

计算经过修改和修正的类似于Evan的想法的形成小组的步骤：

SELECT id_type
     , min(date) AS begin
     , max(date) AS end
     , count(*)  AS row_ct  -- optional addition
FROM  (
   SELECT date, id_type, count(step OR NULL) OVER (ORDER BY date) AS grp
   FROM  (
      SELECT date, id_type
           , lag(id_type, 1, id_type) OVER (ORDER BY date) <> id_type AS step
      FROM   tmp
      ) sub1
   ) sub2
GROUP  BY id_type, grp
ORDER  BY min(date);

假设涉及的列为NOT NULL。否则，您需要做更多的事情。

还要假设date已定义UNIQUE，否则您需要在ORDER BY子句中添加一个决胜符，以获得确定性结果。像：ORDER BY date, id。

详细说明（非常相似的问题的答案）：

选择最长的连续序列

特别注意：

在相关情况下，必须lag()具有3个参数才能优雅地覆盖第一行（或最后一行）的特殊情况。（如果没有上一行（下一行），则默认使用第三个参数。
```
lag(id_type, 1, id_type) OVER ()
```
由于我们在实际只关心变化的id_type（TRUE），它不会在这种特殊情况下无所谓。NULL而FALSE两者都不是指望step。
count(step OR NULL) OVER (ORDER BY date)是最短的语法，也可以在Postgres 9.3或更早的版本中使用。count()只计算非空值...

在现代Postgres中，更简洁的等效语法为：
```
count(step) FILTER (WHERE step) OVER (ORDER BY date)
```
细节：
- 为了获得绝对的性能，SUM是更快还是COUNT？

2.减去两个窗口函数，一个子查询

类似于Erik的修改思想：

SELECT min(date) AS begin
     , max(date) AS end
     , id_type
FROM  (
   SELECT date, id_type
        , row_number() OVER (ORDER BY date)
        - row_number() OVER (PARTITION BY id_type ORDER BY date) AS grp
   FROM   tmp
   ) sub
GROUP  BY id_type, grp
ORDER  BY min(date);

就像我在上面提到的（您从未澄清过）那样，如果date定义了UNIQUE，那dense_rank()将毫无意义，因为结果与for相同，row_number()而后者要便宜得多。

如果date是没有定义UNIQUE（我们不知道，只重复上(date, id_type)），所有这些查询都是没有意义的，因为结果是任意的。

而且，子查询通常比Postgres中的CTE便宜。仅在需要时使用CTE 。

3. plpgsql函数的最高性能

由于此问题已变得出乎意料地受欢迎，因此我将添加另一个解决方案以展示最佳性能。

SQL具有许多复杂的工具来创建具有简短优雅语法的解决方案。但是，声明性语言对于涉及程序元素的更复杂的要求有其局限性。

一个服务器端程序的功能是这个速度比任何张贴到目前为止，因为它仅需要一个单一的顺序扫描在桌子上和一个单一的排序操作。如果有合适的索引，则仅进行一次仅索引扫描。

CREATE OR REPLACE FUNCTION f_tmp_groups()
  RETURNS TABLE (id_type int, grp_begin timestamp, grp_end timestamp) AS
$func$
DECLARE
   _row  tmp;                       -- use table type for row variable
BEGIN
   FOR _row IN
      TABLE tmp ORDER BY date       -- add more columns to make order deterministic
   LOOP
      CASE _row.id_type = id_type 
      WHEN TRUE THEN                -- same group continues
         grp_end := _row.date;      -- remember last date so far
      WHEN FALSE THEN               -- next group starts
         RETURN NEXT;               -- return result for last group
         id_type   := _row.id_type;
         grp_begin := _row.date;
         grp_end   := _row.date;
      ELSE                          -- NULL for 1st row
         id_type   := _row.id_type; -- remember row data for starters
         grp_begin := _row.date;
         grp_end   := _row.date;
      END CASE;
   END LOOP;

   RETURN NEXT;                     -- return last result row      
END
$func$ LANGUAGE plpgsql;

呼叫：

SELECT * FROM f_tmp_groups();

测试：

EXPLAIN (ANALYZE, TIMING OFF)  -- to focus on total performance
SELECT * FROM  f_tmp_groups();

您可以使函数具有多态类型，并传递表类型和列名。细节：

重构PL / pgSQL函数以返回各种SELECT查询的输出

如果您不想或无法为此保留功能，则甚至需要动态创建一个临时功能。花费几毫秒。

如何在PostgreSQL中创建一个临时函数？

dbfiddle for Postgres 9.6，比较了这三个数据库的性能。基于Jack的测试用例，进行了修改。

dbfiddle for Postgres 8.4，其中性能差异更大。

— 欧文·布兰德斯特
source

读几次-仍然不确定您在谈论三个参数滞后count(x or null)是什么，还是什么时候必须使用，甚至在那里做什么。也许你可以展示一些样品，其中它是必需的，因为它不需要在这里。并且，什么是覆盖这些极端情况的关键要求。顺便说一句，我仅将pl / pgsql示例的downvote更改为upvote。这太酷了。（但是，通常我不赞成总结其他答案或掩盖极端情况的答案，尽管我不愿意说这是一个极端情况，因为我不理解）。

— 埃文·卡罗尔

我会把它们放在两个单独的自回答问题中，因为我敢肯定我不是唯一想知道count(x or null)它会做什么的人。如果您愿意，我很乐意问两个问题。

— 埃文·卡罗尔

这是一个问题，在什么情况下count(x or null)，缺口和离岛区需要？

— 埃文·卡罗尔

7

您可以通过对ROW_NUMBER()操作进行简单的减法来完成此操作（或者，如果您的日期不是唯一的，尽管每个日期仍然是唯一的id_type，则可以使用DENSE_RANK()它来代替，尽管这将是更昂贵的查询）：

WITH IdTypes AS (
   SELECT
      date,
      id_type,
      Row_Number() OVER (ORDER BY date)
         - Row_Number() OVER (PARTITION BY id_type ORDER BY date)
         AS Seq
   FROM
      tmp
)
SELECT
   Min(date) AS begin,
   Max(date) AS end,
   id_type
FROM IdTypes
GROUP BY id_type, Seq
ORDER BY begin
;

在DB Fiddle上查看此工作（或查看DENSE_RANK版本）

结果：

begin                  end                    id_type
---------------------  ---------------------  -------
2017-01-10 07:19:21    2017-01-10 07:19:25    3
2017-01-10 07:19:26    2017-01-10 07:19:26    5
2017-01-10 07:19:27.1  2017-01-10 07:19:27.1  3
2017-01-10 07:19:28    2017-01-10 07:19:29    5
2017-01-10 07:19:30.1  2017-01-10 07:19:30.1  3
2017-01-10 07:19:31    2017-01-10 07:19:31    5
2017-01-10 07:19:32    2017-01-10 07:19:32    3
2017-01-10 07:19:33.1  2017-01-10 07:19:37.1  5

从逻辑上讲，您可以将其视为简单DENSE_RANK()的PREORDER BY，即，您希望将DENSE_RANK所有在一起排序的项目中的，并希望它们按日期排序，那么您只需要解决以下令人讨厌的问题：在日期的每次更改时，DENSE_RANK都会增加。您可以使用上面显示的表达式来实现。试想一下，如果您使用以下语法：DENSE_RANK() OVER (PREORDER BY date, ORDER BY id_type)在PREORDER排名计算中排除，而仅对进行ORDER BY计数。

请注意，GROUP BY这对于生成的Seq列以及该id_type列都非常重要。Seq本身不是唯一的，可能会有重叠-您还必须按分组id_type。

有关此主题的进一步阅读：

检测行值之间的变化-阅读“亲自了解”部分。
或更简单的解释

如果希望开始或结束日期与上一个或下一个期间的结束/开始日期相同（因此没有空格），则第一个链接为您提供了一些可以使用的代码。加上其他可以帮助您查询的版本。尽管它们必须从SQL Server语法进行翻译...

— 埃里克
source

6

在Postgres 8.4上，您可以使用RECURSIVE函数。

他们是如何做到的呢

递归函数通过按降序依次选择日期，为每个不同的id_type添加一个级别。

       date           | id_type | lv
--------------------------------------
2017-01-10 07:19:21.0      3       8
2017-01-10 07:19:22.0      3       8
2017-01-10 07:19:23.1      3       8
2017-01-10 07:19:24.1      3       8
2017-01-10 07:19:25.0      3       8
2017-01-10 07:19:26.0      5       7
2017-01-10 07:19:27.1      3       6
2017-01-10 07:19:28.0      5       5
2017-01-10 07:19:29.0      5       5
2017-01-10 07:19:30.1      3       4
2017-01-10 07:19:31.0      5       3
2017-01-10 07:19:32.0      3       2
2017-01-10 07:19:33.1      5       1
2017-01-10 07:19:35.0      5       1
2017-01-10 07:19:36.1      5       1
2017-01-10 07:19:37.1      5       1

然后使用MAX（date），MIN（date）按级别，id_type分组以获得所需的结果。

with RECURSIVE rdates as 
(
    (select   date, id_type, 1 lv 
     from     yourTable
     order by date desc
     limit 1
    )
    union
    (select    d.date, d.id_type,
               case when r.id_type = d.id_type 
                    then r.lv 
                    else r.lv + 1 
               end lv    
    from       yourTable d
    inner join rdates r
    on         d.date < r.date
    order by   date desc
    limit      1)
)
select   min(date) StartDate,
         max(date) EndDate,
         id_type
from     rdates
group by lv, id_type
;

+---------------------+---------------------+---------+
| startdate           |       enddate       | id_type |
+---------------------+---------------------+---------+
| 10.01.2017 07:19:21 | 10.01.2017 07:19:25 |    3    |
| 10.01.2017 07:19:26 | 10.01.2017 07:19:26 |    5    |
| 10.01.2017 07:19:27 | 10.01.2017 07:19:27 |    3    |
| 10.01.2017 07:19:28 | 10.01.2017 07:19:29 |    5    |
| 10.01.2017 07:19:30 | 10.01.2017 07:19:30 |    3    |
| 10.01.2017 07:19:31 | 10.01.2017 07:19:31 |    5    |
| 10.01.2017 07:19:32 | 10.01.2017 07:19:32 |    3    |
| 10.01.2017 07:19:33 | 10.01.2017 07:19:37 |    5    |
+---------------------+---------------------+---------+

检查它：http : //rextester.com/WCOYFP6623

— 麦克纳斯
source

5

这是另一种方法，与Evan和Erwin相似，因为它使用LAG来确定岛。它与那些解决方案的不同之处在于，它仅使用一层嵌套，不进行分组，并且使用更多的窗口函数：

SELECT
  id_type,
  date AS begin,
  COALESCE(
    LEAD(prev_date) OVER (ORDER BY date ASC),
    last_date
  ) AS end
FROM
  (
    SELECT
      id_type,
      date,
      LAG(date) OVER (ORDER BY date ASC) AS prev_date,
      MAX(date) OVER () AS last_date,
      CASE id_type
        WHEN LAG(id_type) OVER (ORDER BY date ASC)
        THEN 0
        ELSE 1
      END AS is_start
    FROM
      tmp
  ) AS derived
WHERE
  is_start = 1
ORDER BY
  date ASC
;

is_start嵌套SELECT中的计算列标记每个岛的开始。此外，嵌套的SELECT公开每行的上一个日期和数据集的上一个日期。

对于作为各自岛屿开始位置的行，前一个日期实际上是前一个岛屿的结束日期。这就是主要的SELECT使用它的方式。它仅选择符合is_start = 1条件的行，并为每个返回的行显示该行自己的dateas begin和下一行的prev_dateas end。由于最后一行没有下一行，LEAD(prev_date)因此为其返回空值，为此COALESCE函数将其替换为数据集的最后日期。

您可以在dbfiddle中使用此解决方案。

在引入标识孤岛的其他列时，您可能希望向每个窗口函数的OVER子句引入PARTITION BY子句。例如，如果您要检测由定义的组中的孤岛parent_id，则上述查询可能需要如下所示：

SELECT
  parent_id,
  id_type,
  date AS begin,
  COALESCE(
    LEAD(prev_date) OVER (PARTITION BY parent_id ORDER BY date ASC),
    last_date
  ) AS end
FROM
  (
    SELECT
      parent_id,
      id_type,
      date,
      LAG(date) OVER (PARTITION BY parent_id ORDER BY date ASC) AS prev_date,
      MAX(date) OVER (PARTITION BY parent_id) AS last_date,
      CASE id_type
        WHEN LAG(id_type) OVER (PARTITION BY parent_id ORDER BY date ASC)
        THEN 0
        ELSE 1
      END AS is_start
    FROM
      tmp
  ) AS derived
WHERE
  is_start = 1
ORDER BY
  date ASC
;

而且，如果您决定采用Erwin或Evan的解决方案，我相信也需要对其进行类似的更改。

— 安德烈·M
source

5

除了作为一种实际的解决方案之外，您还可以使用用户定义的汇总来实现这一目标，这不仅仅是出于学术上的兴趣。与其他解决方案一样，即使在Postgres 8.4上也可以使用，但是正如其他人所评论的，请升级。

聚合处理的方式null就好像是不同的一样foo_type，因此，空值的运行方式将是相同的grp-可能是或可能不是您想要的。

create function grp_sfunc(integer[],integer) returns integer[] language sql as $$
  select array[$1[1]+($1[2] is distinct from $2 or $1[3]=0)::integer,$2,1];
$$;

create function grp_finalfunc(integer[]) returns integer language sql as $$
  select $1[1];
$$;

create aggregate grp(integer)(
  sfunc = grp_sfunc
, stype = integer[]
, finalfunc = grp_finalfunc
, initcond = '{0,0,0}'
);

select min(foo_at) begin_at, max(foo_at) end_at, foo_type
from (select *, grp(foo_type) over (order by foo_at) from foo) z
group by grp, foo_type
order by 1;

begin_at | end_at | foo_type
：-------------------- | ：-------------------- | -------：
2017-01-10 07:19:21 | 2017-01-10 07:19:25 | 3
2017-01-10 07:19:26 | 2017-01-10 07:19:26 | 5
2017-01-10 07：19：27.1 | 2017-01-10 07：19：27.1 | 3
2017-01-10 07:19:28 | 2017-01-10 07:19:29 | 5
2017-01-10 07：19：30.1 | 2017-01-10 07：19：30.1 | 3
2017-01-10 07:19:31 | 2017-01-10 07:19:31 | 5
2017-01-10 07:19:32 | 2017-01-10 07:19:32 | 3
2017-01-10 07：19：33.1 | 2017-01-10 07：19：37.1 | 5

dbfiddle 在这里

— 杰克说尝试topanswers.xyz
source

4

这可以RECURSIVE CTE通过将“开始时间”从一行传递到另一行，以及一些额外的（方便）准备工作来完成。

该查询返回您想要的结果：

WITH RECURSIVE q AS
(
    SELECT
        id_type,
        "date",
        /* We compute next id_type for convenience, plus row_number */
        row_number()  OVER (w) AS rn,
        lead(id_type) OVER (w) AS next_id_type
    FROM
        t
    WINDOW
        w AS (ORDER BY "date") 
)

准备后...递归部分

, rec AS 
(
    /* Anchor */
    SELECT
        q.rn,
        q."date" AS "begin",
        /* When next_id_type is different from Look also at **next** row to find out whether we need to mark an end */
        case when q.id_type is distinct from q.next_id_type then q."date" END AS "end",
        q.id_type
    FROM
        q
    WHERE
        rn = 1

    UNION ALL

    /* Loop */
    SELECT
        q.rn,
        /* We keep copying 'begin' from one row to the next while type doesn't change */
        case when q.id_type = rec.id_type then rec.begin else q."date" end AS "begin",
        case when q.id_type is distinct from q.next_id_type then q."date" end AS "end",
        q.id_type
    FROM
        rec
        JOIN q ON q.rn = rec.rn+1
)
-- We filter the rows where "end" is not null, and project only needed columns
SELECT
    "begin", "end", id_type
FROM
    rec
WHERE
    "end" is not null ;

您可以在http://rextester.com/POYM83542上进行检查

这种方法无法很好地扩展。对于8_641行表，它需要7s，对于两倍大小的表，它需要28s。更多示例显示执行时间看起来像O（n ^ 2）。

埃文·卡罗尔（Evan Carrol）的方法花费不到1秒的时间（即去吧！），看起来像O（n）。递归查询绝对是低效的，应被视为万不得已。

— 若阿诺洛
source