在Postgresql查询中有效选择多个连续范围的开始和结束


19

我在一个表中有大约十亿行数据,其中一个名称和一个介于1-288之间的整数。对于给定的名称,每个int都是唯一的,并且不会出现该范围内的每个可能的整数-因此存在间隙。

此查询生成一个示例案例:

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) AS baz ("name", "int")

我想为每个名称和连续整数序列生成一个查询表,并在其中一行。每个这样的行将包含:

名称 -的值的名字
开始 -在所述连续序列中的第一个整数
-在连续序列的最终值
跨度 - 端-开始+ 1

该查询为上述示例生成示例输出:

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
              ('foo', 10, 11, 2),
              ('foo', 13, 13, 1),
              ('bar', 1, 3, 3)
     ) AS contiguous_ranges ("name", "start", "end", span)

因为我有很多行,所以效率更高更好。也就是说,我只需要运行一次此查询,因此这不是绝对要求。

提前致谢!

编辑:

我应该补充一点,欢迎使用PL / pgSQL解决方案(请解释任何花哨的技巧-我还是PL / pgSQL的新手)。


我会找到一种以足够小的块处理表的方法(可能是通过将“名称”散列到N个存储桶中,或采用名称的第一个/最后一个字母),以便使排序适合内存。扫描表中的多个表可能比让排序溢出到磁盘要快。一旦有了这些,就可以使用开窗功能。同样,不要忘记利用数据中的模式。也许大多数“名称”实际上都有288个值,在这种情况下,您可以从主过程中排除这些值。随机漫游结束:)

很好-欢迎来到该网站。您提供的解决方案有运气吗?
杰克·道格拉斯

谢谢。我实际上在发布此问题后不久就更改了项目(此后不久,我又换了工作),所以我再也没有机会测试这些解决方案。在这种情况下,我应该如何选择答案?
2012年

Answers:


9

怎么样使用 with recursive

测试视图:

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");

查询:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;

结果:

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)

我想知道这对您的十亿行表有何影响。


如果性能是一个问题,则使用work_mem的设置可能有助于提高性能。
Frank Heikens 2012年

7

您可以使用窗口功能来做到这一点。基本思想是使用leadlag窗口函数在当前行之前和之后拉行。然后我们可以计算是否有序列的开始或结束:

create temp view temp_view as
    select
        n,
        val,
        (lead <> val + 1 or lead is null) as islast,
        (lag <> val - 1 or lag is null) as isfirst,
        (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
    from
    (
        select
            n,
            lead(val, 1) over( partition by n order by n, val),
            lag(val, 1) over(partition by n order by n, val ),
            val
        from test
        order by n, val
    ) as t
;  
select * from temp_view;
 n  | val | islast | isfirst | orphan 
-----+-----+--------+---------+--------
 bar |   1 | f      | t       | f
 bar |   2 | f      | f       | f
 bar |   3 | t      | f       | f
 bar |  24 | t      | t       | t
 bar |  42 | t      | t       | t
 foo |   2 | f      | t       | f
 foo |   3 | f      | f       | f
 foo |   4 | t      | f       | f
 foo |  10 | f      | t       | f
 foo |  11 | t      | f       | f
 foo |  13 | t      | t       | t
(11 rows)

(我使用了一个视图,因此下面的逻辑将更容易理解。)因此,现在我们知道该行是开始还是结束。我们必须将其折叠成行:

select
    n as "name",
    first,
    coalesce (last, first) as last,
    coalesce (last - first + 1, 1) as span
from
(
    select
    n,
    val as first,
    -- this will not be excellent perf. since were calling the view
    -- for each row sequence found. Changing view into temp table 
    -- will probably help with lots of values.
    (
        select min(val)
        from temp_view as last
        where islast = true
        -- need this since isfirst=true, islast=true on an orphan sequence
        and last.orphan = false
        and first.val < last.val
        and first.n = last.n
    ) as last
    from
        (select * from temp_view where isfirst = true) as first
) as t
;

 name | first | last | span 
------+-------+------+------
 bar  |     1 |    3 |    3
 bar  |    24 |   24 |    1
 bar  |    42 |   42 |    1
 foo  |     2 |    4 |    3
 foo  |    10 |   11 |    2
 foo  |    13 |   13 |    1
(6 rows)

对我来说看起来很正确:)


3

另一个窗口函数解决方案。不知道效率,我在最后添加了执行计划(尽管行数很少,但可能没有太大价值)。如果您想玩:SQL-Fiddle测试

表和数据:

CREATE TABLE baz
( name VARCHAR(10) NOT NULL
, i INT  NOT NULL
, UNIQUE  (name, i)
) ;

INSERT INTO baz
  VALUES 
    ('foo', 2),
    ('foo', 3),
    ('foo', 4),
    ('foo', 10),
    ('foo', 11),
    ('foo', 13),
    ('bar', 1),
    ('bar', 2),
    ('bar', 3)
  ;

查询:

SELECT a.name     AS name
     , a.i        AS start
     , b.i        AS "end"
     , b.i-a.i+1  AS span
FROM
      ( SELECT name, i
             , ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn
        FROM baz AS a
        WHERE NOT EXISTS
              ( SELECT * 
                FROM baz AS prev
                WHERE prev.name = a.name
                  AND prev.i = a.i - 1
              ) 
      ) AS a
    JOIN
      ( SELECT name, i 
             , ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn
        FROM baz AS a
        WHERE NOT EXISTS
              ( SELECT * 
                FROM baz AS next
                WHERE next.name = a.name
                  AND next.i = a.i + 1
              )
      ) AS b
    ON  b.name = a.name
    AND b.rn  = a.rn
 ; 

查询计划

Merge Join (cost=442.74..558.76 rows=18 width=46)
Merge Cond: ((a.name)::text = (a.name)::text)
Join Filter: ((row_number() OVER (?)) = (row_number() OVER (?)))
-> WindowAgg (cost=221.37..238.33 rows=848 width=42)
-> Sort (cost=221.37..223.49 rows=848 width=42)
Sort Key: a.name, a.i
-> Merge Anti Join (cost=157.21..180.13 rows=848 width=42)
Merge Cond: (((a.name)::text = (prev.name)::text) AND (((a.i - 1)) = prev.i))
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: a.name, ((a.i - 1))
-> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42)
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: prev.name, prev.i
-> Seq Scan on baz prev (cost=0.00..21.30 rows=1130 width=42)
-> Materialize (cost=221.37..248.93 rows=848 width=50)
-> WindowAgg (cost=221.37..238.33 rows=848 width=42)
-> Sort (cost=221.37..223.49 rows=848 width=42)
Sort Key: a.name, a.i
-> Merge Anti Join (cost=157.21..180.13 rows=848 width=42)
Merge Cond: (((a.name)::text = (next.name)::text) AND (((a.i + 1)) = next.i))
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: a.name, ((a.i + 1))
-> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42)
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: next.name, next.i
-> Seq Scan on baz next (cost=0.00..21.30 rows=1130 width=42)

3

在SQL Server上,我将再添加一个名为previousInt的列:

SELECT *
FROM ( VALUES ('foo', 2, NULL),
              ('foo', 3, 2),
              ('foo', 4, 3),
              ('foo', 10, 4),
              ('foo', 11, 10),
              ('foo', 13, 11),
              ('bar', 1, NULL),
              ('bar', 2, 1),
              ('bar', 3, 2)
     ) AS baz ("name", "int", "previousInt")

我将使用CHECK约束来确保previousInt <int,并使用FK约束(name,previousInt)来引用(name,int),以及另外两个约束来确保水密数据的完整性。完成后,选择差距很简单:

SELECT NAME, PreviousInt, Int from YourTable WHERE PreviousInt < Int - 1;

为了加快速度,我可能创建一个仅包含空白的过滤索引。这意味着您的所有差距都是预先计算的,因此选择非常快,并且约束条件确保了预先计算的数据的完整性。我经常使用这样的解决方案,它们遍布我的系统。


1

您可以寻找Tabibitosan方法:

https://community.oracle.com/docs/DOC-915680
http://rwijk.blogspot.com/2014/01/tabibitosan.html
https://www.xaprb.com/blog/2006/03/22/find-contiguous-ranges-with-sql/

基本上:

SQL> create table mytable (nr)
  2  as
  3  select 1 from dual union all
  4  select 2 from dual union all
  5  select 3 from dual union all
  6  select 6 from dual union all
  7  select 7 from dual union all
  8  select 11 from dual union all
  9  select 18 from dual union all
 10  select 19 from dual union all
 11  select 20 from dual union all
 12  select 21 from dual union all
 13  select 22 from dual union all
 14  select 25 from dual
 15  /

 Table created.

 SQL> with tabibitosan as
 2  ( select nr
 3         , nr - row_number() over (order by nr) grp
 4      from mytable
 5  )
 6  select min(nr)
 7       , max(nr)
 8    from tabibitosan
 9   group by grp
10   order by grp
11  /

   MIN(NR)    MAX(NR)
---------- ----------
         1          3
         6          7
        11         11
        18         22
        25         25

5 rows selected.

我认为这种性能更好:

SQL> r
  1  select min(nr) as range_start
  2    ,max(nr) as range_end
  3  from (-- our previous query
  4    select nr
  5      ,rownum
  6      ,nr - rownum grp
  7    from  (select nr
  8       from   mytable
  9       order by 1
 10      )
 11   )
 12  group by grp
 13* order by 1

RANGE_START  RANGE_END
----------- ----------
      1      3
      6      7
     11     11
     18     22
     25     25

0

一个粗略的计划:

  • 选择每个名称的最小值(按名称分组)
  • 为每个名称选择minimum2,其中min2> min1且不存在(子查询:SEL min2-1)。
  • Sel max val1> min val1其中max val1 <min val2。

从2开始重复,直到没有更多更新发生为止。从那里开始,事情变得复杂了,戈尔迪安(Gordian),将最大和最小的分组。我想我会选择一种编程语言。

PS:一个不错的示例表,上面有几个示例值,可以供每个人使用,因此并非每个人都从头开始创建他的测试数据。


0

该解决方案的灵感来自于使用窗口函数和OVER子句的nate c的答案。有趣的是,该答案恢复为带有外部引用的子查询。使用另一级窗口功能可以完成行合并。它可能看起来不太漂亮,但我认为它会更有效,因为它利用了强大的窗口功能的内置逻辑。

我从nate的解决方案中意识到,初始行集已经将必要的标志设置为1)选择开始和结束范围值,以及2)消除了中间的多余行。仅由于窗口函数的限制,查询才嵌套了两个深层的子查询,这限制了如何使用列别名。从逻辑上讲,我只用一个嵌套子查询就可以产生结果。

其他一些注意事项:以下是SQLite3的代码。SQLite方言是从postgresql派生的,因此它非常相似,甚至可以不变地工作。我在OVER子句中添加了帧限制,因为lag()lead()函数分别在前后分别只需要一个单行窗口(因此无需保留所有先前行的默认集合)。我也选择了名称firstlast因为这个词end是保留字。

create temp view test as 
with cte(name, int) AS (
select * from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3) ))
select * from cte;


SELECT name,
       int AS first, 
       endpoint AS last,
       (endpoint - int + 1) AS span
FROM ( SELECT name, 
             int, 
             CASE WHEN prev <> 1 AND next <> -1 -- orphan
                  THEN int
                WHEN next = -1 -- start of range
                  THEN lead(int) OVER (PARTITION BY name 
                                       ORDER BY int 
                                       ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)
                ELSE null END
             AS endpoint
        FROM ( SELECT name, 
                   int,
                   coalesce(int - lag(int) OVER (PARTITION BY name 
                                                 ORDER BY int 
                                                 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 
                            0) AS prev,
                   coalesce(int - lead(int) OVER (PARTITION BY name 
                                                  ORDER BY int 
                                                  ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING),
                            0) AS next
              FROM test
            ) AS mark_boundaries
        WHERE NOT (prev = 1 AND next = -1) -- discard values within range
      ) as raw_ranges
WHERE endpoint IS NOT null
ORDER BY name, first

正如人们所期望的,结果与其他答案一样:

 name | first | last | span
------+-------+------+------
 bar  |     1 |    3 |   3
 foo  |     2 |    4 |   3
 foo  |    10 |   11 |   2
 foo  |    13 |   13 |   1
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.