如何在PostgreSQL中使DISTINCT ON更快？

13

我station_logs在PostgreSQL 9.6数据库中有一个表：

    Column     |            Type             |    
---------------+-----------------------------+
 id            | bigint                      | bigserial
 station_id    | integer                     | not null
 submitted_at  | timestamp without time zone | 
 level_sensor  | double precision            | 
Indexes:
    "station_logs_pkey" PRIMARY KEY, btree (id)
    "uniq_sid_sat" UNIQUE CONSTRAINT, btree (station_id, submitted_at)

我试图为每个获取level_sensor基于的最后一个值。大约有400个唯一值，每天每个大约2万行。submitted_atstation_idstation_idstation_id

创建索引之前：

EXPLAIN ANALYZE
SELECT DISTINCT ON(station_id) station_id, submitted_at, level_sensor
FROM station_logs ORDER BY station_id, submitted_at DESC;

 唯一（费用= 4347852.14..4450301.72行= 89宽度= 20）（实际时间= 22202.080..27619.167行= 98循环= 1）
   ->排序（cost = 4347852.14..4399076.93行= 20489916宽度= 20）（实际时间= 22202.077..26540.827行= 20489812循环= 1）
         排序键：station_id，submitted_at DESC
         排序方式：外部合并磁盘：681040kB
         ->对station_logs进行Seq扫描（成本= 0.00..598895.16行= 20489916宽度= 20）（实际时间= 0.023..3443.587行= 20489812循环= $
 计划时间：0.072毫秒
 执行时间：27690.644 ms

创建索引：

CREATE INDEX station_id__submitted_at ON station_logs(station_id, submitted_at DESC);

创建索引后，对于同一查询：

 唯一（费用= 0.56..2156367.51行= 89宽度= 20）（实际时间= 0.184..16263.413行= 98循环= 1）
   ->使用station_logs上的station_id__submitted_at进行索引扫描（成本= 0.56..2105142.98行= 20489812宽度= 20）（实际时间= 0.181..1 $
 计划时间：0.206毫秒
 执行时间：16263.490 ms

有没有办法使此查询更快？像1秒一样，16秒仍然太多。

— 角izz
source

2

那里有多少个不同的工作站ID，即查询返回多少行？什么版本的Postgres？

— ypercubeᵀᴹ

Postgre 9.6，大约400个唯一的

— station_id

该查询返回一个 “基于submitted_at最后level_sensor值，每个station_id”。DISTINCT ON涉及随机选择，除非您不需要。

— philipxy

18

对于仅400个站，此查询将大大提高速度：

SELECT s.station_id, l.submitted_at, l.level_sensor
FROM   station s
CROSS  JOIN LATERAL (
   SELECT submitted_at, level_sensor
   FROM   station_logs
   WHERE  station_id = s.station_id
   ORDER  BY submitted_at DESC NULLS LAST
   LIMIT  1
   ) l;

dbfiddle 此处
_{（比较此查询的计划，Abelisto的替代方案和您的原始方案）}

EXPLAIN ANALYZE由OP提供的结果：

 嵌套循环（成本= 0.56..356.65行= 102宽度= 20）（实际时间= 0.034..0.979行= 98循环= 1）
   ->在站点s上进行序列扫描（成本= 0.00..3.02行= 102宽度= 4）（实际时间= 0.009..0.016行= 102循环= 1）
   ->限制（成本= 0.56..3.45行= 1宽度= 16）（实际时间= 0.009..0.009行= 1循环= 102）
         ->使用station_logs上的station_id__submitted_at进行索引扫描（成本= 0.56..664062.38行= 230223宽度= 16）（实际时间= 0.009 $
               索引条件：（station_id = s.id）
 计划时间：0.542毫秒
 执行时间：1.013毫秒   -!!

您唯一需要的索引就是您创建的索引：station_id__submitted_at。该UNIQUE限制uniq_sid_sat也做了工作，基本上是这样。两者都维护似乎浪费了磁盘空间和写入性能。

我在查询中添加NULLS LAST到ORDER BY，因为submitted_at未定义NOT NULL。理想情况下，如果适用！，请将NOT NULL约束添加到列submitted_at，删除附加索引并NULLS LAST从查询中删除。

如果submitted_at可以是NULL，请创建此UNIQUE索引以替换当前索引和唯一约束：

CREATE UNIQUE INDEX station_logs_uni ON station_logs(station_id, submitted_at DESC NULLS LAST);

考虑：

假设有一个单独的表station，每个相关表station_id（通常是PK）都只有一行-您应该采用两种方式。如果没有，请创建它。同样，使用此rCTE技术非常快：

CREATE TABLE station AS
WITH RECURSIVE cte AS (
   (
   SELECT station_id
   FROM   station_logs
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT l.station_id
   FROM   cte c
   ,      LATERAL (   
      SELECT station_id
      FROM   station_logs
      WHERE  station_id > c.station_id
      ORDER  BY station_id
      LIMIT  1
      ) l
   )
TABLE cte;

我也在小提琴中使用它。您可以使用类似的查询直接解决您的任务，而无需使用station表-如果无法说服您创建它。

详细说明，解释和替代方法：

优化指标

您的查询现在应该很快。仅当您仍然需要优化读取性能时...

将level_sensor最后一列添加到索引以允许仅索引扫描（如joanolo commented）可能是有意义的。
缺点：它使索引变大-这给使用它的所有查询增加了一点成本。
优点：如果实际上只从索引中扫描出索引，那么手头的查询根本不需要访问堆页面，这使它的访问速度提高了将近一倍。但这对于现在的快速查询可能是微不足道的。

但是，我不希望这对您有用。您提到：

...每天约2万行station_id。

通常，这将指示不断增加的写入负载（station_id每5秒1个）。您对最新行感兴趣。仅索引扫描仅适用于所有事务可见的堆页面（已设置可见性映射中的位）。您将不得不VACUUM为表运行非常激进的设置，以跟上写入负载，并且它在大多数情况下仍然无法正常工作。如果我的假设是正确的，则仅索引扫描已完成，请不要将其添加level_sensor到索引中。

OTOH，如果我的假设成立，并且您的表越来越大，则BRIN索引可能会有所帮助。有关：

加快Postgres部分索引的创建

或者，甚至更专业，更有效：仅部分最新索引的局部索引，以切断大量不相关的行：

CREATE INDEX station_id__submitted_at_recent_idx ON station_logs(station_id, submitted_at DESC NULLS LAST)
WHERE submitted_at > '2017-06-24 00:00';

选择一个您知道必须存在年轻行的时间戳。您必须向WHERE所有查询添加匹配条件，例如：

...
WHERE  station_id = s.station_id
AND    submitted_at > '2017-06-24 00:00'
...

您必须不时调整索引和查询。
相关答案以及更多详细信息：

— 欧文·布兰德斯特
source

每当我知道我想要嵌套循环（通常）时，在许多情况下使用LATERAL都会提高性能。

— Paul Draper

6

尝试经典方式：

create index idx_station_logs__station_id on station_logs(station_id);
create index idx_station_logs__submitted_at on station_logs(submitted_at);

analyse station_logs;

with t as (
  select station_id, max(submitted_at) submitted_at 
  from station_logs 
  group by station_id)
select * 
from t join station_logs l on (
  l.station_id = t.station_id and l.submitted_at = t.submitted_at);

dbfiddle

通过ThreadStarter进行分析

 Nested Loop  (cost=701344.63..702110.58 rows=4 width=155) (actual time=6253.062..6253.544 rows=98 loops=1)
   CTE t
     ->  HashAggregate  (cost=701343.18..701344.07 rows=89 width=12) (actual time=6253.042..6253.069 rows=98 loops=1)
           Group Key: station_logs.station_id
           ->  Seq Scan on station_logs  (cost=0.00..598894.12 rows=20489812 width=12) (actual time=0.034..1841.848 rows=20489812 loop$
   ->  CTE Scan on t  (cost=0.00..1.78 rows=89 width=12) (actual time=6253.047..6253.085 rows=98 loops=1)
   ->  Index Scan using station_id__submitted_at on station_logs l  (cost=0.56..8.58 rows=1 width=143) (actual time=0.004..0.004 rows=$
         Index Cond: ((station_id = t.station_id) AND (submitted_at = t.submitted_at))
 Planning time: 0.542 ms
 Execution time: 6253.701 ms

— 阿贝里斯托
source