高效的查询,以从大表中获取每个组的最大价值


14

给定表:

    Column    |            Type             
 id           | integer                     
 latitude     | numeric(9,6)                
 longitude    | numeric(9,6)                
 speed        | integer                     
 equipment_id | integer                     
 created_at   | timestamp without time zone
Indexes:
    "geoposition_records_pkey" PRIMARY KEY, btree (id)

该表有2000万条记录,相对而言,这不是一个很大的记录。但这会使顺序扫描变慢。

如何获得max(created_at)每个记录的最后一条记录()equipment_id

我已经尝试了以下两个查询,并阅读了有关该主题的许多答案的几种变体:

select max(created_at),equipment_id from geoposition_records group by equipment_id;

select distinct on (equipment_id) equipment_id,created_at 
  from geoposition_records order by equipment_id, created_at desc;

我也尝试过为它创建btree索引,equipment_id,created_at但是Postgres发现使用seqscan更快。强制enable_seqscan = off也没有用,因为读取索引的速度与seq扫描一样慢,可能更糟。

查询必须定期运行,始终返回最后一个。

使用Postgres 9.3。

解释/分析(有170万条记录):

set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate  (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
"  ->  Seq Scan on geoposition_records  (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"

set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate  (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
"  ->  Index Scan using geoposition_records_equipment_id_created_at_idx on geoposition_records  (cost=0.00..2987673.75 rows=1651851 width=12) (actual time=0.062..10248.703 rows=1651851 loops=1)"
"Total runtime: 11305.161 ms"

好,上次我检查的期望百分比中没有NULLequipment_id低于0.1%
2013年

Answers:


10

普通的多列b树索引毕竟应该起作用:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

为什么DESC NULLS LAST

功能

如果您对查询计划程序说不清话,那么遍历设备表的函数应该可以解决问题。一次查找一个equipment_id将使用索引。对于一小部分(从您的EXPLAIN ANALYZE输出判断为57 ),那是很快的。
可以安全地假设您有equipment桌子吗?

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp) AS
$func$
BEGIN
FOR equipment_id IN
   SELECT e.equipment_id FROM equipment e ORDER BY 1
LOOP
   SELECT g.created_at
   FROM   geoposition_records g
   WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
   ORDER  BY g.created_at DESC NULLS LAST
   LIMIT  1
   INTO   latest;

   RETURN NEXT;
END LOOP;
END  
$func$  LANGUAGE plpgsql STABLE;

也打个不错的电话:

SELECT * FROM f_latest_equip();

相关子查询

想到这一点,使用此equipment表,您可以通过低关联的子查询对肮脏的工作产生巨大的效果:

SELECT equipment_id
     ,(SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

表现非常好。

LATERAL 加入Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;

详细说明:

与相关子查询的性能类似。比较,,函数,相关子查询的性能max()DISTINCT ONLATERAL在此进行比较:

SQL小提琴


1
@ErwinBrandstetter这是我在Colin回答后尝试过的方法,但我不能停止认为这是一种使用数据库侧n + 1查询的解决方法(不确定是否存在于反模式中,因为存在没有连接开销)...我现在想知道为什么group by根本存在,如果它不能正确处理几百万条记录...那没有意义,应该吗?成为我们所缺少的东西。最后,问题已经稍有变化,我们假设设备表存在……我想知道是否实际上还有另一种方法
Feyd

3

尝试1

如果

  1. 我有一张单独的equipment桌子,
  2. 我有一个索引 geoposition_records(equipment_id, created_at desc)

那么以下对我有用:

select id as equipment_id, (select max(created_at)
                            from geoposition_records
                            where equipment_id = equipment.id
                           ) as max_created_at
from equipment;

我是不是能够迫使PG做一个快速的查询,以确定双方的名单equipment_idS和相关的max(created_at)。但是明天我要再试一次!

尝试2

我找到了此链接:http: //zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values 将此技术与尝试1中的查询相结合,得到:

WITH RECURSIVE equipment(id) AS (
    SELECT MIN(equipment_id) FROM geoposition_records
  UNION
    SELECT (
      SELECT equipment_id
      FROM geoposition_records
      WHERE equipment_id > equipment.id
      ORDER BY equipment_id
      LIMIT 1
    )
    FROM equipment WHERE id IS NOT NULL
)
SELECT id AS equipment_id, (SELECT MAX(created_at)
                            FROM geoposition_records
                            WHERE equipment_id = equipment.id
                           ) AS max_created_at
FROM equipment;

这很有效!但是你需要

  1. 此超扭曲查询表单,以及
  2. 的索引geoposition_records(equipment_id, created_at desc)
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.