我有一个带有一些数字和一些其他数据的PostgreSQL 9.3表:
CREATE TABLE mytable (
myid BIGINT,
somedata BYTEA
)
该表当前有约10M条记录,并占用1GB磁盘空间。myid
不连续。
我想计算100000个连续数字的每个块中有多少行:
SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
这将返回大约3500行。
我注意到,即使查询计划根本没有提及某个索引,该索引的存在也会显着加快此查询的速度。没有索引的查询计划:
db=> EXPLAIN (ANALYZE TRUE, VERBOSE TRUE) SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1636639.92..1709958.65 rows=496942 width=8) (actual time=6783.763..8888.841 rows=3460 loops=1)
Output: ((myid / 100000)), count(*)
-> Sort (cost=1636639.92..1659008.91 rows=8947594 width=8) (actual time=6783.752..8005.831 rows=8947557 loops=1)
Output: ((myid / 100000))
Sort Key: ((mytable.myid / 100000))
Sort Method: external merge Disk: 157440kB
-> Seq Scan on public.mytable (cost=0.00..236506.92 rows=8947594 width=8) (actual time=0.020..1674.838 rows=8947557 loops=1)
Output: (myid / 100000)
Total runtime: 8914.780 ms
(9 rows)
索引:
db=> CREATE INDEX myindex ON mytable ((myid/100000));
db=> VACUUM ANALYZE;
新的查询计划:
db=> EXPLAIN (ANALYZE TRUE, VERBOSE TRUE) SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=281242.99..281285.97 rows=3439 width=8) (actual time=3190.189..3190.800 rows=3460 loops=1)
Output: ((myid / 100000)), count(*)
-> Seq Scan on public.mytable (cost=0.00..236505.56 rows=8947485 width=8) (actual time=0.026..1659.571 rows=8947557 loops=1)
Output: (myid / 100000)
Total runtime: 3190.975 ms
(5 rows)
因此,查询计划和运行时相差很大(几乎三倍),但都没有提到索引。此行为在我的开发机上是完全可重现的:我经历了删除索引,测试查询多次,重新创建索引,再次测试查询的几个周期。这里发生了什么事?
explain (analyze true, verbose true) ...
?
HashAggregate
方法(并且不需要排序),因此您可以获得更好的性能。为什么在计划中未提及该索引,我却一无所知。