PostgreSQL顺序扫描而不是索引扫描为什么？

大家好我的PostgreSQL数据库查询有问题，想知道是否有人可以提供帮助。在某些情况下，我的查询似乎忽略了我创建的用于连接两个表data和的索引data_area。发生这种情况时，它将使用顺序扫描并导致查询慢得多。

顺序扫描（约5分钟）

Unique  (cost=15368261.82..15369053.96 rows=200 width=1942) (actual time=301266.832..301346.936 rows=153812 loops=1)
   CTE data
     ->  Bitmap Heap Scan on data  (cost=6086.77..610089.54 rows=321976 width=297) (actual time=26.286..197.625 rows=335130 loops=1)
           Recheck Cond: (datasetid = 1)
           Filter: ((readingdatetime >= '1920-01-01 00:00:00'::timestamp without time zone) AND (readingdatetime <= '2013-03-11 00:00:00'::timestamp without time zone) AND (depth >= 0::double precision) AND (depth <= 99999::double precision))
           ->  Bitmap Index Scan on data_datasetid_index  (cost=0.00..6006.27 rows=324789 width=0) (actual time=25.462..25.462 rows=335130 loops=1)
                 Index Cond: (datasetid = 1)
   ->  Sort  (cost=15368261.82..15368657.89 rows=158427 width=1942) (actual time=301266.829..301287.110 rows=155194 loops=1)
         Sort Key: data.id
         Sort Method: quicksort  Memory: 81999kB
         ->  Hash Left Join  (cost=15174943.29..15354578.91 rows=158427 width=1942) (actual time=300068.588..301052.832 rows=155194 loops=1)
               Hash Cond: (data_area.area_id = area.id)
               ->  Hash Join  (cost=15174792.93..15351854.12 rows=158427 width=684) (actual time=300066.288..300971.644 rows=155194 loops=1)
                     Hash Cond: (data.id = data_area.data_id)
                     ->  CTE Scan on data  (cost=0.00..6439.52 rows=321976 width=676) (actual time=26.290..313.842 rows=335130 loops=1)
                     ->  Hash  (cost=14857017.62..14857017.62 rows=25422025 width=8) (actual time=300028.260..300028.260 rows=26709939 loops=1)
                           Buckets: 4194304  Batches: 1  Memory Usage: 1043357kB
                           ->  Seq Scan on data_area  (cost=0.00..14857017.62 rows=25422025 width=8) (actual time=182921.056..291687.996 rows=26709939 loops=1)
                                 Filter: (area_id = ANY ('{28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11}'::integer[]))
               ->  Hash  (cost=108.49..108.49 rows=3349 width=1258) (actual time=2.256..2.256 rows=3349 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 584kB
                     ->  Seq Scan on area  (cost=0.00..108.49 rows=3349 width=1258) (actual time=0.007..0.666 rows=3349 loops=1)
 Total runtime: 301493.379 ms

索引扫描（约3秒）（在explain.depesz.com上）

Unique  (cost=17352256.47..17353067.50 rows=200 width=1942) (actual time=3603.303..3681.619 rows=153812 loops=1)
   CTE data
     ->  Bitmap Heap Scan on data  (cost=6284.60..619979.56 rows=332340 width=297) (actual time=26.201..262.314 rows=335130 loops=1)
           Recheck Cond: (datasetid = 1)
           Filter: ((readingdatetime >= '1920-01-01 00:00:00'::timestamp without time zone) AND (readingdatetime <= '2013-03-11 00:00:00'::timestamp without time zone) AND (depth >= 0::double precision) AND (depth <= 99999::double precision))
           ->  Bitmap Index Scan on data_datasetid_index  (cost=0.00..6201.51 rows=335354 width=0) (actual time=25.381..25.381 rows=335130 loops=1)
                 Index Cond: (datasetid = 1)
   ->  Sort  (cost=17352256.47..17352661.98 rows=162206 width=1942) (actual time=3603.302..3623.113 rows=155194 loops=1)
         Sort Key: data.id
         Sort Method: quicksort  Memory: 81999kB
         ->  Hash Left Join  (cost=1296.08..17338219.59 rows=162206 width=1942) (actual time=29.980..3375.921 rows=155194 loops=1)
               Hash Cond: (data_area.area_id = area.id)
               ->  Nested Loop  (cost=0.00..17334287.66 rows=162206 width=684) (actual time=26.903..3268.674 rows=155194 loops=1)
                     ->  CTE Scan on data  (cost=0.00..6646.80 rows=332340 width=676) (actual time=26.205..421.858 rows=335130 loops=1)
                     ->  Index Scan using data_area_pkey on data_area  (cost=0.00..52.13 rows=1 width=8) (actual time=0.006..0.008 rows=0 loops=335130)
                           Index Cond: (data_id = data.id)
                           Filter: (area_id = ANY ('{28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11}'::integer[]))
               ->  Hash  (cost=1254.22..1254.22 rows=3349 width=1258) (actual time=3.057..3.057 rows=3349 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 584kB
                     ->  Index Scan using area_primary_key on area  (cost=0.00..1254.22 rows=3349 width=1258) (actual time=0.012..1.429 rows=3349 loops=1)
 Total runtime: 3706.630 ms

表结构

这是表的表结构data_area。如果需要，我可以提供其他表格。

CREATE TABLE data_area
(
  data_id integer NOT NULL,
  area_id integer NOT NULL,
  CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),
  CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
      REFERENCES area (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
      REFERENCES data (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
);

查询

WITH data AS (
    SELECT * 
    FROM data 
    WHERE 
        datasetid IN (1) 
        AND (readingdatetime BETWEEN '1920-01-01' AND '2013-03-11') 
        AND depth BETWEEN 0 AND 99999
)
SELECT * 
FROM ( 
    SELECT DISTINCT ON (data.id) data.id, * 
    FROM 
        data, 
        data_area 
        LEFT JOIN area ON area_id = area.id 
    WHERE 
        data_id = data.id 
        AND area_id IN (28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11) 
) as s;

返回153812行。确实set enable_seqscan= false;禁用了顺序扫描并获得了索引结果。

我尝试ANALYSE对数据库进行操作，并增加查询中使用的列上收集的统计信息，但似乎无济于事。

有人能对此进行宣传和提出建议吗？

postgresql query-performance execution-plan

— 马克戴维森
source

如果您包含生成每个执行计划的查询，这将对我有所帮助。

— Mike Sherrill'Cat Recall'13

估计的行数和实际的行数有2个数量级的差异吗？我读对了吗？

— Mike Sherrill'Cat Recall'13

@Catcall添加了查询（基本的基本功能，可以弄清楚发生了什么事情）。当您引用估计的行时，是200，然后其实际返回153812？

— 马克戴维森

是的，乍一看200 vs 150k似乎很奇怪。是否有必要将左联接与笛卡尔乘积（FROM data, data_area）混合在一起？乍一看，使用不带ORDER BY子句的DISTINCT ON似乎是一个坏主意。

— Mike Sherrill'Cat Recall'13

explain.depesz.com/s/Uzin可能会提供信息。

— Craig Ringer 2013年

Answers:

注意这一行：

->  Index Scan using data_area_pkey on data_area  (cost=0.00..52.13 rows=1 width=8) 
    (actual time=0.006..0.008 rows=0 loops=335130)

如果考虑循环，计算总成本，则为52.3 * 335130 = 17527299。大于seq_scan替代的14857017.62 。这就是为什么它不使用索引。

因此，优化器高估了索引扫描的成本。我猜您的数据是按索引排序的（由于聚集索引或加载方式的原因），并且/或者您有大量的缓存内存和/或快速的磁盘。因此，几乎没有随机I / O发生。

您还应该检查correlationin pg_stats，优化器在计算索引成本时使用in 来评估集群，最后尝试更改random_page_cost和cpu_index_tuple_cost来匹配您的系统。

— 跑步
source

除非我丢失了某些内容，否则我认为@jop的意思52.13不是52.3，这会导致17470326.9（仍大于seq_scan）

— BotNet

您的CTE实际上不执行任何其他操作，然后“外包”了一些WHERE条件，其中大多数看上去与等同WHERE TRUE。由于CTE通常位于优化范围的后面（这意味着它是自己优化的），因此它们可以对某些查询有很大帮助。但是，在这种情况下，我期望完全相反的效果。

我将尝试将查询重写为尽可能简单：

SELECT d.id, * 
FROM 
    data d 
    JOIN data_area da ON da.data_id = d.id
    LEFT JOIN area a ON da.area_id = a.id 
WHERE 
    d.datasetid IN (1) 
    AND da.area_id IN (28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11) 
    AND (readingdatetime BETWEEN '1920-01-01' AND '2013-03-11') -- this and the next condition don't do anything, I think
    AND depth BETWEEN 0 AND 99999
;

然后检查是否使用了索引。仍然有可能不需要所有的输出列（至少联结表的两列是多余的）。

请报告并告诉我们您使用的是哪个PostgreSQL版本。

— 德佐
source

感谢您的建议，对于延迟回复您的帖子我深表歉意，我一直在从事其他项目。您的建议确实确实意味着该查询现在似乎已可靠地将索引用于所有查询，但是我仍然没有获得期望的性能。我已经对包含更多数据的查询进行了分析 .depesz.com / s / 1yu大约需要4分钟，而95％的时间都花在了INDEX扫描上。

— 马克戴维森

忘了提及我正在使用9.1.4版

— Mark Davidson

基本上索引扫描是相当快的，问题是它重复了几百万次。如果SET enable_nestloop=off在运行查询之前会得到什么？

— dezso 2013年

-1

对于追随者，我遇到了类似的问题

select * from table where bigint_column between x and y and mod(bigint_column, 10000) == z

问题是我的bigint_column“在x和y之间”有一个索引，但是我的查询基本上是该表中的“所有行”，因此它没有使用索引[因为它仍然必须扫描整个表]，但是正在执行seq_scan顺序扫描。对我来说，一个解决办法是为方程的“ mod”侧创建一个新索引，以便可以在表达式上使用它。

— 罗杰派克
source

投票者可以随时发表评论：)

— rogerdpack