Postgres正在执行顺序扫描而不是索引扫描

9

我有一个表，其中包含约1000万行，并且在日期字段上有一个索引。当我尝试提取索引字段的唯一值时，即使结果集只有26个项目，Postgres也会运行顺序扫描。为什么优化师会选择此计划？而我该如何避免呢？

从其他答案中，我怀疑这与查询和索引都息息相关。

explain select "labelDate" from pages group by "labelDate";
                              QUERY PLAN
-----------------------------------------------------------------------
 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)
   Group Key: "labelDate"
   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)

表结构：

http=# \d pages
                                       Table "public.pages"
     Column      |          Type          |        Modifiers
-----------------+------------------------+----------------------------------
 pageid          | integer                | not null default nextval('...
 createDate      | integer                | not null
 archive         | character varying(16)  | not null
 label           | character varying(32)  | not null
 wptid           | character varying(64)  | not null
 wptrun          | integer                | not null
 url             | text                   |
 urlShort        | character varying(255) |
 startedDateTime | integer                |
 renderStart     | integer                |
 onContentLoaded | integer                |
 onLoad          | integer                |
 PageSpeed       | integer                |
 rank            | integer                |
 reqTotal        | integer                | not null
 reqHTML         | integer                | not null
 reqJS           | integer                | not null
 reqCSS          | integer                | not null
 reqImg          | integer                | not null
 reqFlash        | integer                | not null
 reqJSON         | integer                | not null
 reqOther        | integer                | not null
 bytesTotal      | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesImg        | integer                | not null
 bytesFlash      | integer                | not null
 bytesJSON       | integer                | not null
 bytesOther      | integer                | not null
 numDomains      | integer                | not null
 labelDate       | date                   |
 TTFB            | integer                |
 reqGIF          | smallint               | not null
 reqJPG          | smallint               | not null
 reqPNG          | smallint               | not null
 reqFont         | smallint               | not null
 bytesGIF        | integer                | not null
 bytesJPG        | integer                | not null
 bytesPNG        | integer                | not null
 bytesFont       | integer                | not null
 maxageMore      | smallint               | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 fullyLoaded     | integer                |
 cdn             | character varying(64)  |
 SpeedIndex      | integer                |
 visualComplete  | integer                |
 gzipTotal       | integer                | not null
 gzipSavings     | integer                | not null
 siteid          | numeric                |
Indexes:
    "pages_pkey" PRIMARY KEY, btree (pageid)
    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
    "idx_pages_cdn" btree (cdn)
    "idx_pages_labeldate" btree ("labelDate") CLUSTER
    "idx_pages_urlshort" btree ("urlShort")
Triggers:
    pages_label_date BEFORE INSERT OR UPDATE ON pages
      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

— 查理·克拉克（Charlie Clark）
source

8

这是有关Postgres优化的已知问题。如果不同的值很少（例如您的情况）并且您使用的是8.4+版本，则在此处介绍使用递归查询的非常快速的解决方法：Loose Indexscan。

您的查询可以被重写（LATERAL需要9.3+版本）：

WITH RECURSIVE pa AS 
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 
  UNION ALL
    SELECT n.labelDate 
    FROM pa AS p
         , LATERAL 
              ( SELECT labelDate 
                FROM pages 
                WHERE labelDate > p.labelDate 
                ORDER BY labelDate 
                LIMIT 1
              ) AS n
) 
SELECT labelDate 
FROM pa ;

Erwin Brandstetter在此答案中（针对一个相关但不同的问题）提供了详尽的解释和查询的几种变型：优化GROUP BY查询以检索每个用户的最新记录

— 超级立方体
source

6

最佳查询很大程度上取决于数据分布。

每个日期您已经建立了许多行。由于您的案例仅将结果烧毁为26个值，因此使用索引后，以下所有解决方案都将非常快。
（对于更独特的值，情况将变得更加有趣。）

有没有必要让pageid 所有（如你评论）。

指数

您需要的只是在上的简单btree索引"labelDate"。
在列中有多个NULL值的情况下，部分索引可以提供更多（且更小）的信息：

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE  "labelDate" IS NOT NULL;

您稍后澄清了：

0％NULL，但仅在导入时将其修复后。

部分索引对于排除具有NULL值的行的中间状态可能仍然有意义。避免不必要的索引更新（导致膨胀）。

询问

根据临时范围

如果您的日期出现在没有太多间隙的连续范围内，那么我们可以利用数据类型的性质date来发挥我们的优势。两个给定值之间只有有限数量的值。如果差距很小，这将是最快的：

SELECT d."labelDate"
FROM  (
   SELECT generate_series(min("labelDate")::timestamp
                        , max("labelDate")::timestamp
                        , interval '1 day')::date AS "labelDate"
   FROM   pages
   ) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

为什么要投timestamp进去generate_series()？看到：

在PostgreSQL中生成两个日期之间的时间序列

最小值和最大值可以从索引中便宜地选择。如果您知道最短和/或最大日期，它会便宜一些。例：

SELECT d."labelDate"
FROM  (SELECT date '2011-01-01' + g AS "labelDate"
       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

或者，在一个不变的间隔内：

SELECT d."labelDate"
FROM  (SELECT date '2011-01-01' + g AS "labelDate"
       FROM generate_series(0, 363) g) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

索引扫描松散

这对于任何日期分布都非常有效（只要每个日期有很多行）。基本上@ypercube已经提供了什么。但是有一些好处，我们需要确保我们喜欢的索引可以在任何地方使用。

WITH RECURSIVE p AS (
   ( -- parentheses required for LIMIT
   SELECT "labelDate"
   FROM   pages
   WHERE  "labelDate" IS NOT NULL
   ORDER  BY "labelDate"
   LIMIT  1
   ) 
   UNION ALL
   SELECT (SELECT "labelDate" 
           FROM   pages 
           WHERE  "labelDate" > p."labelDate" 
           ORDER  BY "labelDate" 
           LIMIT  1)
   FROM   p
   WHERE  "labelDate" IS NOT NULL
   ) 
SELECT "labelDate" 
FROM   p
WHERE  "labelDate" IS NOT NULL;

第一个CTE p实际上与
```
SELECT min("labelDate") FROM pages
```
但是，冗长的形式确保使用了我们的部分索引。另外，根据我的经验（以及测试），这种表格通常会更快一些。
对于仅一列，rCTE的递归项中的相关子查询应该更快一些。这需要排除导致“ labelDate”为NULL的行。看到：
优化GROUP BY查询以检索每个用户的最新记录

阿西德斯

不带引号的合法小写字母标识符使您的生活更轻松。
对表定义中的列进行排序可以节省一些磁盘空间：

在PostgreSQL中计算并节省空间

— 欧文·布兰德斯特
source

-2

从PostgreSQL文档：

CLUSTER可以使用对指定索引的索引扫描或（如果索引为b树）（如果索引是b树）对表进行重新排序，然后进行排序。它将尝试基于计划者成本参数和可用的统计信息来选择更快的方法。

您在labelDate上的索引是btree。

参考：

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

— 法布里佐·马佐尼（Fabrizio Mazzoni）
source

即使在“ 2000-01-01”和“ 2020-01-01”之间存在“ WHERE“ labelDate”之类的情况下，仍然需要进行顺序扫描。

— 查理·克拉克

目前正在聚类（尽管数据是按大致顺序输入的）。但这仍然不能真正解释查询计划程序决定不使用索引（即使具有WHERE子句）的决定。

— 查理·克拉克

您是否还尝试禁用会话的顺序扫描？set enable_seqscan=off无论如何，文件都是清楚的。如果集群，它将执行顺序扫描。

— Fabrizio Mazzoni 2015年

是的，我尝试禁用顺序扫描，但是并没有太大的区别。该查询的速度实际上并不重要，因为我使用它来创建查找表，然后该表可用于实际查询中的JOINS。

— 查理·克拉克