提高大型PostgresSQL表中COUNT / GROUP-BY的性能？

我正在运行PostgresSQL 9.2，并具有12列的关系，大约有6,700,000行。它包含3D空间中的节点，每个节点都引用一个用户（创建它的用户）。为了查询哪个用户创建了多少个节点，我执行以下操作（添加explain analyze以获得更多信息）：

EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                    QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
   ->  Seq Scan on treenode  (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
         Filter: (project_id = 1)
 Total runtime: 1747.653 ms

如您所见，这大约需要1.7秒。考虑到数据量，这还算不错，但是我想知道是否可以改进。我试图在用户列上添加一个BTree索引，但这没有任何帮助。

您还有其他建议吗？

为了完整起见，这是完整的表定义及其所有索引（没有外键约束，引用和触发器）：

    Column     |           Type           |                      Modifiers                    
---------------+--------------------------+------------------------------------------------------
 id            | bigint                   | not null default nextval('concept_id_seq'::regclass)
 user_id       | bigint                   | not null
 creation_time | timestamp with time zone | not null default now()
 edition_time  | timestamp with time zone | not null default now()
 project_id    | bigint                   | not null
 location      | double3d                 | not null
 reviewer_id   | integer                  | not null default (-1)
 review_time   | timestamp with time zone |
 editor_id     | integer                  |
 parent_id     | bigint                   |
 radius        | double precision         | not null default 0
 confidence    | integer                  | not null default 5
 skeleton_id   | bigint                   |
Indexes:
    "treenode_pkey" PRIMARY KEY, btree (id)
    "treenode_id_key" UNIQUE CONSTRAINT, btree (id)
    "skeleton_id_treenode_index" btree (skeleton_id)
    "treenode_editor_index" btree (editor_id)
    "treenode_location_x_index" btree (((location).x))
    "treenode_location_y_index" btree (((location).y))
    "treenode_location_z_index" btree (((location).z))
    "treenode_parent_id" btree (parent_id)
    "treenode_user_index" btree (user_id)

编辑：这是当我使用@ypercube提出的查询（和索引）时的结果（不使用，查询大约需要5.3秒EXPLAIN ANALYZE）：

EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
                                                                        QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on auth_user u  (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
           ->  Bitmap Heap Scan on treenode t  (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
                 Recheck Cond: ((project_id = 1) AND (user_id = u.id))
                 Rows Removed by Index Recheck: 461076
                 ->  Bitmap Index Scan on treenode_user_index  (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
                       Index Cond: ((project_id = 1) AND (user_id = u.id))
 Total runtime: 5556.190 ms
(9 rows)

Time: 5556.804 ms

编辑2：这是当我按照@ erwin-brandstetter的建议使用indexon project_id, user_id（但尚未进行模式优化）时的结果（查询以与原始查询相同的速度运行1.5秒）：

EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                        QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
   ->  Seq Scan on treenode  (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
         Filter: (project_id = 1)
 Total runtime: 1807.368 ms
(4 rows)

— 汤姆卡
source

您是不是也有一个表，Users与user_id作为主键？

— ypercubeᵀᴹ

我只是看到Postgres有第三方列存储插件。另外，我只是想在新的iOS应用发布

— swasheck

谢谢你的好，清晰，完整的问题-版本，表定义等

— 克雷格·林格

@ypercube是的，我有一个Users表。

— tomka 2014年

project_id和多少不同user_id？该表是否不断更新，或者您可以使用实例化视图（一段时间）？

— Erwin Brandstetter 2014年

Answers:

主要问题是缺少索引。但是还有更多。

SELECT user_id, count(*) AS ct
FROM   treenode
WHERE  project_id = 1
GROUP  BY user_id;

您有很多bigint列。可能是过度杀伤力。通常，integer对于像project_id和这样的列来说绰绰有余user_id。这也将有助于下一项。
_{在优化表定义时，请考虑以下相关答案，重点是数据对齐和填充。但其余大多数也适用：}
- 配置PostgreSQL以获得读取性能
房间里的大象：没有索引project_id。创建一个。这比其余答案更重要。
同时，使它成为多列索引：
```
CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
```
如果您遵循我的建议，integer那么这里将是完美的：
- 复合索引对第一字段的查询是否也有用？
user_id是定义的NOT NULL，因此count(user_id)等效于count(*)，但后者要短一些且更快。（在此特定查询中，这甚至在没有user_id定义的情况下也适用NOT NULL。）
id已经是主键，其他UNIQUE约束是无用镇流器。算了吧：
```
"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)
```
_{撇开：我不用id作列名。使用诸如的描述性内容treenode_id。}

补充信息

问：How many different project_id and user_id?
答：not more than five different project_id。

这意味着Postgres必须读取整个表的20％才能满足您的查询。除非它可以使用仅索引扫描，否则对表的顺序扫描将比涉及任何索引的扫描更快。除了优化表和服务器设置外，这里没有其他性能可得到。

至于仅索引扫描：要查看其有效性，VACUUM ANALYZE请在负担得起的情况下运行（仅锁定表）。然后再次尝试查询。现在，仅使用索引应该会适度更快。请先阅读以下相关答案：

使用ORDER BY日期和文本优化简单查询

以及在仅索引扫描中添加了Postgres 9.6和Postgres Wiki 的手册页。

— 欧文·布兰德斯特
source

欧文，谢谢您的建议。您是正确的，因为user_id并且project_id integer应该绰绰有余。使用count(*)而不是count(user_id)节省70毫秒左右的时间，这是一个很好的认识。我已经添加了EXPLAIN ANALYZE查询我已经添加后，您的建议index的第一篇文章。但是，它并不能提高性能（但也不会造成伤害）。似乎index根本没有使用过。我将很快测试架构优化。

— tomka 2014年

如果我禁用seqscan，则使用索引（Index Only Scan using treenode_project_id_user_id_index on treenode），但是查询大约需要2.5秒（比seqscan长约1秒）。

— tomka 2014年

感谢您的更新。这些丢失的部分应该是我的问题的一部分，是的。我只是不知道他们的影响。我会像您建议的那样优化我的模式-让我们看看我能从中得到什么。感谢您的解释，这对我来说很有意义，因此，我会将您的答案标记为已接受。

— tomka 2014年

我先添加一个索引(project_id, user_id)，然后在9.3版本中，尝试以下查询：

SELECT u.user_id, c.number_of_nodes 
FROM users AS u
   , LATERAL
     ( SELECT COUNT(*) AS number_of_nodes 
       FROM treenode AS t
       WHERE t.project_id = 1 
         AND t.user_id = u.user_id
     ) c 
-- WHERE c.number_of_nodes > 0 ;   -- you probably want this as well
                                   -- to show only relevant users

在9.2中，尝试以下一项：

SELECT u.user_id, 
       ( SELECT COUNT(*) 
         FROM treenode AS t
         WHERE t.project_id = 1 
           AND t.user_id = u.user_id
       ) AS number_of_nodes  
FROM users AS u ;

我假设你有一张users桌子。如果不是，请替换users为：
(SELECT DISTINCT user_id FROM treenode)

— 超级立方体
source

非常感谢您的回答。您是正确的，我有一个用户表。但是，使用9.2中的查询，大约需要5秒钟才能得到结果-无论是否创建索引。我创建了这样的索引：CREATE INDEX treenode_user_index ON treenode USING btree (project_id, user_id);，但是我也尝试了不带USING子句的情况。我想念什么吗？

— tomka 2014年

users表中有多少行，查询返回多少行（因此有多少个用户拥有project_id=1）？添加索引后，能否显示此查询的解释？

— ypercubeᵀᴹ

首先，我的第一句话是错误的。没有建议的索引，大约需要40秒（！）来检索结果。index到位大约需要5秒钟。对困惑感到抱歉。在我的users桌子上，我有46个条目。该查询仅返回9行。令人惊讶的是，SELECT DISTINCT user_id FROM treenode WHERE project_id=1;返回38行。我已将其添加explain到我的第一篇文章中。并避免混淆：users实际上我的表被称为auth_user。

— 汤姆卡

我想知道如何才能SELECT DISTINCT user_id FROM treenode WHERE project_id=1;返回38行，而查询仅返回9。

— ypercubeᵀᴹ

你可以试试吗？ SET enable_seqscan = OFF; (Query); SET enable_seqscan = ON;

— ypercubeᵀᴹ2014年