简单的数据库结构(用于在线论坛):
CREATE TABLE users (
id integer NOT NULL PRIMARY KEY,
username text
);
CREATE INDEX ON users (username);
CREATE TABLE posts (
id integer NOT NULL PRIMARY KEY,
thread_id integer NOT NULL REFERENCES threads (id),
user_id integer NOT NULL REFERENCES users (id),
date timestamp without time zone NOT NULL,
content text
);
CREATE INDEX ON posts (thread_id);
CREATE INDEX ON posts (user_id);
表中约有8 users
万个条目,posts
表中有260 万个条目。这个简单的查询可按帖子获取前100名用户,耗时2.4秒:
EXPLAIN ANALYZE SELECT u.id, u.username, COUNT(p.id) AS PostCount FROM users u
INNER JOIN posts p on p.user_id = u.id
WHERE u.username IS NOT NULL
GROUP BY u.id
ORDER BY PostCount DESC LIMIT 100;
Limit (cost=316926.14..316926.39 rows=100 width=20) (actual time=2326.812..2326.830 rows=100 loops=1)
-> Sort (cost=316926.14..317014.83 rows=35476 width=20) (actual time=2326.809..2326.820 rows=100 loops=1)
Sort Key: (count(p.id)) DESC
Sort Method: top-N heapsort Memory: 32kB
-> HashAggregate (cost=315215.51..315570.27 rows=35476 width=20) (actual time=2311.296..2321.739 rows=34608 loops=1)
Group Key: u.id
-> Hash Join (cost=1176.89..308201.88 rows=1402727 width=16) (actual time=16.538..1784.546 rows=1910831 loops=1)
Hash Cond: (p.user_id = u.id)
-> Seq Scan on posts p (cost=0.00..286185.34 rows=1816634 width=8) (actual time=0.103..1144.681 rows=2173916 loops=1)
-> Hash (cost=733.44..733.44 rows=35476 width=12) (actual time=15.763..15.763 rows=34609 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2021kB
-> Seq Scan on users u (cost=0.00..733.44 rows=35476 width=12) (actual time=0.033..6.521 rows=34609 loops=1)
Filter: (username IS NOT NULL)
Rows Removed by Filter: 11335
Execution time: 2301.357 ms
随着set enable_seqscan = false
更糟糕:
Limit (cost=1160881.74..1160881.99 rows=100 width=20) (actual time=2758.086..2758.107 rows=100 loops=1)
-> Sort (cost=1160881.74..1160970.43 rows=35476 width=20) (actual time=2758.084..2758.098 rows=100 loops=1)
Sort Key: (count(p.id)) DESC
Sort Method: top-N heapsort Memory: 32kB
-> GroupAggregate (cost=0.79..1159525.87 rows=35476 width=20) (actual time=0.095..2749.859 rows=34608 loops=1)
Group Key: u.id
-> Merge Join (cost=0.79..1152157.48 rows=1402727 width=16) (actual time=0.036..2537.064 rows=1910831 loops=1)
Merge Cond: (u.id = p.user_id)
-> Index Scan using users_pkey on users u (cost=0.29..2404.83 rows=35476 width=12) (actual time=0.016..41.163 rows=34609 loops=1)
Filter: (username IS NOT NULL)
Rows Removed by Filter: 11335
-> Index Scan using posts_user_id_index on posts p (cost=0.43..1131472.19 rows=1816634 width=8) (actual time=0.012..2191.856 rows=2173916 loops=1)
Planning time: 1.281 ms
Execution time: 2758.187 ms
username
在Postgres中缺少分组依据,因为它不是必需的(SQL Server表示username
如果要选择用户名,则必须分组依据)。与分组username
会增加ms在Postgres上的执行时间,或者什么都不做。
为了科学起见,我已经将Microsoft SQL Server安装到同一台服务器(运行archlinux,8核心xeon,24 gb ram,ssd),并从Postgres迁移了所有数据- 相同的表结构,相同的索引,相同的数据。相同的查询以获取前100名海报在0.3秒内运行:
SELECT TOP 100 u.id, u.username, COUNT(p.id) AS PostCount FROM dbo.users u
INNER JOIN dbo.posts p on p.user_id = u.id
WHERE u.username IS NOT NULL
GROUP BY u.id, u.username
ORDER BY PostCount DESC
从相同的数据产生相同的结果,但速度快8倍。而且它是Linux上MS SQL的测试版,我想它可以在它的“家用”操作系统Windows Server上运行,它可能会更快。
我的PostgreSQL查询是完全错误的,还是PostgreSQL速度很慢?
附加信息
版本几乎是最新的(9.6.1,当前最新的是9.6.2,ArchLinux的软件包已过时,更新速度很慢)。配置:
max_connections = 75
shared_buffers = 3584MB
effective_cache_size = 10752MB
work_mem = 24466kB
maintenance_work_mem = 896MB
dynamic_shared_memory_type = posix
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
EXPLAIN ANALYZE
输出:https : //pastebin.com/HxucRgnk
尝试使用所有索引,甚至使用GIN和GIST,PostgreSQL的最快方法(并且Googling确认有很多行)是使用顺序扫描。
MS SQL Server 14.0.405.200-1,默认配置。
我在一个API中使用了它(使用无选择的普通选择),然后用chrome调用此API端点,它说需要2500毫秒+,添加50毫秒的HTTP和Web服务器开销(API和SQL在同一服务器上运行) - 一样的。我不在乎这里的100毫秒,我在乎的是整整两秒。
explain analyze SELECT user_id, count(9) FROM posts group by user_id;
需要700毫秒。posts
表的大小是2154 MB。
GROUP BY u.id
为此GROUP BY p.user_id
并尝试一下吗?我的猜测是,即使您只需要发布user_id来获得前N个行,Postgres确实会首先连接并按第二个组进行连接,因为您正在按用户表标识符分组。
posts
使用这样的表将它们与表的其余部分分离可能是有意义的,可以省掉CREATE TABLE post_content (post_id PRIMARY KEY REFERENCES posts (id), content text);
在这种类型的查询上“浪费”的大多数I / O。如果职位比这个更小VACUUM FULL
的posts
可以提供帮助。