我做了很多实验,这是我的发现。
杜松子酒和排序
当前的GIN索引(自9.4版起)无法协助订购。
在PostgreSQL当前支持的索引类型中,只有B树可以产生排序后的输出-其他索引类型以未指定的,依赖于实现的顺序返回匹配的行。
work_mem
感谢Chris指出了此配置参数。它默认为4MB,如果您的记录集更大,则增加到work_mem
适当的值(可以从中找到EXPLAIN ANALYSE
)可以大大加快排序操作。
ALTER SYSTEM SET work_mem TO '32MB';
重新启动服务器以使更改生效,然后再次检查:
SHOW work_mem;
原始查询
我已经用65万种产品填充了数据库,其中某些类别最多可容纳4万种产品。我通过删除published
子句简化了查询:
SELECT * FROM products WHERE category_ids @> ARRAY [248688]
ORDER BY score DESC, title LIMIT 10 OFFSET 30000;
Limit (cost=2435.62..2435.62 rows=1 width=1390) (actual time=1141.254..1141.256 rows=10 loops=1)
-> Sort (cost=2434.00..2435.62 rows=646 width=1390) (actual time=1115.706..1140.513 rows=30010 loops=1)
Sort Key: score, title
Sort Method: external merge Disk: 29656kB
-> Bitmap Heap Scan on products (cost=17.01..2403.85 rows=646 width=1390) (actual time=11.831..25.646 rows=41666 loops=1)
Recheck Cond: (category_ids @> '{248688}'::integer[])
Heap Blocks: exact=6471
-> Bitmap Index Scan on idx_products_category_ids_gin (cost=0.00..16.85 rows=646 width=0) (actual time=10.140..10.140 rows=41666 loops=1)
Index Cond: (category_ids @> '{248688}'::integer[])
Planning time: 0.288 ms
Execution time: 1146.322 ms
正如我们所看到的work_mem
那样,我们还不够Sort Method: external merge Disk: 29656kB
(这里的数字是大概的,它在内存中的快速排序需要稍微超过32MB)。
减少内存占用
不要选择完整的记录进行排序,使用ID,应用排序,偏移量和限制,然后仅加载我们需要的10条记录:
SELECT * FROM products WHERE id in (
SELECT id FROM products WHERE category_ids @> ARRAY[248688]
ORDER BY score DESC, title LIMIT 10 OFFSET 30000
) ORDER BY score DESC, title;
Sort (cost=2444.10..2444.11 rows=1 width=1390) (actual time=707.861..707.862 rows=10 loops=1)
Sort Key: products.score, products.title
Sort Method: quicksort Memory: 35kB
-> Nested Loop (cost=2436.05..2444.09 rows=1 width=1390) (actual time=707.764..707.803 rows=10 loops=1)
-> HashAggregate (cost=2435.63..2435.64 rows=1 width=4) (actual time=707.744..707.746 rows=10 loops=1)
Group Key: products_1.id
-> Limit (cost=2435.62..2435.62 rows=1 width=72) (actual time=707.732..707.734 rows=10 loops=1)
-> Sort (cost=2434.00..2435.62 rows=646 width=72) (actual time=704.163..706.955 rows=30010 loops=1)
Sort Key: products_1.score, products_1.title
Sort Method: quicksort Memory: 7396kB
-> Bitmap Heap Scan on products products_1 (cost=17.01..2403.85 rows=646 width=72) (actual time=11.587..35.076 rows=41666 loops=1)
Recheck Cond: (category_ids @> '{248688}'::integer[])
Heap Blocks: exact=6471
-> Bitmap Index Scan on idx_products_category_ids_gin (cost=0.00..16.85 rows=646 width=0) (actual time=9.883..9.883 rows=41666 loops=1)
Index Cond: (category_ids @> '{248688}'::integer[])
-> Index Scan using products_pkey on products (cost=0.42..8.45 rows=1 width=1390) (actual time=0.004..0.004 rows=1 loops=10)
Index Cond: (id = products_1.id)
Planning time: 0.682 ms
Execution time: 707.973 ms
注意Sort Method: quicksort Memory: 7396kB
。结果好多了。
JOIN和其他B树索引
正如克里斯建议的那样,我创建了其他索引:
CREATE INDEX idx_test7 ON products (score DESC, title);
首先,我尝试像这样加入:
SELECT * FROM products NATURAL JOIN
(SELECT id FROM products WHERE category_ids @> ARRAY[248688]
ORDER BY score DESC, title LIMIT 10 OFFSET 30000) c
ORDER BY score DESC, title;
查询计划略有不同,但结果相同:
Sort (cost=2444.10..2444.11 rows=1 width=1390) (actual time=700.747..700.747 rows=10 loops=1)
Sort Key: products.score, products.title
Sort Method: quicksort Memory: 35kB
-> Nested Loop (cost=2436.05..2444.09 rows=1 width=1390) (actual time=700.651..700.690 rows=10 loops=1)
-> HashAggregate (cost=2435.63..2435.64 rows=1 width=4) (actual time=700.630..700.630 rows=10 loops=1)
Group Key: products_1.id
-> Limit (cost=2435.62..2435.62 rows=1 width=72) (actual time=700.619..700.619 rows=10 loops=1)
-> Sort (cost=2434.00..2435.62 rows=646 width=72) (actual time=697.304..699.868 rows=30010 loops=1)
Sort Key: products_1.score, products_1.title
Sort Method: quicksort Memory: 7396kB
-> Bitmap Heap Scan on products products_1 (cost=17.01..2403.85 rows=646 width=72) (actual time=10.796..32.258 rows=41666 loops=1)
Recheck Cond: (category_ids @> '{248688}'::integer[])
Heap Blocks: exact=6471
-> Bitmap Index Scan on idx_products_category_ids_gin (cost=0.00..16.85 rows=646 width=0) (actual time=9.234..9.234 rows=41666 loops=1)
Index Cond: (category_ids @> '{248688}'::integer[])
-> Index Scan using products_pkey on products (cost=0.42..8.45 rows=1 width=1390) (actual time=0.004..0.004 rows=1 loops=10)
Index Cond: (id = products_1.id)
Planning time: 1.015 ms
Execution time: 700.918 ms
在使用各种偏移量和产品计数时,我无法使PostgreSQL使用附加的B树索引。
所以我以经典的方式创建了联结表:
CREATE TABLE prodcats AS SELECT id AS product_id, unnest(category_ids) AS category_id FROM products;
CREATE INDEX idx_prodcats_cat_prod_id ON prodcats (category_id, product_id);
SELECT p.* FROM products p JOIN prodcats c ON (p.id=c.product_id)
WHERE c.category_id=248688
ORDER BY p.score DESC, p.title LIMIT 10 OFFSET 30000;
Limit (cost=122480.06..122480.09 rows=10 width=1390) (actual time=1290.360..1290.362 rows=10 loops=1)
-> Sort (cost=122405.06..122509.00 rows=41574 width=1390) (actual time=1264.250..1289.575 rows=30010 loops=1)
Sort Key: p.score, p.title
Sort Method: external merge Disk: 29656kB
-> Merge Join (cost=50.46..94061.13 rows=41574 width=1390) (actual time=117.746..182.048 rows=41666 loops=1)
Merge Cond: (p.id = c.product_id)
-> Index Scan using products_pkey on products p (cost=0.42..90738.43 rows=646067 width=1390) (actual time=0.034..116.313 rows=210283 loops=1)
-> Index Only Scan using idx_prodcats_cat_prod_id on prodcats c (cost=0.43..1187.98 rows=41574 width=4) (actual time=0.022..7.137 rows=41666 loops=1)
Index Cond: (category_id = 248688)
Heap Fetches: 0
Planning time: 0.873 ms
Execution time: 1294.826 ms
仍未使用B树索引,结果集不适合work_mem
,因此结果较差。
但是在某些情况下,拥有大量产品和较小偏移量的 PostgreSQL现在决定使用B树索引:
SELECT p.* FROM products p JOIN prodcats c ON (p.id=c.product_id)
WHERE c.category_id=248688
ORDER BY p.score DESC, p.title LIMIT 10 OFFSET 300;
Limit (cost=3986.65..4119.51 rows=10 width=1390) (actual time=264.176..264.574 rows=10 loops=1)
-> Nested Loop (cost=0.98..552334.77 rows=41574 width=1390) (actual time=250.378..264.558 rows=310 loops=1)
-> Index Scan using idx_test7 on products p (cost=0.55..194665.62 rows=646067 width=1390) (actual time=0.030..83.026 rows=108037 loops=1)
-> Index Only Scan using idx_prodcats_cat_prod_id on prodcats c (cost=0.43..0.54 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=108037)
Index Cond: ((category_id = 248688) AND (product_id = p.id))
Heap Fetches: 0
Planning time: 0.585 ms
Execution time: 264.664 ms
实际上这很合乎逻辑,因为B树索引在这里不会产生直接结果,它仅用作顺序扫描的指南。
让我们与GIN查询进行比较:
SELECT * FROM products WHERE id in (
SELECT id FROM products WHERE category_ids @> ARRAY[248688]
ORDER BY score DESC, title LIMIT 10 OFFSET 300
) ORDER BY score DESC, title;
Sort (cost=2519.53..2519.55 rows=10 width=1390) (actual time=143.809..143.809 rows=10 loops=1)
Sort Key: products.score, products.title
Sort Method: quicksort Memory: 35kB
-> Nested Loop (cost=2435.14..2519.36 rows=10 width=1390) (actual time=143.693..143.736 rows=10 loops=1)
-> HashAggregate (cost=2434.71..2434.81 rows=10 width=4) (actual time=143.678..143.680 rows=10 loops=1)
Group Key: products_1.id
-> Limit (cost=2434.56..2434.59 rows=10 width=72) (actual time=143.668..143.670 rows=10 loops=1)
-> Sort (cost=2433.81..2435.43 rows=646 width=72) (actual time=143.642..143.653 rows=310 loops=1)
Sort Key: products_1.score, products_1.title
Sort Method: top-N heapsort Memory: 68kB
-> Bitmap Heap Scan on products products_1 (cost=17.01..2403.85 rows=646 width=72) (actual time=11.625..31.868 rows=41666 loops=1)
Recheck Cond: (category_ids @> '{248688}'::integer[])
Heap Blocks: exact=6471
-> Bitmap Index Scan on idx_products_category_ids_gin (cost=0.00..16.85 rows=646 width=0) (actual time=9.916..9.916 rows=41666 loops=1)
Index Cond: (category_ids @> '{248688}'::integer[])
-> Index Scan using products_pkey on products (cost=0.42..8.45 rows=1 width=1390) (actual time=0.004..0.004 rows=1 loops=10)
Index Cond: (id = products_1.id)
Planning time: 0.630 ms
Execution time: 143.921 ms
GIN的结果要好得多。我检查了产品数量和偏移量的各种组合,在任何情况下,联结表方法都没有更好的选择。
真实指数的力量
为了使PostgreSQL充分利用索引进行排序,所有查询WHERE
参数以及ORDER BY
参数必须驻留在单个B树索引中。为此,我已将排序字段从产品复制到联结表:
CREATE TABLE prodcats AS SELECT id AS product_id, unnest(category_ids) AS category_id, score, title FROM products;
CREATE INDEX idx_prodcats_1 ON prodcats (category_id, score DESC, title, product_id);
SELECT * FROM products WHERE id in (SELECT product_id FROM prodcats WHERE category_id=248688 ORDER BY score DESC, title LIMIT 10 OFFSET 30000) ORDER BY score DESC, title;
Sort (cost=2149.65..2149.67 rows=10 width=1390) (actual time=7.011..7.011 rows=10 loops=1)
Sort Key: products.score, products.title
Sort Method: quicksort Memory: 35kB
-> Nested Loop (cost=2065.26..2149.48 rows=10 width=1390) (actual time=6.916..6.950 rows=10 loops=1)
-> HashAggregate (cost=2064.83..2064.93 rows=10 width=4) (actual time=6.902..6.904 rows=10 loops=1)
Group Key: prodcats.product_id
-> Limit (cost=2064.02..2064.71 rows=10 width=74) (actual time=6.893..6.895 rows=10 loops=1)
-> Index Only Scan using idx_prodcats_1 on prodcats (cost=0.56..2860.10 rows=41574 width=74) (actual time=0.010..6.173 rows=30010 loops=1)
Index Cond: (category_id = 248688)
Heap Fetches: 0
-> Index Scan using products_pkey on products (cost=0.42..8.45 rows=1 width=1390) (actual time=0.003..0.003 rows=1 loops=10)
Index Cond: (id = prodcats.product_id)
Planning time: 0.318 ms
Execution time: 7.066 ms
这是最坏的情况,选择类别中的产品很多且偏移量很大。当offset = 300时,执行时间仅为0.5 ms。
不幸的是,维护这样的连接表需要额外的精力。这可以通过建立索引的实例化视图来实现,但这仅在您的数据很少更新时才有用,因为刷新此类实例化视图是一项繁重的操作。
因此,到目前为止,我一直使用GIN索引,但增加work_mem
和减少了内存占用量查询。