我希望根据是否以整数数组形式传递的大量值列表中包含一列来选择行。
这是我当前使用的查询:
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
item_id = ANY ($1) -- Integer array
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
该表的结构如下:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
...
我尝试了不同的索引并EXPLAIN
在查询上运行后提出了该索引。这对于查询和排序都是最有效的。这是查询的解释分析:
Subquery Scan on x (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 90275
-> WindowAgg (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
-> Index Scan using idx_dtr_query on mytable (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer[])) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
问题是int数组最多可以包含15,000个元素,并且在这种情况下查询变得非常慢(在我的笔记本电脑上是800毫秒,这是最近的Dell XPS)。
我认为将int数组作为参数传递可能很慢,因此考虑到ID列表可以事先存储在我尝试执行的数据库中。我将它们存储在另一个表的数组中并使用item_id = ANY (SELECT UNNEST(item_ids) FROM ...)
,这比我当前的方法慢。我还尝试逐行存储它们,并使用item_id IN (SELECT item_id FROM ...)
,即使在表中仅包含与我的测试用例相关的行,速度甚至更慢。
有更好的方法吗?
更新:按照Evan的评论,我尝试了另一种方法:每个项目都是几个组的一部分,因此,我没有传递组的项目ID,而是尝试在mytable中添加组ID:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
group_ids | integer[] | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
"idx_dtr_group_ids" gin (group_ids)
...
新查询($ 1是目标组ID):
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
$1 = ANY (group_ids)
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
解释分析:
Subquery Scan on x (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 219726
-> WindowAgg (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
-> Sort (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
Sort Key: item_id, start_date, allowed
Sort Method: external sort Disk: 29176kB
-> Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
索引可能还有改进的空间,但是我很难理解postgres如何使用它们,因此我不确定要更改什么。
“ mytable”中有几行?那里有多少个不同的“ item_id”值?
—
尼克(Nick)
另外,您是否不应该在mytable的item_id上具有唯一性约束(可能尚未定义唯一索引)?...编辑:哦,我看到“ PARTITION BY item_id”,所以这个问题转换为“您的数据的自然,真实键是什么?在那里应该形成唯一索引?”
—
尼克(Nick),
大约1200万行
—
Jukurrpa '18
mytable
,大约50 万行item_id
。该表没有真正的自然唯一键,它是为重复事件自动生成的数据。我猜item_id
+ start_date
+ name
(场这里没有显示)可能构成某种关键的。
您可以张贴您得到的执行计划吗?
—
Colin't Hart
当然,对问题添加了解释分析。
—
库尔帕