PostgreSQL-使用成千上万的元素数组

我希望根据是否以整数数组形式传递的大量值列表中包含一列来选择行。

这是我当前使用的查询：

SELECT item_id, other_stuff, ...
FROM (
    SELECT
        -- Partitioned row number as we only want N rows per id
        ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
        item_id, other_stuff, ...
    FROM mytable
    WHERE
        item_id = ANY ($1) -- Integer array
        AND end_date > $2
    ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12

该表的结构如下：

    Column     |            Type             | Collation | Nullable | Default 
---------------+-----------------------------+-----------+----------+---------
 item_id       | integer                     |           | not null | 
 allowed       | boolean                     |           | not null | 
 start_date    | timestamp without time zone |           | not null | 
 end_date      | timestamp without time zone |           | not null | 
 ...


 Indexes:
    "idx_dtr_query" btree (item_id, start_date, allowed, end_date)
    ...

我尝试了不同的索引并EXPLAIN在查询上运行后提出了该索引。这对于查询和排序都是最有效的。这是查询的解释分析：

Subquery Scan on x  (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
  Filter: (x.r <= 12)
  Rows Removed by Filter: 90275
  ->  WindowAgg  (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
        ->  Index Scan using idx_dtr_query on mytable  (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
              Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer[])) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms

问题是int数组最多可以包含15,000个元素，并且在这种情况下查询变得非常慢（在我的笔记本电脑上是800毫秒，这是最近的Dell XPS）。

我认为将int数组作为参数传递可能很慢，因此考虑到ID列表可以事先存储在我尝试执行的数据库中。我将它们存储在另一个表的数组中并使用item_id = ANY (SELECT UNNEST(item_ids) FROM ...)，这比我当前的方法慢。我还尝试逐行存储它们，并使用item_id IN (SELECT item_id FROM ...)，即使在表中仅包含与我的测试用例相关的行，速度甚至更慢。

有更好的方法吗？

更新：按照Evan的评论，我尝试了另一种方法：每个项目都是几个组的一部分，因此，我没有传递组的项目ID，而是尝试在mytable中添加组ID：

    Column     |            Type             | Collation | Nullable | Default 
---------------+-----------------------------+-----------+----------+---------
 item_id       | integer                     |           | not null | 
 allowed       | boolean                     |           | not null | 
 start_date    | timestamp without time zone |           | not null | 
 end_date      | timestamp without time zone |           | not null | 
 group_ids     | integer[]                   |           | not null | 
 ...

 Indexes:
    "idx_dtr_query" btree (item_id, start_date, allowed, end_date)
    "idx_dtr_group_ids" gin (group_ids)
    ...

新查询（$ 1是目标组ID）：

SELECT item_id, other_stuff, ...
FROM (
    SELECT
        -- Partitioned row number as we only want N rows per id
        ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
        item_id, other_stuff, ...
    FROM mytable
    WHERE
        $1 = ANY (group_ids)
        AND end_date > $2
    ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12

解释分析：

Subquery Scan on x  (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
  Filter: (x.r <= 12)
  Rows Removed by Filter: 219726
  ->  WindowAgg  (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
        ->  Sort  (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
              Sort Key: item_id, start_date, allowed
              Sort Method: external sort  Disk: 29176kB
              ->  Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
                    Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
                    Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms

索引可能还有改进的空间，但是我很难理解postgres如何使用它们，因此我不确定要更改什么。

postgresql postgresql-performance

— 朱库帕
source

“ mytable”中有几行？那里有多少个不同的“ item_id”值？

— 尼克（Nick）

另外，您是否不应该在mytable的item_id上具有唯一性约束（可能尚未定义唯一索引）？...编辑：哦，我看到“ PARTITION BY item_id”，所以这个问题转换为“您的数据的自然，真实键是什么？在那里应该形成唯一索引？”

— 尼克（Nick），

大约1200万行mytable，大约50 万行item_id。该表没有真正的自然唯一键，它是为重复事件自动生成的数据。我猜item_id+ start_date+ name（场这里没有显示）可能构成某种关键的。

— Jukurrpa '18

您可以张贴您得到的执行计划吗？

— Colin't Hart

当然，对问题添加了解释分析。

— 库尔帕

有更好的方法吗？

是的，使用临时表。当您的查询如此疯狂时，创建索引临时表没有错。

BEGIN;
  CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
  INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
  CREATE INDEX ON myitems(item_id);
COMMIT;

ANALYZE myitems;

SELECT item_id, other_stuff, ...
FROM (
  SELECT
      -- Partitioned row number as we only want N rows per id
      ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
      item_id, other_stuff, ...
  FROM mytable
  INNER JOIN myitems USING (item_id)
  WHERE end_date > $2
  ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;

但比那更好...

“ 500k个不同的item_id” ...“ int数组最多可以包含15,000个元素”

您正在分别选择数据库的3％。我想知道您是否最好不要在架构本身中创建组/标签等。我从来没有亲自向查询发送15,000个不同的ID。

— 埃文·卡洛尔
source

只是尝试使用临时表，它的速度较慢，至少在15,000 id的情况下如此。至于在架构本身中创建组，您是指一个带有我作为参数传递的id的表吗？我尝试了类似的方法，但是性能与我目前的方法相似或更差。我将用更多详细信息更新问题

— Jukurrpa，2018年

不，我是说。如果您通常有15,000个ID，则通常是在ID中存储某些内容（例如该商品是否为厨房产品），而不是存储与“厨房产品”相对应的group_id，而是尝试查找所有厨房产品通过他们的身份证。（这出于各种原因都是不好的）那15,000个ID代表什么？为什么不将其存储在行本身？

— 埃文·卡罗尔

每个项目都属于多个组（通常是15-20个组），因此我尝试将它们作为int数组存储在mytable中，但无法弄清楚如何正确地对其进行索引。我用所有细节更新了问题。

— 库尔帕