我们在表中添加了两个pg_trgm索引,以启用按电子邮件地址或名称的模糊搜索,因为我们需要按名称或注册过程中拼写错误的电子邮件地址(例如“ @ gmail.con”)查找用户。ANALYZE
在创建索引后运行。
但是,在绝大多数情况下,对这两个索引中的任何一个进行排名搜索都非常缓慢。也就是说,随着超时的增加,查询可能会在60秒内返回,在极少数情况下可能会很快返回15秒,但通常查询会超时。
pg_trgm.similarity_threshold
是的默认值0.3
,但将其提高0.8
似乎没有什么不同。
这个特定的表有超过2500万行,并且不断地对其进行查询,更新和插入(每个表的平均时间小于2ms)。设置为PostgreSQL 9.6.6,在具有通用SSD存储和或多或少默认参数的RDS db.m4.large实例上运行。pg_trgm扩展是1.3版。
查询:
SELECT * FROM users WHERE email % 'chris@example.com' ORDER BY email <-> 'chris@example.com' LIMIT 10;
SELECT * FROM users WHERE (first_name || ' ' || last_name) % 'chris orr' ORDER BY (first_name || ' ' || last_name) <-> 'chris orr' LIMIT 10;
这些查询不需要经常运行(一天运行数十次),但是它们应基于当前表状态,理想情况下大约在10秒内返回。
架构:
=> \d+ users
Table "public.users"
Column | Type | Collation | Nullable | Default | Storage
-------------------+-----------------------------+-----------+----------+---------+----------
id | uuid | | not null | | plain
email | citext | | not null | | extended
email_is_verified | boolean | | not null | | plain
first_name | text | | not null | | extended
last_name | text | | not null | | extended
created_at | timestamp without time zone | | | now() | plain
updated_at | timestamp without time zone | | | now() | plain
… | boolean | | not null | false | plain
… | character varying(60) | | | | extended
… | character varying(6) | | | | extended
… | character varying(6) | | | | extended
… | boolean | | | | plain
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"users_email_key" UNIQUE, btree (email)
"users_search_email_idx" gist (email gist_trgm_ops)
"users_search_name_idx" gist (((first_name || ' '::text) || last_name) gist_trgm_ops)
"users_updated_at_idx" btree (updated_at)
Triggers:
update_users BEFORE UPDATE ON users FOR EACH ROW EXECUTE PROCEDURE update_modified_column()
Options: autovacuum_analyze_scale_factor=0.01, autovacuum_vacuum_scale_factor=0.05
(我知道,我们也许应该还可以添加unaccent()
到users_search_name_idx
与名称查询...)
说明:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE (first_name || ' ' || last_name) % 'chris orr' ORDER BY (first_name || ' ' || last_name) <-> 'chris orr' LIMIT 10;
:
Limit (cost=0.42..40.28 rows=10 width=152) (actual time=58671.973..58676.193 rows=10 loops=1)
Buffers: shared hit=66227 read=231821
-> Index Scan using users_search_name_idx on users (cost=0.42..100264.13 rows=25153 width=152) (actual time=58671.970..58676.180 rows=10 loops=1)
Index Cond: (((first_name || ' '::text) || last_name) % 'chris orr'::text)
Order By: (((first_name || ' '::text) || last_name) <-> 'chris orr'::text"
Buffers: shared hit=66227 read=231821
Planning time: 0.125 ms
Execution time: 58676.265 ms
电子邮件搜索比名称搜索更有可能超时,但这大概是因为电子邮件地址是如此相似(例如,许多 @ gmail.com地址)。
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email % 'chris@example.com' ORDER BY email <-> 'chris@example.com' LIMIT 10;
:
Limit (cost=0.42..40.43 rows=10 width=152) (actual time=58851.719..62181.128 rows=10 loops=1)
Buffers: shared hit=83 read=428918
-> Index Scan using users_search_email_idx on users (cost=0.42..100646.36 rows=25153 width=152) (actual time=58851.716..62181.113 rows=10 loops=1)
Index Cond: ((email)::text % 'chris@example.com'::text)
Order By: ((email)::text <-> 'chris@example.com'::text)
Buffers: shared hit=83 read=428918
Planning time: 0.100 ms
Execution time: 62181.186 ms
查询时间慢的原因可能是什么?与读取的缓冲区数量有关?我找不到有关优化这种特殊查询的更多信息,无论如何,这些查询与pg_trgm文档中的查询非常相似。
这是我们可以优化的东西,还是可以在Postgres中更好地实现,还是希望像Elasticsearch这样的东西更适合此特定用例?
<->
不使用索引的运算符重现任何排名靠前的查询?
pg_trgm
至少为1.3?您可以在中使用“ \ dx”进行检查psql
。