相似度函数的最佳索引


8

因此,我的这张表有620万条记录,而且我必须对该列执行相似的搜索查询。查询可以是:

 SELECT  "lca_test".* FROM "lca_test"
 WHERE (similarity(job_title, 'sales executive') > 0.6)
 AND worksite_city = 'los angeles' 
 ORDER BY salary ASC LIMIT 50 OFFSET 0

可以在where中添加更多条件(年份= X,worksite_state = N,status =“已认证”,visa_class = Z)。

运行其中一些查询可能会花费很长时间,超过30秒。有时超过一分钟。

EXPLAIN ANALYZE 前面提到的查询给了我这个:

Limit  (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
->  Index Scan using index_lca_test_on_salary on lca_test  (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 ms

我不知道该如何索引我的列以使其快速燃烧。

编辑:这是postgres版本:

x86_64-unknown-linux-gnu上的PostgreSQL 9.3.5,由gcc(Debian 4.7.2-5)4.7.2,64位编译

这是表的定义:

                                                         Table "public.lca_test"
         Column         |       Type        |                       Modifiers                       | Storage  | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
 id                     | integer           | not null default nextval('lca_test_id_seq'::regclass) | plain    |              |
 raw_id                 | integer           |                                                       | plain    |              |
 year                   | integer           |                                                       | plain    |              |
 company_id             | integer           |                                                       | plain    |              |
 visa_class             | character varying |                                                       | extended |              |
 employement_start_date | character varying |                                                       | extended |              |
 employement_end_date   | character varying |                                                       | extended |              |
 employer_name          | character varying |                                                       | extended |              |
 employer_address1      | character varying |                                                       | extended |              |
 employer_address2      | character varying |                                                       | extended |              |
 employer_city          | character varying |                                                       | extended |              |
 employer_state         | character varying |                                                       | extended |              |
 employer_postal_code   | character varying |                                                       | extended |              |
 employer_phone         | character varying |                                                       | extended |              |
 employer_phone_ext     | character varying |                                                       | extended |              |
 job_title              | character varying |                                                       | extended |              |
 soc_code               | character varying |                                                       | extended |              |
 naic_code              | character varying |                                                       | extended |              |
 prevailing_wage        | character varying |                                                       | extended |              |
 pw_unit_of_pay         | character varying |                                                       | extended |              |
 wage_unit_of_pay       | character varying |                                                       | extended |              |
 worksite_city          | character varying |                                                       | extended |              |
 worksite_state         | character varying |                                                       | extended |              |
 worksite_postal_code   | character varying |                                                       | extended |              |
 total_workers          | integer           |                                                       | plain    |              |
 case_status            | character varying |                                                       | extended |              |
 case_no                | character varying |                                                       | extended |              |
 salary                 | real              |                                                       | plain    |              |
 salary_max             | real              |                                                       | plain    |              |
 prevailing_wage_second | real              |                                                       | plain    |              |
 lawyer_id              | integer           |                                                       | plain    |              |
 citizenship            | character varying |                                                       | extended |              |
 class_of_admission     | character varying |                                                       | extended |              |
Indexes:
    "lca_test_pkey" PRIMARY KEY, btree (id)
    "index_lca_test_on_id_and_salary" btree (id, salary)
    "index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
    "index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
    "index_lca_test_on_id_and_visa_class" btree (id, visa_class)
    "index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
    "index_lca_test_on_lawyer_id" btree (lawyer_id)
    "index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
    "index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
    "index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
    "index_lca_test_on_salary" btree (salary)
    "index_lca_test_on_visa_class" btree (visa_class)
    "index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
    "index_lca_test_on_worksite_state" btree (worksite_state)
    "index_lca_test_on_year_and_company_id" btree (year, company_id)
    "index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
    "index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
    "lca_test_company_id" btree (company_id)
    "lca_test_employer_name" btree (employer_name)
    "lca_test_id" btree (id)
    "lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
    "fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no

至少包括表定义(具有确切的数据类型和约束)和您的Postgres版本应该很明显。考虑tag-info中有关postgresql-performance的指令。还要澄清是否始终存在相等条件worksite_city
Erwin Brandstetter

谢谢,我编辑了我的帖子以包括这些信息。没错总有上平等的条件下 worksite_cityworksite_stateyear和/或 status
BL0B

Answers:


14

您忘了提到安装了pg_trgm提供该similarity()功能的附加模块。

相似算子 %

首先,无论做什么,都使用相似性运算符%代替expression (similarity(job_title, 'sales executive') > 0.6)。便宜得多。索引支持绑定到Postgres中的运算符,而不是函数。

要获得所需的最小相似度0.6,请运行:

SELECT set_limit(0.6);

除非重置为其他设置,否则该设置将在其余会话中保持不变。检查:

SELECT show_limit();

这有点笨拙,但对性能很有帮助。

简单的情况

如果您只希望列中最匹配job_title的字符串“销售主管”,那么这将是“最近邻居”搜索的简单情况,并且可以使用Trigram运算符类通过GiST索引解决该问题gist_trgm_ops(但不能使用GIN索引) :

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

要同时包含相等条件,则worksite_city需要附加模块btree_gist。运行(每个数据库一次):

CREATE EXTENSION btree_gist;

然后:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

查询:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;

<-> 作为“距离”运算符:

一减去similarity()值。

Postgres也可以合并两个单独的索引,在上的纯btree索引worksite_city和在上的单独的GiST索引job_title,但是多列索引应该是最快的-如果您定期在查询中将这样的两列合并。

你的情况

但是,您的查询将根据salary而不是距离/相似性进行排序,从而完全改变了游戏的性质。现在我们可以同时使用GIN和GiST索引,并且GIN会更快(在Postgres 9.4中,GIN索引有了很大的改进-提示!)

关于附加相等性检查的类似故事worksite_city:安装附加模块btree_gin。运行(每个数据库一次):

CREATE EXTENSION btree_gin;

然后:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);

查询:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50 -- OFFSET 0

再次,这也应该(已经不那么有效)与您已经拥有的简单索引("index_lcas_job_title_trigram")结合使用,并且可能与其他索引结合使用。最佳解决方案取决于整体情况。

阿西德斯

  • 您有很多索引。您确定它们都在使用中并支付维护费用吗?

  • 您有一些可疑的数据类型:

    employement_start_date | character varying
    employement_end_date   | character varying

    好像是应该的date。等等。

相关答案:


我确实在"index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)某处读到杜松子酒比精要更快。真的吗?
bl0b 2015年

1
@ bl0b,杜松子酒根本不支持similarity,因此这样做并不快。
jjanes

@ bl0b:尽管jjanes是正确的(这也是我的第一个想法),但您的情况有所不同,您毕竟可以使用GIN索引。我增加了很多。
Erwin Brandstetter

@ErwinBrandstetter非常感谢su的回答!快速问题:您说GIN更快,我应该安装btree_gin。但是然后在创建索引时,您要运行:CREATE INDEX lcas_trgm_gin_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);只是拼写错误?
bl0b 2015年

1
@ErwinBrandstetter从30秒延长到6秒。很大的进步!非常感谢!
bl0b
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.