我正在通过Heroku使用Postgres 9.3。
我有一个表“ traffic”,其中有1M +条记录,每天都有许多插入和更新。我需要在此表上的不同时间范围内执行SUM操作,这些调用最多可能需要40秒钟,并且希望听到有关如何改进该建议的建议。
我在此表上有以下索引:
CREATE INDEX idx_traffic_partner_only ON traffic (dt_created) WHERE campaign_id IS NULL AND uuid_self <> uuid_partner;
这是一个示例SELECT语句:
SELECT SUM("clicks") AS clicks, SUM("impressions") AS impressions
FROM "traffic"
WHERE "uuid_self" != "uuid_partner"
AND "campaign_id" is NULL
AND "dt_created" >= 'Sun, 29 Mar 2015 00:00:00 +0000'
AND "dt_created" <= 'Mon, 27 Apr 2015 23:59:59 +0000'
这是EXPLAIN ANALYZE:
Aggregate (cost=21625.91..21625.92 rows=1 width=16) (actual time=41804.754..41804.754 rows=1 loops=1)
-> Index Scan using idx_traffic_partner_only on traffic (cost=0.09..20085.11 rows=308159 width=16) (actual time=1.409..41617.976 rows=302392 loops=1)
Index Cond: ((dt_created >= '2015-03-29'::date) AND (dt_created <= '2015-04-27'::date))
Total runtime: 41804.893 ms
http://explain.depesz.com/s/gGA
这个问题与SE上的另一个问题非常相似,但是这个问题使用了跨越两个列时间戳范围的索引,并且该查询的索引计划器的估算值相差甚远。主要建议是创建一个排序的多列索引,但是对于单列索引没有太大影响。其他建议是使用CLUSTER / pg_repack和GIST索引,但是我还没有尝试过,因为我想看看使用常规索引是否有更好的解决方案。
作为参考,我尝试了以下数据库未使用的索引:
INDEX idx_traffic_2 ON traffic (campaign_id, uuid_self, uuid_partner, dt_created);
INDEX idx_traffic_3 ON traffic (dt_created);
INDEX idx_traffic_4 ON traffic (uuid_self);
INDEX idx_traffic_5 ON traffic (uuid_partner);
编辑:Ran解释(分析,详细,成本,缓冲),这些是结果:
Aggregate (cost=20538.62..20538.62 rows=1 width=8) (actual time=526.778..526.778 rows=1 loops=1)
Output: sum(clicks), sum(impressions)
Buffers: shared hit=47783 read=29803 dirtied=4
I/O Timings: read=184.936
-> Index Scan using idx_traffic_partner_only on public.traffic (cost=0.09..20224.74 rows=313881 width=8) (actual time=0.049..431.501 rows=302405 loops=1)
Output: id, uuid_self, uuid_partner, impressions, clicks, dt_created... (other fields redacted)
Index Cond: ((traffic.dt_created >= '2015-03-29'::date) AND (traffic.dt_created <= '2015-04-27'::date))
Buffers: shared hit=47783 read=29803 dirtied=4
I/O Timings: read=184.936
Total runtime: 526.881 ms
http://explain.depesz.com/s/7Gu6
表定义:
CREATE TABLE traffic (
id serial,
uuid_self uuid not null,
uuid_partner uuid not null,
impressions integer NOT NULL DEFAULT 1,
clicks integer NOT NULL DEFAULT 0,
campaign_id integer,
dt_created DATE DEFAULT CURRENT_DATE NOT NULL,
dt_updated TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
)
id是主键,而uuid_self,uuid_partner和campaign_id都是外键。dt_updated字段使用postgres函数进行更新。
traffic
。另外:为什么第二秒EXPLAIN
显示从42秒下降到0.5秒?第一次运行的是冷缓存吗?
id
。还有其他限制吗?我看到两列可以为NULL。每个中NULL值的百分比是多少?你得到什么呢?SELECT count(*) AS ct, count(campaign_id)/ count(*) AS camp_pct, count(dt_updated)/count(*) AS upd_pct FROM traffic;
explain (buffers, analyze, verbose) ...
可能会揭示更多。