我有一张表,其中包含两列整数数组的排列/组合,第三列包含一个值,如下所示:
CREATE TABLE foo
(
perm integer[] NOT NULL,
combo integer[] NOT NULL,
value numeric NOT NULL DEFAULT 0
);
INSERT INTO foo
VALUES
( '{3,1,2}', '{1,2,3}', '1.1400' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0.9280' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0.8000' )
我想找出每个排列以及每个组合的平均偏差和标准偏差。我可以使用以下查询来做到这一点:
SELECT
f1.perm,
f2.combo,
f1.perm_average_value,
f2.combo_average_value,
f1.perm_stddev,
f2.combo_stddev,
f1.perm_count,
f2.combo_count
FROM
(
SELECT
perm,
combo,
avg( value ) AS perm_average_value,
stddev_pop( value ) AS perm_stddev,
count( * ) AS perm_count
FROM foo
GROUP BY perm, combo
) AS f1
JOIN
(
SELECT
combo,
avg( value ) AS combo_average_value,
stddev_pop( value ) AS combo_stddev,
count( * ) AS combo_count
FROM foo
GROUP BY combo
) AS f2 ON ( f1.combo = f2.combo );
但是,当我有大量数据时,该查询会变得非常慢,因为“ foo”表(实际上由14个分区组成,每个分区大约有400万行)需要扫描两次。
最近,我了解到Postgres支持“窗口函数”,这对于特定列基本上像GROUP BY。我修改了查询以如下方式使用它们:
SELECT
perm,
combo,
avg( value ) as perm_average_value,
avg( avg( value ) ) over w_combo AS combo_average_value,
stddev_pop( value ) as perm_stddev,
stddev_pop( avg( value ) ) over w_combo as combo_stddev,
count( * ) as perm_count,
sum( count( * ) ) over w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS ( PARTITION BY combo );
尽管这对于“ combo_count”列有效,但“ combo_average_value”和“ combo_stddev”列不再准确。似乎是对每个排列取平均值,然后对每个组合第二次取平均值,这是不正确的。
我怎样才能解决这个问题?窗口功能甚至可以用作优化吗?