获取联接表中聚合值的增量计数

我在MySQL 5.7.22数据库中有两个表：posts和reasons。每个帖子行都有并且属于许多原因行。每个原因都有一个相关的权重，因此每个帖子都有一个相关的总权重。

对于10个重量点的每个增量（即0、10、20、30等），我希望获得总权重小于或等于该增量的帖子数。我希望这样做的结果看起来像这样：

 weight | post_count
--------+------------
      0 | 0
     10 | 5
     20 | 12
     30 | 18
    ... | ...
    280 | 20918
    290 | 21102
    ... | ...
   1250 | 118005
   1260 | 118039
   1270 | 118040

总权重大致呈正态分布，有一些非常低的值和一些非常高的值（当前最大值为1277），但大部分位于中间。大约有120,000行posts，而大约有120 行reasons。每个帖子平均有5或6个原因。

表格的相关部分如下所示：

CREATE TABLE `posts` (
  id BIGINT PRIMARY KEY
);

CREATE TABLE `reasons` (
  id BIGINT PRIMARY KEY,
  weight INT(11) NOT NULL
);

CREATE TABLE `posts_reasons` (
  post_id BIGINT NOT NULL,
  reason_id BIGINT NOT NULL,
  CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
  CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);

到目前为止，我已经尝试将帖子ID和总权重放入视图中，然后将该视图与其自身合并以获取汇总计数：

CREATE VIEW `post_weights` AS (
    SELECT 
        posts.id,
        SUM(reasons.weight) AS reason_weight
    FROM posts
    INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
    INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
    GROUP BY posts.id
);

SELECT
    FLOOR(p1.reason_weight / 10) AS weight,
    COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;

但是，这是非常缓慢的-我让它运行15分钟而不终止，这在生产中是无法做到的。

有没有更有效的方法可以做到这一点？

如果您有兴趣测试整个数据集，可在此处下载。该文件大约为60MB，可扩展到大约250MB。或者，有12000行的GitHub的要点在这里。

mysql aggregate mysql-5.7

— 密码艺术
source

Answers:

我说，在JOIN条件下使用函数或表达式通常不是一个好主意，因为某些优化程序可以很好地处理它并以任何方式利用索引。我建议创建一个权重表。就像是：

CREATE TABLE weights
( weight int not null primary key 
);

INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);

确保您在上有索引posts_reasons：

CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);

查询如下：

SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

我在家的机器可能有5-6年的历史，它具有3.20 GHz @ 8Gb的Intel®Core™i5-3470 CPU。

uname -a Linux咬伤4.16.6-302.fc28.x86_64＃1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU / Linux

我针对：

https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5

MariaDB [test3]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)


SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

+--------+------------+
| weight | post_count |
+--------+------------+
|      0 |          1 |
|     10 |       2591 |
|     20 |       4264 |
|     30 |       4386 |
|     40 |       5415 |
|     50 |       7499 |
[...]   
|   1270 |     119283 |
|   1320 |     119286 |
|   1330 |     119286 |
[...]
|   2590 |     119286 |
+--------+------------+
256 rows in set (9.89 sec)

如果性能至关重要，并且没有其他帮助，则可以为以下目的创建摘要表：

SELECT pr.post_id, SUM(r.weight) as sum_weight     
FROM reasons r
JOIN posts_reasons pr
    ON r.id = pr.reason_id
GROUP BY pr.post_id

您可以通过触发器维护该表

由于每个砝码都需要完成一定量的工作，因此限制此表可能会有所帮助。

    ON w.weight > x.sum_weight 
WHERE w.weight <= (select MAX(sum_weights) 
                   from (SELECT SUM(weight) as sum_weights 
                   FROM reasons r        
                   JOIN posts_reasons pr
                       ON r.id = pr.reason_id 
                   GROUP BY pr.post_id) a
                  ) 
GROUP BY w.weight

由于我的体重表中有很多不必要的行（最大2590），因此上述限制将执行时间从9秒减少到4秒。

— 伦纳特
source

澄清：这似乎是在计算重量低于以下原因的原因w.weight -是吗？我希望计算的总重量（其相关原因行的权重之和）为lte w.weight。

— ArtOfCode

啊对不起我将重写查询

— Lennart

不过，这让我无所不用其极，谢谢！只需从post_weights我已经创建的现有视图中选择即可reasons。

— ArtOfCode

@ArtOfCode，我是否对修改后的查询正确？顺便说一句，谢谢您的提问。清晰，简洁且包含大量示例数据。Bravo

— Lennart '18

在MySQL中，变量可以在查询中使用，既可以从列中的值计算得出，也可以在表达式中用于新的计算出的列。在这种情况下，使用变量可产生有效的查询：

SELECT
  weight,
  @cumulative := @cumulative + post_count AS post_count
FROM
  (SELECT @cumulative := 0) AS x,
  (
    SELECT
      FLOOR(reason_weight / 10) * 10 AS weight,
      COUNT(*)                       AS post_count
    FROM
      (
        SELECT 
          p.id,
          SUM(r.weight) AS reason_weight
        FROM
          posts AS p
          INNER JOIN posts_reasons AS pr ON p.id = pr.post_id
          INNER JOIN reasons AS r ON pr.reason_id = r.id
        GROUP BY
          p.id
      ) AS d
    GROUP BY
      FLOOR(reason_weight / 10)
    ORDER BY
      FLOOR(reason_weight / 10) ASC
  ) AS derived
;

该d派生表实际上是你的post_weights看法。因此，如果计划保留视图，则可以使用它代替派生表：

SELECT
  weight,
  @cumulative := @cumulative + post_count AS post_count
FROM
  (SELECT @cumulative := 0),
  (
    SELECT
      FLOOR(reason_weight / 10) * 10 AS weight,
      COUNT(*)                       AS post_count
    FROM
      post_weights
    GROUP BY
      FLOOR(reason_weight / 10)
    ORDER BY
      FLOOR(reason_weight / 10) ASC
  ) AS derived
;

可以在SQL Fiddle中找到并使用该解决方案的演示，该演示使用了简化版本的安装程序。

— 安德烈·M
source

我尝试使用完整的数据集查询。我不确定为什么（查询对我来说看起来不错），但是MariaDB抱怨ERROR 1055 (42000): 'd.reason_weight' isn't in GROUP BY如果ONLY_FULL_GROUP_BY在@@ sql_mode中。禁用它我注意到您的查询比我的第一次运行慢（〜11秒）。缓存数据后，速度会更快（〜1秒）。我的查询每次运行大约4秒钟。

— Lennart

@Lennart：这是因为这不是实际的查询。我在小提琴中纠正了它，但忘了更新答案。立即更新，感谢您的注意。

— Andriy M

@Lennart：关于性能，我可能对这种类型的查询有误解。我认为它应该有效地工作，因为计算将在表格上完成一次。对于派生表，尤其是那些使用聚合的表，可能不一定是这种情况。恐怕我既没有适当的MySQL安装，也没有足够的专业知识来进行更深入的分析。

— Andriy M

@Andriy_M，这似乎是我的MariaDB版本中的错误。它不喜欢GROUP BY FLOOR(reason_weight / 10)却接受GROUP BY reason_weight。至于性能，在MySQL方面，我当然也不是专家，这只是我笨拙的机器上的观察结果。因为我首先运行查询，所以所有数据都应该已经缓存，所以我不知道为什么它第一次运行速度较慢。

— Lennart