获取联接表中聚合值的增量计数


10

我在MySQL 5.7.22数据库中有两个表:postsreasons。每个帖子行都有并且属于许多原因行。每个原因都有一个相关的权重,因此每个帖子都有一个相关的权重。

对于10个重量点的每个增量(即0、10、20、30等),我希望获得总权重小于或等于该增量的帖子数。我希望这样做的结果看起来像这样:

 weight | post_count
--------+------------
      0 | 0
     10 | 5
     20 | 12
     30 | 18
    ... | ...
    280 | 20918
    290 | 21102
    ... | ...
   1250 | 118005
   1260 | 118039
   1270 | 118040

总权重大致呈正态分布,有一些非常低的值和一些非常高的值(当前最大值为1277),但大部分位于中间。大约有120,000行posts,而大约有120 行reasons。每个帖子平均有5或6个原因。

表格的相关部分如下所示:

CREATE TABLE `posts` (
  id BIGINT PRIMARY KEY
);

CREATE TABLE `reasons` (
  id BIGINT PRIMARY KEY,
  weight INT(11) NOT NULL
);

CREATE TABLE `posts_reasons` (
  post_id BIGINT NOT NULL,
  reason_id BIGINT NOT NULL,
  CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
  CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);

到目前为止,我已经尝试将帖子ID和权重放入视图中,然后将该视图与其自身合并以获取汇总计数:

CREATE VIEW `post_weights` AS (
    SELECT 
        posts.id,
        SUM(reasons.weight) AS reason_weight
    FROM posts
    INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
    INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
    GROUP BY posts.id
);

SELECT
    FLOOR(p1.reason_weight / 10) AS weight,
    COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;

但是,这是非常缓慢的-我让它运行15分钟而不终止,这在生产中是无法做到的。

有没有更有效的方法可以做到这一点?

如果您有兴趣测试整个数据集,可在此处下载。该文件大约为60MB,可扩展到大约250MB。或者,有12000行的GitHub的要点在这里

Answers:


8

我说,在JOIN条件下使用函数或表达式通常不是一个好主意,因为某些优化程序可以很好地处理它并以任何方式利用索引。我建议创建一个权重表。就像是:

CREATE TABLE weights
( weight int not null primary key 
);

INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);

确保您在上有索引posts_reasons

CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);

查询如下:

SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

我在家的机器可能有5-6年的历史,它具有3.20 GHz @ 8Gb的Intel®Core™i5-3470 CPU。

uname -a Linux咬伤4.16.6-302.fc28.x86_64#1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU / Linux

我针对:

https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5

MariaDB [test3]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)


SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

+--------+------------+
| weight | post_count |
+--------+------------+
|      0 |          1 |
|     10 |       2591 |
|     20 |       4264 |
|     30 |       4386 |
|     40 |       5415 |
|     50 |       7499 |
[...]   
|   1270 |     119283 |
|   1320 |     119286 |
|   1330 |     119286 |
[...]
|   2590 |     119286 |
+--------+------------+
256 rows in set (9.89 sec)

如果性能至关重要,并且没有其他帮助,则可以为以下目的创建摘要表:

SELECT pr.post_id, SUM(r.weight) as sum_weight     
FROM reasons r
JOIN posts_reasons pr
    ON r.id = pr.reason_id
GROUP BY pr.post_id

您可以通过触发器维护该表

由于每个砝码都需要完成一定量的工作,因此限制此表可能会有所帮助。

    ON w.weight > x.sum_weight 
WHERE w.weight <= (select MAX(sum_weights) 
                   from (SELECT SUM(weight) as sum_weights 
                   FROM reasons r        
                   JOIN posts_reasons pr
                       ON r.id = pr.reason_id 
                   GROUP BY pr.post_id) a
                  ) 
GROUP BY w.weight

由于我的体重表中有很多不必要的行(最大2590),因此上述限制将执行时间从9秒减少到4秒。


澄清:这似乎是在计算重量低于以下原因的原因w.weight -是吗?我希望计算的重量(其相关原因行的权重之和)为lte w.weight
ArtOfCode

啊对不起 我将重写查询
Lennart

不过,这让我无所不用其极,谢谢!只需从post_weights我已经创建的现有视图中选择即可reasons
ArtOfCode

@ArtOfCode,我是否对修改后的查询正确?顺便说一句,谢谢您的提问。清晰,简洁且包含大量示例数据。Bravo
Lennart '18

7

在MySQL中,变量可以在查询中使用,既可以从列中的值计算得出,也可以在表达式中用于新的计算出的列。在这种情况下,使用变量可产生有效的查询:

SELECT
  weight,
  @cumulative := @cumulative + post_count AS post_count
FROM
  (SELECT @cumulative := 0) AS x,
  (
    SELECT
      FLOOR(reason_weight / 10) * 10 AS weight,
      COUNT(*)                       AS post_count
    FROM
      (
        SELECT 
          p.id,
          SUM(r.weight) AS reason_weight
        FROM
          posts AS p
          INNER JOIN posts_reasons AS pr ON p.id = pr.post_id
          INNER JOIN reasons AS r ON pr.reason_id = r.id
        GROUP BY
          p.id
      ) AS d
    GROUP BY
      FLOOR(reason_weight / 10)
    ORDER BY
      FLOOR(reason_weight / 10) ASC
  ) AS derived
;

d派生表实际上是你的post_weights看法。因此,如果计划保留视图,则可以使用它代替派生表:

SELECT
  weight,
  @cumulative := @cumulative + post_count AS post_count
FROM
  (SELECT @cumulative := 0),
  (
    SELECT
      FLOOR(reason_weight / 10) * 10 AS weight,
      COUNT(*)                       AS post_count
    FROM
      post_weights
    GROUP BY
      FLOOR(reason_weight / 10)
    ORDER BY
      FLOOR(reason_weight / 10) ASC
  ) AS derived
;

可以在SQL Fiddle中找到并使用该解决方案的演示,该演示使用了简化版本的安装程序


我尝试使用完整的数据集查询。我不确定为什么(查询对我来说看起来不错),但是MariaDB抱怨ERROR 1055 (42000): 'd.reason_weight' isn't in GROUP BY如果ONLY_FULL_GROUP_BY在@@ sql_mode中。禁用它我注意到您的查询比我的第一次运行慢(〜11秒)。缓存数据后,速度会更快(〜1秒)。我的查询每次运行大约4秒钟。
Lennart

1
@Lennart:这是因为这不是实际的查询。我在小提琴中纠正了它,但忘了更新答案。立即更新,感谢您的注意。
Andriy M

@Lennart:关于性能,我可能对这种类型的查询有误解。我认为它应该有效地工作,因为计算将在表格上完成一次。对于派生表,尤其是那些使用聚合的表,可能不一定是这种情况。恐怕我既没有适当的MySQL安装,也没有足够的专业知识来进行更深入的分析。
Andriy M

@Andriy_M,这似乎是我的MariaDB版本中的错误。它不喜欢GROUP BY FLOOR(reason_weight / 10)却接受GROUP BY reason_weight。至于性能,在MySQL方面,我当然也不是专家,这只是我笨拙的机器上的观察结果。因为我首先运行查询,所以所有数据都应该已经缓存,所以我不知道为什么它第一次运行速度较慢。
Lennart
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.