在MySQL中将单列与多个值匹配而不使用自联接表


14

我们有一个表,用于存储问题的答案。我们需要能够找到对特定问题有特定答案的用户。因此,如果我们的表包含以下数据:

user_id     question_id     answer_value  
Sally        1               Pooch  
Sally        2               Peach  
John         1               Pooch  
John         2               Duke

并且我们想找到对问题1回答“ Pooch”而对问题2回答“ Peach”的用户,则以下SQL(显然)不会起作用:

select user_id 
from answers 
where question_id=1 
  and answer_value = 'Pooch'
  and question_id=2
  and answer_value='Peach'

我的第一个想法是针对需要的每个答案自行加入表格:

select a.user_id 
from answers a, answers b 
where a.user_id = b.user_id
  and a.question_id=1
  and a.answer_value = 'Pooch'
  and b.question_id=2
  and b.answer_value='Peach'

这可行,但是由于我们允许任意数量的搜索过滤器,因此我们需要找到效率更高的方法。我的下一个解决方案是这样的:

select user_id, count(question_id) 
from answers 
where (
       (question_id=2 and answer_value = 'Peach') 
    or (question_id=1 and answer_value = 'Pooch')
      )
group by user_id 
having count(question_id)>1

但是,我们希望用户能够两次进行相同的问卷调查,因此他们可能在答案表中对问题1拥有两个答案。

所以,现在我很茫然。解决此问题的最佳方法是什么?谢谢!

Answers:


8

我发现了一种无需自我连接即可执行此查询的聪明方法。

我在适用于Windows的MySQL 5.5.8中运行了这些命令,并得到以下结果:

use test
DROP TABLE IF EXISTS answers;
CREATE TABLE answers (user_id VARCHAR(10),question_id INT,answer_value VARCHAR(20));
INSERT INTO answers VALUES
('Sally',1,'Pouch'),
('Sally',2,'Peach'),
('John',1,'Pooch'),
('John',2,'Duke');
INSERT INTO answers VALUES
('Sally',1,'Pooch'),
('Sally',2,'Peach'),
('John',1,'Pooch'),
('John',2,'Duck');

SELECT user_id,question_id,GROUP_CONCAT(DISTINCT answer_value) given_answers
FROM answers GROUP BY user_id,question_id;

+---------+-------------+---------------+
| user_id | question_id | given_answers |
+---------+-------------+---------------+
| John    |           1 | Pooch         |
| John    |           2 | Duke,Duck     |
| Sally   |           1 | Pouch,Pooch   |
| Sally   |           2 | Peach         |
+---------+-------------+---------------+

该显示表明,John对问题2给出了两个不同的答案,而Sally对问题1给出了两个不同的答案。

要捕获所有用户对哪些问题的回答不同,只需将以上查询放在子查询中,然后在给定答案列表中检查逗号,以获取不同答案的计数,如下所示:

SELECT user_id,question_id,given_answers,
(LENGTH(given_answers) - LENGTH(REPLACE(given_answers,',','')))+1 multianswer_count
FROM (SELECT user_id,question_id,GROUP_CONCAT(DISTINCT answer_value) given_answers
FROM answers GROUP BY user_id,question_id) A;

我懂了:

+---------+-------------+---------------+-------------------+
| user_id | question_id | given_answers | multianswer_count |
+---------+-------------+---------------+-------------------+
| John    |           1 | Pooch         |                 1 |
| John    |           2 | Duke,Duck     |                 2 |
| Sally   |           1 | Pouch,Pooch   |                 2 |
| Sally   |           2 | Peach         |                 1 |
+---------+-------------+---------------+-------------------+

现在,使用另一个子查询过滤掉multianswer_count = 1的行:

SELECT * FROM (SELECT user_id,question_id,given_answers,
(LENGTH(given_answers) - LENGTH(REPLACE(given_answers,',','')))+1 multianswer_count
FROM (SELECT user_id,question_id,GROUP_CONCAT(DISTINCT answer_value) given_answers
FROM answers GROUP BY user_id,question_id) A) AA WHERE multianswer_count > 1;

这是我得到的:

+---------+-------------+---------------+-------------------+
| user_id | question_id | given_answers | multianswer_count |
+---------+-------------+---------------+-------------------+
| John    |           2 | Duke,Duck     |                 2 |
| Sally   |           1 | Pouch,Pooch   |                 2 |
+---------+-------------+---------------+-------------------+

本质上,我执行了三个表扫描:1在主表上,2在小子查询上。没有加入!

试试看 !!!


1
我一直很感谢您在回答中付出的努力。
randomx 2011年

7

我自己喜欢加入方法:

SELECT a.user_id FROM answers a
INNER JOIN answers a1 ON a1.question_id=1 AND a1.answer_value='Pooch'
INNER JOIN answers a2 ON a2.question_id=2 AND a2.answer_value='Peach'
GROUP BY a.user_id

更新 在使用较大的表(〜100万行)进行测试后,此方法花费的时间比OR原始问题中提到的简单方法长得多。


谢谢回复。问题在于这可能是一个很大的表,必须将其连接5至6次可​​能意味着对性能的巨大冲击,对吗?
Christopher Armstrong

好问题。我正在编写一个测试用例进行测试,因为我不知道...完成后将发布结果
Derek Downey

1
因此我插入了100万行,其中包含随机的用户,问题/答案对。Join仍在557秒处进行,您的OR查询在1.84秒内完成...现在坐在角落。
德里克·唐尼

测试表上有索引吗?如果您要扫描数百万行表几次,那无疑会很慢:-)。
玛丽安

@Marian是的,我在(question_id,answer_value)问题上添加了一个索引,因为基数非常低,因此它无济于事(每个连接被扫描100-200k行)
Derek Downey

5

我们user_idanswers表中的表连接成一个连接链,以从其他表中获取数据,但是隔离答案表SQL并以这种简单的方式编写它可以帮助我发现解决方案:

SELECT user_id, COUNT(question_id) 
FROM answers 
WHERE
  (question_id = 2 AND answer_value = 'Peach') 
  OR (question_id = 1 AND answer_value = 'Pooch')
GROUP by user_id 
HAVING COUNT(question_id) > 1

我们不必要地使用第二个子查询。


我喜欢你的答案
Kisspa 2015年

4

如果您有大量数据,我将做两个索引:

  • question_id,answer_value,user_id;和
  • user_id,question_id,answer_value。

由于数据的组织方式,您将需要多次加入。如果您知道哪个问题最不常见的值,则可以稍微加快查询速度,但优化程序应为您完成。

尝试查询为:

SELECT a1.user_id从答案a1
WHERE a1.question_id = 1 AND a1.answer_value ='Pooch'
内联接在a2上回答a2.question_id = 2 
   AND a2.answer_value ='桃子'AND a1.user_id = a2.user_id

表a1应该使用第一个索引。根据数据分布,优化器可以使用任何一个索引。整个查询应从索引中满足。


2

一种解决方法是获取user_id的子集并测试第二个匹配项:

SELECT user_id 
FROM answers 
WHERE question_id = 1 
AND answer_value = 'Pooch'
AND user_id IN (SELECT user_id FROM answers WHERE question_id=2 AND answer_value = 'Peach');

使用罗兰多的结构:

CREATE TABLE answers (user_id VARCHAR(10),question_id INT,answer_value VARCHAR(20));
INSERT INTO answers VALUES
('Sally',1,'Pouch'),
('Sally',2,'Peach'),
('John',1,'Pooch'),
('John',2,'Duke');
INSERT INTO answers VALUES
('Sally',1,'Pooch'),
('Sally',2,'Peach'),
('John',1,'Pooch'),
('John',2,'Duck');

产量:

mysql> SELECT user_id FROM answers WHERE question_id = 1 AND answer_value = 'Pooch' AND user_id IN (SELECT user_id FROM answers WHERE question_id=2 AND answer_value = 'Peach');
+---------+
| user_id |
+---------+
| Sally   |
+---------+
1 row in set (0.00 sec)
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.