Answers:
2019年更新:自从我写下这个答案以来的10年中,已经发现了更多可能产生更好结果的解决方案。此外,此后的SQL Server版本(尤其是SQL 2012)引入了新的T-SQL功能,可用于计算中位数。SQL Server版本还改进了其查询优化器,这可能会影响各种中位数解决方案的性能。网络,我最初的2009年帖子仍然可以,但是对于现代SQL Server应用程序可能有更好的解决方案。看看2012年的这篇文章,这是一个很好的资源: https : //sqlperformance.com/2012/08/t-sql-queries/median
本文发现以下模式比所有其他替代方法要快得多,至少在他们测试的简单模式上要快得多。该解决方案比最慢的(PERCENTILE_CONT
)解决方案快373倍(!!!)。请注意,此技巧需要两个单独的查询,这些查询可能并非在所有情况下都可行。它还需要SQL 2012或更高版本。
DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (@c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;
当然,仅因为在2012年对一种架构进行的一项测试取得了很好的结果,您的工作量可能会有所不同,尤其是在使用SQL Server 2014或更高版本的情况下。如果性能对于中位数计算很重要,我强烈建议尝试并性能测试该文章中建议的几个选项,以确保找到最适合您的模式的选项。
我还要特别小心地使用在此问题PERCENTILE_CONT
的其他答案之一中推荐的(SQL Server 2012中的新增功能),因为上面链接的文章发现此内置功能比最快的解决方案慢373倍。此差异有可能在7年后得到改善,但是就我个人而言,在我验证其性能与其他解决方案之前,我不会在大型桌子上使用此功能。
以下是2009年的原始帖子:
有很多方法可以做到这一点,而性能却大不相同。这是一个经过特别优化的解决方案,其中包括Median,ROW_NUMBER和performance。当涉及执行期间生成的实际I / O时,这是一个特别理想的解决方案-它看起来比其他解决方案更昂贵,但实际上要快得多。
该页面还包含其他解决方案和性能测试详细信息的讨论。请注意,如果有多行中值列的值相同,则使用唯一列作为歧义消除器。
与所有数据库性能方案一样,始终尝试使用真实硬件上的真实数据来测试解决方案-您永远都不知道何时更改SQL Server优化器或环境的特殊性会使正常速度的解决方案变慢。
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
-- SalesOrderId in the ORDER BY is a disambiguator to break ties
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
如果您使用的是SQL 2005或更高版本,那么对于表中的单个列,这是一个很好的,简单的中值计算:
SELECT
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median
select gid, median(score) from T group by gid
。您是否需要相关的子查询?
在SQL Server 2012中,您应该使用PERCENTILE_CONT:
SELECT SalesOrderID, OrderQty,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
另请参阅:http : //blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/
DISTINCT
或GROUPY BY SalesOrderID
?否则,您将有很多重复的行。
PERCENTILE_DISC
我最初的快速答案是:
select max(my_column) as [my_column], quartile
from (select my_column, ntile(4) over (order by my_column) as [quartile]
from my_table) i
--where quartile = 2
group by quartile
这将使您一口气就能获得中位数和四分位间距。如果您真的只想要一行作为中位数,则取消注释where子句。
当您将其放入解释计划中时,60%的工作正在对数据进行排序,这是在像这样计算与位置相关的统计信息时不可避免的。
我修改了答案,以遵循RobertŠevčík-Robajz在以下评论中的出色建议:
;with PartitionedData as
(select my_column, ntile(10) over (order by my_column) as [percentile]
from my_table),
MinimaAndMaxima as
(select min(my_column) as [low], max(my_column) as [high], percentile
from PartitionedData
group by percentile)
select
case
when b.percentile = 10 then cast(b.high as decimal(18,2))
else cast((a.low + b.high) as decimal(18,2)) / 2
end as [value], --b.high, a.low,
b.percentile
from MinimaAndMaxima a
join MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
--where b.percentile = 5
当您有偶数个数据项时,这应该计算正确的中位数和百分位数值。同样,如果只希望中位数而不是整个百分比分布,请取消注释最终的where子句。
更好的是:
SELECT @Median = AVG(1.0 * val)
FROM
(
SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
FROM dbo.EvenRows AS o
CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);
来自大师本人伊齐克·本·甘!
MS SQL Server 2012(及更高版本)具有PERCENTILE_DISC函数,该函数可为排序的值计算特定的百分位数。PERCENTILE_DISC(0.5)将计算中值- https://msdn.microsoft.com/en-us/library/hh231327.aspx
如果要在SQL Server中使用Create Aggregate函数,请执行以下操作。用这种方法这样做的好处是能够编写干净的查询。请注意,此过程可以很容易地用于计算百分位数。
创建一个新的Visual Studio项目并将目标框架设置为.NET 3.5(这适用于SQL 2008,在SQL 2012中可能有所不同)。然后创建一个类文件并放入以下代码或等效的c#:
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO
<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
Implements IBinarySerialize
Private _items As List(Of Decimal)
Public Sub Init()
_items = New List(Of Decimal)()
End Sub
Public Sub Accumulate(value As SqlDecimal)
If Not value.IsNull Then
_items.Add(value.Value)
End If
End Sub
Public Sub Merge(other As Median)
If other._items IsNot Nothing Then
_items.AddRange(other._items)
End If
End Sub
Public Function Terminate() As SqlDecimal
If _items.Count <> 0 Then
Dim result As Decimal
_items = _items.OrderBy(Function(i) i).ToList()
If _items.Count Mod 2 = 0 Then
result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
Else
result = _items((_items.Count - 1) / 2)
End If
Return New SqlDecimal(result)
Else
Return New SqlDecimal()
End If
End Function
Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
'deserialize it from a string
Dim list = r.ReadString()
_items = New List(Of Decimal)
For Each value In list.Split(","c)
Dim number As Decimal
If Decimal.TryParse(value, number) Then
_items.Add(number)
End If
Next
End Sub
Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
'serialize the list to a string
Dim list = ""
For Each item In _items
If list <> "" Then
list += ","
End If
list += item.ToString()
Next
w.Write(list)
End Sub
End Class
然后编译它,并将DLL和PDB文件复制到您的SQL Server计算机上,并在SQL Server中运行以下命令:
CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO
CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3)
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO
然后,您可以编写一个查询来计算中位数,如下所示:SELECT dbo.Median(Field)FROM Table
我只是在寻找针对中位数的基于集合的解决方案时碰到了该页面。在查看了此处的一些解决方案之后,我提出了以下解决方案。希望是有帮助的/有效的。
DECLARE @test TABLE(
i int identity(1,1),
id int,
score float
)
INSERT INTO @test (id,score) VALUES (1,10)
INSERT INTO @test (id,score) VALUES (1,11)
INSERT INTO @test (id,score) VALUES (1,15)
INSERT INTO @test (id,score) VALUES (1,19)
INSERT INTO @test (id,score) VALUES (1,20)
INSERT INTO @test (id,score) VALUES (2,20)
INSERT INTO @test (id,score) VALUES (2,21)
INSERT INTO @test (id,score) VALUES (2,25)
INSERT INTO @test (id,score) VALUES (2,29)
INSERT INTO @test (id,score) VALUES (2,30)
INSERT INTO @test (id,score) VALUES (3,20)
INSERT INTO @test (id,score) VALUES (3,21)
INSERT INTO @test (id,score) VALUES (3,25)
INSERT INTO @test (id,score) VALUES (3,29)
DECLARE @counts TABLE(
id int,
cnt int
)
INSERT INTO @counts (
id,
cnt
)
SELECT
id,
COUNT(*)
FROM
@test
GROUP BY
id
SELECT
drv.id,
drv.start,
AVG(t.score)
FROM
(
SELECT
MIN(t.i)-1 AS start,
t.id
FROM
@test t
GROUP BY
t.id
) drv
INNER JOIN @test t ON drv.id = t.id
INNER JOIN @counts c ON t.id = c.id
WHERE
t.i = ((c.cnt+1)/2)+drv.start
OR (
t.i = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start
AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0
)
GROUP BY
drv.id,
drv.start
尽管贾斯汀·格兰特(Justin Grant)的解决方案看起来很可靠,但我发现,当给定分区键中有多个重复值时,ASC重复值的行号最终会乱序,因此它们无法正确对齐。
这是我的结果的一部分:
KEY VALUE ROWA ROWD
13 2 22 182
13 1 6 183
13 1 7 184
13 1 8 185
13 1 9 186
13 1 10 187
13 1 11 188
13 1 12 189
13 0 1 190
13 0 2 191
13 0 3 192
13 0 4 193
13 0 5 194
我使用贾斯汀的代码作为该解决方案的基础。尽管考虑到使用多个派生表效率不高,但它确实解决了我遇到的行排序问题。任何改进都将受到欢迎,因为我对T-SQL经验不足。
SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
SELECT PKEY,VALUE,ROWA,ROWD,
'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
FROM
(
SELECT
PKEY,
cast(VALUE as decimal(5,2)) as VALUE,
ROWA,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD
FROM
(
SELECT
PKEY,
VALUE,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA
FROM [MTEST]
)T1
)T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
上面贾斯汀的例子非常好。但是应该非常清楚地说明主键需求。我已经看到了没有密钥的代码,结果很糟糕。
我对Percentile_Cont的抱怨是,它不会为您提供数据集中的实际值。要获得“中间值”,它是数据集中的实际值,请使用Percentile_Disc。
SELECT SalesOrderID, OrderQty,
PERCENTILE_DISC(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
在UDF中,编写:
Select Top 1 medianSortColumn from Table T
Where (Select Count(*) from Table
Where MedianSortColumn <
(Select Count(*) From Table) / 2)
Order By medianSortColumn
中位数发现
这是查找属性中位数的最简单方法。
Select round(S.salary,4) median from employee S where (select count(salary) from station where salary < S.salary ) = (select count(salary) from station where salary > S.salary)
在此处查看其他用于SQL中位数计算的解决方案:“ 使用MySQL计算中位数的简单方法 ”(这些解决方案大多与供应商无关)。
使用COUNT聚合,您可以首先计算有多少行并将其存储在名为@cnt的变量中。然后,您可以计算用于OFFSET-FETCH过滤器的参数,以基于数量排序指定要跳过的行数(偏移值)和要过滤的行数(获取值)。
要跳过的行数是(@cnt-1)/2。很明显,对于奇数计数,此计算是正确的,因为在除以2之前,您首先对单个中间值减去了1。
这对于偶数计数也是正确的,因为在表达式中使用的除法是整数除法。因此,当从偶数减去1时,剩下的是奇数。
当将该奇数值除以2时,结果(.5)的小数部分将被截断。要获取的行数为2-(@cnt%2)。这个想法是,当计数为奇数时,模运算的结果为1,您需要提取1行。当计数为偶数时,取模操作的结果为0,则需要获取2行。通过从2中减去模运算的结果1或0,可以分别得到所需的1或2。最后,要计算中位数,请获取一个或两个结果量,并在将输入整数值转换为数字1之后应用平均值,如下所示:
DECLARE @cnt AS INT = (SELECT COUNT(*) FROM [Sales].[production].[stocks]);
SELECT AVG(1.0 * quantity) AS median
FROM ( SELECT quantity
FROM [Sales].[production].[stocks]
ORDER BY quantity
OFFSET (@cnt - 1) / 2 ROWS FETCH NEXT 2 - @cnt % 2 ROWS ONLY ) AS D;
我想自己解决一个问题,但是我的大脑绊倒了。我认为它可以,但是请不要在早上解释。:P
DECLARE @table AS TABLE
(
Number int not null
);
insert into @table select 2;
insert into @table select 4;
insert into @table select 9;
insert into @table select 15;
insert into @table select 22;
insert into @table select 26;
insert into @table select 37;
insert into @table select 49;
DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;
WITH MyResults(RowNo, Number) AS
(
SELECT RowNo, Number FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
--Create Temp Table to Store Results in
DECLARE @results AS TABLE
(
[Month] datetime not null
,[Median] int not null
);
--This variable will determine the date
DECLARE @IntDate as int
set @IntDate = -13
WHILE (@IntDate < 0)
BEGIN
--Create Temp Table
DECLARE @table AS TABLE
(
[Rank] int not null
,[Days Open] int not null
);
--Insert records into Temp Table
insert into @table
SELECT
rank() OVER (ORDER BY DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0), DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')),[SVR].[ref_num]) as [Rank]
,DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')) as [Days Open]
FROM
mdbrpt.dbo.View_Request SVR
LEFT OUTER JOIN dbo.dtv_apps_systems vapp
on SVR.category = vapp.persid
LEFT OUTER JOIN dbo.prob_ctg pctg
on SVR.category = pctg.persid
Left Outer Join [mdbrpt].[dbo].[rootcause] as [Root Cause]
on [SVR].[rootcause]=[Root Cause].[id]
Left Outer Join [mdbrpt].[dbo].[cr_stat] as [Status]
on [SVR].[status]=[Status].[code]
LEFT OUTER JOIN [mdbrpt].[dbo].[net_res] as [net]
on [net].[id]=SVR.[affected_rc]
WHERE
SVR.Type IN ('P')
AND
SVR.close_date IS NOT NULL
AND
[Status].[SYM] = 'Closed'
AND
SVR.parent is null
AND
[Root Cause].[sym] in ( 'RC - Application','RC - Hardware', 'RC - Operational', 'RC - Unknown')
AND
(
[vapp].[appl_name] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
OR
pctg.sym in ('Systems.Release Health Dashboard.Problem','DTV QA Test.Enterprise Release.Deferred Defect Log')
AND
[Net].[nr_desc] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
)
AND
DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0) = DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0)
ORDER BY [Days Open]
DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;
WITH MyResults(RowNo, [Days Open]) AS
(
SELECT RowNo, [Days Open] FROM
(SELECT ROW_NUMBER() OVER (ORDER BY [Days Open]) AS RowNo, [Days Open] FROM @table) AS Foo
)
insert into @results
SELECT
DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0) as [Month]
,AVG([Days Open])as [Median] FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
set @IntDate = @IntDate+1
DELETE FROM @table
END
select *
from @results
order by [Month]
这适用于SQL 2000:
DECLARE @testTable TABLE
(
VALUE INT
)
--INSERT INTO @testTable -- Even Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56
--
--INSERT INTO @testTable -- Odd Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 39 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56
DECLARE @RowAsc TABLE
(
ID INT IDENTITY,
Amount INT
)
INSERT INTO @RowAsc
SELECT VALUE
FROM @testTable
ORDER BY VALUE ASC
SELECT AVG(amount)
FROM @RowAsc ra
WHERE ra.id IN
(
SELECT ID
FROM @RowAsc
WHERE ra.id -
(
SELECT MAX(id) / 2.0
FROM @RowAsc
) BETWEEN 0 AND 1
)
对于像我这样正在学习基础知识的新手来说,我个人认为此示例更容易理解,因为它更容易准确地了解正在发生的事情以及中值从何而来...
select
( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]
from (select
datediff(dd,startdate,enddate) as [Value1]
,xxxxxxxxxxxxxx as [Value2]
from dbo.table1
)a
绝对敬畏上面的一些代码!!!
以下解决方案在这些假设下起作用:
码:
IF OBJECT_ID('dbo.R', 'U') IS NOT NULL
DROP TABLE dbo.R
CREATE TABLE R (
A FLOAT NOT NULL);
INSERT INTO R VALUES (1);
INSERT INTO R VALUES (2);
INSERT INTO R VALUES (3);
INSERT INTO R VALUES (4);
INSERT INTO R VALUES (5);
INSERT INTO R VALUES (6);
-- Returns Median(R)
select SUM(A) / CAST(COUNT(A) AS FLOAT)
from R R1
where ((select count(A) from R R2 where R1.A > R2.A) =
(select count(A) from R R2 where R1.A < R2.A)) OR
((select count(A) from R R2 where R1.A > R2.A) + 1 =
(select count(A) from R R2 where R1.A < R2.A)) OR
((select count(A) from R R2 where R1.A > R2.A) =
(select count(A) from R R2 where R1.A < R2.A) + 1) ;
DECLARE @Obs int
DECLARE @RowAsc table
(
ID INT IDENTITY,
Observation FLOAT
)
INSERT INTO @RowAsc
SELECT Observations FROM MyTable
ORDER BY 1
SELECT @Obs=COUNT(*)/2 FROM @RowAsc
SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs
我尝试了几种选择,但是由于我的数据记录具有重复的值,因此ROW_NUMBER版本似乎不是我的选择。因此,这里是我使用的查询(带有NTILE的版本):
SELECT distinct
CustomerId,
(
MAX(CASE WHEN Percent50_Asc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) +
MIN(CASE WHEN Percent50_desc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId)
)/2 MEDIAN
FROM
(
SELECT
CustomerId,
TotalDue,
NTILE(2) OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC) AS Percent50_Asc,
NTILE(2) OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC) AS Percent50_desc
FROM Sales.SalesOrderHeader SOH
) x
ORDER BY CustomerId;
基于上面的Jeff Atwood的答案,它是使用GROUP BY和相关的子查询来获取每个组的中位数。
SELECT TestID,
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score DESC) AS TopHalf)
) / 2 AS MedianScore,
AVG(Score) AS AvgScore, MIN(Score) AS MinScore, MAX(Score) AS MaxScore
FROM Posts_parent
GROUP BY Posts_parent.TestID
通常,我们可能不仅需要为整个表计算中位数,还需要为某些ID的汇总计算中位数。换句话说,计算表中每个ID的中位数,其中每个ID都有很多记录。(基于@gdoron编辑的解决方案:良好的性能并且可以在许多SQL中使用)
SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val,
COUNT(*) OVER (PARTITION BY our_id) AS cnt,
ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rnk
FROM our_table
) AS x
WHERE rnk IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;
希望能帮助到你。
对于您的问题,Jeff Atwood已经给出了简单有效的解决方案。但是,如果您正在寻找其他方法来计算中位数,则下面的SQL代码将为您提供帮助。
create table employees(salary int);
insert into employees values(8); insert into employees values(23); insert into employees values(45); insert into employees values(123); insert into employees values(93); insert into employees values(2342); insert into employees values(2238);
select * from employees;
declare @odd_even int; declare @cnt int; declare @middle_no int;
set @cnt=(select count(*) from employees); set @middle_no=(@cnt/2)+1; select @odd_even=case when (@cnt%2=0) THEN -1 ELse 0 END ;
select AVG(tbl.salary) from (select salary,ROW_NUMBER() over (order by salary) as rno from employees group by salary) tbl where tbl.rno=@middle_no or tbl.rno=@middle_no+@odd_even;
如果您希望在MySQL中计算中位数,则此github链接将非常有用。
这是找到我能想到的中位数的最佳解决方案。示例中的名称基于Justin示例。确保存在表Sales.SalesOrderHeader的索引,并且该顺序的索引列为CustomerId和TotalDue。
SELECT
sohCount.CustomerId,
AVG(sohMid.TotalDue) as TotalDueMedian
FROM
(SELECT
soh.CustomerId,
COUNT(*) as NumberOfRows
FROM
Sales.SalesOrderHeader soh
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY
(Select
soh.TotalDue
FROM
Sales.SalesOrderHeader soh
WHERE soh.CustomerId = sohCount.CustomerId
ORDER BY soh.TotalDue
OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS
FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
) As sohMid
GROUP BY sohCount.CustomerId
更新
我不太确定哪种方法具有最佳性能,因此我通过基于所有三种方法在同一批中运行查询来对我的方法贾斯汀·格兰特和杰夫·阿特伍德斯进行了比较,每个查询的批处理成本为:
没有索引:
并带有索引
我试图通过创建大约14000行中的2到512倍的数据来创建更多数据,以查看查询对索引的扩展程度,这意味着最终大约有720万行。注意,我确保CustomeId字段在每次执行单个副本时都是唯一的,因此与CustomerId的唯一实例相比,行的比例保持恒定。在执行此操作的同时,我运行了执行程序,然后在此之后重建了索引,并且我注意到使用这些值的数据,结果稳定在大约128倍:
我想知道如何通过缩放行数并保持不变的CustomerId不变来影响性能,所以我在执行此操作的地方设置了一个新测试。现在,批次成本比率不再保持稳定,而是不断变化,而不是每个平均每个CustomerId大约有20行,而每个这样的唯一ID最终大约有10000行。其中的数字:
通过比较结果,确保可以正确实现每种方法。我的结论是,只要索引存在,我使用的方法通常会更快。还请注意,此方法是针对本文中此特定问题的推荐方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5
进一步提高对该查询的后续调用的性能的一种方法是,将计数信息保存在辅助表中。您甚至可以通过触发一个触发器来维护它,该触发器将更新并保存与取决于CustomerId的SalesOrderHeader行数有关的信息,当然,您也可以简单地存储中位数。
对于大规模数据集,可以尝试以下GIST:
https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2
它通过汇总您在集合中会找到的不同值(例如年龄或出生年份等)来工作,并使用SQL窗口函数查找您在查询中指定的任何百分位位置。