查询优化:时间间隔


10

总的来说,我有两种时间间隔:

presence timeabsence time

absence time 可以具有不同的类型(例如休息,缺席,特殊日子等),并且时间间隔可能重叠和/或相交。

这是肯定的,只有间隔的合理组合,原始数据存在,例如。重叠的存在间隔没有意义,但可能存在。我现在尝试通过多种方法来确定出现的时间间隔-对我来说,最舒服的似乎是紧随其后的时间间隔。

;with "timestamps"
as
(
    select
        "id" = row_number() over ( order by "empId", "timestamp", "opening", "type" )
        , "empId"
        , "timestamp"
        , "type"
        , "opening"
    from
    (
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 1 as "type" from "worktime" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
        union all
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 2 as "type" from "break" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
        union all
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 3 as "type" from "absence" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
    ) as data
)
select 
      T1."empId"
    , "starttime"   = T1."timestamp"
    , "endtime"     = T2."timestamp"
from 
    "timestamps" as T1
    left join "timestamps" as T2
        on T2."empId" = T1."empId"
        and T2."id" = T1."id" + 1
    left join "timestamps" as RS
        on RS."empId" = T2."empId"
        and RS."id" <= T1."id"      
group by
    T1."empId", T1."timestamp", T2."timestamp"
having
    (sum( power( 2, RS."type" ) * RS."opening" ) = 2)
order by 
    T1."empId", T1."timestamp";

有关一些演示数据,请参见SQL-Fiddle

原始数据以"starttime" - "endtime"或形式存在于不同的表中"starttime" - "duration"

想法是获得每个时间戳的有序列表,并在每个时间使用打开间隔的“位掩码”滚动总和来估计存在时间。

即使不同时间间隔的星际相等,小提琴也会起作用并给出估计的结果。在此示例中不使用索引。

这是完成质疑任务的正确方法,还是有更优雅的方法呢?

如果与回答相关:每位员工每张表格的数据量最多为一万个数据集。sql-2012无法用于总计计算内联的前辈的滚动总和。


编辑:

只需对大量测试数据(1000、10.000、100.000、100万)执行查询,就可以看到运行时间呈指数增长。显然是警告标志,对吗?

我更改了查询,并通过新奇的更新删除了滚动汇总。

我添加了一个辅助表:

create table timestamps
(
  "id" int
  , "empId" int
  , "timestamp" datetime
  , "type" int
  , "opening" int
  , "rolSum" int
)

create nonclustered index "idx" on "timestamps" ( "rolSum" ) include ( "id", "empId", "timestamp" )

我将计算滚动总和移到了这个地方:

declare @rolSum int = 0
update "timestamps" set @rolSum = "rolSum" = @rolSum + power( 2, "type" ) * "opening" from "timestamps"

看到SQL提琴这里

关于“工作时间”表中的100万个条目,运行时间减少到3秒。

问题保持不变:解决此问题的最有效方法是什么?


我敢肯定会有争议,但是您可以尝试在CTE中要这样做。请改用临时表,看看它是否更快。
rottengeek

只是一个样式问题:我从未见过有人将所有列名和表名放在双引号中。这是您整个公司的做法吗?我绝对觉得不舒服。我认为这是没有必要的,因此会增加信号上的噪声...
ErikE

@ErikE上述方法是一个巨大插件的一部分。有些对象是动态创建的,并且取决于最终用户输入选择。因此,例如,表名或视图名中可能会出现空格。那些双引号不会使查询崩溃...!
Nico

@Nico在我的世界中通常用方括号来表示,例如[this]。我想我比双引号更好。
ErikE

@ErikE方括号是tsql。标准是双引号!无论如何,我是这样学习的,所以习惯了!
Nico

Answers:


3

关于绝对最佳方法,我无法回答您的问题。但是我可以提供一种解决问题的方法,可能更好,也可能更好。它有一个合理的执行计划,我认为它会表现良好。(我很想知道,所以分享结果!)

对于使用我自己的语法样式而不是您的语法样式,我深表歉意—当所有内容都排在通常位置时,它有助于查询向导。

该查询在SqlFiddle中可用。我为EmpID 1设置了一个重叠部分,以确保已覆盖。如果最终发现在状态数据中不会发生重叠,则可以删除最终查询和Dense_Rank计算。

WITH Points AS (
  SELECT DISTINCT
    T.EmpID,
    P.TimePoint
  FROM
    (
      SELECT * FROM dbo.WorkTime
      UNION SELECT * FROM dbo.BreakTime
      UNION SELECT * FROM dbo.Absence
    ) T
    CROSS APPLY (VALUES (StartTime), (EndTime)) P (TimePoint)
), Groups AS (
  SELECT
    P.EmpID,
    P.TimePoint,
    Grp =
      Row_Number()
      OVER (PARTITION BY P.EmpID ORDER BY P.TimePoint, X.Which) / 2
  FROM
    Points P
    CROSS JOIN (VALUES (1), (2)) X (Which)
), Ranges AS (
  SELECT
    G.EmpID,
    StartTime = Min(G.TimePoint),
    EndTime = Max(G.TimePoint)
  FROM Groups G
  GROUP BY
    G.EmpID,
    G.Grp
  HAVING Count(*) = 2
), Presences AS (
  SELECT
    R.*,
    P.Present,
    Grp =
       Dense_Rank() OVER (PARTITION BY R.EmpID ORDER BY R.StartTime)
       - Dense_Rank() OVER (PARTITION BY R.EmpID, P.Present ORDER BY R.StartTime)
  FROM
    Ranges R
    CROSS APPLY (
      SELECT
        CASE WHEN EXISTS (
          SELECT *
          FROM dbo.WorkTime W
          WHERE
            R.EmpID = W.EmpID
            AND R.StartTime < W.EndTime
            AND W.StartTime < R.EndTime
        ) AND NOT EXISTS (
          SELECT *
          FROM dbo.BreakTime B
          WHERE
            R.EmpID = B.EmpID
            AND R.StartTime < B.EndTime
            AND B.StartTime < R.EndTime
        ) AND NOT EXISTS (
          SELECT *
          FROM dbo.Absence A
          WHERE
            R.EmpID = A.EmpID
            AND R.StartTime < A.EndTime
            AND A.StartTime < R.EndTime
        ) THEN 1 ELSE 0 END
    ) P (Present)
)
SELECT
  EmpID,
  StartTime = Min(StartTime),
  EndTime = Max(EndTime)
FROM Presences
WHERE Present = 1
GROUP BY
  EmpID,
  Grp
ORDER BY
  EmpID,
  StartTime;

注意:将三个表合并在一起并添加一列以指示该时间是哪种时间:工作,休息或缺席,将提高该查询的性能。

为什么要问所有的CTE?因为每个人都是我需要对数据执行的操作。有一个汇总,或者我需要在窗口函数上放置WHERE条件,或者在不允许窗口函数的子句中使用它。

现在,我要开始看看是否能想到另一种策略来实现这一目标。:)

为了娱乐,我在此处包括为帮助解决问题而制作的“关系图”:

------------
   -----------------
                ---------------
                           -----------

    ---    ------   ------       ------------

----   ----      ---      -------

三组破折号(用空格分隔)按顺序表示:状态数据,状态数据和期望的结果。


感谢您的这种方法。回到办公室后,我将对其进行检查,并为您提供更大数据库的运行时结果。
Nico

运行时间明显比第一种方法高得多。我没有时间检查是否进一步的索引可能会减少它。会尽快检查!
Nico 2013年

我有另一个想法,我没时间上班。值得的是,您的查询返回的错误结果在所有表中都有重叠的范围。
ErikE

我再次检查了一下,看到这个小提琴在所有三个表中都有完全重叠的间隔。如我所见,它返回正确的结果。您能否提供返回错误结果的情况?随时调整小提琴的演示数据!
Nico

好吧,我明白你的意思。如果在一张桌子上相交的间隔,结果就发疯了。会检查一下。
Nico
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.