计算总访问量


12

我正在尝试编写一个查询,其中我必须通过照顾重叠的日子来计算客户的访问次数。假设itemID 2009的开始日期为23日,结束日期为26日,因此项目20010在这几天之间,我们将不将此购买日期添加到我们的总数中。

示例场景:

Item ID Start Date   End Date   Number of days     Number of days Candidate for visit count
20009   2015-01-23  2015-01-26     4                      4
20010   2015-01-24  2015-01-24     1                      0
20011   2015-01-23  2015-01-26     4                      0
20012   2015-01-23  2015-01-27     5                      1
20013   2015-01-23  2015-01-27     5                      0
20014   2015-01-29  2015-01-30     2                      2

输出应为7 VisitDays

输入表:

CREATE TABLE #Items    
(
CustID INT,
ItemID INT,
StartDate DATETIME,
EndDate DATETIME
)           


INSERT INTO #Items
SELECT 11205, 20009, '2015-01-23',  '2015-01-26'  
UNION ALL 
SELECT 11205, 20010, '2015-01-24',  '2015-01-24'    
UNION ALL  
SELECT 11205, 20011, '2015-01-23',  '2015-01-26' 
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'  
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'   
UNION ALL  
SELECT 11205, 20012, '2015-01-28',  '2015-01-29'  

到目前为止,我已经尝试过:

CREATE TABLE #VisitsTable
    (
      StartDate DATETIME,
      EndDate DATETIME
    )

INSERT  INTO #VisitsTable
        SELECT DISTINCT
                StartDate,
                EndDate
        FROM    #Items items
        WHERE   CustID = 11205
        ORDER BY StartDate ASC

IF EXISTS (SELECT TOP 1 1 FROM #VisitsTable) 
BEGIN 


SELECT  ISNULL(SUM(VisitDays),1)
FROM    ( SELECT DISTINCT
                    abc.StartDate,
                    abc.EndDate,
                    DATEDIFF(DD, abc.StartDate, abc.EndDate) + 1 VisitDays
          FROM      #VisitsTable abc
                    INNER JOIN #VisitsTable bc ON bc.StartDate NOT BETWEEN abc.StartDate AND abc.EndDate      
        ) Visits

END



--DROP TABLE #Items 
--DROP TABLE #VisitsTable      

Answers:


5

第一个查询创建不重叠的不同开始日期和结束日期范围。

注意:

  • 您的样本(id=0)与来自Ypercube(id=1)的样本混合
  • 对于每个id或大量id,此解决方案可能无法很好地扩展海量数据。这样做的好处是不需要数字表。对于大型数据集,数字表很可能会带来更好的性能。

查询:

SELECT DISTINCT its.id
    , Start_Date = its.Start_Date 
    , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    --, x1=itmax.End_Date, x2=itmin.Start_Date, x3=its.End_Date
FROM @Items its
OUTER APPLY (
    SELECT Start_Date = MAX(End_Date) FROM @Items std
    WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
) itmin
OUTER APPLY (
    SELECT End_Date = MIN(Start_Date) FROM @Items std
    WHERE std.Item_ID <> its.Item_ID+1000 AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
) itmax;

输出:

id  | Start_Date                    | End_Date                      
0   | 2015-01-23 00:00:00.0000000   | 2015-01-23 00:00:00.0000000   => 1
0   | 2015-01-24 00:00:00.0000000   | 2015-01-27 00:00:00.0000000   => 4
0   | 2015-01-29 00:00:00.0000000   | 2015-01-30 00:00:00.0000000   => 2
1   | 2016-01-20 00:00:00.0000000   | 2016-01-22 00:00:00.0000000   => 3
1   | 2016-01-23 00:00:00.0000000   | 2016-01-24 00:00:00.0000000   => 2
1   | 2016-01-25 00:00:00.0000000   | 2016-01-29 00:00:00.0000000   => 5

如果您将这些开始日期和结束日期与DATEDIFF一起使用:

SELECT DATEDIFF(day
    , its.Start_Date 
    , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
) + 1
...

输出(重复)为:

  • 1、4和2的ID为0(您的样本=> SUM=7
  • ID为1的3、2和5(Ypercube样本=> SUM=10

然后,您只需将所有内容与SUM和组合在一起GROUP BY

SELECT id 
    , Days = SUM(
        DATEDIFF(day, Start_Date, End_Date)+1
    )
FROM (
    SELECT DISTINCT its.id
         , Start_Date = its.Start_Date 
        , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    FROM @Items its
    OUTER APPLY (
        SELECT Start_Date = MAX(End_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
    ) itmin
    OUTER APPLY (
        SELECT End_Date = MIN(Start_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
    ) itmax
) as d
GROUP BY id;

输出:

id  Days
0   7
1   10

具有2个不同ID的数据:

INSERT INTO @Items
    (id, Item_ID, Start_Date, End_Date)
VALUES 
    (0, 20009, '2015-01-23', '2015-01-26'),
    (0, 20010, '2015-01-24', '2015-01-24'),
    (0, 20011, '2015-01-23', '2015-01-26'),
    (0, 20012, '2015-01-23', '2015-01-27'),
    (0, 20013, '2015-01-23', '2015-01-27'),
    (0, 20014, '2015-01-29', '2015-01-30'),

    (1, 20009, '2016-01-20', '2016-01-24'),
    (1, 20010, '2016-01-23', '2016-01-26'),
    (1, 20011, '2016-01-25', '2016-01-29')

8

关于打包时间间隔有很多问题和文章。例如,Itzik Ben-Gan的“ 包装间隔”

您可以打包给定用户的时间间隔。一旦打包,就不会有重叠,因此您可以简单地总结打包间隔的持续时间。


如果您的时间间隔是没有时间的日期,我将使用Calendar表格。该表仅列出了几十年的日期。如果您没有日历表,只需创建一个:

CREATE TABLE [dbo].[Calendar](
    [dt] [date] NOT NULL,
CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED 
(
    [dt] ASC
));

很多方法可以填充这样的表

例如,从1900-01-01开始的100K行(〜270年):

INSERT INTO dbo.Calendar (dt)
SELECT TOP (100000) 
    DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '19000101') AS dt
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);

另请参见为什么数字表“无价”?

有了Calendar表格后,这里是如何使用它的方法。

每个原始行与参加Calendar表格,因为有之间的日期返回尽可能多的行StartDateEndDate

然后,我们计算不同的日期,从而删除重叠的日期。

SELECT COUNT(DISTINCT CA.dt) AS TotalCount
FROM
    #Items AS T
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= T.StartDate
            AND dbo.Calendar.dt <= T.EndDate
    ) AS CA
WHERE T.CustID = 11205
;

结果

TotalCount
7

7

我非常同意a Numbers和一个Calendar表非常有用,如果使用Calendar表可以大大简化此问题。

不过,我会建议另一种解决方案(不需要日历表或窗口汇总-就像Itzik的链接文章中的某些答案一样)。它可能并非在所有情况下都是最有效的(或在所有情况下都可能是最差的!),但是我认为测试没有害处。

它的工作原理是首先找到不与其他间隔重叠的开始日期和结束日期,然后将它们放在两行中(分别是开始日期和结束日期),以便为其分配行号,最后将第一个开始日期与第一个结束日期进行匹配,第二个和第二个,依此类推:

WITH 
  start_dates AS
    ( SELECT CustID, StartDate,
             Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                     ORDER BY StartDate)
      FROM items AS i
      WHERE NOT EXISTS
            ( SELECT *
              FROM Items AS j
              WHERE j.CustID = i.CustID
                AND j.StartDate < i.StartDate AND i.StartDate <= j.EndDate 
            )
      GROUP BY CustID, StartDate
    ),
  end_dates AS
    ( SELECT CustID, EndDate,
             Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                     ORDER BY EndDate) 
      FROM items AS i
      WHERE NOT EXISTS
            ( SELECT *
              FROM Items AS j
              WHERE j.CustID = i.CustID
                AND j.StartDate <= i.EndDate AND i.EndDate < j.EndDate 
            )
      GROUP BY CustID, EndDate
    )
SELECT s.CustID, 
       Result = SUM( DATEDIFF(day, s.StartDate, e.EndDate) + 1 )
FROM start_dates AS s
  JOIN end_dates AS e
    ON  s.CustID = e.CustID
    AND s.Rn = e.Rn 
GROUP BY s.CustID ;

on (CustID, StartDate, EndDate)和on 两个索引(CustID, EndDate, StartDate)对于提高查询性能很有用。

与日历(也许是唯一的日历)相比,它的优势在于,它可以轻松地调整为使用datetime值并以不同的精度(更大(几周,几年)或更小(小时,分钟或秒,毫秒等),而不仅仅是计算日期。一分钟或几秒精度的日历表会很大,将它交叉(交叉)到一个大表将是一种非常有趣的体验,但可能不是最有效的。

(感谢弗拉基米尔·巴拉诺夫(Vladimir Baranov)):很难对性能进行适当的比较,因为不同方法的性能可能取决于数据分布。1)间隔多长时间-间隔越短,日历表的性能就越好,因为长间隔会产生很多中间行2)间隔多久重叠一次-大多数是不重叠的间隔,而大多数间隔都覆盖相同的范围。我认为Itzik解决方案的性能取决于此。可能还有其他方法可以使数据倾斜,并且很难说出各种方法的效率会受到影响。


1
我看到2份。或者,如果我们将反半联接算为2个一半,则可能是3个;)
ypercubeᵀᴹ16/

1
@wBob如果您进行了性能测试,请在答案中添加它们。我很高兴见到他们,当然还有很多人。这是该网站如何工作..
ypercubeᵀᴹ

3
@wBob不必那么好斗-没有人对性能表示担忧。如果您有自己的疑虑,欢迎运行自己的测试。您对答案有多复杂的主观评估并不是拒绝的原因。您如何执行自己的测试并扩展自己的答案,而不是降低另一个答案呢?如果需要,使自己的答案更值得赞成,但不要反对其他合法的答案。
Monkpit,

1
大声笑没有战斗在这里@Monkpit。完全有效的理由,以及对绩效的认真讨论。
wBob

2
@wBob,很难对性能进行正确的比较,因为不同方法的性能可能取决于数据分布。1)间隔多长时间-间隔越短,日历表的性能越好,因为长间隔会产生很多中间行2)间隔多长时间重叠一次-大多数是不重叠的间隔,而大多数间隔都覆盖相同的范围。我认为Itzik解决方案的性能取决于此。可能还有其他方式倾斜数据,这些只是我想到的几种方法。
弗拉基米尔·巴拉诺夫

2

我认为使用日历表可以很简单,例如:

SELECT i.CustID, COUNT( DISTINCT c.calendarDate ) days
FROM #Items i
    INNER JOIN calendar.main c ON c.calendarDate Between i.StartDate And i.EndDate
GROUP BY i.CustID

试验台

USE tempdb
GO

-- Cutdown calendar script
IF OBJECT_ID('dbo.calendar') IS NULL
BEGIN

    CREATE TABLE dbo.calendar (
        calendarId      INT IDENTITY(1,1) NOT NULL,
        calendarDate    DATE NOT NULL,

        CONSTRAINT PK_calendar__main PRIMARY KEY ( calendarDate ASC ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
        CONSTRAINT UK_calendar__main UNIQUE NONCLUSTERED ( calendarId ASC ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
END
GO


-- Populate calendar table once only
IF NOT EXISTS ( SELECT * FROM dbo.calendar )
BEGIN

    -- Populate calendar table
    WITH cte AS
    (
    SELECT 0 x
    UNION ALL
    SELECT x + 1
    FROM cte
    WHERE x < 11323 -- Do from year 1 Jan 2000 until 31 Dec 2030 (extend if required)
    )
    INSERT INTO dbo.calendar ( calendarDate )
    SELECT
        calendarDate
    FROM
        (
        SELECT 
            DATEADD( day, x, '1 Jan 2010' ) calendarDate,
            DATEADD( month, -7, DATEADD( day, x, '1 Jan 2010' ) ) academicDate
        FROM cte
        ) x
    WHERE calendarDate < '1 Jan 2031'
    OPTION ( MAXRECURSION 0 )

    ALTER INDEX ALL ON dbo.calendar REBUILD

END
GO





IF OBJECT_ID('tempdb..Items') IS NOT NULL DROP TABLE Items
GO

CREATE TABLE dbo.Items
    (
    CustID INT NOT NULL,
    ItemID INT NOT NULL,
    StartDate DATE NOT NULL,
    EndDate DATE NOT NULL,

    INDEX _cdx_Items CLUSTERED ( CustID, StartDate, EndDate )
    )
GO

INSERT INTO Items ( CustID, ItemID, StartDate, EndDate )
SELECT 11205, 20009, '2015-01-23',  '2015-01-26'  
UNION ALL 
SELECT 11205, 20010, '2015-01-24',  '2015-01-24'    
UNION ALL  
SELECT 11205, 20011, '2015-01-23',  '2015-01-26' 
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'  
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'   
UNION ALL  
SELECT 11205, 20012, '2015-01-28',  '2015-01-29'
GO


-- Scale up : )
;WITH cte AS (
SELECT TOP 1000000 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM master.sys.columns c1
    CROSS JOIN master.sys.columns c2
    CROSS JOIN master.sys.columns c3
)
INSERT INTO Items ( CustID, ItemID, StartDate, EndDate )
SELECT 11206 + rn % 999, 20012 + rn, DATEADD( day, rn % 333, '1 Jan 2015' ), DATEADD( day, ( rn % 333 ) + rn % 7, '1 Jan 2015' )
FROM cte
GO
--:exit



-- My query: Pros: simple, one copy of items, easy to understand and maintain.  Scales well to 1 million + rows.
-- Cons: requires calendar table.  Others?
SELECT i.CustID, COUNT( DISTINCT c.calendarDate ) days
FROM dbo.Items i
    INNER JOIN dbo.calendar c ON c.calendarDate Between i.StartDate And i.EndDate
GROUP BY i.CustID
--ORDER BY i.CustID
GO


-- Vladimir query: Pros: Effectively same as above
-- Cons: I wouldn't use CROSS APPLY where it's not necessary.  Fortunately optimizer simplifies avoiding RBAR (I think).
-- Point of style maybe, but in terms of queries being self-documenting I prefer number 1.
SELECT T.CustID, COUNT( DISTINCT CA.calendarDate ) AS TotalCount
FROM
    Items AS T
    CROSS APPLY
    (
        SELECT c.calendarDate
        FROM dbo.calendar c
        WHERE
            c.calendarDate >= T.StartDate
            AND c.calendarDate <= T.EndDate
    ) AS CA
GROUP BY T.CustID
--ORDER BY T.CustID
--WHERE T.CustID = 11205
GO


/*  WARNING!! This is commented out as it can't compete in the scale test.  Will finish at scale 100, 1,000, 10,000, eventually.  I got 38 mins for 10,0000.  Pegs CPU.  

-- Julian:  Pros; does not require calendar table.
-- Cons: over-complicated (eg versus Query 1 in terms of number of lines of code, clauses etc); three copies of dbo.Items table (we have already shown
-- this query is possible with one); does not scale (even at 100,000 rows query ran for 38 minutes on my test rig versus sub-second for first two queries).  <<-- this is serious.
-- Indexing could help.
SELECT DISTINCT
    CustID,
     StartDate = CASE WHEN itmin.StartDate < its.StartDate THEN itmin.StartDate ELSE its.StartDate END
    , EndDate = CASE WHEN itmax.EndDate > its.EndDate THEN itmax.EndDate ELSE its.EndDate END
FROM Items its
OUTER APPLY (
    SELECT StartDate = MIN(StartDate) FROM Items std
    WHERE std.ItemID <> its.ItemID AND (
        (std.StartDate <= its.StartDate AND std.EndDate >= its.StartDate)
        OR (std.StartDate >= its.StartDate AND std.StartDate <= its.EndDate)
    )
) itmin
OUTER APPLY (
    SELECT EndDate = MAX(EndDate) FROM Items std
    WHERE std.ItemID <> its.ItemID AND (
        (std.EndDate >= its.StartDate AND std.EndDate <= its.EndDate)
        OR (std.StartDate <= its.EndDate AND std.EndDate >= its.EndDate)
    )
) itmax
GO
*/

-- ypercube:  Pros; does not require calendar table.
-- Cons: over-complicated (eg versus Query 1 in terms of number of lines of code, clauses etc); four copies of dbo.Items table (we have already shown
-- this query is possible with one); does not scale well; at 1,000,000 rows query ran for 2:20 minutes on my test rig versus sub-second for first two queries.
WITH 
  start_dates AS
    ( SELECT CustID, StartDate,
             Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                     ORDER BY StartDate)
      FROM items AS i
      WHERE NOT EXISTS
            ( SELECT *
              FROM Items AS j
              WHERE j.CustID = i.CustID
                AND j.StartDate < i.StartDate AND i.StartDate <= j.EndDate 
            )
      GROUP BY CustID, StartDate
    ),
  end_dates AS
    ( SELECT CustID, EndDate,
             Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                     ORDER BY EndDate) 
      FROM items AS i
      WHERE NOT EXISTS
            ( SELECT *
              FROM Items AS j
              WHERE j.CustID = i.CustID
                AND j.StartDate <= i.EndDate AND i.EndDate < j.EndDate 
            )
      GROUP BY CustID, EndDate
    )
SELECT s.CustID, 
       Result = SUM( DATEDIFF(day, s.StartDate, e.EndDate) + 1 )
FROM start_dates AS s
  JOIN end_dates AS e
    ON  s.CustID = e.CustID
    AND s.Rn = e.Rn 
GROUP BY s.CustID ;

2
虽然效果很好,但您应该阅读一下以下不良习惯:错误处理日期/范围查询:摘要2. 避免针对DATETIME,SMALLDATETIME,DATETIME2和DATETIMEOFFSET 进行范围查询
Julien Vavasseur
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.