从syslog日志文件中快速提取时间范围?


12

我有一个标准syslog格式的日志文件。看起来像这样,除了每秒数百行:

Jan 11 07:48:46 blahblahblah...
Jan 11 07:49:00 blahblahblah...
Jan 11 07:50:13 blahblahblah...
Jan 11 07:51:22 blahblahblah...
Jan 11 07:58:04 blahblahblah...

它不会在恰好午夜滚动,但永远不会超过两天。

我经常不得不从该文件中提取时间片。我想为此编写一个通用脚本,可以这样调用:

$ timegrep 22:30-02:00 /logs/something.log

...并从午夜22:30开始划出界线,直到午夜2点为止。

有一些警告:

  • 我不想麻烦在命令行上键入日期,而只是时间。该程序应该足够聪明以找出它们。
  • 日志日期格式不包含年份,因此应该根据当前年份进行猜测,但是在新年前后还是可以做的事情。
  • 我希望它速度很快-它应该利用行是为了在文件中四处查找并使用二进制搜索这一事实。

在我花大量时间写这篇文章之前,它已经存在吗?

Answers:


9

更新:我已经用许多改进的更新版本替换了原始代码。我们称这种(实际的?)alpha质量。

此版本包括:

  • 命令行选项处理
  • 命令行日期格式验证
  • 一些try街区
  • 行读入功能

原文:

好吧,你知道什么?“寻找”,你们将会发现!这是一个Python程序,它在文件中四处寻找并使用或多或少的二进制搜索。这大大高于AWK脚本更快,其他人写道。

这是(pre?)alpha质量。它应该具有try块和输入验证以及大量测试,并且毫无疑问会更像Python。但这是为了您的娱乐。哦,它是为Python 2.6编写的。

新代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# timegrep.py by Dennis Williamson 20100113
# in response to http://serverfault.com/questions/101744/fast-extraction-of-a-time-range-from-syslog-logfile

# thanks to serverfault user http://serverfault.com/users/1545/mike
# for the inspiration

# Perform a binary search through a log file to find a range of times
# and print the corresponding lines

# tested with Python 2.6

# TODO: Make sure that it works if the seek falls in the middle of
#       the first or last line
# TODO: Make sure it's not blind to a line where the sync read falls
#       exactly at the beginning of the line being searched for and
#       then gets skipped by the second read
# TODO: accept arbitrary date

# done: add -l long and -s short options
# done: test time format

version = "0.01a"

import os, sys
from stat import *
from datetime import date, datetime
import re
from optparse import OptionParser

# Function to read lines from file and extract the date and time
def getdata():
    """Read a line from a file

    Return a tuple containing:
        the date/time in a format such as 'Jan 15 20:14:01'
        the line itself

    The last colon and seconds are optional and
    not handled specially

    """
    try:
        line = handle.readline(bufsize)
    except:
        print("File I/O Error")
        exit(1)
    if line == '':
        print("EOF reached")
        exit(1)
    if line[-1] == '\n':
        line = line.rstrip('\n')
    else:
        if len(line) >= bufsize:
            print("Line length exceeds buffer size")
        else:
            print("Missing newline")
        exit(1)
    words = line.split(' ')
    if len(words) >= 3:
        linedate = words[0] + " " + words[1] + " " + words[2]
    else:
        linedate = ''
    return (linedate, line)
# End function getdata()

# Set up option handling
parser = OptionParser(version = "%prog " + version)

parser.usage = "\n\t%prog [options] start-time end-time filename\n\n\
\twhere times are in the form hh:mm[:ss]"

parser.description = "Search a log file for a range of times occurring yesterday \
and/or today using the current time to intelligently select the start and end. \
A date may be specified instead. Seconds are optional in time arguments."

parser.add_option("-d", "--date", action = "store", dest = "date",
                default = "",
                help = "NOT YET IMPLEMENTED. Use the supplied date instead of today.")

parser.add_option("-l", "--long", action = "store_true", dest = "longout",
                default = False,
                help = "Span the longest possible time range.")

parser.add_option("-s", "--short", action = "store_true", dest = "shortout",
                default = False,
                help = "Span the shortest possible time range.")

parser.add_option("-D", "--debug", action = "store", dest = "debug",
                default = 0, type = "int",
                help = "Output debugging information.\t\t\t\t\tNone (default) = %default, Some = 1, More = 2")

(options, args) = parser.parse_args()

if not 0 <= options.debug <= 2:
    parser.error("debug level out of range")
else:
    debug = options.debug    # 1 = print some debug output, 2 = print a little more, 0 = none

if options.longout and options.shortout:
    parser.error("options -l and -s are mutually exclusive")

if options.date:
    parser.error("date option not yet implemented")

if len(args) != 3:
    parser.error("invalid number of arguments")

start = args[0]
end   = args[1]
file  = args[2]

# test for times to be properly formatted, allow hh:mm or hh:mm:ss
p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')

if not p.match(start) or not p.match(end):
    print("Invalid time specification")
    exit(1)

# Determine Time Range
yesterday = date.fromordinal(date.today().toordinal()-1).strftime("%b %d")
today     = datetime.now().strftime("%b %d")
now       = datetime.now().strftime("%R")

if start > now or start > end or options.longout or options.shortout:
    searchstart = yesterday
else:
    searchstart = today

if (end > start > now and not options.longout) or options.shortout:
    searchend = yesterday
else:
    searchend = today

searchstart = searchstart + " " + start
searchend = searchend + " " + end

try:
    handle = open(file,'r')
except:
    print("File Open Error")
    exit(1)

# Set some initial values
bufsize = 4096  # handle long lines, but put a limit them
rewind  =  100  # arbitrary, the optimal value is highly dependent on the structure of the file
limit   =   75  # arbitrary, allow for a VERY large file, but stop it if it runs away
count   =    0
size    =    os.stat(file)[ST_SIZE]
beginrange   = 0
midrange     = size / 2
oldmidrange  = midrange
endrange     = size
linedate     = ''

pos1 = pos2  = 0

if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstart, searchend))

# Seek using binary search
while pos1 != endrange and oldmidrange != 0 and linedate != searchstart:
    handle.seek(midrange)
    linedate, line = getdata()    # sync to line ending
    pos1 = handle.tell()
    if midrange > 0:             # if not BOF, discard first read
        if debug > 1: print("...partial: (len: {0}) '{1}'".format((len(line)), line))
        linedate, line = getdata()

    pos2 = handle.tell()
    count += 1
    if debug > 0: print("#{0} Beg: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".format(count, beginrange, midrange, endrange, pos1, pos2, linedate))
    if  searchstart > linedate:
        beginrange = midrange
    else:
        endrange = midrange
    oldmidrange = midrange
    midrange = (beginrange + endrange) / 2
    if count > limit:
        print("ERROR: ITERATION LIMIT EXCEEDED")
        exit(1)

if debug > 0: print("...stopping: '{0}'".format(line))

# Rewind a bit to make sure we didn't miss any
seek = oldmidrange
while linedate >= searchstart and seek > 0:
    if seek < rewind:
        seek = 0
    else:
        seek = seek - rewind
    if debug > 0: print("...rewinding")
    handle.seek(seek)

    linedate, line = getdata()    # sync to line ending
    if debug > 1: print("...junk: '{0}'".format(line))

    linedate, line = getdata()
    if debug > 0: print("...comparing: '{0}'".format(linedate))

# Scan forward
while linedate < searchstart:
    if debug > 0: print("...skipping: '{0}'".format(linedate))
    linedate, line = getdata()

if debug > 0: print("...found: '{0}'".format(line))

if debug > 0: print("Beg: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".format(beginrange, midrange, endrange, pos1, pos2, linedate))

# Now that the preliminaries are out of the way, we just loop,
#     reading lines and printing them until they are
#     beyond the end of the range we want

while linedate <= searchend:
    print line
    linedate, line = getdata()

if debug > 0: print("Start: '{0}' End: '{1}'".format(searchstart, searchend))
handle.close()

哇。我真的需要学习Python ...
Stefan Lasiewski 2010年

@Dennis Williamson:我看到一行if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstar$。是searchstar应该有一个结束$,或者是一个错字?我得到这一行语法错误(行159)
斯特凡Lasiewski

@Stefan我将替换为))
Bill Weiss 2010年

@Stefan:谢谢。这是我已修正的错字。为了快速参考,$应改为t, searchend))所以说... searchstart, searchend))
已暂停,直至另行通知。

@Stefan:对不起。我认为就可以了。
暂停,直到另行通知。

0

通过网上快速搜索,有些东西是根据关键字(例如FIRE或诸如此类:)提取的,但是没有任何东西可以从文件中提取日期范围。

似乎很难做到您所建议的:

  1. 搜索开始时间。
  2. 打印该行。
  3. 如果结束时间<开始时间,且行的日期是>结束和<开始,则停止。
  4. 如果结束时间是>开始时间,并且行的日期是>结束,请停止。

似乎很简单,如果您不介意Ruby,我可以为您编写:)


我不介意Ruby,但是如果您想在大文件中高效地进行处理,则#1并不简单-您需要将seek()移到中点,找到最近的行,看看它是如何开始的,然后重复一个新的中点。查看每一行效率太低。
mike 2010年

您说的很大,但未指定实际大小。大小有多大?更糟糕的是,如果涉及到很多天,那么很容易只用时间找到错误的那一天。毕竟,如果您跨越了一天的界限,则脚本运行的日期将始终与开始时间不同。这些文件是否可以通过mmap()放入内存中?
Michael Graff 2010年

网络安装的磁盘上大约30 GB。
mike 2010年

0

这将根据它们与当前时间(“现在”)的关系来打印开始时间和结束时间之间的条目范围。

用法:

timegrep [-l] start end filename

例:

$ timegrep 18:47 03:22 /some/log/file

-l(长)选项使尽可能长的输出。如果开始时间的小时和分钟值小于结束时间和现在的时间,则开始时间将被解释为昨天。如果开始时间和结束时间HH:MM值均大于“现在”,则将结束时间解释为今天。

假设“现在”是“ 1月11日19:00”,这就是各种示例开始时间和结束时间的解释方式(-l除非另有说明):

开始结束范围开始范围结束
1月10日1月10日19:01 23:59
19:01 00:00 Jan 10 Jan 11
00:00 18:59 1月11日1月11日
18:59 18:58 1月10日1月10日
19:01 23:59 Jan 10 Jan 11#-l
00:00 18:59 Jan 10 Jan 11#-l
18:59 19:01 Jan 10 Jan 11#-l

几乎所有脚本都已安装。最后两行完成所有工作。

警告:未完成参数验证或错误检查。边缘盒尚未经过全面测试。这是使用gawk其他版本的AWK 编写的。

#!/usr/bin/awk -f
BEGIN {
    arg=1
    if ( ARGV[arg] == "-l" ) {
        long = 1
        ARGV[arg++] = ""
    }
    start = ARGV[arg]
    ARGV[arg++] = ""
    end = ARGV[arg]
    ARGV[arg++] = ""

    yesterday = strftime("%b %d", mktime(strftime("%Y %m %d -24 00 00")))
    today = strftime("%b %d")
    now = strftime("%R")

    if ( start > now || start > end || long )
        startdate = yesterday
    else
        startdate = today

    if ( end > now && end > start && start > now && ! long )
        enddate = yesterday
    else
        enddate = today
    fi

startdate = startdate " " start
enddate = enddate " " end
}

$1 " " $2 " " $3 > enddate {exit}
$1 " " $2 " " $3 >= startdate {print}

我认为AWK在搜索文件方面非常有效。我认为搜索未索引文本文件的任何其他步骤都不一定会更快。


看来您忽略了我的第三个要点。日志的大小约为30 GB-如果文件的第一行是7:00,最后一行是23:00,并且我想在22:00和22:01之间进行分片,则不需要该脚本会在7:00和22:00之间的每一行进行查看。我希望它估算出它的位置,寻找到这一点,然后进行新的估算,直到找到它为止。
mike 2010年

我没有忽略它。我在最后一段表达了我的意见。
暂停,直到另行通知。

0

一个使用二进制搜索的C ++程序-它需要一些简单的修改(即调用strptime)来处理文本日期。

http://gitorious.org/bs_grep/

我以前的版本支持文本日期,但是对于我们的日志文件来说,它仍然太慢了。分析表明,超过90%的时间都用在strptime中,因此,我们刚刚修改了日志格式,使其也包含数字unix时间戳。


0

即使这个答案为时已晚,但对某些人可能是有益的。

我已经将@Dennis Williamson的代码转换为可用于其他python东西的Python类。

我添加了对多个日期支持的支持。

import os
from stat import *
from datetime import date, datetime
import re

# @TODO Support for rotated log files - currently using the current year for 'Jan 01' dates.
class LogFileTimeParser(object):
    """
    Extracts parts of a log file based on a start and enddate
    Uses binary search logic to speed up searching

    Common usage: validate log files during testing

    Faster than awk parsing for big log files
    """
    version = "0.01a"

    # Set some initial values
    BUF_SIZE = 4096  # self.handle long lines, but put a limit to them
    REWIND = 100  # arbitrary, the optimal value is highly dependent on the structure of the file
    LIMIT = 75  # arbitrary, allow for a VERY large file, but stop it if it runs away

    line_date = ''
    line = None
    opened_file = None

    @staticmethod
    def parse_date(text, validate=True):
        # Supports Aug 16 14:59:01 , 2016-08-16 09:23:09 Jun 1 2005  1:33:06PM (with or without seconds, miliseconds)
        for fmt in ('%Y-%m-%d %H:%M:%S %f', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M',
                    '%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S',
                    '%b %d %Y %H:%M:%S %f', '%b %d %Y %H:%M', '%b %d %Y %H:%M:%S',
                    '%b %d %Y %I:%M:%S%p', '%b %d %Y %I:%M%p', '%b %d %Y %I:%M:%S%p %f'):
            try:
                if fmt in ['%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S']:

                    return datetime.strptime(text, fmt).replace(datetime.now().year)
                return datetime.strptime(text, fmt)
            except ValueError:
                pass
        if validate:
            raise ValueError("No valid date format found for '{0}'".format(text))
        else:
            # Cannot use NoneType to compare datetimes. Using minimum instead
            return datetime.min

    # Function to read lines from file and extract the date and time
    def read_lines(self):
        """
        Read a line from a file
        Return a tuple containing:
            the date/time in a format supported in parse_date om the line itself
        """
        try:
            self.line = self.opened_file.readline(self.BUF_SIZE)
        except:
            raise IOError("File I/O Error")
        if self.line == '':
            raise EOFError("EOF reached")
        # Remove \n from read lines.
        if self.line[-1] == '\n':
            self.line = self.line.rstrip('\n')
        else:
            if len(self.line) >= self.BUF_SIZE:
                raise ValueError("Line length exceeds buffer size")
            else:
                raise ValueError("Missing newline")
        words = self.line.split(' ')
        # This results into Jan 1 01:01:01 000000 or 1970-01-01 01:01:01 000000
        if len(words) >= 3:
            self.line_date = self.parse_date(words[0] + " " + words[1] + " " + words[2],False)
        else:
            self.line_date = self.parse_date('', False)
        return self.line_date, self.line

    def get_lines_between_timestamps(self, start, end, path_to_file, debug=False):
        # Set some initial values
        count = 0
        size = os.stat(path_to_file)[ST_SIZE]
        begin_range = 0
        mid_range = size / 2
        old_mid_range = mid_range
        end_range = size
        pos1 = pos2 = 0

        # If only hours are supplied
        # test for times to be properly formatted, allow hh:mm or hh:mm:ss
        p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')
        if p.match(start) or p.match(end):
            # Determine Time Range
            yesterday = date.fromordinal(date.today().toordinal() - 1).strftime("%Y-%m-%d")
            today = datetime.now().strftime("%Y-%m-%d")
            now = datetime.now().strftime("%R")
            if start > now or start > end:
                search_start = yesterday
            else:
                search_start = today
            if end > start > now:
                search_end = yesterday
            else:
                search_end = today
            search_start = self.parse_date(search_start + " " + start)
            search_end = self.parse_date(search_end + " " + end)
        else:
            # Set dates
            search_start = self.parse_date(start)
            search_end = self.parse_date(end)
        try:
            self.opened_file = open(path_to_file, 'r')
        except:
            raise IOError("File Open Error")
        if debug:
            print("File: '{0}' Size: {1} Start: '{2}' End: '{3}'"
                  .format(path_to_file, size, search_start, search_end))

        # Seek using binary search -- ONLY WORKS ON FILES WHO ARE SORTED BY DATES (should be true for log files)
        try:
            while pos1 != end_range and old_mid_range != 0 and self.line_date != search_start:
                self.opened_file.seek(mid_range)
                # sync to self.line ending
                self.line_date, self.line = self.read_lines()
                pos1 = self.opened_file.tell()
                # if not beginning of file, discard first read
                if mid_range > 0:
                    if debug:
                        print("...partial: (len: {0}) '{1}'".format((len(self.line)), self.line))
                    self.line_date, self.line = self.read_lines()
                pos2 = self.opened_file.tell()
                count += 1
                if debug:
                    print("#{0} Beginning: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".
                          format(count, begin_range, mid_range, end_range, pos1, pos2, self.line_date))
                if search_start > self.line_date:
                    begin_range = mid_range
                else:
                    end_range = mid_range
                old_mid_range = mid_range
                mid_range = (begin_range + end_range) / 2
                if count > self.LIMIT:
                    raise IndexError("ERROR: ITERATION LIMIT EXCEEDED")
            if debug:
                print("...stopping: '{0}'".format(self.line))
            # Rewind a bit to make sure we didn't miss any
            seek = old_mid_range
            while self.line_date >= search_start and seek > 0:
                if seek < self.REWIND:
                    seek = 0
                else:
                    seek -= self.REWIND
                if debug:
                    print("...rewinding")
                self.opened_file.seek(seek)
                # sync to self.line ending
                self.line_date, self.line = self.read_lines()
                if debug:
                    print("...junk: '{0}'".format(self.line))
                self.line_date, self.line = self.read_lines()
                if debug:
                    print("...comparing: '{0}'".format(self.line_date))
            # Scan forward
            while self.line_date < search_start:
                if debug:
                    print("...skipping: '{0}'".format(self.line_date))
                self.line_date, self.line = self.read_lines()
            if debug:
                print("...found: '{0}'".format(self.line))
            if debug:
                print("Beginning: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".
                      format(begin_range, mid_range, end_range, pos1, pos2, self.line_date))
            # Now that the preliminaries are out of the way, we just loop,
            # reading lines and printing them until they are beyond the end of the range we want
            while self.line_date <= search_end:
                # Exclude our 'Nonetype' values
                if not self.line_date == datetime.min:
                    print self.line
                self.line_date, self.line = self.read_lines()
            if debug:
                print("Start: '{0}' End: '{1}'".format(search_start, search_end))
            self.opened_file.close()
        # Do not display EOFErrors:
        except EOFError as e:
            pass
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.