Logstash解析包含多个日志条目的xml文档


8

我目前正在评估logstash和elasticsearch是否对我们的用例有用。我所拥有的是一个包含多个条目的日志文件,其格式为

<root>
    <entry>
        <fieldx>...</fieldx>
        <fieldy>...</fieldy>
        <fieldz>...</fieldz>
        ...
        <fieldarray>
            <fielda>...</fielda>
            <fielda>...</fielda>
            ...
        </fieldarray>
    </entry>
    <entry>
    ...
    </entry>
    ...
<root>

每个entry元素将包含一个日志事件。(如果您有兴趣,该文件实际上是Tempo时间表(Atlassian JIRA插件)工作日志导出。)

是否可以在不编写自己的编解码器的情况下将此类文件转换为多个日志事件?

Answers:


11

好吧,我找到了一个对我有用的解决方案。该解决方案的最大问题是XML插件不是很不稳定,而是文档编写不充分,有错误,或者文档编写不正确和不正确。

TLDR

Bash命令行:

gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf

Logstash配置:

input {
    stdin {}
}

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
    # multiline filter adds the tag "multiline" only to lines spanning multiple lines
    # We _only_ want those here.
    if "multiline" in [tags] {
        # Add the encoding line here. Could in theory extract this from the
        # first line with a clever filter. Not worth the effort at the moment.
        mutate {
            replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
        }
        # This filter exports the hierarchy into the field "entry". This will
        # create a very deep structure that elasticsearch does not really like.
        # Which is why I used add_field to flatten it.
        xml {
            target => entry
            source => message
            add_field => {
                fieldx         => "%{[entry][fieldx]}"
                fieldy         => "%{[entry][fieldy]}"
                fieldz         => "%{[entry][fieldz]}"
                # With deeper nested fields, the xml converter actually creates
                # an array containing hashes, which is why you need the [0]
                # -- took me ages to find out.
                fielda         => "%{[entry][fieldarray][0][fielda]}"
                fieldb         => "%{[entry][fieldarray][0][fieldb]}"
                fieldc         => "%{[entry][fieldarray][0][fieldc]}"
            }
        }
        # Remove the intermediate fields before output. "message" contains the
        # original message (XML). You may or may-not want to keep that.
        mutate {
            remove_field => ["message"]
            remove_field => ["entry"]
        }
    }
}

output {
    ...
}

详细

我的解决方案之所以有效,是因为至少直到该entry级别为止,我的XML输入都是非常统一的,因此可以通过某种模式匹配来处理。

由于导出基本上是XML的一长行,而logstash xml插件实质上仅适用于包含XML数据的字段(读取:行中的列),因此我不得不将数据更改为更有用的格式。

Shell:准备文件

  • gzcat -d file.xml.gz |:数据太多-显然您可以跳过
  • tr -d "\n\r" |:删除XML元素内的换行符:某些元素可以包含换行符作为字符数据。下一步要求将其删除或以某种方式进行编码。即使假设此时您已经将所有XML代码放在一行中,但是此命令是否删除元素之间的任何空格都没有关系

  • xmllint --format - |:使用xmllint格式化XML(libxml附带)

    在此,<root><entry><fieldx>...</fieldx></entry></root>正确格式化了XML()的单个大义大利面条行:

    <root>
      <entry>
        <fieldx>...</fieldx>
        <fieldy>...</fieldy>
        <fieldz>...</fieldz>
        <fieldarray>
          <fielda>...</fielda>
          <fieldb>...</fieldb>
          ...
        </fieldarray>
      </entry>
      <entry>
        ...
      </entry>
      ...
    </root>
    

Logstash

logstash -f logstash-csv.conf

.conf在“ TL; DR”部分中查看文件的完整内容。)

在这里,multiline过滤器可以解决问题。它可以将多行合并为一条日志消息。这就是为什么必须进行格式化的原因xmllint

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
}

基本上,这就是说每行缩进超过两个空格(或者是</entry>/ xmllint缺省缩进两个空格)都属于上一行。这也意味着字符数据不能包含换行符(tr在外壳程序中带有分隔符),并且xml必须规范化(xmllint)


嗨,您成功完成了这项工作吗?我很好奇,因为我有同样的需求,而多行解决方案以及拆分对我来说都不行。感谢您的反馈
2015年

@viz可行,但我们从未在生产中使用过它。仅当您具有非常规则的XML结构并首先使用缩进对其进行了格式化时,多行才有效(请参见答案,“准备文件”部分)
2015年

1

我有一个类似的情况。要解析此xml:

<ROOT number="34">
  <EVENTLIST>
    <EVENT name="hey"/>
    <EVENT name="you"/>
  </EVENTLIST>
</ROOT>

我使用此配置进行logstash:

input {
  file {
    path => "/path/events.xml"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<ROOT"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}
filter {
  xml {
    source => "message"
    target => "xml_content"
  }
  split {
    field => "xml_content[EVENTLIST]"
  }
  split {
    field => "xml_content[EVENTLIST][EVENT]"
  }
  mutate {
    add_field => { "number" => "%{xml_content[number]}" }
    add_field => { "name" => "%{xml_content[EVENTLIST][EVENT][name]}" }
    remove_field => ['xml_content', 'message', 'path']
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

我希望这可以帮助某人。我需要很长时间才能得到它。

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.