好吧,我找到了一个对我有用的解决方案。该解决方案的最大问题是XML插件不是很不稳定,而是文档编写不充分,有错误,或者文档编写不正确和不正确。
TLDR
Bash命令行:
gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf
Logstash配置:
input {
stdin {}
}
filter {
# add all lines that have more indentation than double-space to the previous line
multiline {
pattern => "^\s\s(\s\s|\<\/entry\>)"
what => previous
}
# multiline filter adds the tag "multiline" only to lines spanning multiple lines
# We _only_ want those here.
if "multiline" in [tags] {
# Add the encoding line here. Could in theory extract this from the
# first line with a clever filter. Not worth the effort at the moment.
mutate {
replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
}
# This filter exports the hierarchy into the field "entry". This will
# create a very deep structure that elasticsearch does not really like.
# Which is why I used add_field to flatten it.
xml {
target => entry
source => message
add_field => {
fieldx => "%{[entry][fieldx]}"
fieldy => "%{[entry][fieldy]}"
fieldz => "%{[entry][fieldz]}"
# With deeper nested fields, the xml converter actually creates
# an array containing hashes, which is why you need the [0]
# -- took me ages to find out.
fielda => "%{[entry][fieldarray][0][fielda]}"
fieldb => "%{[entry][fieldarray][0][fieldb]}"
fieldc => "%{[entry][fieldarray][0][fieldc]}"
}
}
# Remove the intermediate fields before output. "message" contains the
# original message (XML). You may or may-not want to keep that.
mutate {
remove_field => ["message"]
remove_field => ["entry"]
}
}
}
output {
...
}
详细
我的解决方案之所以有效,是因为至少直到该entry
级别为止,我的XML输入都是非常统一的,因此可以通过某种模式匹配来处理。
由于导出基本上是XML的一长行,而logstash xml插件实质上仅适用于包含XML数据的字段(读取:行中的列),因此我不得不将数据更改为更有用的格式。
Shell:准备文件
gzcat -d file.xml.gz |
:数据太多-显然您可以跳过
tr -d "\n\r" |
:删除XML元素内的换行符:某些元素可以包含换行符作为字符数据。下一步要求将其删除或以某种方式进行编码。即使假设此时您已经将所有XML代码放在一行中,但是此命令是否删除元素之间的任何空格都没有关系
xmllint --format - |
:使用xmllint格式化XML(libxml附带)
在此,<root><entry><fieldx>...</fieldx></entry></root>
正确格式化了XML()的单个大义大利面条行:
<root>
<entry>
<fieldx>...</fieldx>
<fieldy>...</fieldy>
<fieldz>...</fieldz>
<fieldarray>
<fielda>...</fielda>
<fieldb>...</fieldb>
...
</fieldarray>
</entry>
<entry>
...
</entry>
...
</root>
Logstash
logstash -f logstash-csv.conf
(.conf
在“ TL; DR”部分中查看文件的完整内容。)
在这里,multiline
过滤器可以解决问题。它可以将多行合并为一条日志消息。这就是为什么必须进行格式化的原因xmllint
:
filter {
# add all lines that have more indentation than double-space to the previous line
multiline {
pattern => "^\s\s(\s\s|\<\/entry\>)"
what => previous
}
}
基本上,这就是说每行缩进超过两个空格(或者是</entry>
/ xmllint缺省缩进两个空格)都属于上一行。这也意味着字符数据不能包含换行符(tr
在外壳程序中带有分隔符),并且xml必须规范化(xmllint)