使libmagic / file检测.docx文件


17

从其他地方可以看到 docx,xl​​sx和pttx是ZIP。将它们上传到我的Web应用程序时,file(通过libmagicpython-magic)将其检测为ZIP。

我将文件的内容作为Blob存储在数据库中,但是自然地我不想让用户相信它是哪种文件类型。因此,我想file在下载过程中信任并自动生成一个文件名。

我知道可以修改,/etc/magic但是格式(magic(5))对我来说太复杂了。我在Debian bugs上找到了关于该问题的bug报告,但是由于它是从2008年开始的,因此似乎无法很快得到修复。

我猜我唯一的选择是确实信任用户(但仍将内容存储为blob),并且仅根据文件名检查文件扩展名。这样,我可以禁止某些扩展并允许其他扩展。当用户重新下载他的文件时,他可以通过任何方式上传它。但是,如果与他人共享文件,此解决方案是不安全的,因为您可以简单地重命名文件以允许上传。

有任何想法吗?

最后,我找到了docx等的幻数列表,但无法将其转换为magic(5)格式。

Answers:


17

您可以使用

0       string  PK\x03\x04\x14\x00\x06\x00      Microsoft Office Open XML Format

/ etc / magic中的内容,以根据您提供的信息识别常规文件类型。

(但是,这可能并不通用:PK\x03\x04\x00\x14\x08\x08在LibreOffice生成的XLSX文件开始时就已经观察到。)

更高版本的Ubuntu可以正确识别.docx,.pptx和.xlsx文件。在文件实用程序的源代码中进行挖掘,我发现了~/file-5.09/magic/Magdir/msooxml用于标识的文件。您可以获取文件的副本并将其添加到/etc/magic文件中。


包括已更新至v 1.5的文件副本


# $File: msooxml,v 1.5 2014/08/05 07:38:45 christos Exp $
# msooxml:  file(1) magic for Microsoft Office XML
# From: Ralf Brown <ralf.brown@gmail.com>

# .docx, .pptx, and .xlsx are XML plus other files inside a ZIP
#   archive.  The first member file is normally "[Content_Types].xml".
#   but some libreoffice generated files put this later. Perhaps skip
#   the "[Content_Types].xml" test?
# Since MSOOXML doesn't have anything like the uncompressed "mimetype"
#   file of ePub or OpenDocument, we'll have to scan for a filename
#   which can distinguish between the three types

# start by checking for ZIP local file header signature
0       string      PK\003\004
!:strength +10
# make sure the first file is correct
>0x1E       regex       \\[Content_Types\\]\\.xml|_rels/\\.rels
# skip to the second local file header
# since some documents include a 520-byte extra field following the file
# header, we need to scan for the next header
>>(18.l+49) search/2000 PK\003\004
# now skip to the *third* local file header; again, we need to scan due to a
# 520-byte extra field following the file header
>>>&26      search/1000 PK\003\004
# and check the subdirectory name to determine which type of OOXML
# file we have.  Correct the mimetype with the registered ones:
# http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26     string      word/       Microsoft Word 2007+
!:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26     string      ppt/        Microsoft PowerPoint 2007+
!:mime application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26     string      xl/     Microsoft Excel 2007+
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26     default     x       Microsoft OOXML
---

但是将V1.2留作后代。

随着文件包的更新,在上面包含链接的副本可能会过时。

#------------------------------------------------------------------------------
# $File: msooxml,v 1.2 2013/01/25 23:04:37 christos Exp $
# msooxml:  file(1) magic for Microsoft Office XML
# From: Ralf Brown <ralf.brown@gmail.com>

# .docx, .pptx, and .xlsx are XML plus other files inside a ZIP
#   archive.  The first member file is normally "[Content_Types].xml".
# Since MSOOXML doesn't have anything like the uncompressed "mimetype"
#   file of ePub or OpenDocument, we'll have to scan for a filename
#   which can distinguish between the three types

# start by checking for ZIP local file header signature
0               string          PK\003\004
# make sure the first file is correct
>0x1E           string          [Content_Types].xml
# skip to the second local file header
#   since some documents include a 520-byte extra field following the file
#   header,  we need to scan for the next header
>>(18.l+49)     search/2000     PK\003\004
# now skip to the *third* local file header; again, we need to scan due to a
#   520-byte extra field following the file header
>>>&26          search/1000     PK\003\004
# and check the subdirectory name to determine which type of OOXML
#   file we have
#   Correct the mimetype with the registered ones:
#     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26         string          word/           Microsoft Word 2007+
!:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+
!:mime application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26         string          xl/             Microsoft Excel 2007+
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26         default         x               Microsoft OOXML
!:strength +10

1
我将该文件(msooxml)的内容添加到/ etc / magic(在debian上),并且可以正常工作。
杰伊·K

这对我也有用-尽管我犯了使用~/file-5.11/magic/Magdir/msooxml源的错误,但对于我正在使用的某些powerpoint示例文件不起作用。不过,其中的版本file-5.17效果很好(也许与制表符或... dunno有关)。
dsummersl 2014年

FWIW,我在Scientific Linux 6上尝试过此操作,但显然仍在file5.04上执行,如@ stanley-c所述,它会将MIME类型标记截断为64个字符(但警告您)。我还尝试了Mac OS X Mavericks,但无法使其应用规则(尽管它警告我不要转义第二条规则中的[和。)。
jwadsa​​ck

请注意,“ Microsoft OOXML”也可以是.docx文件,而不仅仅是“ Microsoft Word 2007+”
golimar

4

文件,版本5.13之前,将截断MIME类型为64个字符。因此,使用msooxml的内容,文件-bi命令中的MIME类型将变成“ mime application / vnd.openxmlformats-officedocument.wordprocessingml.d; charset = binary”


0

如果使用libreoffice的docx,则可以在/ etc / magic中添加内容(如下):

# start by checking for ZIP local file header signature
0               string          PK\003\004
!:strength +10
>1104           search/300      PK\003\004
# and check the subdirectory name to determine which type of OOXML
# file we have.  Correct the mimetype with the registered ones:
# http://technet.microsoft.com/en-us/library/cc179224.aspx
>>&26           string          word/           Microsoft Word 2007+
!:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>&26         string          ppt/            Microsoft PowerPoint 2007+
!:mime application/vnd.openxmlformats-officedocument.presentationml.presentation
>>&26         string          xl/             Microsoft Excel 2007+
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>&26         default         x               Microsoft OOXML

尝试过此操作,但会导致某些以前正确检测到的xlsx文件被正确检测,而且还会导致某些先前正确检测到的xlsx文件不再被检测到
Motin
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.