如果大多数程序都包含NULL字符,则认为该文件是二进制文件(该文件不是“面向行的”文件)。
这是用Python实现的perl版本的pp_fttext()
(pp_sys.c
):
import sys
PY3 = sys.version_info[0] == 3
# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr
_text_characters = (
b''.join(int2byte(i) for i in range(32, 127)) +
b'\n\r\t\f\b')
def istextfile(fileobj, blocksize=512):
""" Uses heuristics to guess whether the given file is text or binary,
by reading a single block of bytes from the file.
If more than 30% of the chars in the block are non-text, or there
are NUL ('\x00') bytes in the block, assume this is a binary file.
"""
block = fileobj.read(blocksize)
if b'\x00' in block:
# Files with null bytes are binary
return False
elif not block:
# An empty file is considered a valid text file
return True
# Use translate's 'deletechars' argument to efficiently remove all
# occurrences of _text_characters from the block
nontext = block.translate(None, _text_characters)
return float(len(nontext)) / len(block) <= 0.30
还要注意,此代码被编写为可以在Python 2和Python 3上运行而无需更改。
资料来源:Perl的“猜测文件是文本还是二进制”,用Python实现