如何一次从Python文件中读取单个字符?


Answers:


89
with open(filename) as f:
  while True:
    c = f.read(1)
    if not c:
      print "End of file"
      break
    print "Read a character:", c

41
由于这一次读取一个字节,因此非ASCII编码不会失败吗?
David Chouinard,

3
问题和答案是令人困惑的字符和字节概念。如果文件是每个字符编码的单个字节,例如Ascii等,那么是的,您通过读取单个字节大小的块来读取单个字符,否则,如果编码需要每个字符多个字节,则您是仅读取单个字节而不是单个字符。
巴塞尔Shishani,

2
那就对了。因此,我经常这样做result = open(filename).read(),然后逐字阅读result
湿婆

3
大卫·乔伊纳德(David Chouinard)的问题:此代码段在Python 3中正确工作,文件格式为UTF-8。例如,如果您具有Windows-1250编码的文件,只需将第一行更改为 with open(filename, encoding='Windows-1250') as f:
SergO,2016年

1
加上SergO,open(filename, "r")vsopen(filename, "rb")可能导致不同数量的迭代(至少谈论Python 3)。“ r”模式可以读取多个字节以获取c是否遇到适当的特殊字符。
dcc310

40

首先打开一个文件:

with open("filename") as fileobj:
    for line in fileobj:  
       for ch in line: 
           print ch

同意,这似乎是更Python化的方法。这还不能处理非ASCII编码吗?
罗恩

16
您可能一次读取一个字符的文件的一个原因是文件太大而无法容纳在内存中。但是上面的答案假设每一行都可以容纳在内存中。
CS

编辑以匹配的Python 3

16

我喜欢公认的答案:这很简单,可以完成工作。我还想提供一个替代实现:

def chunks(filename, buffer_size=4096):
    """Reads `filename` in chunks of `buffer_size` bytes and yields each chunk
    until no more characters can be read; the last chunk will most likely have
    less than `buffer_size` bytes.

    :param str filename: Path to the file
    :param int buffer_size: Buffer size, in bytes (default is 4096)
    :return: Yields chunks of `buffer_size` size until exhausting the file
    :rtype: str

    """
    with open(filename, "rb") as fp:
        chunk = fp.read(buffer_size)
        while chunk:
            yield chunk
            chunk = fp.read(buffer_size)

def chars(filename, buffersize=4096):
    """Yields the contents of file `filename` character-by-character. Warning:
    will only work for encodings where one character is encoded as one byte.

    :param str filename: Path to the file
    :param int buffer_size: Buffer size for the underlying chunks,
    in bytes (default is 4096)
    :return: Yields the contents of `filename` character-by-character.
    :rtype: char

    """
    for chunk in chunks(filename, buffersize):
        for char in chunk:
            yield char

def main(buffersize, filenames):
    """Reads several files character by character and redirects their contents
    to `/dev/null`.

    """
    for filename in filenames:
        with open("/dev/null", "wb") as fp:
            for char in chars(filename, buffersize):
                fp.write(char)

if __name__ == "__main__":
    # Try reading several files varying the buffer size
    import sys
    buffersize = int(sys.argv[1])
    filenames  = sys.argv[2:]
    sys.exit(main(buffersize, filenames))

我建议的代码与您接受的答案基本相同:从文件中读取给定数量的字节。不同之处在于,它首先读取大量数据(对于X86,4006是一个很好的默认值,但您可能要尝试1024或8192;页面大小的任意倍数),然后在该数据块中生成字符一个。

对于大型文件,我提供的代码可能会更快。以托尔斯泰的《战争与和平》全文为例。这些是我的计时结果(使用OS X 10.7.4的Mac Book Pro; so.py是我为粘贴的代码指定的名称):

$ time python so.py 1 2600.txt.utf-8
python so.py 1 2600.txt.utf-8  3.79s user 0.01s system 99% cpu 3.808 total
$ time python so.py 4096 2600.txt.utf-8
python so.py 4096 2600.txt.utf-8  1.31s user 0.01s system 99% cpu 1.318 total

现在:不要以缓冲区大小4096作为普遍真理;看看我得到的不同大小的结果(缓冲区大小(字节)与墙时间(秒)):

   2 2.726 
   4 1.948 
   8 1.693 
  16 1.534 
  32 1.525 
  64 1.398 
 128 1.432 
 256 1.377 
 512 1.347 
1024 1.442 
2048 1.316 
4096 1.318 

如您所见,您可以早一点开始看到收益(我的时间安排可能非常不准确);缓冲区大小是性能和内存之间的权衡。默认值4096是一个合理的选择,但与往常一样,首先进行测量。


9

Python本身可以以交互方式为您提供帮助:

>>> help(file.read)
Help on method_descriptor:

read(...)
    read([size]) -> read at most size bytes, returned as a string.

    If the size argument is negative or omitted, read until EOF is reached.
    Notice that when in non-blocking mode, less data than what was requested
    may be returned, even if no size parameter was given.

6
我同意这种观点,但也许这更适合作为对OP的评论?
Mike Boers 2010年

2
可能会,但是我认为所有这些文字在评论中看起来都会很凌乱。
Mattias Nilsson 2010年

8

只是:

myfile = open(filename)
onecaracter = myfile.read(1)




2

这也将起作用:

with open("filename") as fileObj:
    for line in fileObj:  
        for ch in line:
            print(ch)

它遍历文件中的每一行以及每一行中的每个字符。


0
f = open('hi.txt', 'w')
f.write('0123456789abcdef')
f.close()
f = open('hej.txt', 'r')
f.seek(12)
print f.read(1) # This will read just "c"

3
欢迎来到Stackoverflow!您应该详细说明-这为什么是答案?
davidkonrad,2015年

0

作为补充,如果您正在读取包含vvvvery巨大行的文件,这可能会破坏您的内存,则可以考虑将它们读入缓冲区,然后产生每个字符

def read_char(inputfile, buffersize=10240):
    with open(inputfile, 'r') as f:
        while True:
            buf = f.read(buffersize)
            if not buf:
                break
            for char in buf:
                yield char
        yield '' #handle the scene that the file is empty

if __name__ == "__main__":
    for word in read_char('./very_large_file.txt'):
        process(char)

0
#reading out the file at once in a list and then printing one-by-one
f=open('file.txt')
for i in list(f.read()):
    print(i)

尽管这可以回答作者的问题,但它缺少一些解释性的文字和文档链接。没有一些短语,原始代码片段不是很有帮助。您可能还会发现如何写一个好的答案很有帮助。请修改您的答案。
hellow

您无需强制转换即可列出。
user240515
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.