Python将csv导入列表

192

我有一个大约有2000条记录的CSV文件。

每个记录都有一个字符串和一个类别：

This is the first line,Line1
This is the second line,Line2
This is the third line,Line3

我需要将此文件读入如下列表：

data = [('This is the first line', 'Line1'),
        ('This is the second line', 'Line2'),
        ('This is the third line', 'Line3')]

如何使用Python将CSV导入到我需要的列表中？

python csv

— 摩根TN
source

2

然后使用csv模块：docs.python.org/2/library/csv.html

— furas

4

如果有适合您问题的答案，请接受。

— Maciej Gol 2015年

1

如何使用Python读写CSV文件的

— 马丁·托马

304

使用csv模块：

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

print(data)

输出：

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

如果您需要元组：

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = [tuple(row) for row in reader]

print(data)

输出：

[('This is the first line', 'Line1'), ('This is the second line', 'Line2'), ('This is the third line', 'Line3')]

旧的Python 2答案，也使用csv模块：

import csv
with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print your_list
# [['This is the first line', 'Line1'],
#  ['This is the second line', 'Line2'],
#  ['This is the third line', 'Line3']]

— 麦琪（Maciej Gol）
source

4

为什么使用“ rb”而不是“ r”？

— imrek 2015年

5

@DrunkenMaster，b使文件以二进制模式（而不是文本模式）打开。在某些系统上，文本模式意味着\n在读取或写入时会转换为特定于平台的新行。参见docs。

— Maciej Gol 2015年

7

这在Python 3.x中不起作用：“ csv.Error：迭代器应返回字符串，而不是字节（您是否以文本模式打开文件？）”在Python 3.x中起作用的答案请参见下文

— Gilbert，

2

为了节省几秒钟的调试时间，您可能应该为第一个解决方案添加注释，例如“ Python 2.x版本”

— paradite

如何使用第一种解决方案，但csv文件中只有一些列？

— Sigur

54

已针对Python 3更新：

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print(your_list)

输出：

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

— 石ho里
source

指定'r'是默认模式，因此无需指定。文档还提到如果csvfile是文件对象，则应使用newline =''将其打开。

— AMC

43

熊猫非常擅长处理数据。这是一个如何使用它的示例：

import pandas as pd

# Read the CSV into a pandas data frame (df)
#   With a df you can do many things
#   most important: visualize data with Seaborn
df = pd.read_csv('filename.csv', delimiter=',')

# Or export it in many ways, e.g. a list of tuples
tuples = [tuple(x) for x in df.values]

# or export it as a list of dicts
dicts = df.to_dict().values()

一大优势是，熊猫自动处理标题行。

如果您还没有听说过Seaborn，建议您看看。

另请参阅：如何使用Python读写CSV文件？

熊猫＃2

import pandas as pd

# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()

# Convert
dicts = df.to_dict('records')

df的内容是：

     country   population population_time    EUR
0    Germany   82521653.0      2016-12-01   True
1     France   66991000.0      2017-01-01   True
2  Indonesia  255461700.0      2017-01-01  False
3    Ireland    4761865.0             NaT   True
4      Spain   46549045.0      2017-06-01   True
5    Vatican          NaN             NaT   True

字典的内容是

[{'country': 'Germany', 'population': 82521653.0, 'population_time': Timestamp('2016-12-01 00:00:00'), 'EUR': True},
 {'country': 'France', 'population': 66991000.0, 'population_time': Timestamp('2017-01-01 00:00:00'), 'EUR': True},
 {'country': 'Indonesia', 'population': 255461700.0, 'population_time': Timestamp('2017-01-01 00:00:00'), 'EUR': False},
 {'country': 'Ireland', 'population': 4761865.0, 'population_time': NaT, 'EUR': True},
 {'country': 'Spain', 'population': 46549045.0, 'population_time': Timestamp('2017-06-01 00:00:00'), 'EUR': True},
 {'country': 'Vatican', 'population': nan, 'population_time': NaT, 'EUR': True}]

熊猫＃3

import pandas as pd

# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()

# Convert
lists = [[row[col] for col in df.columns] for row in df.to_dict('records')]

的内容lists是：

[['Germany', 82521653.0, Timestamp('2016-12-01 00:00:00'), True],
 ['France', 66991000.0, Timestamp('2017-01-01 00:00:00'), True],
 ['Indonesia', 255461700.0, Timestamp('2017-01-01 00:00:00'), False],
 ['Ireland', 4761865.0, NaT, True],
 ['Spain', 46549045.0, Timestamp('2017-06-01 00:00:00'), True],
 ['Vatican', nan, NaT, True]]

— 马丁·托马
source

tuples = [tuple(x) for x in df.values]可以写成tuples = list(df.itertuples(index=False))。请注意，Pandas文档不鼓励使用.values赞成.to_numpy()。第三个例子让我感到困惑。首先，因为变量名为tuples，这意味着它是一个元组列表，而实际上是一个列表列表。第二，因为据我所知，整个表达式可以替换为df.to_list()。我也不知道第二个例子在这里是否真的有用。

— AMC

9

Python3更新：

import csv
from pprint import pprint

with open('text.csv', newline='') as file:
    reader = csv.reader(file)
    res = list(map(tuple, reader))

pprint(res)

输出：

[('This is the first line', ' Line1'),
 ('This is the second line', ' Line2'),
 ('This is the third line', ' Line3')]

如果csvfile是文件对象，则应使用打开newline=''。
CSV模组

— 结石
source

为什么要使用list(map())列表理解？另外，请注意第二列每个元素开头的空格。

— AMC

5

如果你相信有您的输入没有逗号，以外的其他类别分开，你可以逐行读取文件中的行和分裂上,，然后推结果List

也就是说，您似乎正在查看CSV文件，因此您可以考虑为其使用模块

— 米克尔
source

4

result = []
for line in text.splitlines():
    result.append(tuple(line.split(",")))

— 酸蛇
source

1

您能在这篇文章中添加一些解释吗？仅代码（有时）是好的，但是代码和解释（大多数时候）是更好的

— Barranka

3

我知道Barranka的评论已经使用了一年多，但是对于任何偶然发现但无法弄清楚的人：对于text.splitlines（）中的行：将每一行都放入临时变量“ line”中。line.split（“，”）创建一个逗号分隔的字符串列表。tuple（〜）将该列表放入一个元组，append（〜）将其添加到结果中。循环之后，结果是一个元组列表，每个元组一行，而每个元组元素则是csv文件中的一个元素。

— 路易（Louis）

除了@Louis所说的以外.read().splitlines()，您无需使用，您可以直接遍历文件的每一行：for line in in_file: res.append(tuple(line.rstrip().split(",")))另外，请注意，using .split(',')表示第二列的每个元素都将以多余的空格开头。

— AMC

我上面共享的代码的附录：line.rstrip()-> line.rstrip('\n')。

— AMC

3

正如评论中已经说过的那样，您可以csv在python中使用该库。csv的意思是逗号分隔的值，这似乎与您的情况完全相同：标签和由逗号分隔的值。

作为类别和值类型，我宁愿使用字典类型而不是元组列表。

无论如何，在下面的代码中，我都会同时显示两种方式：d是字典，l是元组列表。

import csv

file_name = "test.txt"
try:
    csvfile = open(file_name, 'rt')
except:
    print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
d = dict()
l =  list()
for row in csvReader:
    d[row[1]] = row[0]
    l.append((row[0], row[1]))
print(d)
print(l)

— 弗朗切斯科·博伊（Francesco Boi）
source

为什么不使用上下文管理器来处理文件？为什么要混合两种不同的变量命名约定？是否(row[0], row[1])比仅仅使用更弱/更容易出错tuple(row)？

— AMC

为什么您认为执行tuple（row）不太容易出错？您指的是什么变量命名约定？请链接官方的python命名约定。据我所知，try -except是处理文件的好方法：上下文处理程序是什么意思？

— Francesco Boi

您为什么认为执行tuple（row）不太容易出错？因为它不需要您手动写出每个索引。如果您输入有误，或者元素数量发生了变化，则必须返回并更改代码。try-except很好，上下文管理器是with语句。您可以在该主题上找到大量资源，例如这一资源。

— AMC

我看不出上下文管理器会比ol'的try-except块更好。另一方面，积极的方面是您键入的代码更少；对于其余的，如果元素数量（我猜你的意思是列数）改变我的更好，因为它只提取所需的值，而另一个则提取所有优点。没有任何特定要求，您不能说哪个更好，所以浪费时间争论哪个更好：在这种情况下，两个都是有效的

— Francesco Boi

我看不出上下文管理器会比ol'的try-except块更好。请参阅我之前的评论，上下文管理器不会替换 try-except。

— AMC

2

一个简单的循环就足够了：

lines = []
with open('test.txt', 'r') as f:
    for line in f.readlines():
        l,name = line.strip().split(',')
        lines.append((l,name))

print lines

— 亨特·麦克米伦
source

1

如果某些条目中包含逗号怎么办？

— 托尼·恩尼斯

@TonyEnnis然后，您将需要使用更高级的处理循环。上面Maciej的答案显示了如何使用Python随附的csv解析器来执行此操作。该解析器很可能具有您需要的所有逻辑。

— 亨特·麦克米伦

1

不幸的是，我发现没有一个现有的答案特别令人满意。

这是一个使用csv模块的简单，完整的Python 3解决方案。

import csv

with open('../resources/temp_in.csv', newline='') as f:
    reader = csv.reader(f, skipinitialspace=True)
    rows = list(reader)

print(rows)

注意skipinitialspace=True参数。这是必要的，因为不幸的是，OP的CSV在每个逗号后都包含空格。

输出：

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

— 资产管理公司
source

0

稍微扩展您的需求，并假设您不关心行的顺序，并希望将它们分组在类别下，则以下解决方案可能适用于您：

>>> fname = "lines.txt"
>>> from collections import defaultdict
>>> dct = defaultdict(list)
>>> with open(fname) as f:
...     for line in f:
...         text, cat = line.rstrip("\n").split(",", 1)
...         dct[cat].append(text)
...
>>> dct
defaultdict(<type 'list'>, {' CatA': ['This is the first line', 'This is the another line'], ' CatC': ['This is the third line'], ' CatB': ['This is the second line', 'This is the last line']})

这样，您可以在字典中键为类别下获得所有可用的相关行。

— 扬·维尔辛斯基
source

0

这是Python 3.x中最简单的将CSV导入多维数组的方法，它仅4行代码而无需导入任何内容！

#pull a CSV into a multidimensional array in 4 lines!

L=[]                            #Create an empty list for the main array
for line in open('log.txt'):    #Open the file and read all the lines
    x=line.rstrip()             #Strip the \n from each line
    L.append(x.split(','))      #Split each line into a list and add it to the
                                #Multidimensional array
print(L)

— 杰森·布彻（Jason Boucher）
source

注意，它是一个列表，而不是数组！为什么不使用上下文管理器正确处理文件对象？请注意，此解决方案在每行的第二项上留有多余的空格，并且如果任何数据包含逗号，它将失败。

— AMC

-1

接下来是一段代码，该代码使用csv模块，但使用第一行（即csv表的标头）将file.csv内容提取到字典列表中

import csv
def csv2dicts(filename):
  with open(filename, 'rb') as f:
    reader = csv.reader(f)
    lines = list(reader)
    if len(lines) < 2: return None
    names = lines[0]
    if len(names) < 1: return None
    dicts = []
    for values in lines[1:]:
      if len(values) != len(names): return None
      d = {}
      for i,_ in enumerate(names):
        d[names[i]] = values[i]
      dicts.append(d)
    return dicts
  return None

if __name__ == '__main__':
  your_list = csv2dicts('file.csv')
  print your_list

— 阿列克谢·安东年科（Alexey Antonenko）
source

1

为什么不只是使用csv.DictReader？

— AMC