为什么逗号会损坏CSV文件中的记录分隔符/分隔符?


32

我在读这篇文章,并对这个问题的正确答案感到好奇。

我唯一想到的是,在某些国家/地区,小数点分隔符是逗号,在CSV中共享数据时可能会出现问题,但是我不确定我的答案。


6
几乎任何定界符都比逗号更好。原因是,当将逗号分隔的文件读入某些数据解析工具时,逗号可能与标点符号混淆,从而破坏了字段或列的“布局”。
Mike Hunter

33
愤世嫉俗的人指出本文是SAS的烟熏片,可能暗示SAS在处理带有逗号的CSV文件时可能会出现问题:-)。
ub

3
@whuber-SAS(以我的经验)可能难以处理CSV文件(无论是否有逗号),并且需要为每个SAS不喜欢的怪异事物进行大量的手工编码。
杰里米·迈尔斯

8
在寻找越来越模糊的定界符(管道,盗窃者,荆棘)的过程中,人们感到绝望的是,这表明同意和遵循标准确实是人们交换定界文本文件中数据的唯一安全方法。通用标准必须允许表示任何文本字符串(如RFC4180一样),而不是依靠这样的假设,即不需要将某些文本字符串放入其他工作中。
Scortchi-恢复莫妮卡

2
(a)我经常成功导入.csv文件。(b)我建议人们如果数据中包含逗号,则不要使用.csv。这些互不矛盾。不幸的是(b)在某些方面需要解释。
尼克·考克斯

Answers:


33

CSV格式规范在RFC 4180中定义。该规范的发布是因为

不存在正式规范,因此可以对CSV文件进行多种解释

不幸的是,自2005年(发布RFC的日期)以来,没有任何变化。我们仍然有各种各样的实现。RFC 4180中定义的一般方法是将包含字符(例如逗号)的字段括在引号中,但是此建议并非总是由其他软件来满足。

问题在于,在各种欧洲语言环境中,逗号都是小数点,因此您写的0,005不是0.005。在其他情况下,例如,用逗号代替空格来表示数字组4,000,000.00(请参见此处)。在这两种情况下,使用逗号都可能导致从csv文件读取数据时出错,因为您的软件并不真正知道0,005, 0,1是两个数字还是四个不同的数字(请参见此处的示例)。

最后但并非最不重要的一点是,如果将文本存储在数据文件中,则文本中的逗号比例如分号要普遍得多,因此,如果您的文本未用引号引起来,则这样的数据也可以很容易地读取而出现错误。

只要按照RFC 4180的建议使用CSV文件来避免上述问题,就不会使逗号变得更好,或更糟的字段分隔符。但是,如果存在使用简化的CSV格式(未将字段括在引号中)的风险,或者可能会不一致地使用建议,则其他分隔符(例如分号)似乎是更安全的方法。


6
任何实施RFC 4180定义的实际CSV标准的软件都肯定会确切地知道如何解释任何给定的字符串。虽然使用,true而不是使用罕见的分隔符会使数据s肿,但这是正确的。显然,所有这些人都认为他们知道CSV的工作原理,但实际上却不知道。
Voo 2015年

2
@Voo是的,但是因为以这种混乱的方式使用“ csv”文件,所以更安全的是不使用逗号,而使用其他分隔符(例如分号)代替。这是OP问题的答案。与逗号相比,分号(或其他非逗号)没有什么“更好”的选择,在许多情况下,分号只是更安全的选择。
蒂姆

2
@Voo +1了您的评论。但是,任何使用CSV的人都不会真正在意膨胀的数据文件!
ub

17

从技术上讲,逗号与用作分隔符的任何其他字符一样好。格式名称直接表示值以逗号分隔(逗号分隔值)。

CSV格式的说明使用逗号作为分隔符。

任何包含逗号的字段都应加双引号。这样就不会造成问题在读取数据,从看到这一点6,说明

  1. 包含换行符(CRLF),双引号和逗号的字段应用双引号引起来。

例如,默认情况下,函数read.csvwrite.csvfrom R使用逗号作为分隔符。


4
这是最佳答案,因为它指的values是逗号分隔。其他提到欧洲formatting数字的人,这对csv来说不是问题standard,因为您正确引用了上面的第6点。任何数据格式都存在与“正确使用”的区别。关键是-了解您的数据。其他人提到tab;定界,但是当您处理用户输入的数据(也许通过表单并由数据库捕获)时,它们可能与逗号具有相同的问题-我不得不纠缠于人们输入的自由文本输入字段有胖手指tab...很烂)
Adrian Torrie 2015年

Tim's answer has now been edited to include the information @djhurio provided.
Adrian Torrie

11

In addition to being a digit separator in numbers, it is also forms part of address (such as customer address etc) in many countries. While some countries have short well-define addresses, many others have, long-winding addresses including, sometimes two commas in the same line. Good CSV files enclose all such data in double quotes. But over-simplistic, poorly written parsers don't provide for reading and differentiating such. (Then, there is the problem of using double quotes as part of the data, such as quote from a poem).


2
(+1) The standard provides for use of double quotes as part of the data by insisting on doubling them again: "Belloc", "Tarantella", """the fleas that tease in the High Pyrenees""". In England it's not uncommon to find address fields containing the name of a house in quotes, thus: "Chatsworth", Melton Road, Leamington. (It's not clear why: Fowler grumbled that "the implication seems to be: living in the house that sensible people call '164 Melton Road', but one fool likes to call 'Chatsworth'".)
Scortchi - Reinstate Monica

1
@Scortchi It seems that we learned the same poems at age 12 (+/- error). I fear that what I read as unfortunate early 20th century English snobbery of the upper middle-class for the habits of the lower middle-class obscures your last example, which will not be transparent beyond a small group.
Nick Cox

@NickCox: Twelve sounds about right. Funny that I can't remember whether I've read any poems this year, let alone recall any lines from them. Though Fowler's point was about the effect on the reader of unnecessary quotation marks (see unnecessaryquotes.com), I think you're right to see the influence of snobbery in his choice of example. At any rate, I hope the rather minor point that it's something to watch out for if you're ever sent a CSV file containing English addresses is clear to all despite my divagations.
Scortchi - Reinstate Monica

1
in India, it is common for people who build their first homes (not apartments), to keep a innovative flowery name, often in a vernacular language or Sanskrit phrase and those are in double quotes, such as "Guru Kripa". Names like Genelia D'Souza and Derek O'Brien are common too. Then, addresses that say, " Old Door No. nnn / New Door No. mm/c " , due to government renumbering complicate address storage even further, for having slashes and single quotes in unexpected corners.
Whirl Mind

@WhirlMind: That's interesting - I've noticed a lot of - well, more than I'd expect - Scottish Gaelic & Welsh house names in England, which is perhaps the nearest equivalent to picking a vernacular language in which to name your home.
Scortchi - Reinstate Monica

9

While @Tim s answer is correct - I would like to add that "csv" as a whole has no common standard - especially the escaping rules are not defined at all, leading to "formats" which are readable in one program, but not another. This is excarberated by the fact that every "programmer" under the sun just thinks "oooh csv- I will build my own parser!" and then misses all of the edge cases.

Moreover, csv totally lacks the abillity to store metadata or even the data type of a column - leading to at several documents which you must read to unterstand the data.


5
Yes, there is standard tools.ietf.org/html/rfc4180 and many other formats do not store any metadata, it is just not designed for storing metadata - .txt files also do not store metadata about text documents...
Tim

4
Tim, that standard is ignored more often than not, making it a non-standard,,,
Christian Sauer

8
The great thing about standards is that there are so many to choose from. (Variously mutated and attributed.)
Nick Cox

4

If you can ditch the comma delimiter and use a tab character you will have much better success. You can leave the file named .CSV and importing into most programs is usually not a problem. Just specify TAB delimited rather than comma when you import your file. If there are commas in your data you WILL have a problem when specifying comma delimited as you are well aware.


5
If there are tabs in your data, the converse applies. It's just, at least in my experience, less likely.
Nick Cox

@Nick and Gorilla: I've had good results with | as a delimiter in home-brewed csv-like text files of records (with book titles and other document metadata). | never occurs in the data I work with, so I can just write perl scripts that simply split/join without checking for quoting of any kind. This was for a one-off project that just involves processing metadata saved from an MS Access database. For any larger project, or if you plan to keep data in this file-format long-term, pick something more robust! I could always tweak something if this month's batch broke something.
Peter Cordes

@PeterCordes I believe you, and whatever works. But clearly the cost of idiosyncratic separators may be the need to explain those to others and it is key that they can import such data files without difficulty. Faced with an unusual file format, it is necessary to have access to some routine, function or command that can split strings on arbitrary separators.
Nick Cox

@PeterCordes When I wrote a split command for Stata I looked at, among other things, the Perl equivalent to see what it did and didn't do. Not the source code, just the functionality offered.
Nick Cox

1
@NickCox: A lot of perl's functions are quite well designed, IMO. They get the job done without a lot of special limitations like you find in awk (which is often good), or esp. other Unix tools like cut, sort, and uniq.
Peter Cordes

4

ASCII provides us with four "separator" characters, as shown below in a snippet from the ascii(7) *nix man page:

   Oct   Dec   Hex   Char
   ----------------------
   034   28    1C    FS  (file separator)
   035   29    1D    GS  (group separator)
   036   30    1E    RS  (record separator)
   037   31    1F    US  (unit separator)

This answer provides a decent overview of their intended usage.

Of course, these control codes lack the human-friendliness (readability and input) of more popular delimiters, but are acceptable choices for internal and/or ephemeral exchange of data between programs.


2
Interesting. I don't think I've ever seen these used in the wild though...
Matt Krause

4

The problem is not the comma; the problem is quoting. Regardless of which record and field delimiters you use, you need to be prepared for meeting them in the content. So you need a quoting mechanism. AND THEN you need a way for the quoting character(s) to appear too.

Following the RFC 4180 standard makes everything simpler for everybody.

I have personally had to write a script to probably fix the output from a program that got this wrong, so I am a bit militant about it. "probably fix" means that it worked for MY data, but I can see situations where it would fail. (In that program's defense, it was written before the standard.)

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.