我在读这篇文章,并对这个问题的正确答案感到好奇。
我唯一想到的是,在某些国家/地区,小数点分隔符是逗号,在CSV中共享数据时可能会出现问题,但是我不确定我的答案。
我在读这篇文章,并对这个问题的正确答案感到好奇。
我唯一想到的是,在某些国家/地区,小数点分隔符是逗号,在CSV中共享数据时可能会出现问题,但是我不确定我的答案。
Answers:
CSV格式规范在RFC 4180中定义。该规范的发布是因为
不存在正式规范,因此可以对CSV文件进行多种解释
不幸的是,自2005年(发布RFC的日期)以来,没有任何变化。我们仍然有各种各样的实现。RFC 4180中定义的一般方法是将包含字符(例如逗号)的字段括在引号中,但是此建议并非总是由其他软件来满足。
问题在于,在各种欧洲语言环境中,逗号都是小数点,因此您写的0,005
不是0.005
。在其他情况下,例如,用逗号代替空格来表示数字组4,000,000.00
(请参见此处)。在这两种情况下,使用逗号都可能导致从csv文件读取数据时出错,因为您的软件并不真正知道0,005, 0,1
是两个数字还是四个不同的数字(请参见此处的示例)。
最后但并非最不重要的一点是,如果将文本存储在数据文件中,则文本中的逗号比例如分号要普遍得多,因此,如果您的文本未用引号引起来,则这样的数据也可以很容易地读取而出现错误。
只要按照RFC 4180的建议使用CSV文件来避免上述问题,就不会使逗号变得更好,或更糟的字段分隔符。但是,如果存在使用简化的CSV格式(未将字段括在引号中)的风险,或者可能会不一致地使用建议,则其他分隔符(例如分号)似乎是更安全的方法。
,
true而不是使用罕见的分隔符会使数据s肿,但这是正确的。显然,所有这些人都认为他们知道CSV的工作原理,但实际上却不知道。
从技术上讲,逗号与用作分隔符的任何其他字符一样好。格式名称直接表示值以逗号分隔(逗号分隔值)。
CSV格式的说明使用逗号作为分隔符。
任何包含逗号的字段都应加双引号。这样就不会造成问题在读取数据,从看到这一点6,说明:
- 包含换行符(CRLF),双引号和逗号的字段应用双引号引起来。
values
是逗号分隔。其他提到欧洲formatting
数字的人,这对csv来说不是问题standard
,因为您正确引用了上面的第6点。任何数据格式都存在与“正确使用”的区别。关键是-了解您的数据。其他人提到tab
或;
定界,但是当您处理用户输入的数据(也许通过表单并由数据库捕获)时,它们可能与逗号具有相同的问题-我不得不纠缠于人们输入的自由文本输入字段有胖手指tab
...很烂)
In addition to being a digit separator in numbers, it is also forms part of address (such as customer address etc) in many countries. While some countries have short well-define addresses, many others have, long-winding addresses including, sometimes two commas in the same line. Good CSV files enclose all such data in double quotes. But over-simplistic, poorly written parsers don't provide for reading and differentiating such. (Then, there is the problem of using double quotes as part of the data, such as quote from a poem).
While @Tim s answer is correct - I would like to add that "csv" as a whole has no common standard - especially the escaping rules are not defined at all, leading to "formats" which are readable in one program, but not another. This is excarberated by the fact that every "programmer" under the sun just thinks "oooh csv- I will build my own parser!" and then misses all of the edge cases.
Moreover, csv totally lacks the abillity to store metadata or even the data type of a column - leading to at several documents which you must read to unterstand the data.
If you can ditch the comma delimiter and use a tab character you will have much better success. You can leave the file named .CSV and importing into most programs is usually not a problem. Just specify TAB delimited rather than comma when you import your file. If there are commas in your data you WILL have a problem when specifying comma delimited as you are well aware.
|
as a delimiter in home-brewed csv-like text files of records (with book titles and other document metadata). |
never occurs in the data I work with, so I can just write perl scripts that simply split/join without checking for quoting of any kind. This was for a one-off project that just involves processing metadata saved from an MS Access database. For any larger project, or if you plan to keep data in this file-format long-term, pick something more robust! I could always tweak something if this month's batch broke something.
split
command for Stata I looked at, among other things, the Perl equivalent to see what it did and didn't do. Not the source code, just the functionality offered.
cut
, sort
, and uniq
.
ASCII provides us with four "separator" characters, as shown below in a snippet from the ascii(7) *nix man page:
Oct Dec Hex Char
----------------------
034 28 1C FS (file separator)
035 29 1D GS (group separator)
036 30 1E RS (record separator)
037 31 1F US (unit separator)
This answer provides a decent overview of their intended usage.
Of course, these control codes lack the human-friendliness (readability and input) of more popular delimiters, but are acceptable choices for internal and/or ephemeral exchange of data between programs.
The problem is not the comma; the problem is quoting. Regardless of which record and field delimiters you use, you need to be prepared for meeting them in the content. So you need a quoting mechanism. AND THEN you need a way for the quoting character(s) to appear too.
Following the RFC 4180 standard makes everything simpler for everybody.
I have personally had to write a script to probably fix the output from a program that got this wrong, so I am a bit militant about it. "probably fix" means that it worked for MY data, but I can see situations where it would fail. (In that program's defense, it was written before the standard.)