Answers:
这是到目前为止我发现的最快的版本,比readLines快6倍。在150MB的日志文件上,这需要0.35秒,而使用readLines()则需要2.40秒。只是为了好玩,Linux的wc -l命令需要0.15秒。
public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
编辑,在9 1/2年后:我几乎没有Java经验,但是无论如何我都尝试根据LineNumberReader
下面的解决方案对该代码进行基准测试,因为它困扰着我没有人做。似乎特别是对于大文件,我的解决方案更快。尽管似乎要花一些时间才能使优化程序完成不错的工作。我已经玩了一些代码,并产生了一个始终最快的新版本:
public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}
// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}
// count remaining characters
while (readChars != -1) {
System.out.println(readChars);
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}
return count == 0 ? 1 : count;
} finally {
is.close();
}
}
1.3GB文本文件的基准结果,y轴以秒为单位。我用相同的文件执行了100次运行,并用System.nanoTime()
。您会看到它countLinesOld
有一些异常值,并且countLinesNew
没有异常值,虽然速度更快一点,但是差异在统计上是显着的。LineNumberReader
显然慢一些。
我已经实现了该问题的另一种解决方案,我发现它在计数行时更有效:
try
(
FileReader input = new FileReader("input.txt");
LineNumberReader count = new LineNumberReader(input);
)
{
while (count.skip(Long.MAX_VALUE) > 0)
{
// Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
}
result = count.getLineNumber() + 1; // +1 because line index starts at 0
}
LineNumberReader
的lineNumber
字段是整数...会不会只包装长度超过Integer.MAX_VALUE的文件?为什么要在这里长时间跳过呢?
wc -l
计算文件中换行符的数量。这行得通,因为每一行都以换行符结尾,包括文件中的最后一行。每行都有一个换行符,包括空行,因此换行符数==文件中的行数。现在,lineNumber
in中的变量FileNumberReader
也表示看到的换行符的数量。它从零开始,直到找到任何换行符为止,并且随着看到的每个换行符而增加。因此,请不要在行号中添加一个。
wc -l
也是报告这种文件的方式。另请参阅stackoverflow.com/questions/729692/…–
wc -l
将返回1。
对于不以换行符结尾的多行文件,可接受的答案有一个错误。一个没有换行符结束的一行文件将返回1,但是一个没有换行符结束的两行文件也将返回1。这是解决此问题的公认解决方案的实现。除了最后的读取,这些endsWithoutNewLine检查会浪费一切,但与整个函数相比,在时间上应该是微不足道的。
public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '\n');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}
用 Java-8,您可以使用流:
try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
long numOfLines = lines.count();
...
}
如果文件的末尾没有换行符,则上述方法count()的答案给了我行错误计数-它无法计算文件中的最后一行。
这种方法对我来说更好:
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
cnt
。
我知道这是一个老问题,但是公认的解决方案与我需要它做的并不完全匹配。因此,我对其进行了改进,使其可以接受各种行终止符(而不仅仅是换行符),并使用指定的字符编码(而不是ISO-8859- n)。一机多用(适当地重构):
public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '\r': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '\n': {
if (prevChar == '\r') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '\r':
case '\n': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}
该解决方案的速度与公认的解决方案相当,但在我的测试中却慢了约4%(尽管众所周知,Java中的时序测试并不可靠)。
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (Stream<String> lines = Files.lines(file.toPath())) {
return lines.count();
}
}
在JDK8_u31上测试。但是与这种方法相比,确实性能较慢:
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {
byte[] c = new byte[1024];
boolean empty = true,
lastEmpty = false;
long count = 0;
int read;
while ((read = is.read(c)) != -1) {
for (int i = 0; i < read; i++) {
if (c[i] == '\n') {
count++;
lastEmpty = true;
} else if (lastEmpty) {
lastEmpty = false;
}
}
empty = false;
}
if (!empty) {
if (count == 0) {
count = 1;
} else if (!lastEmpty) {
count++;
}
}
return count;
}
}
经过测试,非常快。
Stream<String> - Time consumed: 122796351 Stream<String> - Num lines: 109808 Method - Time consumed: 12838000 Method - Num lines: 1
行数甚至也错了
BufferedInputStream
则无论如何都不要读入自己的缓冲区。此外,即使您的方法可能在性能上稍有优势,但由于它不再支持\r
单行终止符(旧的MacOS)并且不再支持每种编码,因此也会失去灵活性。
使用Scanner的直接方式
static void lineCounter (String path) throws IOException {
int lineCount = 0, commentsCount = 0;
Scanner input = new Scanner(new File(path));
while (input.hasNextLine()) {
String data = input.nextLine();
if (data.startsWith("//")) commentsCount++;
lineCount++;
}
System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
}
我得出结论wc -l
::s计数换行符的方法很好,但是在最后一行不以换行符结尾的文件上返回非直观结果。
基于LineNumberReader的@ er.vikas解决方案但在行数中添加了一个,则在最后一行以换行符结尾的文件上返回了非直观的结果。
因此,我做了一个算法,其处理如下:
@Test
public void empty() throws IOException {
assertEquals(0, count(""));
}
@Test
public void singleNewline() throws IOException {
assertEquals(1, count("\n"));
}
@Test
public void dataWithoutNewline() throws IOException {
assertEquals(1, count("one"));
}
@Test
public void oneCompleteLine() throws IOException {
assertEquals(1, count("one\n"));
}
@Test
public void twoCompleteLines() throws IOException {
assertEquals(2, count("one\ntwo\n"));
}
@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
assertEquals(2, count("one\ntwo"));
}
@Test
public void aFewLines() throws IOException {
assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}
它看起来像这样:
static long countLines(InputStream is) throws IOException {
try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
char[] buf = new char[8192];
int n, previousN = -1;
//Read will return at least one byte, no need to buffer more
while((n = lnr.read(buf)) != -1) {
previousN = n;
}
int ln = lnr.getLineNumber();
if (previousN == -1) {
//No data read at all, i.e file was empty
return 0;
} else {
char lastChar = buf[previousN - 1];
if (lastChar == '\n' || lastChar == '\r') {
//Ending with newline, deduct one
return ln;
}
}
//normal case, return line number + 1
return ln + 1;
}
}
如果您想获得直观的结果,则可以使用它。如果只需要wc -l
兼容性,请简单使用@ er.vikas解决方案,但不要在结果中添加一个并重试跳过:
try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
while(lnr.skip(Long.MAX_VALUE) > 0){};
return lnr.getLineNumber();
}
如何在Java代码中使用Process类?然后读取命令的输出。
Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();
BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
System.out.println(line);
lineCount = Integer.parseInt(line);
}
需要尝试一下。将发布结果。
如果您没有任何索引结构,那么您将无法读取整个文件。但是您可以通过避免逐行读取并使用正则表达式匹配所有行终止符来对其进行优化。
EOF处没有换行符('\ n')的多行文件的最佳优化代码。
/**
*
* @param filename
* @return
* @throws IOException
*/
public static int countLines(String filename) throws IOException {
int count = 0;
boolean empty = true;
FileInputStream fis = null;
InputStream is = null;
try {
fis = new FileInputStream(filename);
is = new BufferedInputStream(fis);
byte[] c = new byte[1024];
int readChars = 0;
boolean isLine = false;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if ( c[i] == '\n' ) {
isLine = false;
++count;
}else if(!isLine && c[i] != '\n' && c[i] != '\r'){ //Case to handle line count where no New Line character present at EOF
isLine = true;
}
}
}
if(isLine){
++count;
}
}catch(IOException e){
e.printStackTrace();
}finally {
if(is != null){
is.close();
}
if(fis != null){
fis.close();
}
}
LOG.info("count: "+count);
return (count == 0 && !empty) ? 1 : count;
}
带有正则表达式的扫描器:
public int getLineCount() {
Scanner fileScanner = null;
int lineCount = 0;
Pattern lineEndPattern = Pattern.compile("(?m)$");
try {
fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
while (fileScanner.hasNext()) {
fileScanner.next();
++lineCount;
}
}catch(FileNotFoundException e) {
e.printStackTrace();
return lineCount;
}
fileScanner.close();
return lineCount;
}
还没计时。
如果你用这个
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
您无法运行到大行,例如10万行,因为从reader.getLineNumber返回的是int。您需要长类型的数据来处理最多的行。
int
最多可以容纳20亿个值。如果加载的文件超过20亿行,则存在溢出问题。就是说,如果加载的未索引文本文件超过20亿行,则可能还有其他问题。