在文本文件Java中写入大量数据的最快方法


68

我必须在text [csv]文件中写入大量数据。我使用BufferedWriter写入数据,并且花费了大约40秒的时间来写入174 mb的数据。这是Java可以提供的最快速度吗?

bufferedWriter = new BufferedWriter ( new FileWriter ( "fileName.csv" ) );

注意:这40秒还包括从结果集中迭代和获取记录的时间。:) 174 mb用于结果集中的400000行。


5
您不会在运行此代码的计算机上激活防病毒软件?
托尔比约恩Ravn的安徒生

Answers:


102

您可以尝试删除BufferedWriter并直接使用FileWriter。在现代系统上,无论如何,您很有可能只是写入驱动器的缓存。

我需要4-5秒的时间来写入175MB(400万个字符串)-这是在运行Windows XP和80GB,7200-RPM日立磁盘的双核2.4GHz戴尔上进行的。

您能否确定记录检索有多少时间和文件写入有多少时间?

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

public class FileWritingPerfTest {
    

private static final int ITERATIONS = 5;
private static final double MEG = (Math.pow(1024, 2));
private static final int RECORD_COUNT = 4000000;
private static final String RECORD = "Help I am trapped in a fortune cookie factory\n";
private static final int RECSIZE = RECORD.getBytes().length;

public static void main(String[] args) throws Exception {
    List<String> records = new ArrayList<String>(RECORD_COUNT);
    int size = 0;
    for (int i = 0; i < RECORD_COUNT; i++) {
        records.add(RECORD);
        size += RECSIZE;
    }
    System.out.println(records.size() + " 'records'");
    System.out.println(size / MEG + " MB");
    
    for (int i = 0; i < ITERATIONS; i++) {
        System.out.println("\nIteration " + i);
        
        writeRaw(records);
        writeBuffered(records, 8192);
        writeBuffered(records, (int) MEG);
        writeBuffered(records, 4 * (int) MEG);
    }
}

private static void writeRaw(List<String> records) throws IOException {
    File file = File.createTempFile("foo", ".txt");
    try {
        FileWriter writer = new FileWriter(file);
        System.out.print("Writing raw... ");
        write(records, writer);
    } finally {
        // comment this out if you want to inspect the files afterward
        file.delete();
    }
}

private static void writeBuffered(List<String> records, int bufSize) throws IOException {
    File file = File.createTempFile("foo", ".txt");
    try {
        FileWriter writer = new FileWriter(file);
        BufferedWriter bufferedWriter = new BufferedWriter(writer, bufSize);
    
        System.out.print("Writing buffered (buffer size: " + bufSize + ")... ");
        write(records, bufferedWriter);
    } finally {
        // comment this out if you want to inspect the files afterward
        file.delete();
    }
}

private static void write(List<String> records, Writer writer) throws IOException {
    long start = System.currentTimeMillis();
    for (String record: records) {
        writer.write(record);
    }
    // writer.flush(); // close() should take care of this
    writer.close(); 
    long end = System.currentTimeMillis();
    System.out.println((end - start) / 1000f + " seconds");
}
}

2
@rozario每个写调用只能产生大约175MB,然后将其删除。否则,您将获得175MB x 4个不同的写调用x 5次迭代= 3.5GB的数据。您可以检查file.delete()的返回值,如果该值为假,则抛出异常。
David Moles

请注意,writer.flush()在这种情况下这不是必需的,因为会隐式writer.close() 刷新内存。顺便说一句:最佳做法建议使用try资源关闭,而不是显式调用close()
patryk.beza 2015年

3
FWIW,这是为Java 5编写的,至少没有文档证明它可以关闭刷新,并且没有try-with-resources。它可能使用更新。
David Moles

1
我只是查阅了Java 1.1的文档,Writer.flush()并说“关闭流,先刷新它。”。因此调用flush()之前close()从来没有需要。顺便说一句,之所以BufferedWriter可能没有用的原因之一是,它的FileWriter一种特殊化OutputStreamWriter,当它进行目标编码中从char序列到字节序列的转换时,无论如何都必须有自己的缓冲。当字符集编码器不得不以更高的速率刷新其较小的字节缓冲区时,在前端具有更多的缓冲区无济于事。
Holger

1
的确,但是在文档或教程中(据我所知),并没有很好地解决额外缓冲的实际含义以及如何决定是否使用它。请注意,NIO API甚至根本没有Buffered…对应的通道类型。
Holger

38

尝试使用内存映射文件(以300 m / s的速度在我的m / c,Core 2 duo和2.5GB RAM中写入174MB):

byte[] buffer = "Help I am trapped in a fortune cookie factory\n".getBytes();
int number_of_lines = 400000;

FileChannel rwChannel = new RandomAccessFile("textfile.txt", "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE, 0, buffer.length * number_of_lines);
for (int i = 0; i < number_of_lines; i++)
{
    wrBuf.put(buffer);
}
rwChannel.close();

实例化ByteBuffer时aMessage.length()表示什么?
酒店

2
Jut fyi,在MacBook Pro(2013年末),2.6 Ghz Core i7和Apple 1tb SSD上运行大约需要
140毫秒

@JerylCook当您知道确切的大小时,映射的内存很有用。在这里,我们预先保留了一个缓冲区*数字空间。
Deepak Agarwal

谢谢!我可以将其用于超过2GB的文件吗?MappedByteBuffer映射(MapMode var1,long var2,long var4):if(var4> 2147483647L){throw new IllegalArgumentException(“ Size超出了Integer.MAX_VALUE”)
Mikhail Ionkin

戴尔酷睿i5(1.6,2.3)Ghz
FSm上105毫秒

19

仅出于统计目的:

机器是旧的戴尔,带有新的固态硬盘

处理器:Intel Pentium D 2,8 Ghz

固态硬盘:Patriot Inferno 120GB SSD

4000000 'records'
175.47607421875 MB

Iteration 0
Writing raw... 3.547 seconds
Writing buffered (buffer size: 8192)... 2.625 seconds
Writing buffered (buffer size: 1048576)... 2.203 seconds
Writing buffered (buffer size: 4194304)... 2.312 seconds

Iteration 1
Writing raw... 2.922 seconds
Writing buffered (buffer size: 8192)... 2.406 seconds
Writing buffered (buffer size: 1048576)... 2.015 seconds
Writing buffered (buffer size: 4194304)... 2.282 seconds

Iteration 2
Writing raw... 2.828 seconds
Writing buffered (buffer size: 8192)... 2.109 seconds
Writing buffered (buffer size: 1048576)... 2.078 seconds
Writing buffered (buffer size: 4194304)... 2.015 seconds

Iteration 3
Writing raw... 3.187 seconds
Writing buffered (buffer size: 8192)... 2.109 seconds
Writing buffered (buffer size: 1048576)... 2.094 seconds
Writing buffered (buffer size: 4194304)... 2.031 seconds

Iteration 4
Writing raw... 3.093 seconds
Writing buffered (buffer size: 8192)... 2.141 seconds
Writing buffered (buffer size: 1048576)... 2.063 seconds
Writing buffered (buffer size: 4194304)... 2.016 seconds

如我们所见,原始方法的缓冲速度较慢。


2
但是,只要文本大小变大,缓冲方法就会变慢。
FSm

5

Java可能不会限制您的传输速度。相反,我会怀疑(无特定顺序)

  1. 从数据库传输的速度
  2. 传输到磁盘的速度

如果您读取了完整的数据集然后将其写出到磁盘,那将需要更长的时间,因为JVM必须分配内存,并且db rea / disk的写操作将顺序进行。相反,对于您从数据库进行的每次读取,我都会将其写入缓冲的写入器,因此该操作将更接近并发操作(我不知道您是否正在执行此操作)



3

package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;

/**
 * @author Naresh Bhabat
 * 
Following  implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.


Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.



It uses random access file,which is almost like streaming API.


 * ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);

 *  	   for 10 threads:Total time required for reading and writing the text in
 *         :seconds 349.317
 * 
 *         For 100:Total time required for reading the text and writing   : seconds 464.042
 * 
 *         For 1000 : Total time required for reading and writing text :466.538 
 *         For 10000  Total time required for reading and writing in seconds 479.701
 *
 * 
 */
public class DealWithHugeRecordsinFile extends TestCase {

	static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
	static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
	static volatile RandomAccessFile fileToWrite;
	static volatile RandomAccessFile file;
	static volatile String fileContentsIter;
	static volatile int position = 0;

	public static void main(String[] args) throws IOException, InterruptedException {
		long currentTimeMillis = System.currentTimeMillis();

		try {
			fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles 
			file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles 
			seriouslyReadProcessAndWriteAsynch();

		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		Thread currentThread = Thread.currentThread();
		System.out.println(currentThread.getName());
		long currentTimeMillis2 = System.currentTimeMillis();
		double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
		System.out.println("Total time required for reading the text in seconds " + time_seconds);

	}

	/**
	 * @throws IOException
	 * Something  asynchronously serious
	 */
	public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
		ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
		while (true) {
			String readLine = file.readLine();
			if (readLine == null) {
				break;
			}
			Runnable genuineWorker = new Runnable() {
				@Override
				public void run() {
					// do hard processing here in this thread,i have consumed
					// some time and eat some exception in write method.
					writeToFile(FILEPATH_WRITE, readLine);
					// System.out.println(" :" +
					// Thread.currentThread().getName());

				}
			};
			executor.execute(genuineWorker);
		}
		executor.shutdown();
		while (!executor.isTerminated()) {
		}
		System.out.println("Finished all threads");
		file.close();
		fileToWrite.close();
	}

	/**
	 * @param filePath
	 * @param data
	 * @param position
	 */
	private static void writeToFile(String filePath, String data) {
		try {
			// fileToWrite.seek(position);
			data = "\n" + data;
			if (!data.contains("Randomization")) {
				return;
			}
			System.out.println("Let us do something time consuming to make this thread busy"+(position++) + "   :" + data);
			System.out.println("Lets consume through this loop");
			int i=1000;
			while(i>0){
			
				i--;
			}
			fileToWrite.write(data.getBytes());
			throw new Exception();
		} catch (Exception exception) {
			System.out.println("exception was thrown but still we are able to proceeed further"
					+ " \n This can be used for marking failure of the records");
			//exception.printStackTrace();

		}

	}
}


1
请添加一些文字,解释为什么此答案比其他答案更好。在代码中添加注释是不够的。
本杰明·洛瑞

这样做可能更好的原因:这是一个实时场景,并且处于工作状态示例中。它的其他好处是,它可以异步进行读取,处理和写入的处理...它使用高效的Java api(即随机访问)文件,该文件是线程安全的,并且多个线程可以同时对其进行读写。它不会在运行时引起内存开销,也不会导致系统崩溃...这是一种多用途解决方案,用于处理可以在各个线程中跟踪的记录处理失败。请让我知道我是否可以提供更多帮助。
RAM

2
谢谢,这就是您的帖子需要的信息。也许考虑将其添加到帖子正文中:)
本杰明·洛瑞

3
如果有10个线程需要349.317秒才能写入2GB数据,那么它可能是最慢的写入海量数据的方式(除非您指的是毫秒)
Deepak Agarwal

1

对于那些想要缩短记录检索和转储到文件中的时间(即不对记录进行处理)的人,而不是将它们放入ArrayList中,请将这些记录附加到StringBuffer中。应用toSring()函数以获取单个String并将其立即写入文件。

对我来说,检索时间从22秒减少到17秒。


那只是创建一些伪造“记录”的一个例子-我假设在现实世界中,记录来自其他地方(在OP的情况下是数据库)。但是可以,如果您需要先将所有内容读入内存,则aStringBuffer可能会更快。原始String数组(String[])也可能会更快。
David Moles
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.