Parallel File Reading: Python vs Java

Given a set of files, I wanted to see how Python and Java would perform in both single- and multi- threaded environments. As a simple task, I chose to just count up the number of bytes in a given file by manually iterating over the bytes. Essentially–an intentionally non-optimal method of calculating the file size. Java is usually faster than Python, but I was surprised to see that for this task, Python significantly faster.Java vs Python file IO performanceMy test for this was to read approximately 185MB worth of data spread across 18 files on my 2012 MacBook Pro Intel i7 (2.9GHz). Both programs performed approximately 40% better when utilizing multiple threads and Python was overall about 70% faster.

In Java, we can use a SimpleFileVisitor to walk the directory tree, and ask an ExecutorService to execute a list of Callables.

import java.util.concurrent.*;
import java.util.List;
import java.util.ArrayList;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.FileInputStream;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;

public class FileProc {
	
	public static void main(String[] args) throws Exception {
		List<FileProcessor> todo = new ArrayList<>();
		Path rootDir = Paths.get("/Users/phillip/Dropbox/eBooks");
		
		Files.walkFileTree(rootDir, new SimpleFileVisitor<Path>() {
		   @Override
		   public FileVisitResult visitFile(Path path, BasicFileAttributes attr) throws IOException {
		      String location = path.toString();
			  todo.add(new FileProcessor(new File(location)));
		      return FileVisitResult.CONTINUE;
		   }
		});

		//Choose one:
		//ExecutorService executor = Executors.newSingleThreadExecutor();
		//ExecutorService executor = Executors.newFixedThreadPool(8);
		List<Future<FileDetails>> futures = executor.invokeAll(todo);
		for(Future<FileDetails> future : futures) {
			System.out.println(future.get());
		}
		executor.shutdown();
	}
}

class FileProcessor implements Callable<FileDetails> {
	private File file;

	public FileProcessor(File file){
		this.file = file;
	}

	public FileDetails call() throws IOException {
		String fileName = file.getName();
		Path path = Paths.get(file.getPath());
		int fileSize = getSizeManually();
		return new FileDetails(fileName, fileSize);
	}

	private int getSizeManually() throws IOException {
		int sum = 0;
		try(FileInputStream fis = new FileInputStream(file)) {
			while(fis.read() != -1){
				sum++;
			}
		}

		return sum;
	}
}

class FileDetails {
	private String fileName;
	private int fileSize;

	public FileDetails(String fileName, int fileSize) {
		this.fileName = fileName;
		this.fileSize = fileSize;
	}

	@Override
	public String toString(){
		return fileName + " .... " + fileSize + " bytes";
	}
}

In Python, the code is much more concise. We use os.walk to walk the directory tree and map a function to a (thread) Pool:

from multiprocessing import Pool
import os

def main():
    with Pool(processes=8) as pool:
        print(pool.map(process_file, get_files()))

def get_files():
    files = []
    root = "/Users/phillip/Dropbox/eBooks"
    for (dirpath, dirnames, filenames) in os.walk(root):
        for f in filenames:
            files.append(dirpath + "/" + f)

    return files

def process_file(file):
    with open(file,'rb') as to_process:
        sum = 0
        byte = to_process.read(1)
        while byte:
            byte = to_process.read(1)
            sum += 1

    return {"fileName":to_process.name,"fileSize":sum}

if __name__=="__main__":
    main()

I don’t know much about the Python internals, so let me know if you have any insight as to why it was so much faster than the Java code!

4 thoughts on “Parallel File Reading: Python vs Java

  1. samuel french

    I’m worried your java code isn’t really doing a whole lot in parallel, you should to make everything static and use the stream interface for collections it’s a lot faster than 1.7’s fork/join.

    also the os you are using should have file size cached. But if your using a rotational hard drive, reads are going to be moving only one head over the disk platter anyways, so creating all the threads just to get file size by reading files one byte at a time after forking a bunch of threads which is time consuming.

    After the invokeAll call, you might try adding a isReady flag, so if the item has finished it’s read operation, you can print it immediately.

    i’ve had good luck with the http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html

    or reading using a character buffer run on the worker size 255 and then run the workers in a parallel stream. you can have them return their results to a buffer via the ConcurrentHashMap class and a parellelstream

    that is a good way to read files for parsing. for just finding the length, there’s gotta be a better way

    don’t no shit bout python doe

    1. Phillip

      Thanks for the feedback! Hopefully no one reading this thinks this is the correct way to get the size of a file, that definitely wasn’t my intention. As you point out, there are much better ways to do that. Rather, the point was to use “an intentionally non-optimal method” that can easily be replicated in both Java and Python for bench-marking purposes only.

  2. testou

    Traceback (most recent call last):
    File “readpara.py”, line 28, in
    main()
    File “readpara.py”, line 5, in main
    with Pool(processes=8) as pool:
    AttributeError: __exit__

    It ends with an error code…

Leave a Reply to Phillip Cancel reply

Your email address will not be published. Required fields are marked *