Parallel File Reading: Python vs Java

Given a set of files, I wanted to see how Python and Java would perform in both single- and multi- threaded environments. As a simple task, I chose to just count up the number of bytes in a given file by manually iterating over the bytes. Essentially–an intentionally non-optimal method of calculating the file size. Java is usually faster than Python, but I was surprised to see that for this task, Python significantly faster.My test for this was to read approximately 185MB worth of data spread across 18 files on my 2012 MacBook Pro Intel i7 (2.9GHz). Both programs performed approximately 40% better when utilizing multiple threads and Python was overall about 70% faster.

In Java, we can use a SimpleFileVisitor to walk the directory tree, and ask an ExecutorService to execute a list of Callables.

import java.util.concurrent.*;
import java.util.List;
import java.util.ArrayList;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.FileInputStream;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;

public class FileProc {
	
	public static void main(String[] args) throws Exception {
		List<FileProcessor> todo = new ArrayList<>();
		Path rootDir = Paths.get("/Users/phillip/Dropbox/eBooks");
		
		Files.walkFileTree(rootDir, new SimpleFileVisitor<Path>() {
		   @Override
		   public FileVisitResult visitFile(Path path, BasicFileAttributes attr) throws IOException {
		      String location = path.toString();
			  todo.add(new FileProcessor(new File(location)));
		      return FileVisitResult.CONTINUE;
		   }
		});

		//Choose one:
		//ExecutorService executor = Executors.newSingleThreadExecutor();
		//ExecutorService executor = Executors.newFixedThreadPool(8);
		List<Future<FileDetails>> futures = executor.invokeAll(todo);
		for(Future<FileDetails> future : futures) {
			System.out.println(future.get());
		}
		executor.shutdown();
	}
}

class FileProcessor implements Callable<FileDetails> {
	private File file;

	public FileProcessor(File file){
		this.file = file;
	}

	public FileDetails call() throws IOException {
		String fileName = file.getName();
		Path path = Paths.get(file.getPath());
		int fileSize = getSizeManually();
		return new FileDetails(fileName, fileSize);
	}

	private int getSizeManually() throws IOException {
		int sum = 0;
		try(FileInputStream fis = new FileInputStream(file)) {
			while(fis.read() != -1){
				sum++;
			}
		}

		return sum;
	}
}

class FileDetails {
	private String fileName;
	private int fileSize;

	public FileDetails(String fileName, int fileSize) {
		this.fileName = fileName;
		this.fileSize = fileSize;
	}

	@Override
	public String toString(){
		return fileName + " .... " + fileSize + " bytes";
	}
}

In Python, the code is much more concise. We use os.walk to walk the directory tree and map a function to a (thread) Pool:

from multiprocessing import Pool
import os

def main():
    with Pool(processes=8) as pool:
        print(pool.map(process_file, get_files()))

def get_files():
    files = []
    root = "/Users/phillip/Dropbox/eBooks"
    for (dirpath, dirnames, filenames) in os.walk(root):
        for f in filenames:
            files.append(dirpath + "/" + f)

    return files

def process_file(file):
    with open(file,'rb') as to_process:
        sum = 0
        byte = to_process.read(1)
        while byte:
            byte = to_process.read(1)
            sum += 1

    return {"fileName":to_process.name,"fileSize":sum}

if __name__=="__main__":
    main()

I don’t know much about the Python internals, so let me know if you have any insight as to why it was so much faster than the Java code!

Tagged on: concurrency, file, io, java, parallel, performance, python

4 thoughts on “Parallel File Reading: Python vs Java”

samuel french May 20, 2015 at 12:52 am

I’m worried your java code isn’t really doing a whole lot in parallel, you should to make everything static and use the stream interface for collections it’s a lot faster than 1.7’s fork/join.

also the os you are using should have file size cached. But if your using a rotational hard drive, reads are going to be moving only one head over the disk platter anyways, so creating all the threads just to get file size by reading files one byte at a time after forking a bunch of threads which is time consuming.

After the invokeAll call, you might try adding a isReady flag, so if the item has finished it’s read operation, you can print it immediately.

i’ve had good luck with the http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html

or reading using a character buffer run on the worker size 255 and then run the workers in a parallel stream. you can have them return their results to a buffer via the ConcurrentHashMap class and a parellelstream

that is a good way to read files for parsing. for just finding the length, there’s gotta be a better way

don’t no shit bout python doe

Reply ↓
1. Phillip May 20, 2015 at 10:41 am
  
  Thanks for the feedback! Hopefully no one reading this thinks this is the correct way to get the size of a file, that definitely wasn’t my intention. As you point out, there are much better ways to do that. Rather, the point was to use “an intentionally non-optimal method” that can easily be replicated in both Java and Python for bench-marking purposes only.
  
  Reply ↓
testou February 24, 2017 at 8:23 am

Traceback (most recent call last):
File “readpara.py”, line 28, in
main()
File “readpara.py”, line 5, in main
with Pool(processes=8) as pool:
AttributeError: __exit__

It ends with an error code…

Reply ↓
1. Phillip Johnson Post authorFebruary 24, 2017 at 9:15 am
  
  I just checked and it works for me on Python 3.5.2. What version are you using?
  
  Reply ↓

4 thoughts on “Parallel File Reading: Python vs Java”

Leave a Reply Cancel reply