Given a set of files, I wanted to see how Python and Java would perform in both single- and multi- threaded environments. As a simple task, I chose to just count up the number of bytes in a given file by manually iterating over the bytes. Essentially–an intentionally non-optimal method of calculating the file size. Java is usually faster than Python, but I was surprised to see that for this task, Python significantly faster.My test for this was to read approximately 185MB worth of data spread across 18 files on my 2012 MacBook Pro Intel i7 (2.9GHz). Both programs performed approximately 40% better when utilizing multiple threads and Python was overall about 70% faster.
In Java, we can use a SimpleFileVisitor
to walk the directory tree, and ask an ExecutorService
to execute a list of Callable
s.
import java.util.concurrent.*; import java.util.List; import java.util.ArrayList; import java.io.IOException; import java.io.File; import java.io.FileReader; import java.io.FileInputStream; import java.nio.file.*; import java.nio.file.attribute.BasicFileAttributes; public class FileProc { public static void main(String[] args) throws Exception { List<FileProcessor> todo = new ArrayList<>(); Path rootDir = Paths.get("/Users/phillip/Dropbox/eBooks"); Files.walkFileTree(rootDir, new SimpleFileVisitor<Path>() { @Override public FileVisitResult visitFile(Path path, BasicFileAttributes attr) throws IOException { String location = path.toString(); todo.add(new FileProcessor(new File(location))); return FileVisitResult.CONTINUE; } }); //Choose one: //ExecutorService executor = Executors.newSingleThreadExecutor(); //ExecutorService executor = Executors.newFixedThreadPool(8); List<Future<FileDetails>> futures = executor.invokeAll(todo); for(Future<FileDetails> future : futures) { System.out.println(future.get()); } executor.shutdown(); } } class FileProcessor implements Callable<FileDetails> { private File file; public FileProcessor(File file){ this.file = file; } public FileDetails call() throws IOException { String fileName = file.getName(); Path path = Paths.get(file.getPath()); int fileSize = getSizeManually(); return new FileDetails(fileName, fileSize); } private int getSizeManually() throws IOException { int sum = 0; try(FileInputStream fis = new FileInputStream(file)) { while(fis.read() != -1){ sum++; } } return sum; } } class FileDetails { private String fileName; private int fileSize; public FileDetails(String fileName, int fileSize) { this.fileName = fileName; this.fileSize = fileSize; } @Override public String toString(){ return fileName + " .... " + fileSize + " bytes"; } }
In Python, the code is much more concise. We use os.walk
to walk the directory tree and map a function to a (thread) Pool
:
from multiprocessing import Pool import os def main(): with Pool(processes=8) as pool: print(pool.map(process_file, get_files())) def get_files(): files = [] root = "/Users/phillip/Dropbox/eBooks" for (dirpath, dirnames, filenames) in os.walk(root): for f in filenames: files.append(dirpath + "/" + f) return files def process_file(file): with open(file,'rb') as to_process: sum = 0 byte = to_process.read(1) while byte: byte = to_process.read(1) sum += 1 return {"fileName":to_process.name,"fileSize":sum} if __name__=="__main__": main()
I don’t know much about the Python internals, so let me know if you have any insight as to why it was so much faster than the Java code!
I’m worried your java code isn’t really doing a whole lot in parallel, you should to make everything static and use the stream interface for collections it’s a lot faster than 1.7’s fork/join.
also the os you are using should have file size cached. But if your using a rotational hard drive, reads are going to be moving only one head over the disk platter anyways, so creating all the threads just to get file size by reading files one byte at a time after forking a bunch of threads which is time consuming.
After the invokeAll call, you might try adding a isReady flag, so if the item has finished it’s read operation, you can print it immediately.
i’ve had good luck with the http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
or reading using a character buffer run on the worker size 255 and then run the workers in a parallel stream. you can have them return their results to a buffer via the ConcurrentHashMap class and a parellelstream
that is a good way to read files for parsing. for just finding the length, there’s gotta be a better way
don’t no shit bout python doe
Thanks for the feedback! Hopefully no one reading this thinks this is the correct way to get the size of a file, that definitely wasn’t my intention. As you point out, there are much better ways to do that. Rather, the point was to use “an intentionally non-optimal method” that can easily be replicated in both Java and Python for bench-marking purposes only.
Traceback (most recent call last):
File “readpara.py”, line 28, in
main()
File “readpara.py”, line 5, in main
with Pool(processes=8) as pool:
AttributeError: __exit__
It ends with an error code…
I just checked and it works for me on Python 3.5.2. What version are you using?