findimagedupes should be parallelizable

Bug #502224 reported by gwern
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
findimagedupes (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: findimagedupes

An excellent feature for findimagedupes would be hashing/analyzing multiple images at once, in parallel. Each image can be analyzed independently, and the file IO makes up a minuscule amount of the runtime - the problem is embarrassingly parallel. Practically linear speedups should be perfectly possible.

And the benefits are real: on large collections, the runtime can be many minutes or hours. I have 4 cores which are generally not doing much; why can't they all be used to cut the runtime by half or more?

I looked into running 4 findimagedupes concurrently and then using --merge to bring together their results, but this is deeply hacky and I worry about race-conditions and data consistency in the ultimate fingerprint database; parallelism is something the application should be handling internally.

Revision history for this message
gwern (gwern0) wrote :

It's possible that this has been fixed as of 2.18-3: I seem to regularly see findimagedupes using 200-300% in top, or 2 or 3 of my 4 cores.

Revision history for this message
Jonathan H N Chin (jhnc) wrote :

Sorry, I just saw this as I don't monitor bug trackers.
I'm the author.

This is a good idea but the code would need to be refactored.
I'll have a think.

-jonathan

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.