Double File Scanner

Submit portable freeware that you find here. It helps if you include information like description, extraction instruction, Unicode support, whether it writes to the registry, and so on.
Post Reply
Message
Author
User avatar
deathcubek
Posts: 221
Joined: Thu Jul 14, 2011 9:42 am
Location: Island of Lost Minds

Double File Scanner

#1 Post by deathcubek »

This tool allows for detecting duplicate files on your hard-drive, quickly.

Image

Code: Select all

The purpose of this tool is scanning the selected directory or directories for
duplicate files, i.e. files with identical content. Duplicate files are
identified by first calculating the SHA-1 digest of each file and then looking
for values that appear more than once. In particular, files with identical
content are guaranteed to have the same SHA-1 digest, while files with
differing content will have different SHA-1 values with very high certainty.

All computed SHA-1 values are stored in a hash table, so collisions are found
quickly and we do NOT need to compare every digest to every other one. Also,
the files are processed concurrently in multiple "worker" threads in order to
parallelize and speed-up the SHA-1 computations on multi-core processors. On
our test machine it took ~15 minutes to analyse all the ~260,000 files on the
system drive (~63.5 GB). During this operation ~44,000 duplicates were found.

The list of identified duplicates can be exported to the XML and INI formats.
Download:
https://github.com/lordmulder/DoubleFil ... ses/latest
Last edited by deathcubek on Sun Jun 22, 2014 4:43 pm, edited 1 time in total.

User avatar
deathcubek
Posts: 221
Joined: Thu Jul 14, 2011 9:42 am
Location: Island of Lost Minds

Re: Double File Scanner

#2 Post by deathcubek »

Double File Scanner v2.02
https://github.com/lordmulder/DoubleFil ... /tag/v2.02
Changes:
* Added automatic clean-up wizard
* Display the number of duplicates per group
* Display the size of each file
* Display tooltips when hovering tree view items
* Various minor fixes and improvements

User avatar
deathcubek
Posts: 221
Joined: Thu Jul 14, 2011 9:42 am
Location: Island of Lost Minds

Re: Double File Scanner

#3 Post by deathcubek »

Double File Scanner v2.03
https://github.com/lordmulder/DoubleFil ... /tag/v2.03
Changes:
- Performance optimizations
- Display file name and path in separate columns
- Further improved sorting of the results
- Various minor fixes and improvements

User avatar
I am Baas
Posts: 4150
Joined: Thu Aug 07, 2008 4:51 am

Re: Double File Scanner

#4 Post by I am Baas »

Tested version 2.0.3.1 portable

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Double File Scanner

#5 Post by webfork »

Dig the simplicity, the license, and the look-and-feel. Even the progress window looks good.

Wishlist:
  • Pause/cancel button
  • Configurations around criteria (file date, time, size, hash, etc). Usually I prefer programs with no settings, but a double file finder kinda needs it.
  • Some kind of indication of progress. I set this up on a 10 gig volume and it ran for 2 hours. I have no idea if it was working or what and nothing worked so I just did an End Task to shut it down.

Enternal
Posts: 89
Joined: Thu Jan 02, 2014 3:41 pm

Re: Double File Scanner

#6 Post by Enternal »

I LOOOVE CloneSpy. I use it all the time and it's definitely one of my favorites. But there is one issue with it that always drive me a bit nuts at times and that is how UGLY it is :lol:

So right away this totally got my interest and it's open source too which makes it even nicer if you like to look at the code (in my case, for fun)! After playing around with it, it's very clear that I will continue to use CloneSpy but this will also be in my toolbox. Each of them clearly have their purposes. CloneSpy gives you power in exchange for a bit harder to start with and complexity of options. This is perfect for that "quick" scanning of a folder and be done with it. Then again, I have a bad habit of keeping multiple tools of similar purposes on my USB drive.

Anyway, like webfork said, this should really come with a pause/stop button. That's probably the most important part for usability. Progress report is definite a very nice thing to have but it's not a total requirement.

User avatar
deathcubek
Posts: 221
Joined: Thu Jul 14, 2011 9:42 am
Location: Island of Lost Minds

Re: Double File Scanner

#7 Post by deathcubek »

webfork wrote:Pause/cancel button
Actually you can simply press ESC to abort. I think there's even a Tooltip for that. But maybe that should be more obvious :wink:

Anyway, I will think about a way to pause the process, although it's not as straight forward as one may think, because of how the various tasks are handled by the thread pool...

webfork wrote:Configurations around criteria (file date, time, size, hash, etc). Usually I prefer programs with no settings, but a double file finder kinda needs it.
So you want to match the files based other criteria than the file's content (hash), such as the file's time and/or date?

Or you do you want to stay with the same matching criteria (hash of file content) and just include/exclude files based on certain pre-filtering criteria?

The latter would be relatively easy to do, I think. The former, I don't know if it can be integrated nicely...

webfork wrote:Some kind of indication of progress. I set this up on a 10 gig volume and it ran for 2 hours. I have no idea if it was working or what and nothing worked so I just did an End Task to shut it down.
What's wrong with the current progress indicator :?:

I know that during the first phase, when we are still scanning the file system for files/directories, there is no real progress being displayed yet. But how should that be possible at all?

As long as we haven't generated the list of files yet, it is impossible to know how many files or sub-directories we will encounter in the next folder to be processed...

Think of it like you want to enumerate all nodes in a huge Tree. You start at the root node and you simply have no idea how many nodes there will be in each branch (sub-tree) unless you have actually processed it.

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Double File Scanner

#8 Post by webfork »

Actually you can simply press ESC to abort
Weird ... I should have tried that.
So you want to match the files based other criteria than the file's content (hash), such as the file's time and/or date?
Sometimes I'm confident of the file's content and just want to check by name+date+filesize or name+filesize. In other words, it's not necessary to run an full checksum if you're confident of file integrity.

A great example is on media drives when checking lots of movies and music. Running checksums on the entire drive will take hours.
I don't know if it can be integrated nicely...
DoubleKiller does a good job of this but is, unlike your programs, a mess of a UI. I'd say keep that as the default, maybe with an option to do a faster and less accurate version (with subsequent criteria checkboxes or whatever).

include/exclude files based on certain pre-filtering criteria?
That's also a great idea. I'd suggest putting .bak files as a template for other file types?

What's wrong with the current progress indicator ... I know that during the first phase
Yeah I never got out of the first phase so I dunno.

You start at the root node and you simply have no idea how many nodes there will be in each branch (sub-tree) unless you have actually processed it.
Can you run a quick check/scan at the beginning for total files? If total files is (Y) and the current checked number by the program is (X) maybe put that number up as "X files of Y checked"?

User avatar
deathcubek
Posts: 221
Joined: Thu Jul 14, 2011 9:42 am
Location: Island of Lost Minds

Re: Double File Scanner

#9 Post by deathcubek »

webfork wrote:A great example is on media drives when checking lots of movies and music. Running checksums on the entire drive will take hours.
This whole program was written with the idea of identifying files with identical content in mind. So matching files by other criteria than "content" cannot be added easily, but I'll think about a solution.

Anyway, if you just want to match files by their date (rather than by the actual content), you could probably just search for "*" in explorer and then sort the result by date...

Can you run a quick check/scan at the beginning for total files? If total files is (Y) and the current checked number by the program is (X) maybe put that number up as "X files of Y checked"?
That's exactly what happens during the first phase :o

In that phase we do nothing but determining the total number of files. The problem is, we start with some directory (the one the user has selected) and in that directory we find a certain number of files plus a certain number of sub-directories. So we remember those files (for the second phase) and we schedule those sub-directories to be scanned next. Then the same procedure is repeated in each of the pending sub-directories. Inside those sub-directories we will find even more files - and probably also a number of sub-sub-directories. So these sub-sub-directories need to be scheduled for processing as well. Then the sub-sub-sub-directories. And so on! Consequently, at any point, we know how many files and directories we have seen so far. But the total number of files isn't known until the first phase has been completed.

Fortunately, the first phase actually is very fast. And it's the second phase, where the hash of each files is computed, that takes most of the time. Still, on a very large drive, even the first phase can take several minutes...

(BTW: Scanning my entire system drive, containing about 65 GB of data in a bout 260,000 files, took about 15 minutes overall)

User avatar
webfork
Posts: 10821
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Double File Scanner

#10 Post by webfork »

Went back and checked ... pressing Esc works. No clue why I didn't try that.
That's exactly what happens during the first phase :o ... Fortunately, the first phase actually is very fast. And it's the second phase, where the hash of each files is computed, that takes most of the time. Still, on a very large drive, even the first phase can take several minutes...
Weird. There might be something with my setup. Let me be specific:
  1. Download and extract
  2. Click Start Scan
  3. Run on my TrueCrypt volume (with Recurse Directories checked)
  4. Click OK
  5. "Searching for files and directories, please be patient" appears
  6. Wait 10 mins
This is what I'm seeing: http://i.imgur.com/V7RVjuh.png

How long should it take to check the number of files in a volume? When I open the volume, select all files, right-click and select "properties", the number comes right up (in my case 12,000 files in 12.2 gigs).

Also, it's very CPU intensive during this first phase. What's up with that?

User avatar
deathcubek
Posts: 221
Joined: Thu Jul 14, 2011 9:42 am
Location: Island of Lost Minds

Re: Double File Scanner

#11 Post by deathcubek »

webfork wrote:Went back and checked ... pressing Esc works. No clue why I didn't try that.
That's exactly what happens during the first phase :o ... Fortunately, the first phase actually is very fast. And it's the second phase, where the hash of each files is computed, that takes most of the time. Still, on a very large drive, even the first phase can take several minutes...
Weird. There might be something with my setup. Let me be specific:
  1. Download and extract
  2. Click Start Scan
  3. Run on my TrueCrypt volume (with Recurse Directories checked)
  4. Click OK
  5. "Searching for files and directories, please be patient" appears
  6. Wait 10 mins
This is what I'm seeing: http://i.imgur.com/V7RVjuh.png
I didn't mean that the first phase necessarily takes a short time. Actually, it can take quite some time on a large volume! But it's certainly very fast compared to the second phase.

And running on a TrueCrypt volume, with the overhead of encryption, certainly doesn't make things faster :wink:

Anyway, you can run the Double File Scanner program with the "--console" switch in order to display some additional status information, if you want to.

webfork wrote:How long should it take to check the number of files in a volume? When I open the volume, select all files, right-click and select "properties", the number comes right up (in my case 12,000 files in 12.2 gigs).
You mean in Windows Explorer?

Well, either Windows Explorer directly uses some low-level Win32 API functions that are faster than Qt's QDirIterator class. Or it uses some smart caching strategy, so it doesn't actually need to scan whole the file system at the moment when you open the properties dialogue, but instead just grabs the info from its cache.

webfork wrote:Also, it's very CPU intensive during this first phase. What's up with that?
High CPU usage isn't a bad thing per se. Actually, the "CPU usage" you see in Taskmanager is simply the fraction of time that the CPU has been working, as opposed to the time the CPU has been idle. So if, for example, you have 75% CPU usage, it means that, in the last time interval, the CPU has been working 75% of the time and it has been idle 25% of the time. In other words: 25% of the CPU cycles have been wasted unused! Okay, modern CPU's do not actually "waste" these CPU cycles (like old CPU's used to do), but fall into sleep state very quickly. But still these CPU cycles could have been used for something useful instead. So, from this perspective, you want the CPU usage to be as high as possible - in order to finish your task as quickly as possible.

Double File Scanner handles each directory as a separate task. It uses a thread pool to distribute these tasks on multiple threads and thus take advantage of multi-core processors. This way the overall process is much faster - and CPU usage will be higher (intentionally).

Enternal wrote:Anyway, like webfork said, this should really come with a pause/stop button.
This experimental version has suspend/resume support hacked in. Use the "Pause" button!

http://sourceforge.net/projects/mulders ... p/download

xrouge
Posts: 1
Joined: Wed Jan 04, 2017 11:05 am

Re: Double File Scanner

#12 Post by xrouge »

Is it possible to run the tool on the command line and generate a report without exporting it from the GUI?

User avatar
smaragdus
Posts: 2120
Joined: Sat Jun 22, 2013 3:24 am
Location: Aeaea

Re: Double File Scanner

#13 Post by smaragdus »

Overall I like Double File Scanner but I think it lacks a simple but essential feature- an option to delete selected file to recycle bin, illustration of what I mean below:

Image

Post Reply