Grouping by Digest

Top   Previous   Next

Digests are essentially more accurate checksums. They are very useful when you want to ensure that there is no chance of zsDuplicateHunter falsely identifying two files as duplicates. A checksum is a 4 byte number and there is some chance that two files could be different but have the same checksum (the likelihood is very low though). Digests range from 128 bits (for MD5 and SHA-1) to 512 bits (for SHA-512). It is theoretically possible (although extremely unlikely) that collisions could occur for the smaller digests however research indicates that collisions are not possible on the large digests (SHA-256, SHA-384, and SHA-512).

 

If it is possible to have collisions using checksums and the smaller digests, why would you want to use them?

 

Calculating the larger digests takes considerably more time than calculating checksums and the smaller digests. Also, the chance of getting false positives is very low. Ultimately, the decision of which method to use when grouping files requires you to balance speed and accuracy concerns to meet your needs. For comparison purposes here are some timings using various checksum and digest methods for grouping.

Timing Example 1

Timings for a small folder with approximately 3,800 files in it. Most files are quite small.

 

Grouping Type

Time (seconds)

Adler Checksum

14.2

CRC32 Checksum

13.7

CRC32 and Size

14.0

MD5 Digest

14.8

SHA-1 Digest

15.2

SHA-256 Digest

15.9

SHA-384 Digest

18.2

SHA-512 Digest

18.2

 

In this case, the times correlate well with the strength of the algorithm (the better algorithms take significantly longer to check). In all cases, the same number of duplicates were found (no false positives in any of the checks).

 

Timing Example 2

Timings for a large folder with approximately 59,000 files in it. Files are larger than the first test.

 

Grouping Type

Time (minutes)

Adler Checksum

3.4

CRC32 Checksum

3.7

MD5 Digest

6.1

SHA-1 Digest

6.6

SHA-512 Digest

7.9

 

Again, the times correlate well with the strength of the algorithm (the better algorithms take significantly longer to check). In this cases though, the digest algorithms are nearly twice as slow as the checksum algorithms. This shows that the time to calculate the digests and checksums are related to the size of the files.