Digests are essentially more accurate checksums. They are very useful when you want to ensure that there is no chance of zsDuplicateHunter falsely identifying two files as duplicates. A checksum is a 4 byte number and there is some chance that two files could be different but have the same checksum (the likelihood is very low though). Digests range from 128 bits (for MD5 and SHA-1) to 512 bits (for SHA-512). It is theoretically possible (although extremely unlikely) that collisions could occur for the smaller digests however research indicates that collisions are not possible on the large digests (SHA-256, SHA-384, and SHA-512).
If it is possible to have collisions using checksums and the smaller digests, why would you want to use them?
Calculating the larger digests takes considerably more time than calculating checksums and the smaller digests. Also, the chance of getting false positives is very low. Ultimately, the decision of which method to use when grouping files requires you to balance speed and accuracy concerns to meet your needs. For comparison purposes here are some timings using various checksum and digest methods for grouping. Timing Example 1 Timings for a small folder with approximately 3,800 files in it. Most files are quite small.
In this case, the times correlate well with the strength of the algorithm (the better algorithms take significantly longer to check). In all cases, the same number of duplicates were found (no false positives in any of the checks).
Timing Example 2 Timings for a large folder with approximately 59,000 files in it. Files are larger than the first test.
Again, the times correlate well with the strength of the algorithm (the better algorithms take significantly longer to check). In this cases though, the digest algorithms are nearly twice as slow as the checksum algorithms. This shows that the time to calculate the digests and checksums are related to the size of the files.
|