Deduplicating Near Duplicates
Any modification to an image file, even if barely visible to the human eye, changes the binary representation to the point where standard de-duplication method fails.
Such modifications include
- Recoding the image to another file format
- Changing the image size in pixels
- Changing the image size in bytes by changing the JPEG quality
- Tuning of color or brightness
- Minor cropping
To deduplicate a set of images that include near-duplicates one can use semantic search. The idea is to encode the images with an appropriate deep neural network and then compare the distances in vector space. Images that are close in this vector space are likely to be near-duplicates.
Sounds complicated? It doesn't have to be! Using a Nyckel Image Search function this can be done using the following pseudo-code.
INPUT: set of images that need to be deduplicated
- Create a new image search function.
- Add all images to the function. Store the
idfor each image.
- Search the function for each image using
sampleCount=2. Store each response.
- For each image IMG
- Get the search response corresponding to IMG
- Read out the second
searchSampleentry from the response. (The first entry corresponds to the self-match).
searchSample.distance< 1%, IMG and
searchSample.sampleIdare near-duplicates. Remove one of them.
- Optionally delete the function.
OUTPUT: set of deduplicated images
- Resize images to 200 pixels largest side for faster uploads.
- Use multithreading when adding-to and searching the function (Steps 2 & 3).