Deduplicating Near Duplicates

Any modification to an image file, even if barely visible to the human eye, changes the binary representation to the point where standard de-duplication method fails.

Example Near Duplicates

Such modifications include

  • Recoding the image to another file format
  • Changing the image size in pixels
  • Changing the image size in bytes by changing the JPEG quality
  • Tuning of color or brightness
  • Minor cropping

To deduplicate a set of images that include near-duplicates one can use semantic search. The idea is to encode the images with an appropriate deep neural network and then compare the distances in vector space. Images that are close in this vector space are likely to be near-duplicates.

Sounds complicated? It doesn't have to be! Using a Nyckel Image Search function this can be done using the following pseudo-code.


INPUT: set of images that need to be deduplicated

  1. Create a new image search function.
  2. Add all images to the function. Store the idfor each image.
  3. Search the function for each image using sampleCount=2. Store each response.
  4. For each image IMG
    1. Get the search response corresponding to IMG
    2. Read out the second searchSample entry from the response. (The first entry corresponds to the self-match).
    3. If searchSample.distance < 1%, IMG and searchSample.sampleId are near-duplicates.
  5. Delete the function.

Optional speed-ups:

  1. Resize images to 224x224 pixels for faster uploads.
  2. Use multithreading when adding-to and searching the function (Steps 2 & 3).

Codesample

A python codesample is provide in our codesamples repo.

python -m dedupe <nyckel_client_id> <nyckel_secret_id> <path_to_folder_with_image_files>