Running simages from the console

simages can be run locally in the terminal with simages-show.

Usage:

simages-show --data-dir .

See all the options for simages-show with simages-show --help:

Find similar pairs of images in a folder

usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
                    [--pairs PAIRS] [--zdim ZDIM]

Named Arguments

--data-dir, -d Folder containing image data
--show-train, -t
 Show training of embedding extractor every epoch
--epochs, -e

Number of passes of dataset through model for training. More is better but takes more time.

Default: 2

--num-channels, -c
 

Number of channels for data (1 for grayscale, 3 for color)

Default: 3

--pairs, -p

Number of pairs of images to show

Default: 10

--zdim, -z

Compression bits (bigger generally performs better but takes more time)

Default: 8

simages-show calls find_duplicates():

simages.main.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:
  • input (str or np.ndarray) – folder directory or N x C x H x W array
  • n (int) – number of closest pairs to identify
  • num_epochs (int) – how long to train the autoencoder (more is generally better)
  • show (bool) – display the closest pairs
  • show_train (bool) – show output every
  • show_path (bool) – show image paths of duplicates instead of index
  • z_dim (int) – size of compression (more is generally better, but slower)
  • kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:

indices for closest pairs of images distances (np.ndarray): distances of each pair to each other

Return type:

pairs (np.ndarray)

Web Interface (optional)

Alternatively, removing duplicate images in a dataset interactively is easy with simages.

  • Install mongodb on your system.
  • Install additional dependencies (Flask and PyMongo) with pip install "simages[all]"
  • Add images to the database via simages add {image_folder_path}.
  • Find duplicates and run the web server with simages find {image_folder_path}.

Add your pictures to the database (this will take some time depending on the number of pictures)

simages add <images_folder_path>

A webpage will come up with all of the similar or duplicate pictures:

https://raw.githubusercontent.com/justinshenk/simages/master/images/screenshot_server.png
simages find <images_folder_path>


Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:
    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

simages calls cli().