Running simages from the console¶

simages can be run locally in the terminal with simages-show.

Usage:

simages-show --data-dir .

See all the options for simages-show with simages-show --help:

Find similar pairs of images in a folder

usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
                    [--pairs PAIRS] [--zdim ZDIM]

Named Arguments¶

`--data-dir, -d`	Folder containing image data
`--show-train, -t`
	Show training of embedding extractor every epoch
`--epochs, -e`	Number of passes of dataset through model for training. More is better but takes more time. Default: 2
`--num-channels, -c`
	Number of channels for data (1 for grayscale, 3 for color) Default: 3
`--pairs, -p`	Number of pairs of images to show Default: 10
`--zdim, -z`	Compression bits (bigger generally performs better but takes more time) Default: 8

simages-show calls find_duplicates():

simages.main.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]¶

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:	input (str or np.ndarray) – folder directory or N x C x H x W array n (int) – number of closest pairs to identify num_epochs (int) – how long to train the autoencoder (more is generally better) show (bool) – display the closest pairs show_train (bool) – show output every show_path (bool) – show image paths of duplicates instead of index z_dim (int) – size of compression (more is generally better, but slower) kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:	indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type:	pairs (np.ndarray)

Web Interface (optional)¶

Alternatively, removing duplicate images in a dataset interactively is easy with simages.

Install mongodb on your system.
Install additional dependencies (Flask and PyMongo) with pip install "simages[all]"
Add images to the database via simages add {image_folder_path}.
Find duplicates and run the web server with simages find {image_folder_path}.

Add your pictures to the database (this will take some time depending on the number of pictures)

simages add <images_folder_path>

A webpage will come up with all of the similar or duplicate pictures:

https://raw.githubusercontent.com/justinshenk/simages/master/images/screenshot_server.png

simages find <images_folder_path>


Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:
    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

simages calls cli().