Running simages from the console¶
simages can be run locally in the terminal with simages-show
.
Usage:
simages-show --data-dir .
See all the options for simages-show
with simages-show --help
:
Find similar pairs of images in a folder
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
[--epochs EPOCHS] [--num-channels NUM_CHANNELS]
[--pairs PAIRS] [--zdim ZDIM]
Named Arguments¶
--data-dir, -d | Folder containing image data |
--show-train, -t | |
Show training of embedding extractor every epoch | |
--epochs, -e | Number of passes of dataset through model for training. More is better but takes more time. Default: 2 |
--num-channels, -c | |
Number of channels for data (1 for grayscale, 3 for color) Default: 3 | |
--pairs, -p | Number of pairs of images to show Default: 10 |
--zdim, -z | Compression bits (bigger generally performs better but takes more time) Default: 8 |
simages-show
calls find_duplicates()
:
-
simages.main.
find_duplicates
(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.
Parameters: - input (str or np.ndarray) – folder directory or N x C x H x W array
- n (int) – number of closest pairs to identify
- num_epochs (int) – how long to train the autoencoder (more is generally better)
- show (bool) – display the closest pairs
- show_train (bool) – show output every
- show_path (bool) – show image paths of duplicates instead of index
- z_dim (int) – size of compression (more is generally better, but slower)
- kwargs (dict) – etc, passed to EmbeddingExtractor
Returns: indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type: pairs (np.ndarray)
Web Interface (optional)¶
Alternatively, removing duplicate images in a dataset interactively is easy with simages
.
- Install mongodb on your system.
- Install additional dependencies (Flask and PyMongo) with
pip install "simages[all]"
- Add images to the database via
simages add {image_folder_path}
. - Find duplicates and run the web server with
simages find {image_folder_path}
.
Add your pictures to the database (this will take some time depending on the number of pictures)
simages add <images_folder_path>
A webpage will come up with all of the similar or duplicate pictures:
simages find <images_folder_path>
Usage:
simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
simages remove <path> ... [--db=<db_path>]
simages clear [--db=<db_path>]
simages show [--db=<db_path>]
simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
simages -h | --help
Options:
-h, --help Show this screen
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
--epochs=<epochs> Epochs for training [default: 2]
simages
calls cli()
.