Reference

Main methods

The following methods are available via simages.main:

A tool to find and remove duplicate pictures (CLI and webserver modified with permission from @philipbl’s https://github.com/philipbl/duplicate_images).

Command line:

Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:

    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]
simages.main.build_parser()[source]
simages.main.parse_arguments(args)[source]
simages.main.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:
  • input (str or np.ndarray) – folder directory or N x C x H x W array
  • n (int) – number of closest pairs to identify
  • num_epochs (int) – how long to train the autoencoder (more is generally better)
  • show (bool) – display the closest pairs
  • show_train (bool) – show output every
  • show_path (bool) – show image paths of duplicates instead of index
  • z_dim (int) – size of compression (more is generally better, but slower)
  • kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:

indices for closest pairs of images distances (np.ndarray): distances of each pair to each other

Return type:

pairs (np.ndarray)

simages.main.main()[source]

Main entry point for simages-show via command line.

simages.main.find_similar(db)[source]
simages.main.cli()[source]

Extractor Methods

The following methods are available via simages.extractor.EmbeddingExtractor:

class simages.extractor.EmbeddingExtractor(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Bases: object

Extract embeddings from data with models and allow visualization.

trainloader
Type:torch loader
evalloader
Type:torch loader
model
Type:torch.nn.Module
embeddings
Type:np.ndarray
__init__(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Inits EmbeddingExtractor with input, either str or np.ndarray, performs training and validation.

Parameters:
  • input (np.ndarray or str) – data
  • num_channels (int) – grayscale = 1, color = 3
  • num_epochs (int) – more is better (generally)
  • batch_size (int) – number of images per batch
  • show (bool) – show closest pairs
  • show_path (bool) – show path of duplicates
  • show_train (bool) – show intermediate training results
  • z_dim (int) – compression size
  • metric (str) – distance metric for scipy.spatial.distance.cdist() (eg, euclidean, cosine, hamming, etc.)
  • model (torch.nn.Module, optional) – class implementing same methods as BasicAutoencoder
  • db_conn_string (str) – Mongodb connection string
  • kwargs (dict) –
get_image(index: int) → torch.Tensor[source]
train()[source]

Train autoencoder to build embeddings of dataset. Final embeddings are created in eval().

eval()[source]

Evaluate reconstruction of embeddings built in train.

duplicates(n: int = 10, quantile: float = None) → Tuple[numpy.ndarray, numpy.ndarray][source]

Identify n closest pairs of images, or quantile (for example, closest 0.05).

Parameters:
  • n (int) – number of pairs
  • quantile (float) – quantile of total combination (suggested range: 0.001 - 0.01)
static channels_last(img: numpy.ndarray) → numpy.ndarray[source]

Move channels from first to last by swapping axes.

show(img: Union[torch.Tensor, numpy.ndarray], title: str = '', block: bool = True, y_labels=None, unnormalize=True)[source]

Plot img with title.

Parameters:
  • img (torch.Tensor or np.ndarray) – Image to plot
  • title (str) – plot title
  • block (bool) – block matplotlib plot until window closed
show_images(indices: Union[list, int], title='')[source]

Plot images (from validation data) at indices with title

image_paths(indices, short=True)[source]

Get path to image at index of eval/embedding

Parameters:
  • Union[int,list] (indices) – indices of embeddings in dataset
  • short (bool) – truncate filepath to 30 charachters
Returns:

paths to images in image folder

Return type:

paths (str or list of str)

show_duplicates(n=5, path=False) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Show duplicates from comparison of embeddings. Uses closely package to get pairs.

Parameters:
  • n (int) – how many closest pairs to identify
  • path (bool) – Plot pairs of images with abbreviated paths
Returns:

pairs as indices distances (np.ndarray): distances of pairs

Return type:

pairs (np.ndarray)

unnormalize(image: torch.Tensor) → torch.Tensor[source]

Unnormalize an image.

Parameters:image (torch.Tensor) –
Returns:image (torch.Tensor)
decode(embedding: Optional[numpy.ndarray] = None, index: Optional[int] = None, show: bool = False, astensor: bool = False) → numpy.ndarray[source]

Decode embeddings at index or pass embedding directly

Parameters:
  • embedding (np.ndarray, optional) – embedding of image
  • index (int) – index (of validation set / embeddings) to decode
  • show (bool) – plot the results
  • astensor (bool) – keep as torch.Tensor
Returns:

reconstructed image from embedding

Return type:

image (np.ndarray or torch.Tensor)

Embedding methods

The following methods are available via simages.embeddings.Embeddings:

class simages.embeddings.Embeddings(input: Union[numpy.ndarray, str], **kwargs)[source]

Bases: object

Create embeddings from input data by training an autoencoder.

Passes arguments for EmbeddingExtractor.

extractor

workhorse for extracting embeddings from dataset

Type:simages.EmbeddingExtractor
embeddings

embeddings

Type:np.ndarray
pairs

n closest pairs

Type:np.ndarray
distances

distances between n-closest pairs

Type:np.ndarray
__init__(input: Union[numpy.ndarray, str], **kwargs)[source]

Inits Embeddings with data.

array
duplicates(n: int = 10)[source]
show_duplicates(n=5)[source]

Convenience wrapper for EmbeddingExtractor.show_duplicates

images_to_embeddings(data_dir: str, **kwargs)[source]
array_to_embeddings(array: numpy.ndarray, **kwargs)[source]

Dataset methods

The following classes are available via simages.dataset:

A tool to find and remove duplicate pictures (CLI and webserver modified with permission from @philipbl’s https://github.com/philipbl/duplicate_images).

Command line:

Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:

    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]
simages.main.build_parser()[source]
simages.main.parse_arguments(args)[source]
simages.main.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:
  • input (str or np.ndarray) – folder directory or N x C x H x W array
  • n (int) – number of closest pairs to identify
  • num_epochs (int) – how long to train the autoencoder (more is generally better)
  • show (bool) – display the closest pairs
  • show_train (bool) – show output every
  • show_path (bool) – show image paths of duplicates instead of index
  • z_dim (int) – size of compression (more is generally better, but slower)
  • kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:

indices for closest pairs of images distances (np.ndarray): distances of each pair to each other

Return type:

pairs (np.ndarray)

simages.main.main()[source]

Main entry point for simages-show via command line.

simages.main.find_similar(db)[source]
simages.main.cli()[source]

Module contents

Find similar images in a dataset <https://github.com/justinshenk/simages>

class simages.Embeddings(input: Union[numpy.ndarray, str], **kwargs)[source]

Bases: object

Create embeddings from input data by training an autoencoder.

Passes arguments for EmbeddingExtractor.

extractor

workhorse for extracting embeddings from dataset

Type:simages.EmbeddingExtractor
embeddings

embeddings

Type:np.ndarray
pairs

n closest pairs

Type:np.ndarray
distances

distances between n-closest pairs

Type:np.ndarray
__init__(input: Union[numpy.ndarray, str], **kwargs)[source]

Inits Embeddings with data.

array
duplicates(n: int = 10)[source]
show_duplicates(n=5)[source]

Convenience wrapper for EmbeddingExtractor.show_duplicates

images_to_embeddings(data_dir: str, **kwargs)[source]
array_to_embeddings(array: numpy.ndarray, **kwargs)[source]
class simages.EmbeddingExtractor(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Bases: object

Extract embeddings from data with models and allow visualization.

trainloader
Type:torch loader
evalloader
Type:torch loader
model
Type:torch.nn.Module
embeddings
Type:np.ndarray
__init__(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Inits EmbeddingExtractor with input, either str or np.ndarray, performs training and validation.

Parameters:
  • input (np.ndarray or str) – data
  • num_channels (int) – grayscale = 1, color = 3
  • num_epochs (int) – more is better (generally)
  • batch_size (int) – number of images per batch
  • show (bool) – show closest pairs
  • show_path (bool) – show path of duplicates
  • show_train (bool) – show intermediate training results
  • z_dim (int) – compression size
  • metric (str) – distance metric for scipy.spatial.distance.cdist() (eg, euclidean, cosine, hamming, etc.)
  • model (torch.nn.Module, optional) – class implementing same methods as BasicAutoencoder
  • db_conn_string (str) – Mongodb connection string
  • kwargs (dict) –
get_image(index: int) → torch.Tensor[source]
train()[source]

Train autoencoder to build embeddings of dataset. Final embeddings are created in eval().

eval()[source]

Evaluate reconstruction of embeddings built in train.

duplicates(n: int = 10, quantile: float = None) → Tuple[numpy.ndarray, numpy.ndarray][source]

Identify n closest pairs of images, or quantile (for example, closest 0.05).

Parameters:
  • n (int) – number of pairs
  • quantile (float) – quantile of total combination (suggested range: 0.001 - 0.01)
static channels_last(img: numpy.ndarray) → numpy.ndarray[source]

Move channels from first to last by swapping axes.

show(img: Union[torch.Tensor, numpy.ndarray], title: str = '', block: bool = True, y_labels=None, unnormalize=True)[source]

Plot img with title.

Parameters:
  • img (torch.Tensor or np.ndarray) – Image to plot
  • title (str) – plot title
  • block (bool) – block matplotlib plot until window closed
show_images(indices: Union[list, int], title='')[source]

Plot images (from validation data) at indices with title

image_paths(indices, short=True)[source]

Get path to image at index of eval/embedding

Parameters:
  • Union[int,list] (indices) – indices of embeddings in dataset
  • short (bool) – truncate filepath to 30 charachters
Returns:

paths to images in image folder

Return type:

paths (str or list of str)

show_duplicates(n=5, path=False) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Show duplicates from comparison of embeddings. Uses closely package to get pairs.

Parameters:
  • n (int) – how many closest pairs to identify
  • path (bool) – Plot pairs of images with abbreviated paths
Returns:

pairs as indices distances (np.ndarray): distances of pairs

Return type:

pairs (np.ndarray)

unnormalize(image: torch.Tensor) → torch.Tensor[source]

Unnormalize an image.

Parameters:image (torch.Tensor) –
Returns:image (torch.Tensor)
decode(embedding: Optional[numpy.ndarray] = None, index: Optional[int] = None, show: bool = False, astensor: bool = False) → numpy.ndarray[source]

Decode embeddings at index or pass embedding directly

Parameters:
  • embedding (np.ndarray, optional) – embedding of image
  • index (int) – index (of validation set / embeddings) to decode
  • show (bool) – plot the results
  • astensor (bool) – keep as torch.Tensor
Returns:

reconstructed image from embedding

Return type:

image (np.ndarray or torch.Tensor)

simages.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:
  • input (str or np.ndarray) – folder directory or N x C x H x W array
  • n (int) – number of closest pairs to identify
  • num_epochs (int) – how long to train the autoencoder (more is generally better)
  • show (bool) – display the closest pairs
  • show_train (bool) – show output every
  • show_path (bool) – show image paths of duplicates instead of index
  • z_dim (int) – size of compression (more is generally better, but slower)
  • kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:

indices for closest pairs of images distances (np.ndarray): distances of each pair to each other

Return type:

pairs (np.ndarray)

class simages.PILDataset(pil_list: list, transform: Optional[Callable] = None)[source]

Bases: torch.utils.data.dataset.Dataset

PIL dataset.

__init__(pil_list: list, transform: Optional[Callable] = None)[source]
Parameters:
  • pil_list (list of PIL images) –
  • transform (callable, optional) – Optional transform to be applied on a sample.
class simages.ImageFolder(root: str, loader: Callable = <function default_loader>, extensions: Optional[list] = None, transform: Optional[list] = None, is_valid_file: Optional[Callable] = None)[source]

Bases: torchvision.datasets.vision.VisionDataset

A generic data loader where the samples are arranged in this way:

root/xxx.ext
root/xxy.ext
root/xxz.ext
Parameters:
  • root (string) – Root directory path.
  • loader (callable) – A function to load a sample given its path.
  • extensions (tuple[string]) – A list of allowed extensions. both extensions and is_valid_file should not be passed.
  • transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g, transforms.RandomCrop for images.
  • is_valid_file – A function that takes path of an Image file and check if the file is a valid_file (used to check of corrupt files) both extensions and is_valid_file should not be passed.
__getitem__(index: int)[source]
Parameters:index (int) – Index
Returns:(sample, target) where target is class_index of the target class.
Return type:tuple
class simages.BasicAutoencoder(num_channels: int = 1, z_dim: int = 8, hw=48)[source]

Bases: torch.nn.modules.module.Module

__init__(num_channels: int = 1, z_dim: int = 8, hw=48)[source]

Basic autoencoder - default for simages.

Parameters:
  • num_channels (int) – grayscale = 1, color = 3
  • z_dim (int) – number of embedding units to compress image to
  • hw (int) – height and width for input/output image
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

decode(x)[source]
simages.linkageplot(embeddings: numpy.ndarray, ordered=True)[source]

Plot linkage between embeddings in hierarchical clustering of the distance matrix

Parameters:
  • embeddings (np.ndarray) – embeddings of images in dataset
  • ordered (bool) – order distance matrix before plotting

Inheritance diagram:

Inheritance diagram of simages.extractor.EmbeddingExtractor