Reference¶

Main methods¶

The following methods are available via simages.main:

A tool to find and remove duplicate pictures (CLI and webserver modified with permission from @philipbl’s https://github.com/philipbl/duplicate_images).

Command line:

Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:

    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

simages.main.build_parser()[source]¶

simages.main.parse_arguments(args)[source]¶

simages.main.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]¶

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:	input (str or np.ndarray) – folder directory or N x C x H x W array n (int) – number of closest pairs to identify num_epochs (int) – how long to train the autoencoder (more is generally better) show (bool) – display the closest pairs show_train (bool) – show output every show_path (bool) – show image paths of duplicates instead of index z_dim (int) – size of compression (more is generally better, but slower) kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:	indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type:	pairs (np.ndarray)

simages.main.main()[source]¶: Main entry point for simages-show via command line.

simages.main.find_similar(db)[source]¶

simages.main.cli()[source]¶

Extractor Methods¶

The following methods are available via simages.extractor.EmbeddingExtractor:

class simages.extractor.EmbeddingExtractor(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Bases: object

Extract embeddings from data with models and allow visualization.

trainloader¶

Type:	torch loader

evalloader¶

Type:	torch loader

model¶

Type:	torch.nn.Module

embeddings¶

Type:	np.ndarray

__init__(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Inits EmbeddingExtractor with input, either str or np.ndarray, performs training and validation.

Parameters:

input (np.ndarray or str) – data
num_channels (int) – grayscale = 1, color = 3
num_epochs (int) – more is better (generally)
batch_size (int) – number of images per batch
show (bool) – show closest pairs
show_path (bool) – show path of duplicates
show_train (bool) – show intermediate training results
z_dim (int) – compression size
metric (str) – distance metric for scipy.spatial.distance.cdist() (eg, euclidean, cosine, hamming, etc.)
model (torch.nn.Module, optional) – class implementing same methods as BasicAutoencoder
db_conn_string (str) – Mongodb connection string
kwargs (dict) –

get_image(index: int) → torch.Tensor[source]

train()[source]: Train autoencoder to build embeddings of dataset. Final embeddings are created in eval().

eval()[source]: Evaluate reconstruction of embeddings built in train.

duplicates(n: int = 10, quantile: float = None) → Tuple[numpy.ndarray, numpy.ndarray][source]

Identify n closest pairs of images, or quantile (for example, closest 0.05).

Parameters:	n (int) – number of pairs quantile (float) – quantile of total combination (suggested range: 0.001 - 0.01)

static channels_last(img: numpy.ndarray) → numpy.ndarray[source]: Move channels from first to last by swapping axes.

show(img: Union[torch.Tensor, numpy.ndarray], title: str = '', block: bool = True, y_labels=None, unnormalize=True)[source]

Plot img with title.

Parameters:	img (torch.Tensor or np.ndarray) – Image to plot title (str) – plot title block (bool) – block matplotlib plot until window closed

show_images(indices: Union[list, int], title='')[source]: Plot images (from validation data) at indices with title

image_paths(indices, short=True)[source]

Get path to image at index of eval/embedding

Parameters:	Union[int,list] (indices) – indices of embeddings in dataset short (bool) – truncate filepath to 30 charachters
Returns:	paths to images in image folder
Return type:	paths (str or list of str)

show_duplicates(n=5, path=False) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Show duplicates from comparison of embeddings. Uses closely package to get pairs.

Parameters:	n (int) – how many closest pairs to identify path (bool) – Plot pairs of images with abbreviated paths
Returns:	pairs as indices distances (np.ndarray): distances of pairs
Return type:	pairs (np.ndarray)

unnormalize(image: torch.Tensor) → torch.Tensor[source]

Unnormalize an image.

Parameters:	image (`torch.Tensor`) –
Returns:	image (`torch.Tensor`)

decode(embedding: Optional[numpy.ndarray] = None, index: Optional[int] = None, show: bool = False, astensor: bool = False) → numpy.ndarray[source]

Decode embeddings at index or pass embedding directly

Parameters:	embedding (np.ndarray, optional) – embedding of image index (int) – index (of validation set / embeddings) to decode show (bool) – plot the results astensor (bool) – keep as torch.Tensor
Returns:	reconstructed image from embedding
Return type:	image (np.ndarray or torch.Tensor)

Embedding methods¶

The following methods are available via simages.embeddings.Embeddings:

class simages.embeddings.Embeddings(input: Union[numpy.ndarray, str], **kwargs)[source]¶

Bases: object

Create embeddings from input data by training an autoencoder.

Passes arguments for EmbeddingExtractor.

extractor¶

workhorse for extracting embeddings from dataset

Type:	simages.EmbeddingExtractor

embeddings¶

embeddings

Type:	np.ndarray

pairs¶

n closest pairs

Type:	np.ndarray

distances¶

distances between n-closest pairs

Type:	np.ndarray

__init__(input: Union[numpy.ndarray, str], **kwargs)[source]¶: Inits Embeddings with data.

array¶

duplicates(n: int = 10)[source]¶

show_duplicates(n=5)[source]¶: Convenience wrapper for EmbeddingExtractor.show_duplicates

images_to_embeddings(data_dir: str, **kwargs)[source]¶

array_to_embeddings(array: numpy.ndarray, **kwargs)[source]¶

Dataset methods¶

The following classes are available via simages.dataset:

A tool to find and remove duplicate pictures (CLI and webserver modified with permission from @philipbl’s https://github.com/philipbl/duplicate_images).

Command line:

Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:

    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

simages.main.build_parser()[source]

simages.main.parse_arguments(args)[source]

simages.main.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:	input (str or np.ndarray) – folder directory or N x C x H x W array n (int) – number of closest pairs to identify num_epochs (int) – how long to train the autoencoder (more is generally better) show (bool) – display the closest pairs show_train (bool) – show output every show_path (bool) – show image paths of duplicates instead of index z_dim (int) – size of compression (more is generally better, but slower) kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:	indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type:	pairs (np.ndarray)

simages.main.main()[source]: Main entry point for simages-show via command line.

simages.main.find_similar(db)[source]

simages.main.cli()[source]

Module contents¶

Find similar images in a dataset <https://github.com/justinshenk/simages>

class simages.Embeddings(input: Union[numpy.ndarray, str], **kwargs)[source]¶

Bases: object

Create embeddings from input data by training an autoencoder.

Passes arguments for EmbeddingExtractor.

extractor¶

workhorse for extracting embeddings from dataset

Type:	simages.EmbeddingExtractor

embeddings¶

embeddings

Type:	np.ndarray

pairs¶

n closest pairs

Type:	np.ndarray

distances¶

distances between n-closest pairs

Type:	np.ndarray

__init__(input: Union[numpy.ndarray, str], **kwargs)[source]¶: Inits Embeddings with data.

array¶

duplicates(n: int = 10)[source]¶

show_duplicates(n=5)[source]¶: Convenience wrapper for EmbeddingExtractor.show_duplicates

images_to_embeddings(data_dir: str, **kwargs)[source]¶

array_to_embeddings(array: numpy.ndarray, **kwargs)[source]¶

class simages.EmbeddingExtractor(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]¶

Bases: object

Extract embeddings from data with models and allow visualization.

trainloader¶

Type:	torch loader

evalloader¶

Type:	torch loader

model¶

Type:	torch.nn.Module

embeddings¶

Type:	np.ndarray

__init__(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]¶

Inits EmbeddingExtractor with input, either str or np.ndarray, performs training and validation.

Parameters:

input (np.ndarray or str) – data
num_channels (int) – grayscale = 1, color = 3
num_epochs (int) – more is better (generally)
batch_size (int) – number of images per batch
show (bool) – show closest pairs
show_path (bool) – show path of duplicates
show_train (bool) – show intermediate training results
z_dim (int) – compression size
metric (str) – distance metric for scipy.spatial.distance.cdist() (eg, euclidean, cosine, hamming, etc.)
model (torch.nn.Module, optional) – class implementing same methods as BasicAutoencoder
db_conn_string (str) – Mongodb connection string
kwargs (dict) –

get_image(index: int) → torch.Tensor[source]¶

train()[source]¶: Train autoencoder to build embeddings of dataset. Final embeddings are created in eval().

eval()[source]¶: Evaluate reconstruction of embeddings built in train.

duplicates(n: int = 10, quantile: float = None) → Tuple[numpy.ndarray, numpy.ndarray][source]¶

Identify n closest pairs of images, or quantile (for example, closest 0.05).

Parameters:	n (int) – number of pairs quantile (float) – quantile of total combination (suggested range: 0.001 - 0.01)

static channels_last(img: numpy.ndarray) → numpy.ndarray[source]¶: Move channels from first to last by swapping axes.

show(img: Union[torch.Tensor, numpy.ndarray], title: str = '', block: bool = True, y_labels=None, unnormalize=True)[source]¶

Plot img with title.

Parameters:	img (torch.Tensor or np.ndarray) – Image to plot title (str) – plot title block (bool) – block matplotlib plot until window closed

show_images(indices: Union[list, int], title='')[source]¶: Plot images (from validation data) at indices with title

image_paths(indices, short=True)[source]¶

Get path to image at index of eval/embedding

Parameters:	Union[int,list] (indices) – indices of embeddings in dataset short (bool) – truncate filepath to 30 charachters
Returns:	paths to images in image folder
Return type:	paths (str or list of str)

show_duplicates(n=5, path=False) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]¶

Show duplicates from comparison of embeddings. Uses closely package to get pairs.

Parameters:	n (int) – how many closest pairs to identify path (bool) – Plot pairs of images with abbreviated paths
Returns:	pairs as indices distances (np.ndarray): distances of pairs
Return type:	pairs (np.ndarray)

unnormalize(image: torch.Tensor) → torch.Tensor[source]¶

Unnormalize an image.

Parameters:	image (`torch.Tensor`) –
Returns:	image (`torch.Tensor`)

decode(embedding: Optional[numpy.ndarray] = None, index: Optional[int] = None, show: bool = False, astensor: bool = False) → numpy.ndarray[source]¶

Decode embeddings at index or pass embedding directly

Parameters:	embedding (np.ndarray, optional) – embedding of image index (int) – index (of validation set / embeddings) to decode show (bool) – plot the results astensor (bool) – keep as torch.Tensor
Returns:	reconstructed image from embedding
Return type:	image (np.ndarray or torch.Tensor)

simages.find_duplicates(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]¶

Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.

Parameters:	input (str or np.ndarray) – folder directory or N x C x H x W array n (int) – number of closest pairs to identify num_epochs (int) – how long to train the autoencoder (more is generally better) show (bool) – display the closest pairs show_train (bool) – show output every show_path (bool) – show image paths of duplicates instead of index z_dim (int) – size of compression (more is generally better, but slower) kwargs (dict) – etc, passed to EmbeddingExtractor
Returns:	indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type:	pairs (np.ndarray)

class simages.PILDataset(pil_list: list, transform: Optional[Callable] = None)[source]¶

Bases: torch.utils.data.dataset.Dataset

PIL dataset.

__init__(pil_list: list, transform: Optional[Callable] = None)[source]¶

Parameters:	pil_list (list of PIL images) – transform (callable, optional) – Optional transform to be applied on a sample.

class simages.ImageFolder(root: str, loader: Callable = <function default_loader>, extensions: Optional[list] = None, transform: Optional[list] = None, is_valid_file: Optional[Callable] = None)[source]¶

Bases: torchvision.datasets.vision.VisionDataset

A generic data loader where the samples are arranged in this way:

root/xxx.ext
root/xxy.ext
root/xxz.ext

Parameters:

root (string) – Root directory path.
loader (callable) – A function to load a sample given its path.
extensions (tuple[string]) – A list of allowed extensions. both extensions and is_valid_file should not be passed.
transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g, transforms.RandomCrop for images.
is_valid_file – A function that takes path of an Image file and check if the file is a valid_file (used to check of corrupt files) both extensions and is_valid_file should not be passed.

__getitem__(index: int)[source]¶

Parameters:	index (int) – Index
Returns:	(sample, target) where target is class_index of the target class.
Return type:	tuple

class simages.BasicAutoencoder(num_channels: int = 1, z_dim: int = 8, hw=48)[source]¶

Bases: torch.nn.modules.module.Module

__init__(num_channels: int = 1, z_dim: int = 8, hw=48)[source]¶

Basic autoencoder - default for simages.

Parameters:	num_channels (int) – grayscale = 1, color = 3 z_dim (int) – number of embedding units to compress image to hw (int) – height and width for input/output image

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

decode(x)[source]¶

simages.linkageplot(embeddings: numpy.ndarray, ordered=True)[source]¶

Plot linkage between embeddings in hierarchical clustering of the distance matrix

Parameters:	embeddings (np.ndarray) – embeddings of images in dataset ordered (bool) – order distance matrix before plotting

Inheritance diagram: