Reference¶
Main methods¶
The following methods are available via simages.main
:
A tool to find and remove duplicate pictures (CLI and webserver modified with permission from @philipbl’s https://github.com/philipbl/duplicate_images).
Command line:
Usage:
simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
simages remove <path> ... [--db=<db_path>]
simages clear [--db=<db_path>]
simages show [--db=<db_path>]
simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
simages -h | --help
Options:
-h, --help Show this screen
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
--epochs=<epochs> Epochs for training [default: 2]
-
simages.main.
find_duplicates
(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.
Parameters: - input (str or np.ndarray) – folder directory or N x C x H x W array
- n (int) – number of closest pairs to identify
- num_epochs (int) – how long to train the autoencoder (more is generally better)
- show (bool) – display the closest pairs
- show_train (bool) – show output every
- show_path (bool) – show image paths of duplicates instead of index
- z_dim (int) – size of compression (more is generally better, but slower)
- kwargs (dict) – etc, passed to EmbeddingExtractor
Returns: indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type: pairs (np.ndarray)
Extractor Methods¶
The following methods are available via simages.extractor.EmbeddingExtractor
:
-
class
simages.extractor.
EmbeddingExtractor
(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source] Bases:
object
Extract embeddings from data with models and allow visualization.
-
trainloader
¶ Type: torch loader
-
evalloader
¶ Type: torch loader
-
model
¶ Type: torch.nn.Module
-
embeddings
¶ Type: np.ndarray
-
__init__
(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source] Inits EmbeddingExtractor with input, either str or np.ndarray, performs training and validation.
Parameters: - input (np.ndarray or str) – data
- num_channels (int) – grayscale = 1, color = 3
- num_epochs (int) – more is better (generally)
- batch_size (int) – number of images per batch
- show (bool) – show closest pairs
- show_path (bool) – show path of duplicates
- show_train (bool) – show intermediate training results
- z_dim (int) – compression size
- metric (str) – distance metric for
scipy.spatial.distance.cdist()
(eg, euclidean, cosine, hamming, etc.) - model (torch.nn.Module, optional) – class implementing same methods as
BasicAutoencoder
- db_conn_string (str) – Mongodb connection string
- kwargs (dict) –
-
get_image
(index: int) → torch.Tensor[source]
-
train
()[source] Train autoencoder to build embeddings of dataset. Final embeddings are created in
eval()
.
-
eval
()[source] Evaluate reconstruction of embeddings built in train.
-
duplicates
(n: int = 10, quantile: float = None) → Tuple[numpy.ndarray, numpy.ndarray][source] Identify n closest pairs of images, or quantile (for example, closest 0.05).
Parameters:
-
static
channels_last
(img: numpy.ndarray) → numpy.ndarray[source] Move channels from first to last by swapping axes.
-
show
(img: Union[torch.Tensor, numpy.ndarray], title: str = '', block: bool = True, y_labels=None, unnormalize=True)[source] Plot img with title.
Parameters:
-
show_images
(indices: Union[list, int], title='')[source] Plot images (from validation data) at indices with title
-
image_paths
(indices, short=True)[source] Get path to image at index of eval/embedding
Parameters: - Union[int,list] (indices) – indices of embeddings in dataset
- short (bool) – truncate filepath to 30 charachters
Returns: paths to images in image folder
Return type: paths (str or list of str)
-
show_duplicates
(n=5, path=False) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source] Show duplicates from comparison of embeddings. Uses closely package to get pairs.
Parameters: Returns: pairs as indices distances (np.ndarray): distances of pairs
Return type: pairs (np.ndarray)
-
unnormalize
(image: torch.Tensor) → torch.Tensor[source] Unnormalize an image.
Parameters: image ( torch.Tensor
) –Returns: image ( torch.Tensor
)
-
decode
(embedding: Optional[numpy.ndarray] = None, index: Optional[int] = None, show: bool = False, astensor: bool = False) → numpy.ndarray[source] Decode embeddings at index or pass embedding directly
Parameters: Returns: reconstructed image from embedding
Return type: image (np.ndarray or torch.Tensor)
-
Embedding methods¶
The following methods are available via simages.embeddings.Embeddings
:
-
class
simages.embeddings.
Embeddings
(input: Union[numpy.ndarray, str], **kwargs)[source]¶ Bases:
object
Create embeddings from input data by training an autoencoder.
Passes arguments for EmbeddingExtractor.
-
extractor
¶ workhorse for extracting embeddings from dataset
Type: simages.EmbeddingExtractor
-
embeddings
¶ embeddings
Type: np.ndarray
-
pairs
¶ n closest pairs
Type: np.ndarray
-
distances
¶ distances between n-closest pairs
Type: np.ndarray
-
array
¶
-
Dataset methods¶
The following classes are available via simages.dataset
:
A tool to find and remove duplicate pictures (CLI and webserver modified with permission from @philipbl’s https://github.com/philipbl/duplicate_images).
Command line:
Usage:
simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
simages remove <path> ... [--db=<db_path>]
simages clear [--db=<db_path>]
simages show [--db=<db_path>]
simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
simages -h | --help
Options:
-h, --help Show this screen
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
--epochs=<epochs> Epochs for training [default: 2]
-
simages.main.
build_parser
()[source]
-
simages.main.
parse_arguments
(args)[source]
-
simages.main.
find_duplicates
(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source] Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.
Parameters: - input (str or np.ndarray) – folder directory or N x C x H x W array
- n (int) – number of closest pairs to identify
- num_epochs (int) – how long to train the autoencoder (more is generally better)
- show (bool) – display the closest pairs
- show_train (bool) – show output every
- show_path (bool) – show image paths of duplicates instead of index
- z_dim (int) – size of compression (more is generally better, but slower)
- kwargs (dict) – etc, passed to EmbeddingExtractor
Returns: indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type: pairs (np.ndarray)
-
simages.main.
main
()[source] Main entry point for simages-show via command line.
-
simages.main.
find_similar
(db)[source]
-
simages.main.
cli
()[source]
Module contents¶
Find similar images in a dataset <https://github.com/justinshenk/simages>
-
class
simages.
Embeddings
(input: Union[numpy.ndarray, str], **kwargs)[source]¶ Bases:
object
Create embeddings from input data by training an autoencoder.
Passes arguments for EmbeddingExtractor.
-
extractor
¶ workhorse for extracting embeddings from dataset
Type: simages.EmbeddingExtractor
-
embeddings
¶ embeddings
Type: np.ndarray
-
pairs
¶ n closest pairs
Type: np.ndarray
-
distances
¶ distances between n-closest pairs
Type: np.ndarray
-
array
¶
-
-
class
simages.
EmbeddingExtractor
(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]¶ Bases:
object
Extract embeddings from data with models and allow visualization.
-
trainloader
¶ Type: torch loader
-
evalloader
¶ Type: torch loader
-
model
¶ Type: torch.nn.Module
-
embeddings
¶ Type: np.ndarray
-
__init__
(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]¶ Inits EmbeddingExtractor with input, either str or np.ndarray, performs training and validation.
Parameters: - input (np.ndarray or str) – data
- num_channels (int) – grayscale = 1, color = 3
- num_epochs (int) – more is better (generally)
- batch_size (int) – number of images per batch
- show (bool) – show closest pairs
- show_path (bool) – show path of duplicates
- show_train (bool) – show intermediate training results
- z_dim (int) – compression size
- metric (str) – distance metric for
scipy.spatial.distance.cdist()
(eg, euclidean, cosine, hamming, etc.) - model (torch.nn.Module, optional) – class implementing same methods as
BasicAutoencoder
- db_conn_string (str) – Mongodb connection string
- kwargs (dict) –
-
train
()[source]¶ Train autoencoder to build embeddings of dataset. Final embeddings are created in
eval()
.
-
duplicates
(n: int = 10, quantile: float = None) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Identify n closest pairs of images, or quantile (for example, closest 0.05).
Parameters:
-
static
channels_last
(img: numpy.ndarray) → numpy.ndarray[source]¶ Move channels from first to last by swapping axes.
-
show
(img: Union[torch.Tensor, numpy.ndarray], title: str = '', block: bool = True, y_labels=None, unnormalize=True)[source]¶ Plot img with title.
Parameters:
-
show_images
(indices: Union[list, int], title='')[source]¶ Plot images (from validation data) at indices with title
-
image_paths
(indices, short=True)[source]¶ Get path to image at index of eval/embedding
Parameters: - Union[int,list] (indices) – indices of embeddings in dataset
- short (bool) – truncate filepath to 30 charachters
Returns: paths to images in image folder
Return type: paths (str or list of str)
-
show_duplicates
(n=5, path=False) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]¶ Show duplicates from comparison of embeddings. Uses closely package to get pairs.
Parameters: Returns: pairs as indices distances (np.ndarray): distances of pairs
Return type: pairs (np.ndarray)
-
unnormalize
(image: torch.Tensor) → torch.Tensor[source]¶ Unnormalize an image.
Parameters: image ( torch.Tensor
) –Returns: image ( torch.Tensor
)
-
decode
(embedding: Optional[numpy.ndarray] = None, index: Optional[int] = None, show: bool = False, astensor: bool = False) → numpy.ndarray[source]¶ Decode embeddings at index or pass embedding directly
Parameters: Returns: reconstructed image from embedding
Return type: image (np.ndarray or torch.Tensor)
-
-
simages.
find_duplicates
(input: Union[str, numpy.ndarray], n: int = 5, num_epochs: int = 2, num_channels: int = 3, show: bool = False, show_train: bool = False, show_path: bool = True, z_dim: int = 8, db=None, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Find duplicates in dataset. Either np.ndarray or path to image folder must be specified as input.
Parameters: - input (str or np.ndarray) – folder directory or N x C x H x W array
- n (int) – number of closest pairs to identify
- num_epochs (int) – how long to train the autoencoder (more is generally better)
- show (bool) – display the closest pairs
- show_train (bool) – show output every
- show_path (bool) – show image paths of duplicates instead of index
- z_dim (int) – size of compression (more is generally better, but slower)
- kwargs (dict) – etc, passed to EmbeddingExtractor
Returns: indices for closest pairs of images distances (np.ndarray): distances of each pair to each other
Return type: pairs (np.ndarray)
-
class
simages.
PILDataset
(pil_list: list, transform: Optional[Callable] = None)[source]¶ Bases:
torch.utils.data.dataset.Dataset
PIL dataset.
-
class
simages.
ImageFolder
(root: str, loader: Callable = <function default_loader>, extensions: Optional[list] = None, transform: Optional[list] = None, is_valid_file: Optional[Callable] = None)[source]¶ Bases:
torchvision.datasets.vision.VisionDataset
A generic data loader where the samples are arranged in this way:
root/xxx.ext root/xxy.ext root/xxz.ext
Parameters: - root (string) – Root directory path.
- loader (callable) – A function to load a sample given its path.
- extensions (tuple[string]) – A list of allowed extensions. both extensions and is_valid_file should not be passed.
- transform (callable, optional) – A function/transform that takes in
a sample and returns a transformed version.
E.g,
transforms.RandomCrop
for images. - is_valid_file – A function that takes path of an Image file and check if the file is a valid_file (used to check of corrupt files) both extensions and is_valid_file should not be passed.
-
class
simages.
BasicAutoencoder
(num_channels: int = 1, z_dim: int = 8, hw=48)[source]¶ Bases:
torch.nn.modules.module.Module
-
__init__
(num_channels: int = 1, z_dim: int = 8, hw=48)[source]¶ Basic autoencoder - default for simages.
Parameters:
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
simages.
linkageplot
(embeddings: numpy.ndarray, ordered=True)[source]¶ Plot linkage between embeddings in hierarchical clustering of the distance matrix
Parameters: - embeddings (np.ndarray) – embeddings of images in dataset
- ordered (bool) – order distance matrix before plotting
Inheritance diagram: