Loading Data¶

Data is loaded using the EmbeddingExtractor class.

EmbeddingExtractor is used to extract embeddings by

Train an autoencoder on the images
Identify similar images from the embeddings of the autoencoder
Plot and visualize the results

Dataset can be provided as a numpy array or as an image folder path.

class simages.extractor.EmbeddingExtractor(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]¶

Extract embeddings from data with models and allow visualization.

trainloader¶

Type:	torch loader

evalloader¶

Type:	torch loader

model¶

Type:	torch.nn.Module

embeddings¶

Type:	np.ndarray

Numpy Array

Load data with:

from simages import EmbeddingExtractor
import numpy as np

# Create grayscale (1-channel) samples
X = np.random.random((100,28,28))
extractor =  EmbeddingExtractor(X, num_channels=1)

# Find duplicates
pairs, distances = extractor.find_duplicates()

Image Folder:

from simages import EmbeddingExtractor

# Point to Folder
data_dir = "downloads"
extractor =  EmbeddingExtractor(data_dir)

# Find duplicates
pairs, distances = extractor.find_duplicates()

# Show duplicates
extractor.show_duplicates(n=5)

Duplicates can be identified using the simages command:

$ simages add `{image_folder}`

$ simages find `{image_folder}`

Duplicates can be deleted on the webserver as described at Removing Duplicates.