Loading Data

Data is loaded using the EmbeddingExtractor class.

EmbeddingExtractor is used to extract embeddings by

  • Train an autoencoder on the images
  • Identify similar images from the embeddings of the autoencoder
  • Plot and visualize the results

Dataset can be provided as a numpy array or as an image folder path.

class simages.extractor.EmbeddingExtractor(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]

Extract embeddings from data with models and allow visualization.

trainloader
Type:torch loader
evalloader
Type:torch loader
model
Type:torch.nn.Module
embeddings
Type:np.ndarray

Numpy Array

Load data with:

from simages import EmbeddingExtractor
import numpy as np

# Create grayscale (1-channel) samples
X = np.random.random((100,28,28))
extractor =  EmbeddingExtractor(X, num_channels=1)

# Find duplicates
pairs, distances = extractor.find_duplicates()

Image Folder:

from simages import EmbeddingExtractor

# Point to Folder
data_dir = "downloads"
extractor =  EmbeddingExtractor(data_dir)

# Find duplicates
pairs, distances = extractor.find_duplicates()

# Show duplicates
extractor.show_duplicates(n=5)

Duplicates can be identified using the simages command:

$ simages add `{image_folder}`

$ simages find `{image_folder}`

Duplicates can be deleted on the webserver as described at Removing Duplicates.