Loading Data¶
Data is loaded using the EmbeddingExtractor
class.
EmbeddingExtractor
is used to extract embeddings by
- Train an autoencoder on the images
- Identify similar images from the embeddings of the autoencoder
- Plot and visualize the results
Dataset can be provided as a numpy array or as an image folder path.
-
class
simages.extractor.
EmbeddingExtractor
(input: Union[str, numpy.ndarray], num_channels: int = 3, num_epochs: int = 2, batch_size: int = 32, show: bool = False, show_path: bool = False, show_train: bool = False, z_dim: int = 8, metric: str = 'cosine', model: Optional[torch.nn.modules.module.Module] = None, db: Optional = None, **kwargs)[source]¶ Extract embeddings from data with models and allow visualization.
-
trainloader
¶ Type: torch loader
-
evalloader
¶ Type: torch loader
-
model
¶ Type: torch.nn.Module
-
embeddings
¶ Type: np.ndarray
-
Numpy Array
Load data with:
from simages import EmbeddingExtractor
import numpy as np
# Create grayscale (1-channel) samples
X = np.random.random((100,28,28))
extractor = EmbeddingExtractor(X, num_channels=1)
# Find duplicates
pairs, distances = extractor.find_duplicates()
Image Folder:
from simages import EmbeddingExtractor
# Point to Folder
data_dir = "downloads"
extractor = EmbeddingExtractor(data_dir)
# Find duplicates
pairs, distances = extractor.find_duplicates()
# Show duplicates
extractor.show_duplicates(n=5)
Duplicates can be identified using the simages
command:
$ simages add `{image_folder}`
$ simages find `{image_folder}`
Duplicates can be deleted on the webserver as described at Removing Duplicates.