ocl.feature_extractors.timm
Module implementing support for timm models and some additional models based on timm.
The classes here additionally allow the extraction of features at multiple levels for both ViTs and CNNs.
Additional models
resnet34_savi
: ResNet34 as used in SAVi and SAVi++resnet50_dino
: ResNet50 trained with DINO self-supervisionvit_small_patch16_224_mocov3
: ViT Small trained with MoCo v3 self-supervisionvit_base_patch16_224_mocov3
: ViT Base trained with MoCo v3 self-supervisionresnet50_mocov3
: ViT Base trained with MoCo v3 self-supervisionvit_small_patch16_224_msn
: ViT Small trained with MSN self-supervisionvit_base_patch16_224_msn
: ViT Base trained with MSN self-supervisionvit_base_patch16_224_mae
: ViT Base trained with Masked Autoencoder self-supervision
TimmFeatureExtractor
Bases: ImageFeatureExtractor
Feature extractor implementation for timm models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name of model. See |
required |
feature_level |
Optional[Union[int, str, List[Union[int, str]]]]
|
Level of features to return. For CNN-based models, a single integer. For ViT models, either a single or a list of feature descriptors. If a list is passed, multiple levels of features are extracted and concatenated. A ViT feature descriptor consists of the type of feature to extract, followed by an integer indicating the ViT block whose features to use. The type of features can be one of "block", "key", "query", "value", specifying that the block's output, attention keys, query or value should be used. If omitted, assumes "block" as the type. Example: "block1" or ["block1", "value2"]. |
None
|
aux_features |
Optional[Union[int, str, List[Union[int, str]]]]
|
Features to store as auxilliary features. The format is the same as in the
|
None
|
pretrained |
bool
|
Whether to load pretrained weights. |
False
|
freeze |
bool
|
Whether the weights of the feature extractor should be trainable. |
False
|
n_blocks_to_unfreeze |
int
|
Number of blocks that should be trainable, beginning from the last block. |
0
|
unfreeze_attention |
bool
|
Whether weights of ViT attention layers should be trainable (only valid for ViT models). According to http://arxiv.org/abs/2203.09795, finetuning attention layers only can yield better results in some cases, while being slightly cheaper in terms of computation and memory. |
False
|
Source code in ocl/feature_extractors/timm.py
|
|
resnet34_savi
ResNet34 as used in SAVi and SAVi++.
As of now, no official code including the ResNet was released, so we can only guess which of the numerous ResNet variants was used. This modifies the basic timm ResNet34 to have 1x1 strides in the stem, and replaces batch norm with group norm. It gives 16x16 feature maps with an input size of 224x224.
From SAVi:
For the modified SAVi (ResNet) model on MOVi++, we replace the convolutional backbone [...] with a ResNet-34 backbone. We use a modified ResNet root block without strides (i.e. 1×1 stride), resulting in 16×16 feature maps after the backbone [w. 128x128 images]. We further use group normalization throughout the ResNet backbone.
From SAVi++:
We used a ResNet-34 backbone with modified root convolutional layer that has 1×1 stride. For all layers, we replaced the batch normalization operation by group normalization.