ocl.feature_extractors.timm
Module implementing support for timm models and some additional models based on timm.
The classes here additionally allow the extraction of features at multiple levels for both ViTs and CNNs.
Additional models
resnet34_savi
: ResNet34 as used in SAVi and SAVi++resnet50_dino
: ResNet50 trained with DINO self-supervisionvit_small_patch16_224_mocov3
: ViT Small trained with MoCo v3 self-supervisionvit_base_patch16_224_mocov3
: ViT Base trained with MoCo v3 self-supervisionresnet50_mocov3
: ViT Base trained with MoCo v3 self-supervisionvit_small_patch16_224_msn
: ViT Small trained with MSN self-supervisionvit_base_patch16_224_msn
: ViT Base trained with MSN self-supervisionvit_base_patch16_224_mae
: ViT Base trained with Masked Autoencoder self-supervision
TimmFeatureExtractor
Bases: ImageFeatureExtractor
Feature extractor implementation for timm models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name of model. See |
required |
feature_level |
Optional[Union[int, str, List[Union[int, str]]]]
|
Level of features to return. For CNN-based models, a single integer. For ViT models, either a single or a list of feature descriptors. If a list is passed, multiple levels of features are extracted and concatenated. A ViT feature descriptor consists of the type of feature to extract, followed by an integer indicating the ViT block whose features to use. The type of features can be one of "block", "key", "query", "value", specifying that the block's output, attention keys, query or value should be used. If omitted, assumes "block" as the type. Example: "block1" or ["block1", "value2"]. |
None
|
aux_features |
Optional[Union[int, str, List[Union[int, str]]]]
|
Features to store as auxilliary features. The format is the same as in the
|
None
|
pretrained |
bool
|
Whether to load pretrained weights. |
False
|
freeze |
bool
|
Whether the weights of the feature extractor should be trainable. |
False
|
n_blocks_to_unfreeze |
int
|
Number of blocks that should be trainable, beginning from the last block. |
0
|
unfreeze_attention |
bool
|
Whether weights of ViT attention layers should be trainable (only valid for ViT models). According to http://arxiv.org/abs/2203.09795, finetuning attention layers only can yield better results in some cases, while being slightly cheaper in terms of computation and memory. |
False
|
Source code in ocl/feature_extractors/timm.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 |
|
resnet34_savi
ResNet34 as used in SAVi and SAVi++.
As of now, no official code including the ResNet was released, so we can only guess which of the numerous ResNet variants was used. This modifies the basic timm ResNet34 to have 1x1 strides in the stem, and replaces batch norm with group norm. It gives 16x16 feature maps with an input size of 224x224.
From SAVi:
For the modified SAVi (ResNet) model on MOVi++, we replace the convolutional backbone [...] with a ResNet-34 backbone. We use a modified ResNet root block without strides (i.e. 1×1 stride), resulting in 16×16 feature maps after the backbone [w. 128x128 images]. We further use group normalization throughout the ResNet backbone.
From SAVi++:
We used a ResNet-34 backbone with modified root convolutional layer that has 1×1 stride. For all layers, we replaced the batch normalization operation by group normalization.