-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support positional encoding interpolation in Zoo models #601
Comments
@korotaS thank you for the report! You are absolutely right. Note, we haven't changed the original implementations (everything which is under external folder is a copy-paste). The problem is indeed in the interpolation. As for the solution on the library side: I think it may be a compromise. There is also a problem we need to solve: the check will be placed in
Do you want to make this contribution to the library, @korotaS ? I would greatly appreciate it :) |
Do you think that interpolating is better than changing some layers? I know that it is not a good idea to replace trained layers with non-trained ones but it may be that they can learn quickly (especially positional encoding). As for this:
I think I can, maybe not today and not tomorrow but I will let you know asap. |
In Dino they also trained on some set of fixed images sizes, but on inference time they allow to interpolate We have a simple way to check if everything is fine. Just run validation on any of ours benchmarks with |
Hi! I am using this example to train a ConcatSiamese model and if I use an extractor from example (
vits16_dino
) - it runs ok but if I use different model, for examplevitb32_unicom
I get an error. Here is an example for reproducing:And here is the last traceback item:
I think the problem is that in
vits16_dino
there is an interpolation before positional embedding, so the tensor is downsampled into preferred shape:open-metric-learning/oml/models/vit_dino/external/vision_transformer.py
Lines 260 to 280 in 05842c8
But in
vitb32_unicom
there is no such interpolation, so the input tensor is 2 times bigger than positional encoding expects, so we need to manually replace some layers that depend on number of patches (model.pos_embed
andmodel.feature[0]
).I think that this information needs to be added in docs or perhaps handled in
ViTUnicomExtractor
or inConcatSiamese
with some kind of warning. Also, this error reproduces with all otherViTUnicomExtractor
andViTCLIPExtractor
models.The text was updated successfully, but these errors were encountered: