Content based analysis of compositionality in Vision Transformers

Dias, Braian Olmiro

Content based analysis of compositionality in Vision Transformers

Files

dias_computer_science_2023.pdf (7.83 MB)

Date

2023

Authors

Dias, Braian Olmiro

Publisher

Tartu Ülikool

Abstract

Neural Network models have achieved state of the art results in various tasks related to vision and language, there are still questions regarding their logical reasoning capabilities. In particular, its not clear whether these models can reason beyond using analogy. For example, in an image captioning model, the model can either learn to correlate a scene representation to a caption i.e. text space, or the model could learn to bind objects explicitly and the utilise the explicit composition of individual representations. The inability of models to perform the later has been related to their failures to generalise on wider scenarios in various tasks. Transformer based models have achieved high performance in various language and vision tasks. Their success has been accredited to their ability to model long range relations between sequences. But in vision transformers there has been a discussion that the use of patches as tokens and the interaction between them, gives them an ability to flexibly bind and model compositional relations between various objects at different distances. Hence, showing aspects on explicit compositional abilities. In this thesis, we perform experiments on the Transformer (VIT) based vision encoder of an image captioning model. In particular we probe the internal representation of the encoder at various layers to examine if a single token captures the representation of 1) an object 2) related objects in scene 3) composition of two objects in the scene. In our results we find some evidence to hint binding of object properties into a single token as the image is processed by the transformer. Further, this work provides a list of methods to create and setup a dataset to study internal compositionality in Vision Transformers models and suggests future lines of study to expand this analysis.

Keywords

Transformers, Vision transformer, compositionality, computer vision, neural networks, image captioning models

URI

https://hdl.handle.net/10062/93580

Collections

MTAT magistritööd – Master's theses

Full item page

Content based analysis of compositionality in Vision Transformers

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections