Content based analysis of compositionality in Vision Transformers
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Neural Network models have achieved state of the art results in various tasks related to
vision and language, there are still questions regarding their logical reasoning capabilities.
In particular, its not clear whether these models can reason beyond using analogy. For
example, in an image captioning model, the model can either learn to correlate a scene
representation to a caption i.e. text space, or the model could learn to bind objects
explicitly and the utilise the explicit composition of individual representations. The
inability of models to perform the later has been related to their failures to generalise
on wider scenarios in various tasks. Transformer based models have achieved high
performance in various language and vision tasks. Their success has been accredited to
their ability to model long range relations between sequences. But in vision transformers
there has been a discussion that the use of patches as tokens and the interaction between
them, gives them an ability to flexibly bind and model compositional relations between
various objects at different distances. Hence, showing aspects on explicit compositional
abilities. In this thesis, we perform experiments on the Transformer (VIT) based vision
encoder of an image captioning model. In particular we probe the internal representation
of the encoder at various layers to examine if a single token captures the representation of
1) an object 2) related objects in scene 3) composition of two objects in the scene. In our
results we find some evidence to hint binding of object properties into a single token as
the image is processed by the transformer. Further, this work provides a list of methods
to create and setup a dataset to study internal compositionality in Vision Transformers
models and suggests future lines of study to expand this analysis.
Description
Keywords
Transformers, Vision transformer, compositionality, computer vision, neural networks, image captioning models