ALIGN: Scaling up visual and vision language representation.
Sreedhu S S
Example image-text pairs randomly sampled from the training dataset of ALIGN. One clearly noisy text label is marked in italics | credits: Google AI Blog
To learn excellent visual and visual language representations is unavoidable to solve computer vision problems. Image retrieval and image classification can enable the development of products and tools like google lens that change daily lives of people. To understand these kinds of representations, current state-of-the-art (SotA) visual and vision language models depend on thoughtfully organized training datasets that must have essential expert knowledge and extensive labels. For vision applications, representations are mostly learned on large scale datasets with clear class labels, such as ImageNet, OpenImages, and JFT-300M. Conceptual Captions and Visual Genome Dense Captions are the popular pre-training datasets for vision-language applications. These pre-training datasets require non-trivial data collection and cleaning steps, limiting the dimension of datasets, and thus obstructing the scale of trained models. Natural Language Processing (NLP) models have attained SotA performance on GLUE and SuperGLUE benchmarks by using large scale pre-training on raw text without human Labels.
Image search can be used for illustrating the quantitative results. The image retrieval system uses ALIGN-trained embeddings and displays the top 1 text-to-image retrieval results for a few of the 160M image pool's text queries. Precise images can be recovered from ALIGN provided comprehensive explanations of a scene, or fine-grained or instance-level concepts like landmarks and artworks. Images and texts can be aligned with the ALIGN model with similar semantics, and also ALIGN can generalize to novel complex concepts.
ALIGN is Large scale ImaGe and Noisy-text embedding for the purpose of building larger and more powerful models easily. For this, they employ a simple dual-encoder architecture which learns to align visual and language representations of the image and text pairs. Image and text encoders are grasped by way of an incompatible loss (formulated as normalised softmax) that forces the embeddings of matched image-text pairs (within the same batch) apart. The large scale dataset makes it feasible for users to scale up the model size to be as large as EfficientNet-L2 (image encoder) and BERT-large (text encoder) trained from scratch. The learned representation can be used for downstream visual and vision language tasks.
The aligned visual and language representations also set new SotA results on Flickr30K and MS-COCO benchmarks, even when compared with more sophisticated cross-attention models, and enable zero-shot image classification and cross-modality search with complex text and text with image queries.
ALIGN is proficient in cross model retrieval and notably outperforms SotA models. In visual- only downstream tasks, it is also comparable to or outperforms SotA models trained with large scale labelled data.
