{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TorchMultimodal Tutorial: Finetuning FLAVA\n==========================================\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multimodal AI has recently become very popular owing to its ubiquitous\nnature, from use cases like image captioning and visual search to more\nrecent applications like image generation from text. **TorchMultimodal\nis a library powered by Pytorch consisting of building blocks and end to\nend examples, aiming to enable and accelerate research in\nmultimodality**.\n\nIn this tutorial, we will demonstrate how to use a **pretrained SoTA\nmodel called** [FLAVA](https://arxiv.org/pdf/2112.04482.pdf) **from\nTorchMultimodal library to finetune on a multimodal task i.e. visual\nquestion answering** (VQA). The model consists of two unimodal\ntransformer based encoders for text and image and a multimodal encoder\nto combine the two embeddings. It is pretrained using contrastive, image\ntext matching and text, image and multimodal masking losses.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Installation\n============\n\nWe will use TextVQA dataset and `bert tokenizer` from Hugging Face for\nthis tutorial. So you need to install datasets and transformers in\naddition to TorchMultimodal.\n\n```{=html}\n
When running this tutorial in Google Colab, install the required packages bycreating a new cell and running the following commands:
!pip install torchmultimodal-nightly\n!pip install datasets\n!pip install transformers
\n```\n```{=html}\nIf you are running this tutorial in Google Colab, run these commandsin a new cell and prepend these commands with an exclamation mark (!)
\n ```\n ```{=html}\n