Welcome to multimodal documentation.¶

multimodal is a python library providing tools for vision and language research. It provides visual features commonly used for Captionning and Visual Question Answering tasks, as well as datasets such as VQA.

Datasets

Visual Features

Bottom-Up Top-Down Object Features

Other

This library was developped by Corentin Dancette. If you have any new feature request or want to report a bug, please open an issue on the github tracker, or submit a Pull Request.