Visual Question Answering

Visual Question Answering datasets are available in multimodal. Annotations data are automatically downloaded and processed when the class is instanciated. Note that the pre-processing can take several minutes.

class multimodal.datasets.VQA(*args: Any, **kwargs: Any)[source]

Pytorch Dataset implementation for the VQA v1 dataset (visual question answering). See https://visualqa.org/ for more details about it.

When this class is instanciated, data will be downloaded in the directory specified by the dir_data parameter. Pre-processing of questions and answers will take several minutes.

When the features argument is specified, visual features will be downloaded as well. About 60Go will be necessary for downloading and extracting features.

Parameters
  • dir_data (str) – dir for the multimodal cache (data will be downloaded in a vqa2/ folder inside this directory

  • features (str|object) – which visual features should be used. Choices: coco-bottomup or coco-bottomup-36 You can also give directly the feature instance.

  • split (str) – Which t [train, val, test]

  • dir_features (str) – directory to download features. If None, defaults to $dir_data/features

  • label (str) – either multilabel, or best. For multilabel, GT scores for questions are given by the score they are assigned by the VQA evaluation. If best, GT is the label of the top answer.

  • tokenize_questions (bool) – If True, preprocessing will tokenize questions into tokens. The tokens are stored in item[“question_tokens”].

  • load (bool) – default True. If false, then the questions annotations and questions will not be loaded in memory. This is useful if you want only to download and process the data.

__getitem__(index)[source]

Returns a dictionnary with the following keys

{
    'image_id',
    'question_id',
    'question',
    'answer_type',
    'multiple_choice_answer',
    'answers',
    'image_id',
    'question_type',
    'question_id',
    'scores',
    'label'   # ground truth label to be used for the loss
}

Aditionnaly, if visual features are used, keys from the features will be added.

static collate_fn(batch)[source]

Use this method to collate batches of data.

evaluate(predictions)[source]

Evaluates a list of predictions, according to the VQA evaluation protocol. See https://visualqa.org/evaluation.html.

Parameters

predictions (list) – List of dictionnaries containing question_id and answer keys. The answer must be specified as a string.

Returns

A dict of floats containing scores for “overall”, “yes/no”, number”, and “other” questions.

class multimodal.datasets.VQA2(*args: Any, **kwargs: Any)[source]

Pytorch Dataset implementation for the VQA v2 dataset (visual question answering). See https://visualqa.org/ for more details about it.

When this class is instanciated, data will be downloaded in the directory specified by the dir_data parameter. Pre-processing of questions and answers will take several minutes.

When the features argument is specified, visual features will be downloaded as well. About 60Go will be necessary for downloading and extracting features.

Parameters
  • dir_data (str) – dir for the multimodal cache (data will be downloaded in a vqa2/ folder inside this directory

  • features (str) – which visual features should be used. Choices: coco-bottomup or coco-bottomup-36

  • split (str) – Which t [train, val, test]

  • dir_features (str) – directory to download features. If None, defaults to $dir_data/features

  • label (str) – either multilabel, or best. For multilabel, GT scores for questions are given by the score they are assigned by the VQA evaluation. If best, GT is the label of the top answer.

  • tokenize_questions (bool) – If True, preprocessing will tokenize questions into tokens. The tokens are stored in item[“question_tokens”].

__getitem__(index)

Returns a dictionnary with the following keys

{
    'image_id',
    'question_id',
    'question',
    'answer_type',
    'multiple_choice_answer',
    'answers',
    'image_id',
    'question_type',
    'question_id',
    'scores',
    'label'   # ground truth label to be used for the loss
}

Aditionnaly, if visual features are used, keys from the features will be added.

static collate_fn(batch)

Use this method to collate batches of data.

evaluate(predictions)

Evaluates a list of predictions, according to the VQA evaluation protocol. See https://visualqa.org/evaluation.html.

Parameters

predictions (list) – List of dictionnaries containing question_id and answer keys. The answer must be specified as a string.

Returns

A dict of floats containing scores for “overall”, “yes/no”, number”, and “other” questions.

class multimodal.datasets.VQACP(*args: Any, **kwargs: Any)[source]

Pytorch Dataset implementation for the VQA-CP v1 dataset (visual question answering). See https://www.cc.gatech.edu/grads/a/aagrawal307/vqa-cp/ for more details about it.

When this class is instanciated, data will be downloaded in the directory specified by the dir_data parameter. Pre-processing of questions and answers will take several minutes.

When the features argument is specified, visual features will be downloaded as well. About 60Go will be necessary for downloading and extracting features.

Parameters
  • dir_data (str) – dir for the multimodal cache (data will be downloaded in a vqa2/ folder inside this directory

  • features (str) – which visual features should be used. Choices: coco-bottomup or coco-bottomup-36

  • split (str) – Which t [train, val, test]

  • dir_features (str) – directory to download features. If None, defaults to $dir_data/features

  • label (str) – either multilabel, or best. For multilabel, GT scores for questions are given by the score they are assigned by the VQA evaluation. If best, GT is the label of the top answer.

  • tokenize_questions (bool) – If True, preprocessing will tokenize questions into tokens. The tokens are stored in item[“question_tokens”].

__getitem__(index)

Returns a dictionnary with the following keys

{
    'image_id',
    'question_id',
    'question',
    'answer_type',
    'multiple_choice_answer',
    'answers',
    'image_id',
    'question_type',
    'question_id',
    'scores',
    'label'   # ground truth label to be used for the loss
}

Aditionnaly, if visual features are used, keys from the features will be added.

static collate_fn(batch)

Use this method to collate batches of data.

evaluate(predictions)

Evaluates a list of predictions, according to the VQA evaluation protocol. See https://visualqa.org/evaluation.html.

Parameters

predictions (list) – List of dictionnaries containing question_id and answer keys. The answer must be specified as a string.

Returns

A dict of floats containing scores for “overall”, “yes/no”, number”, and “other” questions.

class multimodal.datasets.VQACP2(*args: Any, **kwargs: Any)[source]

Pytorch Dataset implementation for the VQA-CP v2 dataset (visual question answering). See https://www.cc.gatech.edu/grads/a/aagrawal307/vqa-cp/ for more details about it.

When this class is instanciated, data will be downloaded in the directory specified by the dir_data parameter. Pre-processing of questions and answers will take several minutes.

When the features argument is specified, visual features will be downloaded as well. About 60Go will be necessary for downloading and extracting features.

Parameters
  • dir_data (str) – dir for the multimodal cache (data will be downloaded in a vqa2/ folder inside this directory

  • features (str) – which visual features should be used. Choices: coco-bottomup or coco-bottomup-36

  • split (str) – Which t [train, val, test]

  • dir_features (str) – directory to download features. If None, defaults to $dir_data/features

  • label (str) – either multilabel, or best. For multilabel, GT scores for questions are given by the score they are assigned by the VQA evaluation. If best, GT is the label of the top answer.

  • tokenize_questions (bool) – If True, preprocessing will tokenize questions into tokens. The tokens are stored in item[“question_tokens”].

__getitem__(index)

Returns a dictionnary with the following keys

{
    'image_id',
    'question_id',
    'question',
    'answer_type',
    'multiple_choice_answer',
    'answers',
    'image_id',
    'question_type',
    'question_id',
    'scores',
    'label'   # ground truth label to be used for the loss
}

Aditionnaly, if visual features are used, keys from the features will be added.

static collate_fn(batch)

Use this method to collate batches of data.

evaluate(predictions)

Evaluates a list of predictions, according to the VQA evaluation protocol. See https://visualqa.org/evaluation.html.

Parameters

predictions (list) – List of dictionnaries containing question_id and answer keys. The answer must be specified as a string.

Returns

A dict of floats containing scores for “overall”, “yes/no”, number”, and “other” questions.

CLEVR

https://cs.stanford.edu/people/jcjohns/clevr/

class multimodal.datasets.CLEVR(*args: Any, **kwargs: Any)[source]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.

See https://cs.stanford.edu/people/jcjohns/clevr/

Warning: instanciating this class will download a 18Gb file to the multimodal data directory (by default in your applications data). You can specify the multimodal data directory by specifying the dir_data argument, or specifying it in your path.

Parameters
  • dir_data (str) – dir for the multimodal cache (data will be downloaded in a clevr/ folder inside this directory

  • split (str) – either train, val or test

  • transform – torchvision transform applied to images. By default, only ToTensor.

__getitem__(index: int)[source]

Returns a dictionnary with the following keys:

{
    "index",
    "question",
    "answer":,
    "question_family_index":,
    "image_filename":,
    "image_index":,
    "image"
    "label",
}

Note that you can recover the program for an example by using the index:

index = item["index"][0]  #  first item of batch
program = clevr.questions[index]["program"]