Bottom-Up Top-Down Object Features¶

Those visual features were introduced by Anderson et. al. in the paper Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

They are extracted with a Faster R-CNN, and trained on the Visual Genome dataset to detect objects and their attributes (shapes, colors…).

multimodal provides a class to download and use those features extracted on the COCO image dataset. They can be used for most Visual Question Answering and Captionning that use the this dataset for images.

from multimodal.features import COCOBottomUpFeatures

bottomup = COCOBottomUpFeatures(
    features="coco-bottomup-36",
    dir_data="/data/multimodal",
)
image_id = 13455
feats = bottomup[image_id]
print(feats.keys())
# ['image_w', 'image_h', 'num_boxes', 'boxes', 'features']
print(feats["features"].shape)  # numpy array
# (36, 2048)

class multimodal.features.COCOBottomUpFeatures(features: str, dir_data: Optional[str] = None)[source]¶

Bottom up features for the COCO dataset

Parameters

features (str) – one of [trainval2014_36, trainval2014, test2014_36, test2014, test2015-36, test2015]. Specifies the split, and the number of detected objects. _36 means 36 objetcs are detected in every image, and otherwise, the number is based on a detection threshold, between 10 and 100 objects.
dir_data (str) – Directory where multimodal data will be downloaded. You need at least 60Go for downloading and extracting the features.

__getitem__(image_id: int)[source]¶

Get the features.

Parameters

image_id (str|int) – The id of the image in COCO dataset.

Returns

A dictionnary containing the following keys:

{
    'image_id',
    'image_h': height
    'image_w': width
    'num_boxes': number of objects
    'boxes': Numpy array of shape (N, 4) containing bounding box coordinates
    'features': Numpy array of shape (N, 2048) containing features.
}

keys()[source]¶

Returns: List of all keys
Return type: list