Tokenizers

Base class

We implement a basic tokenizer class, built on torchtext basic_english tokenizer. Its purpose is to transform text into a series of token_ids (integers), that will be fed to a word vector.

class multimodal.text.BasicTokenizer(tokens: List[str] = [], sentences: List[str] = [], name: Optional[str] = None, pad_token='<pad>', unk_token='<unk>', padding_side='right', dir_data: Optional[str] = None)[source]

This class maps word tokens to token_ids. In case of unknown token ids, the It will also pad the data.

Parameters
  • tokens (list) – Tokens to add in the dictionnary. Those can be tokens from pretrain vectors.

  • sentences (list) – List of sentences that need to be tokenized first, before building the vocab. Tokens from those sentences will be added to the vocabulary if they were not in it already.

  • name (str) – name which will be used to save the tokenizer. Use a different name when changing the tokens.

  • pad_token (str) – token used to pad the data.

  • unk_token (str) – token used for unknown words. The id is saved in the attribute unk_token_id.

  • pad_side (str) – either “left” or “right”. The pad_token_id attribute will save the position.

  • dir_data (str) – directory to save multimodal data.

tokenize(s, replace_unk=True, padding=True)[source]

This function will return the tokenized representation of the input. Example: tokenize(“Hello there”) will return [“hello”, “there”], assuming both words are in the vocabulary.

In case a list of strings is given as input, this function will add padding tokens to ensure that all outputs have the same length.

Parameters
  • s (str | List[str]) – Either a string or a list of string, to be tokenized.

  • keep_unk (bool) – If true, then the tokenizes will not replace unknown words with the UNK token. Default: false

  • padding (bool) – Whether to add the padding token or not.

convert_tokens_to_ids(tokens)[source]

Converts tokenized representations :param tokens: List of string tokens that will be converted to their token ids. :type tokens: list :param If a token is missing from the vocabulary: :param it will be converted to self.unk_token_id.: :param Padding tokens will be converted to self.pad_token_id.:

VQA v2

The pretrained tokenizer for VQA v2 is called pretrained-vqa2.

from multimodal.text import BasicTokenizer
tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa2")
tokens = tokenizer("What color is the car?")
# feed tokens to model