Tokenizers¶

Base class¶

We implement a basic tokenizer class, built on torchtext basic_english tokenizer. Its purpose is to transform text into a series of token_ids (integers), that will be fed to a word vector.

class multimodal.text.BasicTokenizer(tokens: List[str] = [], sentences: List[str] = [], name: Optional[str] = None, pad_token='<pad>', unk_token='<unk>', padding_side='right', dir_data: Optional[str] = None)[source]¶

This class maps word tokens to token_ids. In case of unknown token ids, the It will also pad the data.

Parameters

tokens (list) – Tokens to add in the dictionnary. Those can be tokens from pretrain vectors.
sentences (list) – List of sentences that need to be tokenized first, before building the vocab. Tokens from those sentences will be added to the vocabulary if they were not in it already.
name (str) – name which will be used to save the tokenizer. Use a different name when changing the tokens.
pad_token (str) – token used to pad the data.
unk_token (str) – token used for unknown words. The id is saved in the attribute unk_token_id.
pad_side (str) – either “left” or “right”. The pad_token_id attribute will save the position.
dir_data (str) – directory to save multimodal data.

tokenize(s, replace_unk=True, padding=True)[source]¶

This function will return the tokenized representation of the input. Example: tokenize(“Hello there”) will return [“hello”, “there”], assuming both words are in the vocabulary.

In case a list of strings is given as input, this function will add padding tokens to ensure that all outputs have the same length.

Parameters

s (str | List[str]) – Either a string or a list of string, to be tokenized.
keep_unk (bool) – If true, then the tokenizes will not replace unknown words with the UNK token. Default: false
padding (bool) – Whether to add the padding token or not.

convert_tokens_to_ids(tokens)[source]¶: Converts tokenized representations :param tokens: List of string tokens that will be converted to their token ids. :type tokens: list :param If a token is missing from the vocabulary: :param it will be converted to self.unk_token_id.: :param Padding tokens will be converted to self.pad_token_id.:

VQA v2¶

The pretrained tokenizer for VQA v2 is called pretrained-vqa2.

from multimodal.text import BasicTokenizer
tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa2")
tokens = tokenizer("What color is the car?")
# feed tokens to model