Ahmadzei's picture
update 1
57bdca5
raw
history blame
720 Bytes
The
token indices are under the key input_ids:
thon
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special
IDs the model sometimes uses.
If we decode the previous sequence of ids,
thon
decoded_sequence = tokenizer.decode(encoded_sequence)
we will see
thon
print(decoded_sequence)
[CLS] A Titan RTX has 24GB of VRAM [SEP]
because this is the way a [BertModel] is going to expect its inputs.
L
labels
The labels are an optional argument which can be passed in order for the model to compute the loss itself.