|
The |
|
token indices are under the key input_ids: |
|
thon |
|
|
|
encoded_sequence = inputs["input_ids"] |
|
print(encoded_sequence) |
|
[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102] |
|
|
|
Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special |
|
IDs the model sometimes uses. |
|
If we decode the previous sequence of ids, |
|
thon |
|
|
|
decoded_sequence = tokenizer.decode(encoded_sequence) |
|
|
|
we will see |
|
thon |
|
|
|
print(decoded_sequence) |
|
[CLS] A Titan RTX has 24GB of VRAM [SEP] |
|
|
|
because this is the way a [BertModel] is going to expect its inputs. |
|
L |
|
labels |
|
The labels are an optional argument which can be passed in order for the model to compute the loss itself. |