Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split | |
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix | |
is added for "RA" and "M": | |
thon | |
print(tokenized_sequence) | |
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] | |
These tokens can then be converted into IDs which are understandable by the model. |