Tokenizers documentation
Input Sequences
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.20.3).
Input Sequences
Python
Rust
Node
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
Globally, any sequence can be either a string or a list of strings, according to the operating
mode of the tokenizer: raw text vs pre-tokenized.
TextInputSequence
tokenizers.TextInputSequence A str that represents an input sequence
PreTokenizedInputSequence
tokenizers.PreTokenizedInputSequence A pre-tokenized input sequence. Can be one of:
- A
Listofstr - A
Tupleofstr
alias of Union[List[str], Tuple[str]].
InputSequence
tokenizers.InputSequence Represents all the possible types of input sequences for encoding. Can be:
- When
is_pretokenized=False: TextInputSequence - When
is_pretokenized=True: PreTokenizedInputSequence
alias of Union[str, List[str], Tuple[str]].