Spaces:

Ahmadzei
/

RAG

Runtime error

update 1

57bdca5 over 1 year ago

798 Bytes

	For example, the BERT model
	builds its two sequence input as such:
	thon

	[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

	We can use our tokenizer to automatically generate such a sentence by passing the two sequences to tokenizer as two
	arguments (and not a list, like before) like this:
	thon

	from transformers import BertTokenizer
	tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
	sequence_a = "HuggingFace is based in NYC"
	sequence_b = "Where is HuggingFace based?"
	encoded_dict = tokenizer(sequence_a, sequence_b)
	decoded = tokenizer.decode(encoded_dict["input_ids"])

	which will return:
	thon

	print(decoded)
	[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]

	This is enough for some models to understand where one sequence ends and where another begins.