shirayukikun commited on 29 days ago

Commit

5b88d12

verified ·

1 Parent(s): 825a110

Upload folder using huggingface_hub

Browse files

Files changed (17) hide show

README.md +301 -3
added_tokens.json +356 -0
config.json +40 -0
configuration_byllama_patch.py +218 -0
custom_generate/generate.py +309 -0
generation_config.json +6 -0
model-00001-of-00006.safetensors +3 -0
model-00002-of-00006.safetensors +3 -0
model-00003-of-00006.safetensors +3 -0
model-00004-of-00006.safetensors +3 -0
model-00005-of-00006.safetensors +3 -0
model-00006-of-00006.safetensors +3 -0
model.safetensors.index.json +298 -0
modeling_byllama_patch.py +1299 -0
special_tokens_map.json +44 -0
tokenization_utf8_like_byte_v3.py +1205 -0
tokenizer_config.json +546 -0

README.md CHANGED Viewed

@@ -1,3 +1,301 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- ja
+---
+(English part follows Japanese one.)
+# byGPT-JP-multi-lm-head 6.5B alpha
+バイト単位のtokenizerを採用した，日本語言語モデルです。
+一度に4tokens (bytes) ずつ予測するための，複数のlmヘッドを持つアーキテクチャを採用しています。
+また，multi byte predictionに適した独自のUnicode encodingを採用しています。
+現在開発段階のモデルであり，十分な性能には達していません。
+## 利用方法
+[transformers version 4.56.1](https://github.com/huggingface/transformers/releases/tag/v4.56.1) において、動作確認しています。
+他のバージョンでは動作しない可能性があります。
+```python
+import argparse
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+SAMPLE_INPUT_TEXTS = [
+    "日本三景一覧:\n1. 広島県, 宮島\n2. 京都府, 天橋立\n3. 宮城県, ",
+    "原文: I like to play soccer. 訳文: 私はサッカーをするのが好きです。\n原文: She enjoys reading books. 訳文: 彼女は本を読むのが好きです。\n原文: They went to the park. 訳文:",
+]
+def main(args):
+    torch.manual_seed(args.seed)
+    device = torch.device("cuda")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name_or_path,
+        dtype=torch.bfloat16,
+        trust_remote_code=True,
+    )
+    model.to(device)
+    model.eval()
+    input_texts = [f"{tokenizer.bos_token}{text}" for text in SAMPLE_INPUT_TEXTS]
+    batch = tokenizer(
+        input_texts, return_tensors="pt", padding="longest", add_special_tokens=False
+    )
+    batch = batch.to(device)
+    decoded_ids = model.generate(
+        input_ids=batch.input_ids,
+        attention_mask=batch.attention_mask,
+        eos_token_id=[tokenizer.encode("\n", add_special_tokens=False)],
+        pad_token_id=tokenizer.pad_token_id,
+        max_new_tokens=args.max_new_tokens,
+        do_sample=False,
+    )
+    decoded_texts = tokenizer.batch_decode(decoded_ids, skip_special_tokens=False)
+    for text in decoded_texts:
+        print("===")
+        print(f"Decoded: {text}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(allow_abbrev=False)
+    parser.add_argument(
+        "--model_name_or_path",
+        "-m",
+        type=str,
+        default="tohoku-nlp/bygpt-jp-multi-lm-head-6.5B-alpha",
+        help="Path to the model or model identifier from huggingface.co/models."
+    )
+    parser.add_argument("--max_new_tokens", "-n", type=int, help="Maximum number of new tokens to generate.", default=160)
+    parser.add_argument("--seed", "-s", type=int, help="Random seed", default=42)
+    args = parser.parse_args()
+    main(args)
+```
+### 利用上の注意点
+本モデルは，1度に4bytes (tokens) ずつ予測するため，特殊tokenも複数トークン (bytes) で構成されています．
+そのため，例えばtokenizer.eos\_tokenはlist of intです．
+また，generate関数はcustom\_generateの機能により実装されており，利用可能な機能に制限があります．
+また，このモデルはinstruction tuning等は実施していないモデルです．
+## モデルアーキテクチャ
+[Llama](https://arxiv.org/abs/2302.13971) アーキテクチャをベースとしています。
+具体的には、以下のモジュールを採用しています。
+- [SwiGLU](https://arxiv.org/abs/2002.05202)
+- [Rotary Positional Embeddings (RoPE)](https://arxiv.org/abs/2104.09864)
+- [Grouped Query Attention (GQA)](https://aclanthology.org/2023.emnlp-main.298/)
+また，4tokens (bytes) ずつ予測するため，
+- 4つのlmヘッド
+- 入力のembeddingを4tokenごとにマージするモジュール
+を追加しています。
+## 学習データ
+[llm-jp-corpus-v3](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)  の日本語コーパスのサブセット (ja\_cc, ja\_warp\_html, ja\_warp\_pdf, ja\_wiki, kaken) を使用しました。
+### 学習設定
+|      | tohoku-nlp/bygpt-jp-multi-lm-head-6.5B-alpha |
+| ---- | ---- |
+| Training Steps | 208,000 |
+| Batch Size (tokens) | 5,898,240 |
+| Max Learning Rate | 5.0E-4 |
+| Min Learning Rate | 1.0E-5 |
+| Learning Rate Warmup Steps | 2,000 |
+| Scheduler | cosine |
+| Optimizer | AdamW |
+| Optimizer Config | beta_1 = 0.9, beta_2 = 0.999, eps = 1.0E-8 |
+| Weight Decay | 0.01 |
+| Gradient Clipping | 1.0 |
+| Sequence Length | 11,520 |
+学習には[Megatron-LM](https://arxiv.org/abs/1909.08053)をベースに，独自の変更を加えたコードベースを使用しています。
+## ライセンス
+このモデルは Apache License 2.0 の下で配布しています。
+# 免責事項
+本モデルの作者は本モデルを作成するにあたって、その内容、機能等について細心の注意を払っておりますが、モデルの出力が正確であるかどうか、安全なものであるか等について保証をするものではなく、何らの責任を負うものではありません。
+本モデルの利用により、万一、利用者に何らかの不都合や損害が発生したとしても、モデルやデータセットの作者や作者の所属組織は何らの責任を負うものではありません。
+## 謝辞
+このモデルの学習にあたり様々な面でご協力いただきました [Tohoku NLP Group](https://www.nlp.ecei.tohoku.ac.jp/) の皆様に感謝いたします。
+## 作成者
+- [Keito Kudo](https://x.com/k8kudo)
+- [Go Kamoda](https://x.com/go2oo2)
+- [Daiki Shiono](https://x.com/onely7_deep)
+- [Jun Suzuki](https://x.com/drJunSuzuki)
+<br>
+<br>
+<br>
+<br>
+---
+license: apache-2.0
+language:
+- ja
+---
+(English part follows Japanese one.)
+# byGPT-JP-multi-lm-head 6.5B alpha
+This is a Japanese language model that adopts a byte-level tokenizer.
+It adopts an architecture with multiple LM heads for predicting 4 tokens (bytes) at once.
+It also adopts a unique Unicode encoding suitable for multi-byte prediction.
+This is currently a model in development stage and has not yet reached sufficient performance.
+## Usage
+Operation has been confirmed with [transformers version 4.56.1](https://github.com/huggingface/transformers/releases/tag/v4.56.1).
+It may not work with other versions.
+```python
+import argparse
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+SAMPLE_INPUT_TEXTS = [
+    "日本三景一覧:\n1. 広島県, 宮島\n2. 京都府, 天橋立\n3. 宮城県, ",
+    "原文: I like to play soccer. 訳文: 私はサッカーをするのが好きです。\n原文: She enjoys reading books. 訳文: 彼女は本を読むのが好きです。\n原文: They went to the park. 訳文:",
+]
+def main(args):
+    torch.manual_seed(args.seed)
+    device = torch.device("cuda")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name_or_path,
+        dtype=torch.bfloat16,
+        trust_remote_code=True,
+    )
+    model.to(device)
+    model.eval()
+    input_texts = [f"{tokenizer.bos_token}{text}" for text in SAMPLE_INPUT_TEXTS]
+    batch = tokenizer(
+        input_texts, return_tensors="pt", padding="longest", add_special_tokens=False
+    )
+    batch = batch.to(device)
+    decoded_ids = model.generate(
+        input_ids=batch.input_ids,
+        attention_mask=batch.attention_mask,
+        eos_token_id=[tokenizer.encode("\n", add_special_tokens=False)],
+        pad_token_id=tokenizer.pad_token_id,
+        max_new_tokens=args.max_new_tokens,
+        do_sample=False,
+    )
+    decoded_texts = tokenizer.batch_decode(decoded_ids, skip_special_tokens=False)
+    for text in decoded_texts:
+        print("===")
+        print(f"Decoded: {text}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(allow_abbrev=False)
+    parser.add_argument(
+        "--model_name_or_path",
+        "-m",
+        type=str,
+        default="tohoku-nlp/bygpt-jp-multi-lm-head-6.5B-alpha",
+        help="Path to the model or model identifier from huggingface.co/models."
+    )
+    parser.add_argument("--max_new_tokens", "-n", type=int, help="Maximum number of new tokens to generate.", default=160)
+    parser.add_argument("--seed", "-s", type=int, help="Random seed", default=42)
+    args = parser.parse_args()
+    main(args)
+```
+### Important Notes for Usage
+Since this model predicts 4 bytes (tokens) at once, special tokens are also composed of multiple tokens (bytes).
+Therefore, for example, tokenizer.eos_token is a list of int.
+Also, the generate function is implemented through custom_generate functionality, which has limitations on available features.
+Additionally, this model has not undergone instruction tuning.
+## Model Architecture
+Based on the [Llama](https://arxiv.org/abs/2302.13971) architecture.
+Specifically, it adopts the following modules:
+- [SwiGLU](https://arxiv.org/abs/2002.05202)
+- [Rotary Positional Embeddings (RoPE)](https://arxiv.org/abs/2104.09864)
+- [Grouped Query Attention (GQA)](https://aclanthology.org/2023.emnlp-main.298/)
+Also, for predicting 4 tokens (bytes) at once, we have added:
+- 4 LM heads
+- A module to merge input embeddings every 4 tokens
+## Training Data
+We used a subset of the Japanese corpus from [llm-jp-corpus-v3](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3) (ja_cc, ja_warp_html, ja_warp_pdf, ja_wiki, kaken).
+### Training Configuration
+|      | tohoku-nlp/bygpt-jp-multi-lm-head-6.5B-alpha |
+| ---- | ---- |
+| Training Steps | 170,000 |
+| Batch Size (tokens) | 5,898,240 |
+| Max Learning Rate | 5.0E-4 |
+| Min Learning Rate | 1.0E-5 |
+| Learning Rate Warmup Steps | 2,000 |
+| Scheduler | cosine |
+| Optimizer | AdamW |
+| Optimizer Config | beta_1 = 0.9, beta_2 = 0.999, eps = 1.0E-8 |
+| Weight Decay | 0.01 |
+| Gradient Clipping | 1.0 |
+| Sequence Length | 11,520 |
+For training, we used a codebase based on [Megatron-LM](https://arxiv.org/abs/1909.08053) with our own custom modifications.
+## License
+This model is distributed under the Apache License 2.0.
+# Disclaimer
+While the authors of this model have paid careful attention to its content and functionality during creation, we do not guarantee that the model's outputs are accurate or safe, and we assume no responsibility for them.
+Even if users experience any inconvenience or damage due to the use of this model, the authors of the model and dataset and their affiliated organizations assume no responsibility.
+## Acknowledgments
+We thank all members of the [Tohoku NLP Group](https://www.nlp.ecei.tohoku.ac.jp/) who cooperated with us in various aspects of training this model.
+## Authors
+- [Keito Kudo](https://x.com/k8kudo)
+- [Go Kamoda](https://x.com/go2oo2)
+- [Daiki Shiono](https://x.com/onely7_deep)
+- [Jun Suzuki](https://x.com/drJunSuzuki)

added_tokens.json ADDED Viewed

	@@ -0,0 +1,356 @@

+{
+  "<|begin_of_text|>": [
+    194,
+    128,
+    64,
+    0
+  ],
+  "<|cls|>": [
+    195,
+    128,
+    64,
+    0
+  ],
+  "<|end_header_id|>": [
+    202,
+    128,
+    64,
+    0
+  ],
+  "<|end_of_role|>": [
+    203,
+    128,
+    64,
+    0
+  ],
+  "<|end_of_text|>": [
+    193,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_0|>": [
+    204,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_10|>": [
+    214,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_11|>": [
+    215,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_12|>": [
+    216,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_13|>": [
+    217,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_14|>": [
+    218,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_15|>": [
+    219,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_16|>": [
+    220,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_17|>": [
+    221,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_18|>": [
+    222,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_19|>": [
+    223,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_1|>": [
+    205,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_20|>": [
+    224,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_21|>": [
+    225,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_22|>": [
+    226,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_23|>": [
+    227,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_24|>": [
+    228,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_25|>": [
+    229,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_26|>": [
+    230,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_27|>": [
+    231,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_28|>": [
+    232,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_29|>": [
+    233,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_2|>": [
+    206,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_30|>": [
+    234,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_31|>": [
+    235,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_32|>": [
+    236,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_33|>": [
+    237,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_34|>": [
+    238,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_35|>": [
+    239,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_36|>": [
+    240,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_37|>": [
+    241,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_38|>": [
+    242,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_39|>": [
+    243,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_3|>": [
+    207,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_40|>": [
+    244,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_41|>": [
+    245,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_42|>": [
+    246,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_43|>": [
+    247,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_44|>": [
+    248,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_45|>": [
+    249,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_46|>": [
+    250,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_4|>": [
+    208,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_5|>": [
+    209,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_6|>": [
+    210,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_7|>": [
+    211,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_8|>": [
+    212,
+    128,
+    64,
+    0
+  ],
+  "<|extra_id_9|>": [
+    213,
+    128,
+    64,
+    0
+  ],
+  "<|mask|>": [
+    197,
+    128,
+    64,
+    0
+  ],
+  "<|pad|>": [
+    192,
+    128,
+    64,
+    0
+  ],
+  "<|sep|>": [
+    196,
+    128,
+    64,
+    0
+  ],
+  "<|start_header_id|>": [
+    201,
+    128,
+    64,
+    0
+  ],
+  "<|vision_br|>": [
+    199,
+    128,
+    64,
+    0
+  ],
+  "<|vision_end|>": [
+    200,
+    128,
+    64,
+    0
+  ],
+  "<|vision_start|>": [
+    198,
+    128,
+    64,
+    0
+  ]
+}

config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "_name_or_path": "None",
+  "architectures": [
+    "ByLlamaPatchForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_byllama_patch.ByLlamaPatchConfig",
+    "AutoModel": "modeling_byllama_patch.ByLlamaPatchModel",
+    "AutoModelForCausalLM": "modeling_byllama_patch.ByLlamaPatchForCausalLM"
+  },
+  "bos_token_id": 1,
+  "embedding_aggregator_type": "mean",
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "input_embedding_dim": 4096,
+  "intermediate_size": 13312,
+  "max_position_embeddings": 5760,
+  "mlp_bias": false,
+  "model_type": "byllama_patch",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "num_lm_heads": 4,
+  "output_vocab_size": 256,
+  "pretraining_tp": 1,
+  "qkv_bias": false,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 10000.0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.3",
+  "use_cache": true,
+  "vocab_size": 256
+}

configuration_byllama_patch.py ADDED Viewed

	@@ -0,0 +1,218 @@

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""LLaMA model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+class ByLlamaPatchConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the LLaMA-7B.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`LlamaModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
+            Llama 2 up to 4096, CodeLlama up to 16384.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
+            understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
+            results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
+        head_dim (`int`, *optional*):
+            The attention head dimension. If None, it will default to hidden_size // num_heads
+    ```python
+    >>> from transformers import LlamaModel, LlamaConfig
+    >>> # Initializing a LLaMA llama-7b style configuration
+    >>> configuration = LlamaConfig()
+    >>> # Initializing a model from the llama-7b style configuration
+    >>> model = LlamaModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "byllama_patch"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        qkv_bias=False,
+        attention_dropout=0.0,
+        mlp_bias=False,
+        head_dim=None,
+        num_lm_heads=4,
+        embedding_aggregator_type="linear",
+        input_embedding_dim=None,
+        output_vocab_size=None,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.qkv_bias = qkv_bias
+        self.attention_dropout = attention_dropout
+        self.mlp_bias = mlp_bias
+        self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, copy it it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+        # Custom attribute
+        self.num_lm_heads = num_lm_heads
+        self.embedding_aggregator_type = embedding_aggregator_type
+        self.input_embedding_dim = input_embedding_dim if input_embedding_dim is not None else hidden_size
+        self.output_vocab_size = output_vocab_size if output_vocab_size is not None else vocab_size
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

custom_generate/generate.py ADDED Viewed

	@@ -0,0 +1,309 @@

+from typing import Optional, Union, List
+import inspect
+import torch
+from torch import nn
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation.logits_process import LogitsProcessorList
+from transformers.generation.stopping_criteria import (
+    StoppingCriteriaList,
+    EosTokenCriteria,
+    MaxLengthCriteria,
+)
+from transformers.generation.utils import logging
+logger = logging.get_logger(__name__)
+class EosTokenCriteriaForPatch(EosTokenCriteria):
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
+        is_done = torch.zeros(input_ids.shape[0], dtype=torch.bool, device=input_ids.device)
+        for eos_token_ids in self.eos_token_id:
+            eos_token_length = eos_token_ids.shape[-1]
+            suffix = input_ids[:, -eos_token_length:]
+            is_done |= torch.all(suffix == eos_token_ids, dim=-1)
+        return is_done
+def get_initial_cache_position(model, input_ids, model_kwargs):
+    """Calculates `cache_position` for the pre-fill stage based on `input_ids` and optionally past length"""
+    assert input_ids.size(1) % model.config.num_lm_heads == 0, "Input length must be divisible by num_lm_heads"
+    seq_len = input_ids.size(1) // model.config.num_lm_heads
+    cache_position = torch.ones(seq_len, dtype=torch.int64).cumsum(0) - 1
+    past_length = 0
+    if model_kwargs.get("past_key_values") is not None:
+        cache = model_kwargs["past_key_values"]
+        past_length = 0
+        past_length = cache.get_seq_length()
+        cache_position = cache_position[past_length:]
+    model_kwargs["cache_position"] = cache_position
+    return model_kwargs
+def prepare_inputs_for_generation(
+    model,
+    input_ids: torch.LongTensor,
+    past_key_values: Optional[Cache] = None,
+    attention_mask: Optional[torch.LongTensor] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    **kwargs,
+):
+    """
+    Prepare the model inputs for generation. In includes operations like computing the 4D attention mask or
+    slicing inputs given the existing cache.
+    See the forward pass in the model documentation for expected arguments (different models might have different
+    requirements for e.g. `past_key_values`). This function should work as is for most LLMs.
+    """
+    # 1. Handle BC:
+    model_inputs = {
+        "cache_position": cache_position
+    }
+    assert input_ids.size(1) % model.config.num_lm_heads == 0, "Input length must be divisible by num_lm_heads"
+    # 2. Generic cache-dependent input preparation
+    if past_key_values is not None:
+        model_inputs["past_key_values"] = past_key_values
+        if input_ids.shape[1] != cache_position.shape[0] * model.config.num_lm_heads:
+            indices = torch.arange(
+                cache_position[0] * model.config.num_lm_heads,
+                (cache_position[-1] + 1) * model.config.num_lm_heads,
+                device=cache_position.device,
+            )
+            input_ids = input_ids[:, indices]
+    # 3. Prepare base model inputs
+    input_ids_key = "input_ids"
+    # `clone` calls in this function ensure a consistent stride. See #32227
+    model_inputs[input_ids_key] = input_ids.clone(memory_format=torch.contiguous_format)
+    model_inputs["inputs_embeds"] = None
+    # 4. Create missing `position_ids` on the fly
+    if (
+        attention_mask is not None
+        and kwargs.get("position_ids") is None
+        and "position_ids" in set(inspect.signature(model.forward).parameters.keys())
+    ):
+        bsz = input_ids.size(0)
+        agregated_attention_mask = attention_mask.view(
+            bsz, -1, model.config.num_lm_heads
+        ).all(dim=-1)
+        position_ids = agregated_attention_mask.long().cumsum(-1) - 1
+        position_ids.masked_fill_(agregated_attention_mask == 0, 1)
+        kwargs["position_ids"] = position_ids
+    # 5. Slice model inputs if it's an input that should have the same length as `input_ids`
+    model_input = kwargs.get("position_ids")
+    if model_input is not None:
+        if past_key_values is not None:
+            current_input_length = (
+                model_inputs["inputs_embeds"].shape[1]
+                if model_inputs["inputs_embeds"] is not None
+                else model_inputs[input_ids_key].shape[1]
+            )
+            assert current_input_length % model.config.num_lm_heads == 0, "Input length must be divisible by num_lm_heads"
+            current_input_length //= model.config.num_lm_heads
+            model_input = model_input[:, -current_input_length:]
+            model_input = model_input.clone(memory_format=torch.contiguous_format)
+        model_inputs["position_ids"] = model_input
+    if attention_mask is not None:
+        model_inputs["attention_mask"] = attention_mask
+    # 6. Forward ALL kwargs that are uninitialized (e.g. `use_cache`).
+    for key, value in kwargs.items():
+        if key not in model_inputs:
+            model_inputs[key] = value
+    # 8. Remove unexpected `generate` inputs (TODO @joao: fix trainer and examples)
+    model_inputs.pop("labels", None)
+    return model_inputs
+def extract_past_from_model_output(outputs):
+    past_key_values = None
+    cache_name = "past_key_values"
+    if "past_key_values" in outputs:
+        past_key_values = outputs.past_key_values
+    elif "mems" in outputs:
+        past_key_values = outputs.mems
+    elif "past_buckets_states" in outputs:
+        past_key_values = outputs.past_buckets_states
+    elif "cache_params" in outputs:
+        past_key_values = outputs.cache_params
+        cache_name = "cache_params"
+    return cache_name, past_key_values
+def update_model_kwargs_for_generation(
+    model,
+    outputs,
+    model_kwargs,
+):
+    # update past_key_values keeping its naming used in model code
+    cache_name, cache = extract_past_from_model_output(outputs)
+    model_kwargs[cache_name] = cache
+    if getattr(outputs, "state", None) is not None:
+        model_kwargs["state"] = outputs.state
+    # update attention mask
+    if "attention_mask" in model_kwargs:
+        attention_mask = model_kwargs["attention_mask"]
+        model_kwargs["attention_mask"] = torch.cat(
+            [
+                attention_mask,
+                attention_mask.new_ones(
+                    (attention_mask.shape[0], model.config.num_lm_heads)
+                ),
+            ],
+            dim=-1
+        )
+    if model_kwargs.get("use_cache", True):
+        model_kwargs["cache_position"] = model_kwargs["cache_position"][-1:] + 1
+    else:
+        past_positions = model_kwargs.pop("cache_position")
+        new_positions = torch.arange(
+            past_positions[-1] + 1, (past_positions[-1] + 1) + 1, dtype=past_positions.dtype
+        ).to(past_positions.device)
+        model_kwargs["cache_position"] = torch.cat((past_positions, new_positions))
+    return model_kwargs
+@torch.no_grad()
+def generate(
+        model,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+        max_new_tokens: Optional[int] = None,
+        max_length: Optional[int] = None,
+        eos_token_id: Optional[Union[List[int], List[List[int]]]] = None,
+        pad_token_id: Optional[List[int]] = None,
+        do_sample: bool = False,
+        logits_processor: Optional[LogitsProcessorList] = None,
+        stopping_criteria: Optional[StoppingCriteriaList] = None,
+):
+    if logits_processor is None:
+        logits_processor = LogitsProcessorList()
+    if stopping_criteria is None:
+        stopping_criteria = StoppingCriteriaList()
+    if eos_token_id is not None:
+        eos_ids_tensor = torch.tensor(
+            eos_token_id, device=input_ids.device, dtype=torch.long
+        )
+        if eos_ids_tensor.dim() == 1:
+            eos_ids_tensor = eos_ids_tensor.unsqueeze(0)
+        stopping_criteria.append(
+            EosTokenCriteriaForPatch(eos_token_id=eos_ids_tensor)
+        )
+        if pad_token_id is None:
+            pad_token_id = eos_ids_tensor[0].clone()
+            logger.warning_once(
+                f"Setting `pad_token_id` to `eos_token_id`: {eos_token_id[0]} for open-end generation."
+            )
+        else:
+            pad_token_id = torch.tensor(
+                pad_token_id,
+                device=input_ids.device,
+                dtype=torch.long
+            )
+    if max_new_tokens is not None:
+        if max_length is not None:
+            logger.warning_once(
+                "`max_length` is ignored when `max_new_tokens` is set."
+            )
+        max_length = input_ids.shape[-1] + max_new_tokens
+    if max_length is not None:
+        stopping_criteria.append(MaxLengthCriteria(max_length=max_length))
+    has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria)
+    model_kwargs = {}
+    if attention_mask is not None:
+        model_kwargs["attention_mask"] = attention_mask
+    else:
+        model_kwargs["attention_mask"] = torch.ones_like(input_ids)
+    dynamic_cache_kwargs = {"config": model.config.get_text_config(decoder=True)}
+    model_kwargs["past_key_values"] = DynamicCache(**dynamic_cache_kwargs)
+    batch_size, cur_len = input_ids.shape
+    scores = ()
+    this_peer_finished = False
+    unfinished_sequences = torch.ones(
+        batch_size, 1, dtype=torch.long, device=input_ids.device
+    )
+    model_kwargs = get_initial_cache_position(model, input_ids, model_kwargs)
+    while not this_peer_finished:
+        # prepare model inputs
+        model_inputs = prepare_inputs_for_generation(model, input_ids, **model_kwargs)
+        # forward pass to get next token
+        outputs = model(**model_inputs, return_dict=True)
+        # synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping
+        model_kwargs = update_model_kwargs_for_generation(
+            model,
+            outputs,
+            model_kwargs,
+        )
+        # Clone is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
+        # (the clone itself is always small)
+        next_token_logits = outputs.logits.clone()[:, -model.config.num_lm_heads:, :].float()
+        next_token_logits = next_token_logits.to(input_ids.device)
+        # pre-process distribution
+        next_token_scores = logits_processor(input_ids, next_token_logits)
+        scores += (next_token_scores,)
+        # token selection
+        if do_sample:
+            probs = nn.functional.softmax(next_token_scores, dim=-1)
+            next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
+        else:
+            next_tokens = torch.argmax(next_token_scores, dim=-1)
+        next_tokens[:, 0::4] += 192
+        next_tokens[:, 1::4] += 128
+        next_tokens[:, 2::4] += 64
+        # finished sentences should have their next token be a padding token
+        if has_eos_stopping_criteria:
+            next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
+        # update generated ids, model inputs, and length for next step
+        input_ids = torch.cat([input_ids, next_tokens], dim=-1)
+        unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores).unsqueeze(-1)
+        this_peer_finished = unfinished_sequences.max() == 0
+        cur_len += model.config.num_lm_heads
+        # This is needed to properly delete outputs.logits which may be very large for first iteration
+        # Otherwise a reference to outputs is kept which keeps the logits alive in the next iteration
+        del outputs
+    return input_ids

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.46.3"
+}

model-00001-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc100c11ee3bbc83898c3cde6ae4d4a7c7a3814f7c9fc2fd1e1b497082be0687
+size 4936898720

model-00002-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7ce10f08a3cf75b5663a41b5f461210f32be001a2bf92b2a3f57a83dc02e798
+size 4999813312

model-00003-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a8c3bf3759af6cff566ea61a4fd4225e8be0c298ba861e02bff776cef9653dbe
+size 4966259040

model-00004-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9017d7ad64ad5733ea3642a28034b006211f77232afbf6099ac5170df713c86a
+size 4999813352

model-00005-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:26cd84a858ec63a0d4300659bd733b4c6651016e091b6f6f6680bc1ba72a293d
+size 4932704376

model-00006-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:871f0c555486dbd142e189cc458a5cc3094ed885e22965eaab20f6dd3382c290
+size 1480673040

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,298 @@

+{
+  "metadata": {
+    "total_size": 26316128256
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00006-of-00006.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "model.norm.weight": "model-00006-of-00006.safetensors"
+  }
+}

modeling_byllama_patch.py ADDED Viewed

	@@ -0,0 +1,1299 @@

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, StaticCache
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import _flash_attention_forward
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+)
+from transformers.generation import GenerationMixin
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
+from transformers.modeling_utils import PreTrainedModel
+from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
+from transformers.utils import (
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_flash_attn_greater_or_equal_2_10,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_byllama_patch import ByLlamaPatchConfig
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "meta-llama/Llama-2-7b-hf"
+_CONFIG_FOR_DOC = "ByLlamaPatchConfig"
+class MeanEmbeddingAggregator(nn.Module):
+    def __init__(self, config: ByLlamaPatchConfig):
+        super().__init__()
+        self.config = config
+        self.group_size = config.num_lm_heads
+        assert config.hidden_size % self.group_size == 0, "hidden_size must be divisible by group_size"
+        assert config.hidden_size == config.input_embedding_dim, "hidden_size must be equal to input_embedding_dim"
+    def forward(self, embeddings: torch.Tensor):
+        """
+        Args:
+            embeddings (:obj:`torch.Tensor`): The embeddings to aggregate. Should have shape `(batch_size, seq_len, hidden_size)`.
+        Output:
+            group_embeddings (:obj:`torch.Tensor`): The aggregated embeddings. Should have shape `(batch_size, seq_len // group_size, hidden_size)`.
+        """
+        batch_size, seq_len, hidden_size = embeddings.size()
+        group_embeddings = embeddings.view(batch_size, seq_len // self.group_size, self.group_size, hidden_size)
+        group_embeddings = group_embeddings.sum(dim=2)
+        return group_embeddings
+class LinearEmbeddingAggregator(nn.Module):
+    def __init__(self, config: ByLlamaPatchConfig):
+        super().__init__()
+        self.config = config
+        self.group_size = config.num_lm_heads
+        self.linear = nn.Linear(
+            config.input_embedding_dim * self.group_size,
+            config.hidden_size,
+            bias=False
+        )
+        assert config.hidden_size % self.group_size == 0, "hidden_size must be divisible by group_size"
+    def forward(self, embeddings: torch.Tensor):
+        """
+        Args:
+            embeddings (:obj:`torch.Tensor`): The embeddings to aggregate. Should have shape `(batch_size, seq_len, hidden_size)`.
+        Output:
+            group_embeddings (:obj:`torch.Tensor`): The aggregated embeddings. Should have shape `(batch_size, seq_len // group_size, hidden_size)`.
+        """
+        batch_size, seq_len, input_embedding_dim = embeddings.size()
+        assert seq_len % self.group_size == 0, "seq_len must be divisible by group_size"
+        group_embeddings = embeddings.view(batch_size, seq_len // self.group_size, self.group_size * input_embedding_dim)
+        return self.linear(group_embeddings)
+class LlamaRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        LlamaRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+ALL_LAYERNORM_LAYERS.append(LlamaRMSNorm)
+class LlamaRotaryEmbedding(nn.Module):
+    def __init__(
+        self,
+        dim=None,
+        max_position_embeddings=2048,
+        base=10000,
+        device=None,
+        scaling_factor=1.0,
+        rope_type="default",
+        config: Optional[ByLlamaPatchConfig] = None,
+    ):
+        super().__init__()
+        # TODO (joao): remove the `if` below, only used for BC
+        self.rope_kwargs = {}
+        if config is None:
+            logger.warning_once(
+                "`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the "
+                "`config` argument. All other arguments will be removed in v4.46"
+            )
+            self.rope_kwargs = {
+                "rope_type": rope_type,
+                "factor": scaling_factor,
+                "dim": dim,
+                "base": base,
+                "max_position_embeddings": max_position_embeddings,
+            }
+            self.rope_type = rope_type
+            self.max_seq_len_cached = max_position_embeddings
+            self.original_max_seq_len = max_position_embeddings
+        else:
+            # BC: "rope_type" was originally "type"
+            if config.rope_scaling is not None:
+                self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+            else:
+                self.rope_type = "default"
+            self.max_seq_len_cached = config.max_position_embeddings
+            self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, **self.rope_kwargs)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    def _dynamic_frequency_update(self, position_ids, device):
+        """
+        dynamic RoPE layers should recompute `inv_freq` in the following situations:
+        1 - growing beyond the cached sequence length (allow scaling)
+        2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
+        """
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_seq_len_cached:  # growth
+            inv_freq, self.attention_scaling = self.rope_init_fn(
+                self.config, device, seq_len=seq_len, **self.rope_kwargs
+            )
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: may break with compilation
+            self.max_seq_len_cached = seq_len
+        if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len:  # reset
+            self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
+            self.max_seq_len_cached = self.original_max_seq_len
+    @torch.no_grad()
+    def forward(self, x, position_ids):
+        if "dynamic" in self.rope_type:
+            self._dynamic_frequency_update(position_ids, device=x.device)
+        # Core RoPE block
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
+        cos = cos * self.attention_scaling
+        sin = sin * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
+    def __init__(self, *args, **kwargs):
+        logger.warning_once(
+            "`LlamaLinearScalingRotaryEmbedding` is deprecated an will be removed in v4.46. Please use "
+            "`LlamaRotaryEmbedding`, which now also does linear scaling (simply pass the model config to __init__)."
+        )
+        kwargs["rope_type"] = "linear"
+        super().__init__(*args, **kwargs)
+class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
+    def __init__(self, *args, **kwargs):
+        logger.warning_once(
+            "`LlamaDynamicNTKScalingRotaryEmbedding` is deprecated an will be removed in v4.46. Please use "
+            "`LlamaRotaryEmbedding`, which now also does dynamic ntk scaling (simply pass the model config to "
+            "__init__)."
+        )
+        kwargs["rope_type"] = "dynamic"
+        super().__init__(*args, **kwargs)
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class LlamaMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        if self.config.pretraining_tp > 1:
+            slice = self.intermediate_size // self.config.pretraining_tp
+            gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
+            up_proj_slices = self.up_proj.weight.split(slice, dim=0)
+            down_proj_slices = self.down_proj.weight.split(slice, dim=1)
+            gate_proj = torch.cat(
+                [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
+            )
+            up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
+            intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
+            down_proj = [
+                F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
+            ]
+            down_proj = sum(down_proj)
+        else:
+            down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+class LlamaAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: ByLlamaPatchConfig, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            logger.warning_once(
+                f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
+                "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.attention_dropout = config.attention_dropout
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = getattr(config, "head_dim", self.hidden_size // self.num_heads)
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.is_causal = True
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias or getattr(self.config, "qkv_bias", False))
+        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias or getattr(self.config, "qkv_bias", False))
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias or getattr(self.config, "qkv_bias", False))
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
+        # TODO (joao): remove in v4.46 (RoPE is computed in the model, not in the decoder layers)
+        self.rotary_emb = LlamaRotaryEmbedding(config=self.config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        if self.config.pretraining_tp > 1:
+            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
+            query_slices = self.q_proj.weight.split(
+                (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
+            )
+            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
+            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
+            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
+            query_states = torch.cat(query_states, dim=-1)
+            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
+            key_states = torch.cat(key_states, dim=-1)
+            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
+            value_states = torch.cat(value_states, dim=-1)
+        else:
+            query_states = self.q_proj(hidden_states)
+            key_states = self.k_proj(hidden_states)
+            value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        if position_embeddings is None:
+            logger.warning_once(
+                "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
+                "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
+                "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
+                "removed and `position_embeddings` will be mandatory."
+            )
+            cos, sin = self.rotary_emb(value_states, position_ids)
+        else:
+            cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+        if attention_mask is not None:  # no matter the length, we just slice it
+            causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+            attn_weights = attn_weights + causal_mask
+        # upcast attention to fp32
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+        attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
+        attn_output = torch.matmul(attn_weights, value_states)
+        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(bsz, q_len, -1)
+        if self.config.pretraining_tp > 1:
+            attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
+            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
+            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
+        else:
+            attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+class LlamaFlashAttention2(LlamaAttention):
+    """
+    Llama flash attention module. This module inherits from `LlamaAttention` as the weights of the module stays
+    untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
+    flash attention and deal with padding tokens in case the input contains any of them.
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        if isinstance(past_key_value, StaticCache):
+            raise ValueError(
+                "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
+                "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
+            )
+        output_attentions = False
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        # Flash attention requires the input to have the shape
+        # batch_size x seq_length x head_dim x hidden_dim
+        # therefore we just need to keep the original shape
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        if position_embeddings is None:
+            logger.warning_once(
+                "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
+                "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
+                "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
+                "removed and `position_embeddings` will be mandatory."
+            )
+            cos, sin = self.rotary_emb(value_states, position_ids)
+        else:
+            cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
+        # to be able to avoid many of these transpose/reshape/view.
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        dropout_rate = self.attention_dropout if self.training else 0.0
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in the correct dtype just to be sure everything works as expected.
+        # This might slowdown training & inference so it is recommended to not cast the LayerNorms
+        # in fp32. (LlamaRMSNorm handles it correctly)
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+        attn_output = _flash_attention_forward(
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            q_len,
+            position_ids=position_ids,
+            dropout=dropout_rate,
+            sliding_window=getattr(self, "sliding_window", None),
+            use_top_left_mask=self._flash_attn_uses_top_left_mask,
+            is_causal=self.is_causal,
+        )
+        attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+class LlamaSdpaAttention(LlamaAttention):
+    """
+    Llama attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `LlamaAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    # Adapted from LlamaAttention.forward
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        if output_attentions:
+            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
+            logger.warning_once(
+                "LlamaModel is using LlamaSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
+                'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+            )
+            return super().forward(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+            )
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        if position_embeddings is None:
+            logger.warning_once(
+                "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
+                "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
+                "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
+                "removed and `position_embeddings` will be mandatory."
+            )
+            cos, sin = self.rotary_emb(value_states, position_ids)
+        else:
+            cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        causal_mask = attention_mask
+        if attention_mask is not None:
+            causal_mask = causal_mask[:, :, :, : key_states.shape[-2]]
+        # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
+        # Reference: https://github.com/pytorch/pytorch/issues/112577.
+        if query_states.device.type == "cuda" and causal_mask is not None:
+            query_states = query_states.contiguous()
+            key_states = key_states.contiguous()
+            value_states = value_states.contiguous()
+        # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
+        # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
+        is_causal = True if causal_mask is None and q_len > 1 else False
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states,
+            key_states,
+            value_states,
+            attn_mask=causal_mask,
+            dropout_p=self.attention_dropout if self.training else 0.0,
+            is_causal=is_causal,
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.view(bsz, q_len, -1)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None, past_key_value
+LLAMA_ATTENTION_CLASSES = {
+    "eager": LlamaAttention,
+    "flash_attention_2": LlamaFlashAttention2,
+    "sdpa": LlamaSdpaAttention,
+}
+class LlamaDecoderLayer(nn.Module):
+    def __init__(self, config: ByLlamaPatchConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
+        self.mlp = LlamaMLP(config)
+        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*):
+                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
+                query_sequence_length, key_sequence_length)` if default attention is used.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+            cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+                Indices depicting the position of the input sequence tokens in the sequence
+            position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
+                Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
+                with `head_dim` being the embedding dimension of each attention head.
+            kwargs (`dict`, *optional*):
+                Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
+                into the model
+        """
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        if use_cache:
+            outputs += (present_key_value,)
+        return outputs
+LLAMA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`ByLlamaPatchConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class LlamaPreTrainedModel(PreTrainedModel):
+    config_class = ByLlamaPatchConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["LlamaDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+LLAMA_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
+            `past_key_values`).
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
+            Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
+            returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
+            Two formats are allowed:
+            - a [`~cache_utils.Cache`] instance, see our
+            [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
+            - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+            shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
+            cache format.
+            The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
+            legacy cache format will be returned.
+            If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
+            have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
+            of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+            Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
+            this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
+            the complete sequence length.
+"""
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class ByLlamaPatchModel(LlamaPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
+    Args:
+        config: ByLlamaPatchConfig
+    """
+    def __init__(self, config: ByLlamaPatchConfig):
+        super().__init__(config)
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.input_embedding_dim)
+        if config.embedding_aggregator_type == "mean":
+            self.embedding_aggregator = MeanEmbeddingAggregator(config)
+        elif config.embedding_aggregator_type == "linear":
+            self.embedding_aggregator = LinearEmbeddingAggregator(config)
+        else:
+            raise ValueError(
+                f"Invalid `embedding_aggregator_type` in config: {config.embedding_aggregator_type}. "
+            )
+        self.layers = nn.ModuleList(
+            [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = LlamaRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Aggregate attention mask
+        if attention_mask is not None:
+            attention_mask = torch.ones_like(
+                input_ids, dtype=attention_mask.dtype, device=attention_mask.device
+            )
+        bsz = input_ids.size(0)
+        attention_mask = attention_mask.view(bsz, -1, self.config.num_lm_heads).all(dim=-1)
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        inputs_embeds = self.embedding_aggregator(inputs_embeds)
+        # kept for BC (non `Cache` `past_key_values` inputs)
+        return_legacy_cache = False
+        if use_cache and not isinstance(past_key_values, Cache):
+            return_legacy_cache = True
+            if past_key_values is None:
+                past_key_values = DynamicCache()
+            else:
+                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+                logger.warning_once(
+                    "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
+                    "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
+                    "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
+                )
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens,
+                past_seen_tokens + inputs_embeds.shape[1],
+                device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+        hidden_states = inputs_embeds
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = None
+        for decoder_layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                    position_embeddings=position_embeddings,
+                )
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        next_cache = next_decoder_cache if use_cache else None
+        if return_legacy_cache:
+            next_cache = next_cache.to_legacy_cache()
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_static_cache = isinstance(past_key_values, StaticCache)
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                is_training=self.training,
+            ):
+                return None
+        dtype, device = input_tensor.dtype, input_tensor.device
+        sequence_length = input_tensor.shape[1]
+        if using_static_cache:
+            target_length = past_key_values.get_max_cache_shape()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            device=device,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+        )
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type == "cuda"
+            and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        **kwargs,
+    ):
+        """
+        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+        Args:
+            attention_mask (`torch.Tensor`):
+                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
+                `(batch_size, 1, query_length, key_value_length)`.
+            sequence_length (`int`):
+                The sequence length being processed.
+            target_length (`int`):
+                The target length: when generating with static cache, the mask should be as long as the static cache,
+                to account for the 0 padding, the part of the cache that is not filled yet.
+            dtype (`torch.dtype`):
+                The dtype to use for the 4D attention mask.
+            device (`torch.device`):
+                The device to plcae the 4D attention mask on.
+            cache_position (`torch.Tensor`):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            batch_size (`torch.Tensor`):
+                Batch size.
+        """
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
+            )
+            if sequence_length != 1:
+                causal_mask = torch.triu(causal_mask, diagonal=1)
+            causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+        return causal_mask
+class ByLlamaPatchForCausalLM(LlamaPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = ByLlamaPatchModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(
+            config.hidden_size, config.output_vocab_size, bias=False
+        )
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @classmethod
+    def can_generate(cls) -> bool:
+        return True
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        num_logits_to_keep: int = 0,
+        **loss_kwargs,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            num_logits_to_keep (`int`, *optional*):
+                Calculate logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, LlamaForCausalLM
+        >>> model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+        hidden_states = outputs[0]
+        if self.config.pretraining_tp > 1:
+            lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
+            logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
+            logits = torch.cat(logits, dim=-1)
+        else:
+            # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+            logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+        # Split logits tensor by num_heads
+        bsz, seq_len, _ = logits.size()
+        logits = logits.view(bsz, seq_len * self.config.num_lm_heads, -1).contiguous()
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **loss_kwargs)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "bos_token": {
+    "content": "<|begin_of_text|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<|cls|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<|mask|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|pad|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "<|sep|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenization_utf8_like_byte_v3.py ADDED Viewed

	@@ -0,0 +1,1205 @@

+# Copyright 2021 T5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization class for model ByT5."""
+import warnings
+from typing import (
+    Dict,
+    List,
+    Optional,
+    Union,
+    Tuple
+)
+import json
+import os
+import copy
+import ast
+import torch
+import numpy as np
+from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
+from transformers.tokenization_utils_base import (
+    BatchEncoding,
+    EncodedInput,
+    PaddingStrategy,
+    TruncationStrategy
+)
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
+ADDED_TOKENS_FILE = "added_tokens.json"
+TOKENIZER_CONFIG_FILE = "tokenizer_config.json"
+LARGE_INTEGER = int(1e20)
+def make_serializeable(obj):
+    if isinstance(obj, dict):
+        return {str(k): make_serializeable(v) for k, v in obj.items()}
+    if isinstance(obj, list):
+        return [make_serializeable(v) for v in obj]
+    if isinstance(obj, tuple):
+        return make_serializeable(list(obj))
+    return obj
+class ByteLMTokenizerV3(PreTrainedTokenizer):
+    """Byte tokenizer with completely seperate space for special tokens.
+    tok.pad    Parameters
+    ----------
+    PreTrainedTokenizer : _type_
+        _description_
+    Returns
+    -------
+    _type_
+        _description_
+    Raises
+    ------
+    ValueError
+        _description_
+    ValueError
+        _description_
+    """
+    model_input_names: list[str] = ["input_ids", "attention_mask"]
+    reserve_sizes: list[int] = [59, 0, 0, 0]
+    byte_head_ints: list[int] = [
+        int("11000000", base=2),
+        int("10000000", base=2),
+        int("01000000", base=2),
+        int("00000000", base=2),
+    ]
+    byte_n_free_bits: list[int] = [6, 6, 6, 6]
+    patch_padding: bool
+    reserve_token_list: list[tuple[int]]
+    def __init__(
+        self,
+        patch_padding=True,
+        pad_token="<|pad|>",
+        eos_token="<|end_of_text|>",
+        bos_token="<|begin_of_text|>",
+        cls_token="<|cls|>",
+        sep_token="<|sep|>",
+        mask_token="<|mask|>",
+        vision_start_token="<|vision_start|>",  # for vlm
+        vision_br_token="<|vision_br|>",  # for vlm
+        vision_end_token="<|vision_end|>",  # for vlm
+        start_header_id_token="<|start_header_id|>",  # for it
+        end_header_id_token="<|end_header_id|>",  # for it
+        eor_id="<|end_of_role|>",  # for it
+        extra_ids=47,
+        **kwargs,
+    ) -> None:
+        assert np.prod(
+            [
+                2**n_free_bits - reserve_size
+                for reserve_size, n_free_bits in zip(
+                    self.reserve_sizes, self.byte_n_free_bits
+                )
+            ]
+        ) >= int(
+            "110000", base=16
+        ), "Not enough positions for all unicode. Too many reserve size."
+        self.patch_padding = patch_padding
+        # list up all reserve tokens
+        self._list_up_reserve_tokens()
+        _bos_token = (
+            AddedToken(bos_token, lstrip=False, rstrip=False)
+            if isinstance(bos_token, str)
+            else bos_token
+        )
+        _eos_token = (
+            AddedToken(eos_token, lstrip=False, rstrip=False)
+            if isinstance(eos_token, str)
+            else eos_token
+        )
+        _pad_token = (
+            AddedToken(pad_token, lstrip=False, rstrip=False)
+            if isinstance(pad_token, str)
+            else pad_token
+        )
+        _cls_token = (
+            AddedToken(cls_token, lstrip=False, rstrip=False)
+            if isinstance(cls_token, str)
+            else cls_token
+        )
+        _sep_token = (
+            AddedToken(sep_token, lstrip=False, rstrip=False)
+            if isinstance(sep_token, str)
+            else sep_token
+        )
+        _mask_token = (
+            AddedToken(mask_token, lstrip=False, rstrip=False)
+            if isinstance(mask_token, str)
+            else mask_token
+        )
+        _vision_start_token = (
+            AddedToken(vision_start_token, lstrip=False, rstrip=False)
+            if isinstance(vision_start_token, str)
+            else vision_start_token
+        )
+        _vision_br_token = (
+            AddedToken(vision_br_token, lstrip=False, rstrip=False)
+            if isinstance(vision_br_token, str)
+            else vision_br_token
+        )
+        _vision_end_token = (
+            AddedToken(vision_end_token, lstrip=False, rstrip=False)
+            if isinstance(vision_end_token, str)
+            else vision_end_token
+        )
+        _start_header_id_token = (
+            AddedToken(start_header_id_token, lstrip=False, rstrip=False)
+            if isinstance(start_header_id_token, str)
+            else start_header_id_token
+        )
+        _end_header_id_token = (
+            AddedToken(end_header_id_token, lstrip=False, rstrip=False)
+            if isinstance(end_header_id_token, str)
+            else end_header_id_token
+        )
+        _eor_id = (
+            AddedToken(eor_id, lstrip=False, rstrip=False)
+            if isinstance(eor_id, str)
+            else eor_id
+        )
+        self.offset = 0
+        self._added_tokens_decoder = {
+            self.reserve_token_list[i]: special_token
+            for i, special_token in enumerate(
+                [
+                    _pad_token,
+                    _eos_token,
+                    _bos_token,
+                    _cls_token,
+                    _sep_token,
+                    _mask_token,
+                    _vision_start_token,
+                    _vision_br_token,
+                    _vision_end_token,
+                    _start_header_id_token,
+                    _end_header_id_token,
+                    _eor_id,
+                ]
+            )
+        }
+        offset = len(self._added_tokens_decoder)
+        extra_tokens = {
+            self.reserve_token_list[j + offset]: AddedToken(
+                f"<|extra_id_{i}|>", lstrip=False, rstrip=False
+            )
+            for j, i in enumerate(range(extra_ids))
+        }
+        self._added_tokens_decoder.update(extra_tokens)
+        super().__init__(
+            bos_token=_bos_token,
+            eos_token=_eos_token,
+            pad_token=_pad_token,
+            cls_token=_cls_token,
+            sep_token=_sep_token,
+            mask_token=_mask_token,
+            vision_start_token=_vision_start_token,
+            vision_br_token=_vision_br_token,
+            vision_end_token=_vision_end_token,
+            start_header_id_token=_start_header_id_token,
+            end_header_id_token=_end_header_id_token,
+            eor_id=_eor_id,
+            **kwargs,
+        )
+        self._vocab_size = len(self.get_vocab())
+    def _list_up_reserve_tokens(self):
+        self.reserve_token_list = [
+            (
+                i + self.byte_head_ints[0],
+                self.byte_head_ints[1],
+                self.byte_head_ints[2],
+                self.byte_head_ints[3],
+            )
+            for i in range(self.reserve_sizes[0])
+        ]
+    @property
+    def vocab_size(self):
+        return self._vocab_size
+    def create_tree(
+        self, byte_options: list[list[int]], byte_index: int, max_byte_index: int
+    ) -> list[list[int]]:
+        if byte_index == max_byte_index:
+            return [[reserve_option] for reserve_option in byte_options[byte_index]]
+        concat_list = []
+        for byte_reserve_option in byte_options[byte_index]:
+            if byte_reserve_option is not None:
+                concat_list += [
+                    [byte_reserve_option] + following_bytes
+                    if following_bytes != [None]
+                    else [byte_reserve_option]
+                    for following_bytes in self.create_tree(
+                        byte_options=byte_options,
+                        byte_index=byte_index + 1,
+                        max_byte_index=max_byte_index,
+                    )
+                ]
+            else:
+                concat_list.append([None])
+        return concat_list
+    def get_vocab(self):
+        byte_options = [
+            list(range(reserve_size, 2**n_free_bits))
+            for reserve_size, n_free_bits in zip(
+                self.reserve_sizes, self.byte_n_free_bits
+            )
+        ]
+        if not self.patch_padding:
+            for i in range(len(byte_options) - 1):
+                byte_options[i] += [None]
+        byte_options.reverse()
+        byte_tokens = self.create_tree(
+            byte_options=byte_options, byte_index=0, max_byte_index=3
+        )
+        byte_tokens = sorted(
+            byte_tokens,
+            key=lambda lst: sum([e * (256**i) for i, e in enumerate(lst)])
+            + 256 ** len(lst),
+        )
+        for byte_token_index in range(len(byte_tokens)):
+            byte_tokens[byte_token_index].reverse()
+            for position in range(len(byte_tokens[byte_token_index])):
+                byte_tokens[byte_token_index][position] += self.byte_head_ints[position]
+            byte_tokens[byte_token_index] = tuple(byte_tokens[byte_token_index])
+        vocab = {self.convert_ids_to_tokens(tokens): tokens for tokens in byte_tokens}
+        vocab.pop("")
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+    def _get_padding_truncation_strategies(
+        self, padding=False, truncation=None, max_length=None, pad_to_multiple_of=None, verbose=True, **kwargs
+    ):
+        """
+        Find the correct padding/truncation strategy
+        """
+        # Backward compatibility for previous behavior, maybe we should deprecate it:
+        # If you only set max_length, it activates truncation for max_length
+        if max_length is not None and padding is False and truncation is None:
+            if verbose:
+                if not self.deprecation_warnings.get("Truncation-not-explicitly-activated", False):
+                    logger.warning(
+                        "Truncation was not explicitly activated but `max_length` is provided a specific value, please"
+                        " use `truncation=True` to explicitly truncate examples to max length. Defaulting to"
+                        " 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the"
+                        " tokenizer you can select this strategy more precisely by providing a specific strategy to"
+                        " `truncation`."
+                    )
+                self.deprecation_warnings["Truncation-not-explicitly-activated"] = True
+            truncation = "longest_first"
+        # Get padding strategy
+        if padding is not False:
+            if padding is True:
+                if verbose:
+                    if max_length is not None and (
+                        truncation is None or truncation is False or truncation == "do_not_truncate"
+                    ):
+                        warnings.warn(
+                            "`max_length` is ignored when `padding`=`True` and there is no truncation strategy. "
+                            "To pad to max length, use `padding='max_length'`."
+                        )
+                padding_strategy = PaddingStrategy.LONGEST  # Default to pad to the longest sequence in the batch
+            elif not isinstance(padding, PaddingStrategy):
+                padding_strategy = PaddingStrategy(padding)
+            elif isinstance(padding, PaddingStrategy):
+                padding_strategy = padding
+        else:
+            padding_strategy = PaddingStrategy.DO_NOT_PAD
+        # Get truncation strategy
+        if truncation is not False and truncation is not None:
+            if truncation is True:
+                truncation_strategy = (
+                    TruncationStrategy.LONGEST_FIRST
+                )  # Default to truncate the longest sequences in pairs of inputs
+            elif not isinstance(truncation, TruncationStrategy):
+                truncation_strategy = TruncationStrategy(truncation)
+            elif isinstance(truncation, TruncationStrategy):
+                truncation_strategy = truncation
+        else:
+            truncation_strategy = TruncationStrategy.DO_NOT_TRUNCATE
+        # Set max length if needed
+        if max_length is None:
+            if padding_strategy == PaddingStrategy.MAX_LENGTH:
+                if self.model_max_length > LARGE_INTEGER:
+                    if verbose:
+                        if not self.deprecation_warnings.get("Asking-to-pad-to-max_length", False):
+                            logger.warning(
+                                "Asking to pad to max_length but no maximum length is provided and the model has no"
+                                " predefined maximum length. Default to no padding."
+                            )
+                        self.deprecation_warnings["Asking-to-pad-to-max_length"] = True
+                    padding_strategy = PaddingStrategy.DO_NOT_PAD
+                else:
+                    max_length = self.model_max_length
+            if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:
+                if self.model_max_length > LARGE_INTEGER:
+                    if verbose:
+                        if not self.deprecation_warnings.get("Asking-to-truncate-to-max_length", False):
+                            logger.warning(
+                                "Asking to truncate to max_length but no maximum length is provided and the model has"
+                                " no predefined maximum length. Default to no truncation."
+                            )
+                        self.deprecation_warnings["Asking-to-truncate-to-max_length"] = True
+                    truncation_strategy = TruncationStrategy.DO_NOT_TRUNCATE
+                else:
+                    max_length = self.model_max_length
+        # Test if we have a padding token
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD and self.pad_token is None:
+            raise ValueError(
+                "Asking to pad but the tokenizer does not have a padding token. "
+                "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
+                "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
+            )
+        # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
+        if (
+            truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
+            and padding_strategy != PaddingStrategy.DO_NOT_PAD
+            and pad_to_multiple_of is not None
+            and max_length is not None
+            and (max_length % pad_to_multiple_of != 0)
+        ):
+            raise ValueError(
+                "Truncation and padding are both activated but "
+                f"truncation length ({max_length}) is not a multiple of pad_to_multiple_of ({pad_to_multiple_of})."
+            )
+        return padding_strategy, truncation_strategy, max_length, kwargs
+    def _add_bos_if_not_present(self, token_ids: list[int]) -> list[int]:
+        """Do not add bos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[0] == self.bos_token_id:
+            warnings.warn(
+                f"This sequence already has {self.bos_token}. In future versions this behavior may lead to duplicated"
+                " bos tokens being added."
+            )
+            return token_ids
+        else:
+            return list(self.bos_token_id) + token_ids
+    def _add_eos_if_not_present(self, token_ids: list[int]) -> list[int]:
+        """Do not add eos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+            warnings.warn(
+                f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated"
+                " eos tokens being added."
+            )
+            return token_ids
+        else:
+            return token_ids + list(self.eos_token_id)
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        padding_side: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in `padding_side` argument:
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta).
+            padding_side:
+                The side on which the model should have padding applied. Should be selected between ['right', 'left'].
+                Default value is picked from the class attribute of the same name.
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+        required_input = encoded_inputs[self.model_input_names[0]]
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(required_input)
+        if (
+            max_length is not None
+            and pad_to_multiple_of is not None
+            and (max_length % pad_to_multiple_of != 0)
+        ):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+        needs_to_be_padded = (
+            padding_strategy != PaddingStrategy.DO_NOT_PAD
+            and len(required_input) != max_length
+        )
+        # Initialize attention mask if not present.
+        if return_attention_mask and "attention_mask" not in encoded_inputs:
+            encoded_inputs["attention_mask"] = [1] * len(required_input)
+        if needs_to_be_padded:
+            if self.patch_padding:
+                difference = (max_length - len(required_input)) // len(
+                    self.byte_head_ints
+                )
+                mask_patch_size = 4
+            else:
+                difference = max_length - len(required_input)
+                mask_patch_size = 1
+            padding_side = (
+                padding_side if padding_side is not None else self.padding_side
+            )
+            if padding_side == "right":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = (
+                        encoded_inputs["attention_mask"]
+                        + [0] * difference * mask_patch_size
+                    )
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = (
+                        encoded_inputs["token_type_ids"]
+                        + list(self.pad_token_type_id) * difference
+                    )
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = (
+                        encoded_inputs["special_tokens_mask"]
+                        + [1] * difference * mask_patch_size
+                    )
+                encoded_inputs[self.model_input_names[0]] = (
+                    required_input + list(self.pad_token_id) * difference
+                )
+            elif padding_side == "left":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [
+                        0
+                    ] * difference * mask_patch_size + encoded_inputs["attention_mask"]
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = (
+                        list(self.pad_token_type_id) * difference
+                        + encoded_inputs["token_type_ids"]
+                    )
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = [
+                        1
+                    ] * difference * mask_patch_size + encoded_inputs[
+                        "special_tokens_mask"
+                    ]
+                encoded_inputs[self.model_input_names[0]] = (
+                    list(self.pad_token_id) * difference + required_input
+                )
+            else:
+                raise ValueError(f"Invalid padding strategy:{padding_side}")
+        return encoded_inputs
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: list[int], token_ids_1: list[int] | None = None
+    ) -> list[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A sequence has the following format:
+        - single sequence: `X </s>`
+        - pair of sequences: `A </s> B </s>`
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        token_ids_0 = self._add_bos_if_not_present(token_ids_0)
+        token_ids_0 = self._add_eos_if_not_present(token_ids_0)
+        if token_ids_1 is None:
+            return token_ids_0
+        else:
+            token_ids_1 = self._add_bos_if_not_present(token_ids_1)
+            token_ids_1 = self._add_eos_if_not_present(token_ids_1)
+            return token_ids_0 + token_ids_1
+    def _tokenize(self, text: str) -> list[str]:
+        """Take as input a string and return a list of strings (tokens) for words/sub-words"""
+        token_ids = []
+        for c in text:
+            token_ids.extend(self.unicode_to_bytes(ord(c)))
+        # Convert to string
+        token_ids = [str(i) for i in token_ids]
+        return token_ids
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        token_id = int(token) + self.offset
+        return token_id
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return str(index - self.offset)
+    def _convert_token_to_id_with_added_voc(self, token):
+        if token is None:
+            return None
+        if token in self._added_tokens_encoder:
+            return list(self._added_tokens_encoder[token])
+        return [self._convert_token_to_id(token)]
+    def convert_tokens_to_ids(
+        self, tokens: Union[str, List[str]]
+    ) -> Union[int, List[int]]:
+        """
+        Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the
+        vocabulary.
+        Args:
+            tokens (`str` or `List[str]`): One or several token(s) to convert to token id(s).
+        Returns:
+            `int` or `List[int]`: The token id or list of token ids.
+        """
+        if tokens is None:
+            return None
+        if isinstance(tokens, str):
+            return self._convert_token_to_id_with_added_voc(tokens)
+        ids = []
+        for token in tokens:
+            ids.extend(self._convert_token_to_id_with_added_voc(token))
+        return ids
+    def convert_bytes_for_single_char_to_char(self, ids: list[int]) -> str:
+        byte_ints = []
+        byte_offset = 1
+        if self.is_special_token(ids):  # special token
+            return self.added_tokens_decoder[tuple(ids)].__str__()
+        for byte_position in range(1, len(ids) + 1):
+            byte_int = (
+                ids[-byte_position]
+                - self.byte_head_ints[-byte_position]
+                - self.reserve_sizes[-byte_position]
+            )
+            if byte_int != -self.reserve_sizes[-byte_position]:  # not padding
+                byte_ints.append(byte_int * byte_offset)
+            byte_offset *= (
+                2 ** self.byte_n_free_bits[-byte_position]
+                - self.reserve_sizes[-byte_position]
+            )
+        codepoint = sum(byte_ints)
+        if codepoint >= int("110000", base=16):
+            return None
+        else:
+            try:
+                return chr(codepoint)
+            except ValueError:
+                return None
+    # def is_special_token(self, ids: list[int]):
+    #     return ids[0] < self.byte_head_ints[0] + (self.reserve_sizes[0] - 1)
+    def is_special_token(self, ids: list[int]):
+        return tuple(ids) in self._added_tokens_decoder
+    def convert_ids_to_tokens(
+        self, ids: list[int] | tuple[int], skip_special_tokens: bool = False
+    ) -> str | None:
+        """convert ids for single/multiple unicode character(s) to unicode character(s)"""
+        decoded_chars = ""
+        if isinstance(ids, tuple):
+            ids = list(ids)
+        if self.patch_padding:
+            for byte_position in range(0, len(ids), len(self.byte_head_ints)):
+                char_bytes = ids[
+                    byte_position : byte_position + len(self.byte_head_ints)
+                ]
+                if (
+                    skip_special_tokens and not self.is_special_token(char_bytes)
+                ) or not skip_special_tokens:
+                    char = self.convert_bytes_for_single_char_to_char(char_bytes)
+                    if char:
+                        decoded_chars += char
+            return decoded_chars
+        if not self.is_special_token(ids):  # not special token
+            byte_ints = []
+            byte_offset = 1
+            for byte_position in range(1, len(ids) + 1):
+                if ids[-byte_position] == 0:
+                    break
+                byte_int = (
+                    ids[-byte_position]
+                    - self.byte_head_ints[-byte_position]
+                    - self.reserve_sizes[-byte_position]
+                )
+                assert byte_int >= 0
+                byte_ints.append(byte_int * byte_offset)
+                byte_offset *= (
+                    2 ** self.byte_n_free_bits[-byte_position]
+                    - self.reserve_sizes[-byte_position]
+                )
+            codepoint = sum(byte_ints)
+            if codepoint >= int("110000", base=16):
+                return None
+            else:
+                return chr(codepoint)
+        else:  # special token
+            return self._added_tokens_decoder[tuple(ids)]
+    def unicode_to_bytes(self, codepoint: int) -> list[int]:
+        byte_list_reversed = []
+        for byte_position_from_right in range(len(self.byte_n_free_bits)):
+            byte_n_free_ids = (
+                2 ** self.byte_n_free_bits[-1 - byte_position_from_right]
+                - self.reserve_sizes[-1 - byte_position_from_right]
+            )
+            byte_id = (
+                codepoint % byte_n_free_ids
+                + self.reserve_sizes[-1 - byte_position_from_right]
+                + self.byte_head_ints[-1 - byte_position_from_right]
+            )
+            codepoint //= byte_n_free_ids
+            byte_list_reversed.append(byte_id)
+            if codepoint == 0:
+                if self.patch_padding:
+                    for pad_byte_position_from_right in range(
+                        len(byte_list_reversed), len(self.byte_n_free_bits)
+                    ):
+                        byte_list_reversed.append(
+                            self.byte_head_ints[-1 - pad_byte_position_from_right] + self.reserve_sizes[-1 - pad_byte_position_from_right]
+                        )
+                byte_list_reversed.reverse()
+                return byte_list_reversed
+        raise ValueError("codepoint is too large")
+    # ByteTokenizer has no vocab file
+    def save_vocabulary(
+        self, save_directory: str, filename_prefix: str | None = None
+    ) -> tuple[str]:
+        return ()
+    def image_to_ids(self, image_data: list[list[list[int]]]) -> list[int]:
+        image_data = torch.tensor(image_data)
+        x, y, rgb = image_data.size()
+        assert rgb == 3
+        image_br_token = list(self.added_tokens_encoder["<|vision_br|>"])
+        image_special_byte_index = self.added_tokens_encoder["<|vision_start|>"][0]
+        # add img byte by padding to the beginning
+        image_data = torch.nn.functional.pad(
+            image_data, (1, 0), "constant", value=image_special_byte_index
+        ).view(x, y * 4)
+        image_data = torch.concat(
+            [image_data, torch.tensor(image_br_token * x).view(x, 4)], dim=1
+        ).view(-1)
+        return image_data.tolist()
+    def save_pretrained(
+        self,
+        save_directory: Union[str, os.PathLike],
+        legacy_format: Optional[bool] = None,
+        filename_prefix: Optional[str] = None,
+        push_to_hub: bool = False,
+        **kwargs,
+    ) -> Tuple[str]:
+        """
+        Save the full tokenizer state.
+        This method make sure the full tokenizer can then be re-loaded using the
+        [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`] class method..
+        Warning,None This won't save modifications you may have applied to the tokenizer after the instantiation (for
+        instance, modifying `tokenizer.do_lower_case` after creation).
+        Args:
+            save_directory (`str` or `os.PathLike`): The path to a directory where the tokenizer will be saved.
+            legacy_format (`bool`, *optional*):
+                Only applicable for a fast tokenizer. If unset (default), will save the tokenizer in the unified JSON
+                format as well as in legacy format if it exists, i.e. with tokenizer specific vocabulary and a separate
+                added_tokens files.
+                If `False`, will only save the tokenizer in the unified JSON format. This format is incompatible with
+                "slow" tokenizers (not powered by the *tokenizers* library), so the tokenizer will not be able to be
+                loaded in the corresponding "slow" tokenizer.
+                If `True`, will save the tokenizer in legacy format. If the "slow" tokenizer doesn't exits, a value
+                error is raised.
+            filename_prefix (`str`, *optional*):
+                A prefix to add to the names of the files saved by the tokenizer.
+            push_to_hub (`bool`, *optional*, defaults to `False`):
+                Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
+                repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
+                namespace).
+            kwargs (`Dict[str, Any]`, *optional*):
+                Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
+        Returns:
+            A tuple of `str`: The files saved.
+        """
+        use_auth_token = kwargs.pop("use_auth_token", None)
+        if use_auth_token is not None:
+            warnings.warn(
+                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
+                FutureWarning,
+            )
+            if kwargs.get("token", None) is not None:
+                raise ValueError(
+                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
+                )
+            kwargs["token"] = use_auth_token
+        if os.path.isfile(save_directory):
+            logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
+            return
+        os.makedirs(save_directory, exist_ok=True)
+        if push_to_hub:
+            commit_message = kwargs.pop("commit_message", None)
+            repo_id = kwargs.pop("repo_id", save_directory.split(os.path.sep)[-1])
+            repo_id = self._create_repo(repo_id, **kwargs)
+            files_timestamps = self._get_files_timestamps(save_directory)
+        special_tokens_map_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + SPECIAL_TOKENS_MAP_FILE
+        )
+        tokenizer_config_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + TOKENIZER_CONFIG_FILE
+        )
+        tokenizer_config = copy.deepcopy(self.init_kwargs)
+        # Let's save the init kwargs
+        target_keys = set(self.init_kwargs.keys())
+        # Let's save the special tokens map (only the strings)
+        target_keys.update(["model_max_length", "clean_up_tokenization_spaces"])
+        for k in target_keys:
+            if hasattr(self, k):
+                tokenizer_config[k] = getattr(self, k)
+        # Let's make sure we properly save the special tokens.
+        tokenizer_config.update(self.special_tokens_map)
+        if self.chat_template is not None:
+            if isinstance(self.chat_template, dict):
+                # Chat template dicts are saved to the config as lists of dicts with fixed key names.
+                # They will be reconstructed as a single dict during loading.
+                tokenizer_config["chat_template"] = [{"name": k, "template": v} for k, v in self.chat_template.items()]
+            else:
+                tokenizer_config["chat_template"] = self.chat_template
+        if len(self.init_inputs) > 0:
+            tokenizer_config["init_inputs"] = copy.deepcopy(self.init_inputs)
+        for file_id in self.vocab_files_names.keys():
+            tokenizer_config.pop(file_id, None)
+        # no typefields, this way old fast and slow can load it
+        tokenizer_config = self.convert_added_tokens(tokenizer_config, add_type_field=True, save=True)
+        # Process added tokens seperatly: allows previous versions to ignore it!
+        added_tokens = {}
+        for key, value in self.added_tokens_decoder.items():
+            added_tokens[key] = value.__getstate__()
+        tokenizer_config["added_tokens_decoder"] = added_tokens
+        # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
+        tokenizer_class = self.__class__.__name__
+        # Remove the Fast at the end unless we have a special `PreTrainedTokenizerFast`
+        if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast":
+            tokenizer_class = tokenizer_class[:-4]
+        tokenizer_config["tokenizer_class"] = tokenizer_class
+        if getattr(self, "_auto_map", None) is not None:
+            tokenizer_config["auto_map"] = self._auto_map
+        if getattr(self, "_processor_class", None) is not None:
+            tokenizer_config["processor_class"] = self._processor_class
+        # If we have a custom model, we copy the file defining it in the folder and set the attributes so it can be
+        # loaded from the Hub.
+        if self._auto_class is not None:
+            custom_object_save(self, save_directory, config=tokenizer_config)
+        # remove private information
+        if "name_or_path" in tokenizer_config:
+            tokenizer_config.pop("name_or_path")
+            tokenizer_config.pop("special_tokens_map_file", None)
+            tokenizer_config.pop("tokenizer_file", None)
+        with open(tokenizer_config_file, "w", encoding="utf-8") as f:
+            out_str = json.dumps(
+                make_serializeable(tokenizer_config),
+                indent=2,
+                sort_keys=True,
+                ensure_ascii=False
+            ) + "\n"
+            f.write(out_str)
+        logger.info(f"tokenizer config file saved in {tokenizer_config_file}")
+        # Sanitize AddedTokens in special_tokens_map
+        # kept for forward compatibility, will be removed in transoformers 5. Typefields are not saved for FC, special should not be save either
+        write_dict = self.convert_added_tokens(self.special_tokens_map_extended, save=True, add_type_field=False)
+        with open(special_tokens_map_file, "w", encoding="utf-8") as f:
+            out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
+            f.write(out_str)
+        logger.info(f"Special tokens file saved in {special_tokens_map_file}")
+        file_names = (tokenizer_config_file, special_tokens_map_file)
+        save_files = self._save_pretrained(
+            save_directory=save_directory,
+            file_names=file_names,
+            legacy_format=legacy_format,
+            filename_prefix=filename_prefix,
+        )
+        if push_to_hub:
+            self._upload_modified_files(
+                save_directory,
+                repo_id,
+                files_timestamps,
+                commit_message=commit_message,
+                token=kwargs.get("token"),
+            )
+        return save_files
+    def _save_pretrained(
+        self,
+        save_directory: Union[str, os.PathLike],
+        file_names: Tuple[str],
+        legacy_format: Optional[bool] = None,
+        filename_prefix: Optional[str] = None,
+    ) -> Tuple[str]:
+        """
+        Save a tokenizer using the slow-tokenizer/legacy format: vocabulary + added tokens.
+        Fast tokenizers can also be saved in a unique JSON file containing {config + vocab + added-tokens} using the
+        specific [`~tokenization_utils_fast.PreTrainedTokenizerFast._save_pretrained`]
+        """
+        if legacy_format is False:
+            raise ValueError(
+                "Only fast tokenizers (instances of PreTrainedTokenizerFast) can be saved in non legacy format."
+            )
+        save_directory = str(save_directory)
+        added_tokens_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + ADDED_TOKENS_FILE
+        )
+        # the new get_added_vocab() also returns special tokens and tokens that have an index < vocab_size
+        # added_vocab = {tok: index for tok, index in self.added_tokens_encoder.items() if index >= self.vocab_size}
+        added_vocab = {tok: list(index) for tok, index in self.added_tokens_encoder.items()}
+        if added_vocab:
+            with open(added_tokens_file, "w", encoding="utf-8") as f:
+                out_str = json.dumps(added_vocab, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
+                f.write(out_str)
+                logger.info(f"added tokens file saved in {added_tokens_file}")
+        vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
+        return file_names + vocab_files + (added_tokens_file,)
+    @classmethod
+    def _from_pretrained(
+        cls,
+        resolved_vocab_files,
+        pretrained_model_name_or_path,
+        init_configuration,
+        *init_inputs,
+        token=None,
+        cache_dir=None,
+        local_files_only=False,
+        _commit_hash=None,
+        _is_local=False,
+        trust_remote_code=False,
+        **kwargs,
+    ):
+        # We instantiate fast tokenizers based on a slow tokenizer if we don't have access to the tokenizer.json
+        # file or if `from_slow` is set to True.
+        from_slow = kwargs.get("from_slow", False)
+        gguf_file = kwargs.get("gguf_file", None)
+        has_tokenizer_file = resolved_vocab_files.get("tokenizer_file", None) is not None
+        # If one passes a GGUF file path to `gguf_file` there is no need for this check as the tokenizer will be
+        # loaded directly from the GGUF file.
+        if (from_slow or not has_tokenizer_file) and cls.slow_tokenizer_class is not None and not gguf_file:
+            slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
+                copy.deepcopy(resolved_vocab_files),
+                pretrained_model_name_or_path,
+                copy.deepcopy(init_configuration),
+                *init_inputs,
+                token=token,
+                cache_dir=cache_dir,
+                local_files_only=local_files_only,
+                _commit_hash=_commit_hash,
+                **(copy.deepcopy(kwargs)),
+            )
+        else:
+            slow_tokenizer = None
+        # Prepare tokenizer initialization kwargs
+        # Did we saved some inputs and kwargs to reload ?
+        tokenizer_config_file = resolved_vocab_files.pop("tokenizer_config_file", None)
+        if tokenizer_config_file is not None:
+            with open(tokenizer_config_file, encoding="utf-8") as tokenizer_config_handle:
+                init_kwargs = json.load(tokenizer_config_handle)
+            # First attempt. We get tokenizer_class from tokenizer_config to check mismatch between tokenizers.
+            config_tokenizer_class = init_kwargs.get("tokenizer_class")
+            init_kwargs.pop("tokenizer_class", None)
+            if not has_tokenizer_file:
+                init_kwargs.pop("tokenizer_file", None)
+            saved_init_inputs = init_kwargs.pop("init_inputs", ())
+            if not init_inputs:
+                init_inputs = saved_init_inputs
+        else:
+            config_tokenizer_class = None
+            init_kwargs = init_configuration
+        if not _is_local:
+            if "auto_map" in init_kwargs:
+                # For backward compatibility with odl format.
+                if isinstance(init_kwargs["auto_map"], (tuple, list)):
+                    init_kwargs["auto_map"] = {"AutoTokenizer": init_kwargs["auto_map"]}
+        if config_tokenizer_class is None:
+            # Matt: This entire block is only used to decide if the tokenizer class matches the class in the repo.
+            #       If not, it raises a warning, but otherwise continues. Since we mostly load tokenizers with
+            #       AutoTokenizer these days, it seems like a lot of work (and a source of bugs) for little gain.
+            #       Maybe we can just remove this entirely?
+            from transformers.models.auto.configuration_auto import AutoConfig  # tests_ignore
+            # Second attempt. If we have not yet found tokenizer_class, let's try to use the config.
+            try:
+                config = AutoConfig.from_pretrained(
+                    pretrained_model_name_or_path,
+                    token=token,
+                    cache_dir=cache_dir,
+                    local_files_only=local_files_only,
+                    trust_remote_code=trust_remote_code,
+                    _commit_hash=_commit_hash,
+                )
+                config_tokenizer_class = config.tokenizer_class
+            except (OSError, ValueError, KeyError):
+                # skip if an error occurred.
+                config = None
+            if config_tokenizer_class is None:
+                # Third attempt. If we have not yet found the original type of the tokenizer,
+                # we are loading we see if we can infer it from the type of the configuration file
+                from transformers.models.auto.tokenization_auto import TOKENIZER_MAPPING_NAMES  # tests_ignore
+                if hasattr(config, "model_type"):
+                    model_type = config.model_type
+                else:
+                    # Fallback: use pattern matching on the string.
+                    model_type = None
+                    for pattern in TOKENIZER_MAPPING_NAMES.keys():
+                        if pattern in str(pretrained_model_name_or_path):
+                            model_type = pattern
+                            break
+                if model_type is not None:
+                    config_tokenizer_class, config_tokenizer_class_fast = TOKENIZER_MAPPING_NAMES.get(
+                        model_type, (None, None)
+                    )
+                    if config_tokenizer_class is None:
+                        config_tokenizer_class = config_tokenizer_class_fast
+        if config_tokenizer_class is not None:
+            if cls.__name__.replace("Fast", "") != config_tokenizer_class.replace("Fast", ""):
+                logger.warning(
+                    "The tokenizer class you load from this checkpoint is not the same type as the class this"
+                    " function is called from. It may result in unexpected tokenization. \nThe tokenizer class you"
+                    f" load from this checkpoint is '{config_tokenizer_class}'. \nThe class this function is called"
+                    f" from is '{cls.__name__}'."
+                )
+        # Update with newly provided kwargs
+        init_kwargs.update(kwargs)
+        # Merge resolved_vocab_files arguments in init_kwargs.
+        added_tokens_file = resolved_vocab_files.pop("added_tokens_file", None)
+        special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
+        for args_name, file_path in resolved_vocab_files.items():
+            if args_name not in init_kwargs:
+                init_kwargs[args_name] = file_path
+        tokenizer_file = resolved_vocab_files.pop("tokenizer_file", None)
+        if slow_tokenizer is not None:
+            init_kwargs["__slow_tokenizer"] = slow_tokenizer
+        init_kwargs["name_or_path"] = pretrained_model_name_or_path
+        #### Handle tokenizer serialization of added and special tokens
+        added_tokens_decoder: Dict[int, AddedToken] = {}
+        added_tokens_map: Dict[str, AddedToken] = {}
+        # if we have info on the slow added tokens
+        if "added_tokens_decoder" in init_kwargs:
+            for idx, token in init_kwargs["added_tokens_decoder"].items():
+                if isinstance(token, dict):
+                    token = AddedToken(**token)
+                if isinstance(token, AddedToken):
+                    added_tokens_decoder[ast.literal_eval(idx)] = token
+                    added_tokens_map[str(token)] = token
+                else:
+                    raise ValueError(
+                        f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary or an AddedToken instance"
+                    )
+        else:
+            # begin legacy: read the added_tokens_file and update kwargs with special_tokens_map if modified
+            if special_tokens_map_file is not None:
+                with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
+                    special_tokens_map = json.load(special_tokens_map_handle)
+                    for key, value in special_tokens_map.items():
+                        if key in kwargs and kwargs[key]:
+                            # This value has already been redefined by the kwargs
+                            # We keep this new value and ignore the one stored in the special_tokens_map_file
+                            continue
+                        if isinstance(value, dict):
+                            value = AddedToken(**value, special=True)
+                        elif key == "additional_special_tokens" and isinstance(value, list):
+                            additional_special_tokens = init_kwargs.pop("additional_special_tokens", []) or []
+                            for token in value:
+                                token = AddedToken(**token, special=True) if isinstance(token, dict) else token
+                                if token not in additional_special_tokens:
+                                    additional_special_tokens.append(token)
+                            value = additional_special_tokens
+                        init_kwargs[key] = value
+            # slow -> slow|fast, legacy: convert the `"added_tokens.json"` file to `added_tokens_decoder`.
+            # this is for legacy purpose. We don't add the tokens after init for efficiency.
+            if added_tokens_file is not None:
+                special_tokens = []
+                for key in cls.SPECIAL_TOKENS_ATTRIBUTES & init_kwargs.keys():
+                    if init_kwargs[key] is not None:
+                        if key == "additional_special_tokens":
+                            special_tokens += [str(token) for token in init_kwargs[key]]
+                        else:
+                            special_tokens.append(str(init_kwargs[key]))
+                with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
+                    added_tok_encoder = json.load(added_tokens_handle)
+                for str_token, index in added_tok_encoder.items():
+                    # if index not in added_tokens_decoder and str_token not in added_tokens_map:
+                    special = str_token in special_tokens
+                    added_tokens_decoder[index] = AddedToken(
+                        str_token, rstrip=False, lstrip=False, normalized=not special, special=special
+                    )
+                    added_tokens_map[str(token)] = added_tokens_decoder[index]
+            # allows converting a fast -> slow: add the `tokenizer.json`'s `"added_tokens"` to the slow tokenizer
+            # if `tokenizer_config.json` is `None`
+            if tokenizer_file is not None:
+                # This is for slow so can be done before
+                with open(tokenizer_file, encoding="utf-8") as tokenizer_file_handle:
+                    tokenizer_file_handle = json.load(tokenizer_file_handle)
+                    added_tokens = tokenizer_file_handle.pop("added_tokens")
+                for serialized_tokens in added_tokens:
+                    idx = serialized_tokens.pop("id")
+                    added_tokens_decoder[idx] = AddedToken(**serialized_tokens)
+                    added_tokens_map[str(added_tokens_decoder[idx])] = added_tokens_decoder[idx]
+            # end legacy
+        # Passing AddedTokens and not strings to the class to prevent it from casting the string to a different AddedToken
+        # convert {'__type': 'AddedToken', 'content': '<ent>', 'lstrip': False, 'normalized': True, ...} to AddedTokens
+        init_kwargs["added_tokens_decoder"] = added_tokens_decoder
+        init_kwargs = cls.convert_added_tokens(init_kwargs, save=False)
+        for key in cls.SPECIAL_TOKENS_ATTRIBUTES & init_kwargs.keys():
+            if added_tokens_map != {} and init_kwargs[key] is not None:
+                if key != "additional_special_tokens":
+                    init_kwargs[key] = added_tokens_map.get(str(init_kwargs[key]), init_kwargs[key])
+        # Instantiate the tokenizer.
+        try:
+            tokenizer = cls(*init_inputs, **init_kwargs)
+        except OSError:
+            raise OSError(
+                "Unable to load vocabulary from file. "
+                "Please check that the provided vocabulary is accessible and not corrupted."
+            )
+        # if added_tokens_decoder != {} and max(list(added_tokens_decoder.keys())[-1], 0) > tokenizer.vocab_size:
+        #     logger.warning_advice(
+        #         "Special tokens have been added in the vocabulary, make sure the associated word embeddings are"
+        #         " fine-tuned or trained."
+        #     )
+        return tokenizer

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,546 @@

+{
+  "added_tokens_decoder": {
+    "(192, 128, 64, 0)": {
+      "content": "<|pad|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "(193, 128, 64, 0)": {
+      "content": "<|end_of_text|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "(194, 128, 64, 0)": {
+      "content": "<|begin_of_text|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "(195, 128, 64, 0)": {
+      "content": "<|cls|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "(196, 128, 64, 0)": {
+      "content": "<|sep|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "(197, 128, 64, 0)": {
+      "content": "<|mask|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "(198, 128, 64, 0)": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(199, 128, 64, 0)": {
+      "content": "<|vision_br|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(200, 128, 64, 0)": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(201, 128, 64, 0)": {
+      "content": "<|start_header_id|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(202, 128, 64, 0)": {
+      "content": "<|end_header_id|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(203, 128, 64, 0)": {
+      "content": "<|end_of_role|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(204, 128, 64, 0)": {
+      "content": "<|extra_id_0|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(205, 128, 64, 0)": {
+      "content": "<|extra_id_1|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(206, 128, 64, 0)": {
+      "content": "<|extra_id_2|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(207, 128, 64, 0)": {
+      "content": "<|extra_id_3|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(208, 128, 64, 0)": {
+      "content": "<|extra_id_4|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(209, 128, 64, 0)": {
+      "content": "<|extra_id_5|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(210, 128, 64, 0)": {
+      "content": "<|extra_id_6|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(211, 128, 64, 0)": {
+      "content": "<|extra_id_7|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(212, 128, 64, 0)": {
+      "content": "<|extra_id_8|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(213, 128, 64, 0)": {
+      "content": "<|extra_id_9|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(214, 128, 64, 0)": {
+      "content": "<|extra_id_10|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(215, 128, 64, 0)": {
+      "content": "<|extra_id_11|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(216, 128, 64, 0)": {
+      "content": "<|extra_id_12|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(217, 128, 64, 0)": {
+      "content": "<|extra_id_13|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(218, 128, 64, 0)": {
+      "content": "<|extra_id_14|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(219, 128, 64, 0)": {
+      "content": "<|extra_id_15|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(220, 128, 64, 0)": {
+      "content": "<|extra_id_16|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(221, 128, 64, 0)": {
+      "content": "<|extra_id_17|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(222, 128, 64, 0)": {
+      "content": "<|extra_id_18|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(223, 128, 64, 0)": {
+      "content": "<|extra_id_19|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(224, 128, 64, 0)": {
+      "content": "<|extra_id_20|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(225, 128, 64, 0)": {
+      "content": "<|extra_id_21|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(226, 128, 64, 0)": {
+      "content": "<|extra_id_22|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(227, 128, 64, 0)": {
+      "content": "<|extra_id_23|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(228, 128, 64, 0)": {
+      "content": "<|extra_id_24|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(229, 128, 64, 0)": {
+      "content": "<|extra_id_25|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(230, 128, 64, 0)": {
+      "content": "<|extra_id_26|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(231, 128, 64, 0)": {
+      "content": "<|extra_id_27|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(232, 128, 64, 0)": {
+      "content": "<|extra_id_28|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(233, 128, 64, 0)": {
+      "content": "<|extra_id_29|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(234, 128, 64, 0)": {
+      "content": "<|extra_id_30|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(235, 128, 64, 0)": {
+      "content": "<|extra_id_31|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(236, 128, 64, 0)": {
+      "content": "<|extra_id_32|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(237, 128, 64, 0)": {
+      "content": "<|extra_id_33|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(238, 128, 64, 0)": {
+      "content": "<|extra_id_34|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(239, 128, 64, 0)": {
+      "content": "<|extra_id_35|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(240, 128, 64, 0)": {
+      "content": "<|extra_id_36|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(241, 128, 64, 0)": {
+      "content": "<|extra_id_37|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(242, 128, 64, 0)": {
+      "content": "<|extra_id_38|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(243, 128, 64, 0)": {
+      "content": "<|extra_id_39|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(244, 128, 64, 0)": {
+      "content": "<|extra_id_40|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(245, 128, 64, 0)": {
+      "content": "<|extra_id_41|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(246, 128, 64, 0)": {
+      "content": "<|extra_id_42|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(247, 128, 64, 0)": {
+      "content": "<|extra_id_43|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(248, 128, 64, 0)": {
+      "content": "<|extra_id_44|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(249, 128, 64, 0)": {
+      "content": "<|extra_id_45|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "(250, 128, 64, 0)": {
+      "content": "<|extra_id_46|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<|begin_of_text|>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<|cls|>",
+  "end_header_id_token": {
+    "__type": "AddedToken",
+    "content": "<|end_header_id|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false,
+    "special": false
+  },
+  "eor_id": {
+    "__type": "AddedToken",
+    "content": "<|end_of_role|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false,
+    "special": false
+  },
+  "eos_token": "<|end_of_text|>",
+  "mask_token": "<|mask|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|pad|>",
+  "sep_token": "<|sep|>",
+  "start_header_id_token": {
+    "__type": "AddedToken",
+    "content": "<|start_header_id|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false,
+    "special": false
+  },
+  "tokenizer_class": "ByteLMTokenizerV3",
+  "vision_br_token": {
+    "__type": "AddedToken",
+    "content": "<|vision_br|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false,
+    "special": false
+  },
+  "vision_end_token": {
+    "__type": "AddedToken",
+    "content": "<|vision_end|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false,
+    "special": false
+  },
+  "vision_start_token": {
+    "__type": "AddedToken",
+    "content": "<|vision_start|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false,
+    "special": false
+  },
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_utf8_like_byte_v3.ByteLMTokenizerV3",
+      null
+    ]
+  },
+  "padding_side": "left"
+}