Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -28,6 +28,7 @@ Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/r | |
| 28 | 
             
            - Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
         | 
| 29 | 
             
            - Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender)
         | 
| 30 |  | 
|  | |
| 31 | 
             
            ## Statistics
         | 
| 32 |  | 
| 33 | 
             
            ### Context length
         | 
| @@ -36,58 +37,6 @@ Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/r | |
| 36 | 
             
            | [pair-ranker](https://huggingface.co/llm-blender/pair-ranker)               | 128               | 128                  | 384              |
         | 
| 37 | 
             
            | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |
         | 
| 38 |  | 
| 39 | 
            -
            ### Performance
         | 
| 40 | 
            -
            PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences 
         | 
| 41 | 
            -
            with an extremly small model size (0.4B), approching the performance of GPT-4.
         | 
| 42 | 
            -
             | 
| 43 | 
            -
            We test the pairwise comparison on 
         | 
| 44 | 
            -
            - [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
         | 
| 45 | 
            -
            - [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
         | 
| 46 | 
            -
            - [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
         | 
| 47 | 
            -
             | 
| 48 | 
            -
            #### Auto-J Pairwise test data performance
         | 
| 49 | 
            -
             | 
| 50 | 
            -
            |         Model         |    Summ   |    Exam   |    Code   | Rewriting |   Crea W  |   Func W  |  Comm |    NLP   |  Overall  |
         | 
| 51 | 
            -
            |:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
         | 
| 52 | 
            -
            | Closed -source Models |
         | 
| 53 | 
            -
            |        ChatGPT        |    33.3   |    40.3   |    36.6   |    31.6   |    48.2   |    40.4   |  47.6 |   45.8   |    42.7   |
         | 
| 54 | 
            -
            |       Claude -2       |    30.6   |    36.1   |    41.7   |    34.2   |    48.1   |    42.5   |  40.6 |   48.5   |    42.4   |
         | 
| 55 | 
            -
            |         GPT -4        |    59.7   |    51.4   |    69.2   |    58.3   |    66.7   |    60.4   |  58.3 |   65.2   |    61.9   |
         | 
| 56 | 
            -
            |  Open -source Models  |
         | 
| 57 | 
            -
            |        SteamSHP       |    33.3   |    29.2   |    26.7   |    33.3   |    40.7   |    31.3   |  51.4 |   51.9   |    40.6   |
         | 
| 58 | 
            -
            |        PandaLM        |    29.2   |    33.3   |    31.7   |    23.3   |    43.5   |    32.9   |  44.8 |   48.9   |    38.9   |
         | 
| 59 | 
            -
            |   LLaMA -2-Chat -13B  |    20.8   |    27.8   |    19.2   |     20    |    31.5   |    27.5   |  35.8 |   31.8   |     29    |
         | 
| 60 | 
            -
            |    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
         | 
| 61 | 
            -
            |   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
         | 
| 62 | 
            -
            |   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
         | 
| 63 | 
            -
            |       AUTO -J (13b)       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
         | 
| 64 | 
            -
            |         **PairRM (0.4b)**       | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
         | 
| 65 | 
            -
             | 
| 66 | 
            -
            #### HHH-Alignment and MT-bench human judgements
         | 
| 67 | 
            -
             | 
| 68 | 
            -
            |        Evaluator LM       | HHH ALIGNMENT |           |           |          |             | MT BENCH HUMAN JUDG . |
         | 
| 69 | 
            -
            |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
         | 
| 70 | 
            -
            |                           |     Help .    |   Harm .  |   Hon .   |   Other  | Total Avg . |    Human Preference   |
         | 
| 71 | 
            -
            |           RANDOM          |       50      |     50    |     50    |    50    |      50     |         34.26         |
         | 
| 72 | 
            -
            |  STANFORDNLP REWARD MODEL |     69.49     |   60.34   |   52.46   |   51.16  |    58.82    |         44.79         |
         | 
| 73 | 
            -
            |    ALMOST REWARD MODEL    |     74.58     |   67.24   |   78.69   |   86.05  |    76.02    |          49.9         |
         | 
| 74 | 
            -
            |      LLAMA2 -CHAT 7B      |      66.1     |   81.03   |   70.49   |   74.42  |    72.85    |         51.78         |
         | 
| 75 | 
            -
            |      LLAMA2 -CHAT 13B     |     74.58     |   87.93   |   55.74   |   79.07  |    73.76    |         52.34         |
         | 
| 76 | 
            -
            |      LLAMA2 -CHAT 70B     |      66.1     |   **89.66**   |   67.21   |   74.42  |    74.21    |         53.67         |
         | 
| 77 | 
            -
            | LLAMA2 -CHAT 13B+COARSE . |     68.74     |   68.97   |   65.57   |   67.44  |    67.42    |         46.89         |
         | 
| 78 | 
            -
            |    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
         | 
| 79 | 
            -
            |       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
         | 
| 80 | 
            -
            |       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
         | 
| 81 | 
            -
            |           **PairRM (0.4b)**          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
         | 
| 82 | 
            -
            |        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
         | 
| 83 | 
            -
             | 
| 84 | 
            -
            **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
         | 
| 85 | 
            -
             | 
| 86 | 
            -
            Two reasons to attribute:
         | 
| 87 | 
            -
            - Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
         | 
| 88 | 
            -
            - The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)
         | 
| 89 | 
            -
             | 
| 90 | 
            -
             | 
| 91 | 
             
            ## Usage Example
         | 
| 92 |  | 
| 93 | 
             
            ### Installation
         | 
| @@ -192,6 +141,60 @@ With a `blender.compare()` function, you can easily apply PairRM to poopular RLH | |
| 192 |  | 
| 193 | 
             
            Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)
         | 
| 194 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 195 | 
             
            ## Citation
         | 
| 196 | 
             
            If you are using PairRM in your research, please cite LLM-blender.
         | 
| 197 | 
             
            ```bibtex
         | 
|  | |
| 28 | 
             
            - Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
         | 
| 29 | 
             
            - Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender)
         | 
| 30 |  | 
| 31 | 
            +
             | 
| 32 | 
             
            ## Statistics
         | 
| 33 |  | 
| 34 | 
             
            ### Context length
         | 
|  | |
| 37 | 
             
            | [pair-ranker](https://huggingface.co/llm-blender/pair-ranker)               | 128               | 128                  | 384              |
         | 
| 38 | 
             
            | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |
         | 
| 39 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 40 | 
             
            ## Usage Example
         | 
| 41 |  | 
| 42 | 
             
            ### Installation
         | 
|  | |
| 141 |  | 
| 142 | 
             
            Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)
         | 
| 143 |  | 
| 144 | 
            +
            ### Performance
         | 
| 145 | 
            +
            PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences 
         | 
| 146 | 
            +
            with an extremly small model size (0.4B), approching the performance of GPT-4.
         | 
| 147 | 
            +
             | 
| 148 | 
            +
            We test the pairwise comparison on 
         | 
| 149 | 
            +
            - [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
         | 
| 150 | 
            +
            - [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
         | 
| 151 | 
            +
            - [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
         | 
| 152 | 
            +
             | 
| 153 | 
            +
            #### Auto-J Pairwise test data performance
         | 
| 154 | 
            +
             | 
| 155 | 
            +
            |         Model         |    Summ   |    Exam   |    Code   | Rewriting |   Crea W  |   Func W  |  Comm |    NLP   |  Overall  |
         | 
| 156 | 
            +
            |:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
         | 
| 157 | 
            +
            | Closed -source Models |
         | 
| 158 | 
            +
            |        ChatGPT        |    33.3   |    40.3   |    36.6   |    31.6   |    48.2   |    40.4   |  47.6 |   45.8   |    42.7   |
         | 
| 159 | 
            +
            |       Claude -2       |    30.6   |    36.1   |    41.7   |    34.2   |    48.1   |    42.5   |  40.6 |   48.5   |    42.4   |
         | 
| 160 | 
            +
            |         GPT -4        |    59.7   |    51.4   |    69.2   |    58.3   |    66.7   |    60.4   |  58.3 |   65.2   |    61.9   |
         | 
| 161 | 
            +
            |  Open -source Models  |
         | 
| 162 | 
            +
            |        SteamSHP       |    33.3   |    29.2   |    26.7   |    33.3   |    40.7   |    31.3   |  51.4 |   51.9   |    40.6   |
         | 
| 163 | 
            +
            |        PandaLM        |    29.2   |    33.3   |    31.7   |    23.3   |    43.5   |    32.9   |  44.8 |   48.9   |    38.9   |
         | 
| 164 | 
            +
            |   LLaMA -2-Chat -13B  |    20.8   |    27.8   |    19.2   |     20    |    31.5   |    27.5   |  35.8 |   31.8   |     29    |
         | 
| 165 | 
            +
            |    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
         | 
| 166 | 
            +
            |   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
         | 
| 167 | 
            +
            |   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
         | 
| 168 | 
            +
            |       AUTO -J (13b)       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
         | 
| 169 | 
            +
            |         **PairRM (0.4b)**       | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
         | 
| 170 | 
            +
             | 
| 171 | 
            +
            #### HHH-Alignment and MT-bench human judgements
         | 
| 172 | 
            +
             | 
| 173 | 
            +
            |        Evaluator LM       | HHH ALIGNMENT |           |           |          |             | MT BENCH HUMAN JUDG . |
         | 
| 174 | 
            +
            |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
         | 
| 175 | 
            +
            |                           |     Help .    |   Harm .  |   Hon .   |   Other  | Total Avg . |    Human Preference   |
         | 
| 176 | 
            +
            |           RANDOM          |       50      |     50    |     50    |    50    |      50     |         34.26         |
         | 
| 177 | 
            +
            |  STANFORDNLP REWARD MODEL |     69.49     |   60.34   |   52.46   |   51.16  |    58.82    |         44.79         |
         | 
| 178 | 
            +
            |    ALMOST REWARD MODEL    |     74.58     |   67.24   |   78.69   |   86.05  |    76.02    |          49.9         |
         | 
| 179 | 
            +
            |      LLAMA2 -CHAT 7B      |      66.1     |   81.03   |   70.49   |   74.42  |    72.85    |         51.78         |
         | 
| 180 | 
            +
            |      LLAMA2 -CHAT 13B     |     74.58     |   87.93   |   55.74   |   79.07  |    73.76    |         52.34         |
         | 
| 181 | 
            +
            |      LLAMA2 -CHAT 70B     |      66.1     |   **89.66**   |   67.21   |   74.42  |    74.21    |         53.67         |
         | 
| 182 | 
            +
            | LLAMA2 -CHAT 13B+COARSE . |     68.74     |   68.97   |   65.57   |   67.44  |    67.42    |         46.89         |
         | 
| 183 | 
            +
            |    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
         | 
| 184 | 
            +
            |       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
         | 
| 185 | 
            +
            |       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
         | 
| 186 | 
            +
            |           **PairRM (0.4b)**          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
         | 
| 187 | 
            +
            |        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
         | 
| 188 | 
            +
             | 
| 189 | 
            +
            **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
         | 
| 190 | 
            +
             | 
| 191 | 
            +
            Two reasons to attribute:
         | 
| 192 | 
            +
            - Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
         | 
| 193 | 
            +
            - The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)
         | 
| 194 | 
            +
             | 
| 195 | 
            +
             | 
| 196 | 
            +
             | 
| 197 | 
            +
             | 
| 198 | 
             
            ## Citation
         | 
| 199 | 
             
            If you are using PairRM in your research, please cite LLM-blender.
         | 
| 200 | 
             
            ```bibtex
         | 

