f-galkin commited on
Commit
33bec02
·
verified ·
1 Parent(s): 6ad0288

update readme

Browse files
Files changed (1) hide show
  1. README.md +79 -29
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  <p align="center">
15
  📃 <a href=" https://doi.org/10.1101/2024.07.25.605062" target="_blank">Pre-print</a> • 👾 <a href="https://discord.gg/pSjWbmjX" target="_blank">Discord bot</a> • 🧬 <a href="https://insilico.com/repository/precious3gpt" target="_blank">Validation digest</a> <br>
16
  </p>
17
- <div align=center><img src="P3GPT_architecture.png" width="70%" height="70%" /></div>
18
 
19
  - **Developer**: [Insilico Medicine](https://insilico.com/precious)
20
  - **License**: cc-by-nc-4.0
@@ -22,8 +22,25 @@ tags:
22
  - **Domain**: Biomedical
23
  - **Base architecture**: [MPT](https://huggingface.co/mosaicml/mpt-7b)
24
 
25
- ### Run model using endpoint step by step
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  **Step 1 - connect to endpoint**
29
  ```python
@@ -61,12 +78,12 @@ request_config = {"inputs": config_data, "mode": "meta2diff", "parameters": {
61
 
62
  ```
63
 
64
- **How Precisou3GPT will see given request**
65
  ```text
66
  [BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><age_individ></age_individ><cell></cell><efo>EFO_0000768 </efo><datatype>expression </datatype><drug>curcumin </drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type></dataset_type><gender>m </gender><species>human </species>
67
  ```
68
 
69
- **Step 3 - send request to endpoint**
70
  ```python
71
  output = query(request_config)
72
  ```
@@ -85,8 +102,11 @@ output = query(request_config)
85
 
86
  }
87
  ```
 
88
  Note: If the ```mode``` was supposed to generate compounds, the output would contain ```compounds: List```.
89
 
 
 
90
  ---
91
 
92
  ### Run model locally
@@ -128,38 +148,47 @@ output = precious3gpt_handler(request_config)
128
  ---
129
  ## Precious3GPT request configuration
130
 
131
- ### Generation Modes (`mode` in config)
132
 
133
- Choose the appropriate mode based on your requirements:
134
 
135
- 1. **meta2diff**: Generate signature (up- and down- gene lists) given meta-data such as tissue, compound, gender, etc.
136
- 2. **diff2compound**: Predict compounds based on signature.
137
- 3. **meta2diff2compound**: Generate signatures given meta-data and then predict compounds based on generated signatures.
 
 
138
 
139
- ---
140
 
 
141
 
142
- ### Instruction (`inputs.instruction` in config)
143
 
144
- 1. disease2diff2disease - generate signature for disease / predict disease based on given signature
145
- 2. compound2diff2compound - generate signature for compound / predict compound based on given signature
146
- 3. age_group2diff2age_group - generate signature for age group / predict age group based on signature
 
 
 
 
147
 
148
 
149
  ### Other meta-data (`inputs.` in config)
150
 
151
- Full list of available values for each meta-data item you can find in ```p3_entities_with_type.csv```
 
 
152
 
153
 
154
 
155
  ## Examples
156
 
157
- In the following examples all possible configuration fields are specified. You can leave some meta-data fields in the ```inputs``` section empty string(```""```) or empty list(```[]```).
158
 
159
- _**Example 1**_
 
 
160
 
161
- If you want to generate a signature given specific meta-data you can use the following configuration. Note, ```up``` and ```down``` fields are empty lists as you want to generate them.
162
- Here we ask the model to generate a signature for a human within the age group of 70-90 years, male, in tissue - Lungs with disease EFO_0000768.
163
 
164
  ```json
165
  {
@@ -178,7 +207,7 @@ Here we ask the model to generate a signature for a human within the age group o
178
  }
179
  ```
180
 
181
- Here is output:
182
  ```json
183
  {
184
  "output": {
@@ -191,13 +220,15 @@ Here is output:
191
  "random_seed": 137
192
  }
193
  ```
 
194
 
195
-
196
- _**Example 2**_
197
-
198
- Now let's generate a signature for a healthy human within the age group of 70-90 years, male, in tissue - whole blood.
199
- Note, here we use ```disease2diff2disease``` instruction, but we expect to generate signatures for a healthy human, that's why we'd set ```efo``` to empty string "".
200
- Alternatively, for this example we can add one more instruction to example 2 - "instruction": ["disease2diff2disease", "age_group2diff2age_group"]
 
201
 
202
  ```json
203
  {
@@ -222,7 +253,7 @@ Alternatively, for this example we can add one more instruction to example 2 - "
222
 
223
  ```
224
 
225
- Here is output:
226
  ```json
227
  {
228
  "output": {
@@ -235,8 +266,27 @@ Here is output:
235
  "random_seed": 137
236
  }
237
  ```
238
-
239
  ---
240
 
241
  ## Multi-Modality
242
- Applies by default in tasks where you pass a signature. For each gene in up- and down- lists the model gets embeddings from Knowledge Graph and Text NNs. Then embeddings are averaged in order to obtain one embedding for each modality for each gene list (4 averaged embeddings in total).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  <p align="center">
15
  📃 <a href=" https://doi.org/10.1101/2024.07.25.605062" target="_blank">Pre-print</a> • 👾 <a href="https://discord.gg/pSjWbmjX" target="_blank">Discord bot</a> • 🧬 <a href="https://insilico.com/repository/precious3gpt" target="_blank">Validation digest</a> <br>
16
  </p>
17
+ <div align=center><img src="P3GPT_architecture.png" width="80%" height="80%" /></div>
18
 
19
  - **Developer**: [Insilico Medicine](https://insilico.com/precious)
20
  - **License**: cc-by-nc-4.0
 
22
  - **Domain**: Biomedical
23
  - **Base architecture**: [MPT](https://huggingface.co/mosaicml/mpt-7b)
24
 
 
25
 
26
+ <h1 align="center"> Model summary </h1>
27
+
28
+ - Precious3GPT (P3GPT) is a unique language model that has been trained on 1.2MM omics data points, knowledge graphs, and biomedical texts (PubMed) to be used in drug discovery and aging research;
29
+
30
+ - P3GPT simulates biological processes on an omics level to return the transcriptomic, epigenetic, or proteomic signatures of a wide variety of perturbators;
31
+
32
+ - Various modes of execution allow users to replicate the workflows of chemical screenings, case-control observational studies, and other popular research settings;
33
+
34
+ - The context of P3GPT-simulated experiments can be defined with >60k biomedical entities, including 3 species, 569 tissues and cell lines, 635 health conditions, and 22k small molecules;
35
+
36
+ - You may work with P3GPT either by downloading model weights for a local deployment or by interacting with the Discord bot on the official Inisilico Medicine's server.
37
+
38
+
39
+ <h1 align="center"> Model usage guide </h1>
40
+
41
+ ### Run model with an endpoint
42
+ <details>
43
+ <summary style="font-weight:600">Details</summary>
44
 
45
  **Step 1 - connect to endpoint**
46
  ```python
 
78
 
79
  ```
80
 
81
+ **Actual request processed by Precisou3GPT**
82
  ```text
83
  [BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><age_individ></age_individ><cell></cell><efo>EFO_0000768 </efo><datatype>expression </datatype><drug>curcumin </drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type></dataset_type><gender>m </gender><species>human </species>
84
  ```
85
 
86
+ **Step 3 - send the request to endpoint**
87
  ```python
88
  output = query(request_config)
89
  ```
 
102
 
103
  }
104
  ```
105
+
106
  Note: If the ```mode``` was supposed to generate compounds, the output would contain ```compounds: List```.
107
 
108
+ </details>
109
+
110
  ---
111
 
112
  ### Run model locally
 
148
  ---
149
  ## Precious3GPT request configuration
150
 
 
151
 
152
+ ### Instruction (`inputs.instruction` in `config`)
153
 
154
+ Instructions define the experimental setting P3GPT will be simulating using the information provided in the prompt.
155
+
156
+ 1. `disease2diff2disease` - generate an omics signature characterizing a disease / determine the disease based on a given signature;
157
+ 2. `compound2diff2compound` - generate an omics signature of a compound-induced perturbation / determine the compound given its omics signature;
158
+ 3. `age_group2diff2age_group` - generate differential omics for age groups / determine age groups provided differential gene lists
159
 
 
160
 
161
+ ### Generation Modes (`mode` in config)
162
 
163
+ Generation modes are not part of the prompt processed to P3GPT but may affect the way P3GPT's response is presented or processed:
164
 
165
+ 1. `meta2diff`: The ```compound2diff2compound``` instruction can be executed either way. This mode tells P3GPT to return differentially expressed genes and not compounds;
166
+ 2. `diff2compound`: The reverse of the ```meta2diff``` mode. Make sure to fill in 'up' and 'down' in the prompt first!
167
+ 3. `meta2diff2compound`: Runs ```meta2diff``` first and applies ```diff2compound``` to its output with one call.
168
+
169
+ See ```Precious3GPT_example.ipynb``` tutorial notebook to learn more about building P3GPT requests.
170
+
171
+ ---
172
 
173
 
174
  ### Other meta-data (`inputs.` in config)
175
 
176
+ P3GPT can only simulate the experiments featuring the biomedical entities and metadata values present in ```p3_entities_with_type.csv```
177
+
178
+ If you aim to study a tissue, a compound, or something else using P3GPT, make sure to check that the names of the entities you are using match those in this file.
179
 
180
 
181
 
182
  ## Examples
183
 
184
+ In the following examples, all possible configuration fields are specified. You can leave some meta-data fields in the ```inputs``` section empty string(```""```) or empty list(```[]```).
185
 
186
+ _**Example 1: generate a disease signature**_
187
+ <details>
188
+ <summary style="font-weight:600">Details</summary>
189
 
190
+ If you want to generate a signature given specific metadata you can use the following configuration. Note, ```up``` and ```down``` fields are empty lists as you want to generate them.
191
+ Here, we ask the model to generate a signature for a male human within in the 70-90 years age group, in the "lung" tissue, with "EFO_0000768" (Idiopathic pulmonary fibrosis).
192
 
193
  ```json
194
  {
 
207
  }
208
  ```
209
 
210
+ See the corresponding P3GPT output:
211
  ```json
212
  {
213
  "output": {
 
220
  "random_seed": 137
221
  }
222
  ```
223
+ </details>
224
 
225
+ _**Example 2: generate an aging signature**_
226
+ <details>
227
+ <summary style="font-weight:600">Details</summary>
228
+
229
+ Now, let's generate a signature for the whole blood of a healthy male human in the 70-90 years age group.
230
+ Note that we use the ```disease2diff2disease``` instruction, but we expect to generate the signatures for a healthy human, that's why we set ```efo``` to an empty string "".
231
+ Alternatively, we can add one more instruction to example 2 — "instruction": ["disease2diff2disease", "age_group2diff2age_group"]
232
 
233
  ```json
234
  {
 
253
 
254
  ```
255
 
256
+ P3GPT's output:
257
  ```json
258
  {
259
  "output": {
 
266
  "random_seed": 137
267
  }
268
  ```
269
+ </details>
270
  ---
271
 
272
  ## Multi-Modality
273
+ By default, all tasks with a signature in the input prompt are executed with multimodal features. For each gene in the up-/down- lists, P3GPT pulls the embeddings from the Knowledge Graph and Text neural modelity mappers. Then, the embeddings are averaged to obtain one embedding for each modality and each gene list (4 averaged embeddings in total).
274
+
275
+ ## Cite this model
276
+ Please, cite the following bioRxiv pre-print if you use P3GPT in your research papers or other published materials:
277
+
278
+ ```
279
+ @article {Galkin2024.07.25.605062,
280
+ author = {Galkin, Fedor and Naumov, Vladimir and Pushkov, Stefan and Sidorenko, Denis and Urban, Anatoly and Zagirova, Diana and Alawi, Khadija M and Aliper, Alex and Gumerov, Ruslan and Kalashnikov, Aleksand and Mukba, Sabina and Pogorelskaya, Aleksandra and Ren, Feng and Shneyderman, Anastasia and Tang, Qiuqiong and Xiao, Deyong and Tyshkovskiy, Alexander and Ying, Kejun and Gladyshev, Vadim N. and Zhavoronkov, Alex},
281
+ title = {Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery},
282
+ elocation-id = {2024.07.25.605062},
283
+ year = {2024},
284
+ doi = {10.1101/2024.07.25.605062},
285
+ publisher = {Cold Spring Harbor Laboratory},
286
+ abstract = {We present a multimodal multi-species multi-omics multi-tissue transformer for aging research and drug discovery capable of performing multiple tasks such as age prediction across species, target discovery, tissue, sex, and disease sample classification, drug sensitivity prediction, replication of omics response and prediction of biological and phenotypic response to compound treatment. This model combines textual, tabular, and knowledge graph-derived representations of biological experiments to provide insights into molecular-level biological processes. We demonstrate that P3GPT has developed an intuition for the interactions between compounds, pathologies, and gene regulation in the context of multiple species and tissues. In these areas, it outperforms existing LLMs and we highlight its utility in diverse case studies. P3GPT is a general model that may be used as a target identification tool, aging clock, digital laboratory, and scientific assistant. The model is intended as a community resource available open source as well as via a Discord server.Competing Interest StatementThe authors are affiliated with Insilico Medicine, a commercial company developing and using generative artificial intelligence and other next-generation AI technologies and robotics for drug discovery, drug development, and aging research. Utilizing its generative AI platform and a range of deep aging clocks, Insilico Medicine has developed a portfolio of multiple therapeutic programs targeting fibrotic diseases, cancer, immunological diseases, and a range of age-related diseases.},
287
+ URL = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.25.605062},
288
+ eprint = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.25.605062.full.pdf},
289
+ journal = {bioRxiv}
290
+ }
291
+
292
+ ```