Spaces:
Sleeping
Sleeping
File size: 2,955 Bytes
2dceef8 54ab393 2dceef8 54ab393 2dceef8 54ab393 2dceef8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# ⬢⬢⬢ schemist



Cleaning, collating, and augmenting chemical datasets.
- [Installation](#installation)
- [Command-line usage](#command-line-usage)
- [Python API](#python-api)
- [Documentation](#documentation)
## Installation
### The easy way
Install the pre-compiled version from PyPI:
```bash
pip install schemist
```
### From source
Clone the repository, then `cd` into it. Then run:
```bash
pip install -e .
```
## Command-line usage
**schemist** provides command-line utlities. The list of commands can be checked like so:
```bash
$ schemist --help
usage: schemist [-h] [--version] {clean,convert,featurize,collate,dedup,enumerate,react,split} ...
Tools for cleaning, collating, and augmenting chemical datasets.
options:
-h, --help show this help message and exit
--version, -v show program's version number and exit
Sub-commands:
{clean,convert,featurize,collate,dedup,enumerate,react,split}
Use these commands to specify the tool you want to use.
clean Clean and normalize SMILES column of a table.
convert Convert between string representations of chemical structures.
featurize Convert between string representations of chemical structures.
collate Collect disparate tables or SDF files of libraries into a single table.
dedup Deduplicate chemical structures and retain references.
enumerate Enumerate bio-chemical structures within length and sequence constraints.
react React compounds in silico in indicated columns using a named reaction.
split Split table based on chosen algorithm, optionally taking account of chemical structure during splits.
```
Each command is designed to work on large data files in a streaming fashion, so that the entire file is not held in memory at once. One caveat is that the scaffold-based splits are very slow with tables of millions of rows.
All commands (except `collate`) take from the input table a named column with a SMILES, SELFIES, amino-acid sequence, HELM, or InChI representation of compounds.
The tools complete specific tasks which
can be easily composed into analysis pipelines, because the TSV table output goes to
`stdout` by default so they can be piped from one tool to another.
To get help for a specific command, do
```bash
schemist <command> --help
```
For the Python API, [see below](#python-api).
## Python API
**schemist** can be imported into Python to help make custom analyses.
```python
>>> import schemist as sch
```
## Documentation
Full API documentation is at [ReadTheDocs](https://schemist.readthedocs.org). |