File size: 2,955 Bytes
2dceef8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54ab393
2dceef8
 
54ab393
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dceef8
 
54ab393
 
 
 
 
 
 
 
 
2dceef8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# ⬢⬢⬢ schemist

![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/schemist/python-publish.yml)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/schemist)
![PyPI](https://img.shields.io/pypi/v/schemist)

Cleaning, collating, and augmenting chemical datasets.

- [Installation](#installation)
- [Command-line usage](#command-line-usage)
- [Python API](#python-api)
- [Documentation](#documentation)

## Installation

### The easy way

Install the pre-compiled version from PyPI:

```bash
pip install schemist
```

### From source

Clone the repository, then `cd` into it. Then run:

```bash
pip install -e .
```

## Command-line usage

**schemist**  provides command-line utlities. The list of commands can be checked like so:

```bash
$ schemist --help
usage: schemist [-h] [--version] {clean,convert,featurize,collate,dedup,enumerate,react,split} ...

Tools for cleaning, collating, and augmenting chemical datasets.

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

Sub-commands:
  {clean,convert,featurize,collate,dedup,enumerate,react,split}
                        Use these commands to specify the tool you want to use.
    clean               Clean and normalize SMILES column of a table.
    convert             Convert between string representations of chemical structures.
    featurize           Convert between string representations of chemical structures.
    collate             Collect disparate tables or SDF files of libraries into a single table.
    dedup               Deduplicate chemical structures and retain references.
    enumerate           Enumerate bio-chemical structures within length and sequence constraints.
    react               React compounds in silico in indicated columns using a named reaction.
    split               Split table based on chosen algorithm, optionally taking account of chemical structure during splits.
```

Each command is designed to work on large data files in a streaming fashion, so that the entire file is not held in memory at once. One caveat is that the scaffold-based splits are very slow with tables of millions of rows.

All commands (except `collate`) take from the input table a named column with a SMILES, SELFIES, amino-acid sequence, HELM, or InChI representation of compounds.

The tools complete specific tasks which 
can be easily composed into analysis pipelines, because the TSV table output goes to
`stdout` by default so they can be piped from one tool to another.

To get help for a specific command, do

```bash
schemist <command> --help
```

For the Python API, [see below](#python-api).


## Python API

**schemist** can be imported into Python to help make custom analyses.

```python
>>> import schemist as sch
```

## Documentation

Full API documentation is at [ReadTheDocs](https://schemist.readthedocs.org).