Input of Structures by means of SMILES strings
SMILES is a simple, concise and rather readable molecular structure
specification format. It is (incompletely) published in
D. Weininger, SMILES, a Chemical Language and Information System.
1. Introduction to Methodology and Encoding Rules, J. Chem. Inf. Comput. Sci.
1988, 28, 31-36.
Daylight has a SMILES
WWW tutorial online.
The basic syntax rules are:
- Atoms are specified by their symbol, normally with
ab upper-case first letter. Elements not in the organic subset (B, C, N, O, P,
S, F, Cl, Br, I) and their potential attributes
must be enclosed in square brackets. Elements of the organic subset
need to be bracketed only if they have attributes or unusual hydrogen counts.
Example: [Au] is
a gold atom.
- Elements of the organic subset
automatically receive the number of hydrogen atoms necessary to reach
the lowest common oxidation state. Other elements do not automatically
receive hydrogen atoms. Example: C is methane.
- Attached hydrogens and formal charges are always specified
inside brackets. The number of hydrogens is written as H, followed by the count.
Charges are one or more plus or minus symbol(s), optionally followed by a digit.
Example: [NH4+] is the ammonia cation.
- Bonds are represented by the symbols -, =, # and : for
single, double, triple and quadruple bonds. The single bond
symbol is optional. Examples: CC=C is propene, [H][H] is molecular
hydrogen.
- If an atom is sp2 hybridized, this can be
indicated by writing its symbol with lowercase letters. No bond
order specificication to neighboring sp2 atoms is necessary,
and the automatic hydrogen addition is
automatically adjusted. Example: CC=C and Ccc are alternative represenations of
propene.
- Branches are specified by enclosures in parantheses. They can be
nested. Example: C(C)(C)(C)O is t-butanol.
- Rings are closed by ring link tags, which must follow immediately
after the (possibly bracketed) atom symbol. Multiple ring link tags
can be present at a single atom. They are arbitrary single- or two-digit
numbers. Two-digit numbers must be prefixed with a % sign. These
tags must appear pairwise. Ring link tags can be reused if the closing
tag has been encountered parsing from left to right. Examples:
C1CCC1 is cyclobutane, c1ccccc1c1ccccc1 is biphenyl, and C12C3C4C1C5C4C3C25
is cubane.
- Disconnected structures forming a molecular ensemble are
indicated with a '.' connection. Example: c1cc([O-].[Na+])ccc1 is
(not the most obvious, but a legal representation of) sodium
phenoxide.
- You are free to enter aromatic rings in Kekulé
fashion or with aromatic atoms, i.e. C1=CC=CC=C1 and c1ccccc1 are
identical (benzene). This works even with charged systems.
- Bond cis/trans type stereochemistry is specified with
the / and \ characters. Cl/C=C/Cl is trans-dichloroethene.
From the left chlorine atom the bond goes UP to the C=C core and
on the other side UP again to the second chlorine atom. Consequently,
Cl/C=C\Cl or Cl\C=C/Cl is the cis-compound.
- Isotope labelling is expressed with a prefix before the
atom symbol. Labelled atoms must be enclosed by square brackets. Example:
[13CH4] is 13C-methane and C([2H])([2H])([2H])[2H] is fully
deuterated methane.
The SMILES conversion routine behind the 3D coordinate service
will accept some SMILES strings which are strictly speaking
syntactically incorrect, but are still resolvable (i.e.
allow ring closure numbers after bond order indicators, or
ignore case of atoms not resolvable as pi-centers).
The decoder also understands a number of local,
but compatible SMILES
syntax extensions such atom lists, etc..