
SMILECAS Notations
Descriptions
1. Introduction
SMILES is an acronym for Simplified Molecular Input Line Entry System.
It is a chemical notation system used to represent a molecular structure
by a linear string of symbols. The SMILES notation system was
specifically designed for computer use by chemists. The encoding rules
for SMILES can be learned quickly and easily by anyone with any type of
chemistry background. The history of SMILES notation as a chemical
language and the basic encoding rules for SMILES have been presented by
David Weininger (J. Chem. Inf. Comput. Sci. 28(1): 31-6).
This on-line help
outlines the basic rules used to formulate a SMILES notation for a
chemical structure. The encoding rules outlined here document focus
directly on the SRC software programs.
Learning to write a
SMILES notation for most chemicals is not difficult. However, writing a
SMILES notation for a complicated ring system can be tricky and
time-consuming. The SMILECAS Database (available on-line with our LogKow
demo and from SRC as an add-on product to our estimation software) is
extremely helpful and time-efficient in obtaining SMILES notations. This
database contains the SMILES notations for 103,000 compounds; all you
need is the CAS (Chemical Abstract Service) Reigistry number.
2. Encoding Rules
A SMILES notation depicts a molecular structure as a two-dimensional
picture as if drawn on a piece of paper. A two-dimensional drawing of a
single chemical structure is possible in many different forms. That is,
a single structure can be depicted correctly by many different drawings.
In a similar manner, a single structure can be depicted correctly by
many different SMILES notations. In fact, any modestly large structure
has literally dozens of SMILES notations that will correctly depict the
structure. Any one of the correct depictions is acceptable for computer
interaction.
SMILES notations are
comprised of atoms (designated by atomic symbols), bonds, parentheses
(used to show branching), and numbers (used to designate ring opening
and closing positions). With the exception of designating ring
positions, numbers are not used in SMILES notation.
2.1. Atoms
Atoms are represented by their atomic symbols. For example:
C is carbon N is
nitrogen S is sulfur F is fluorine
I is iodine P is
phosphorus O is oxygen Cl is chlorine
Upper and lower case
letters are important. All aliphatic atoms are entered in upper
case. All aromatic atoms are entered in lower case. The possible
aromatic atoms are carbon, oxygen, sulfur, selenium and nitrogen. Other
potential aromatic atoms are not currently allowed by the SRC programs
because the current estimation methods used in the programs can not
evaluate them.
Atoms with two letter
atomic symbols, such as chlorine or bromine, must have the first letter
entered in upper case. In the case of chlorine or bromine, the second
letter of the atomic symbol can be either upper or lower case. The "r"
in bromine's symbol is usually entered in lower case. It is suggested
that the "l" in chlorine's symbol be entered in upper case ("L") because
it is possible to mis-identify a lower case "l" and the number one "1".
Therefore, chlorine can be entered as either Cl or CL and bromine can be
entered as either Br or BR.
With very rare
exception (see section 3.3), the hydrogen atom is not included in a
SMILES notation. Hydrogen attachments are determined by the program.
This greatly simplifies a SMILES notation. For example:
Compound Molecular Formula SMILES Notation
--------- ----------------- ---------------
Ethylene CH2=CH2 C=C
Propylene CH2=CH-CH3 C=CC
2-Butene CH3-CH=CH-CH3 CC=CC
2.2. Bonds
The four basic bonds in SMILES notation are single, double, triple, and
aromatic bonds. Single bonds do not need to be shown and are usually
omitted. A single can be designated with the hyphen symbol "-". For
example, a correct SMILES notation for propane is C-C-C ;however, there
is no advantage to entering the single bond. Therefore, it is not
normally used (the SRC programs automatically remove any hyphens entered
in a SMILES string).
The double bond is
designated by the equal symbol "=" and is required to identify double
bond. The following examples illustrate the double bond:
Compound Molecular Formula SMILES Notation
--------- ----------------- ---------------
Ethylene CH2=CH2 C=C
Propylene CH2=CH-CH3 C=CC
2-Butene CH3-CH=CH-CH3 CC=CC
The triple bond is
designated by the number symbol "#" and is required to identify a triple
bond. The following examples illustrate the triple bond:
Compound SMILES Notation
------------- ---------------
Acetylene C#C
Propyne C#CC
Butyne C#CCC
Acetonitrile CC#N
Acrylonitrile C=CC#N
The aromatic bond has
no designation. It is explicitly implied by a "lower case letter" for
carbon, nitrogen, oxygen, selenium and sulfur. For example, a typical
SMILES notation for benzene is c1ccccc1 and a typical notation for
pyridine is n1ccccc1. The use of the numbers as ring opening and closing
positions is discussed in section 2.4.
2.3 Branches
Branches in molecular structures are designated by enclosures in
parentheses. The examples of SMILES given in the lists above represent
straight, linear compounds. When a structure contains a branch, the
SMILES Notation of the structure requires that the branch be designated
in enclosed parentheses. The figure below illustrates branching.

As previously noted, a
single structure can have more than one valid SMILES notation. As an
example, valid SMILES notations for the isobutyric acid structure (above
figure) include the following:
CC(C)C(=O)O
C(C)(C)C(=O)O
OC(=O)C(C)C
O=C(O)C(C)C
A branch can not begin
a SMILES notation. For example, (C)CCO is an invalid SMILES notation. A
branch must immediately follow the atom to which it is connected. If an
atom has more than one branch, the branches are coded as consecutive
pairs of parentheses. The tert-butanol structure shown above is an
example. The order of the parentheses is not important; for example,
tert-butanol can be either CC(C)(O)C or CC(O)(C)C.
A branch can not
immediately follow a double bond symbol "=" or a triple bond symbol "#";
it must immediately follow the atom. For example: C=(CC)C is invalid; if
the double bond is connected to the carbon inside the parentheses, the
SMILES should be C(=CC)C; if the double bond is connected to the final
carbon, the SMILES should be C(CC)=C.
"Nested branches" or
"branches-within-branches" are allowed (and frequent needed). The
following figure illustrates nested branches.

Dozens of different,
valid SMILES notations could be coded for the structure above. The
notation could begin at any carbon in the structure. For example, if the
notation begins at the center-most carbon, the SMILES notation could be:
C(C=C)(CC)(C(C)C)(C(C)(C)C)
The SMILES interpreter
used in the SRC programs does not allow two or more consecutive
left-sided (starting) parentheses such as "((" to be used. An example
would be: CC((CC))CC. The reason is: two left-sided parentheses are
never needed to correctly represent any structure; their use promotes
poorly coded SMILES notations. SMILES notations are usually easiest to
comprehend when they have the fewest number of possible branches!
Unnecessary branching can complicate a SMILES notation. For example,
butane is best coded as: CCCC although, it is valid to code it as:
C(C(C(C))).
2.4. Cyclic Structures
The most difficult aspect of writing SMILES notations is writing a
correct SMILES notation for a complicated ring system! Writing SMILES
notations for structures containing only one or two rings is fairly
simple however. The following encoding rules apply to all cyclic
structures:
(1) Cyclic structures
require numbers to indicate where the ring starts and stops. The numbers
1 through 9 are used to indicate the starting and terminating atoms.
(2) The SAME number is
used to indicate the starting and terminating atom for each ring. The
starting and terminating atom must be connected to each other!
(3) Each number that is
used (1, 2, 3, etc.) MUST appear twice and ONLY twice in the entire
SMILES notation. This rule has an exception in the recent MS-Windows
versions of the SRC programs. A SMILES such as c1ccccc1c1ccccc1 is
allowed...the programs convert this to c2ccccc2c1ccccc1.
(4) Numbers are entered
immediately following the atoms used to indicate the starting and
terminating positions. For example, a number can not follow a branch as
in: c1ccccc(Br)1; this notation for bromobenzene should written as
c1ccccc1(Br) or c1ccccc1Br.
(5) A starting or
terminating atom can be associated with two consecutive numbers. For
example, naphthalene can be coded as: c12ccccc1cccc2 (see the example
below). The "12" following the first carbon indicates that the first
carbon is connected to both of the following numbered carbons. Three
consecutive number are not currently allowed by the SRC programs.
Examples are the best
way to understand SMILES notations for cyclic structures. Several
examples are illustrated here. The following concept has been found
useful for writing SMILES notations for ring systems: (a) select one
ring from the entire structure and label the starting and terminating
atoms with the number 1; (b) begin at the starting atom and "snake your
way" (draw a free-hand line) through the cyclic structure so that the
"snake" passes every ring member once and finishes at the terminating
atom. Number each starting and terminating atom of each subsequent ring
as it is passed by the "snake". For complicated structures, it may be
quite a puzzle with many possible solutions. The key is to select an
appropriate ring to start. Once the "snake" has been drawn, simply write
the SMILES notation by starting at the initial atom and then follow the
"snake". The "snake" in the examples below is the curved line that ends
at the arrow head. The "snake" starts at the starting atom and ends at
the terminating atom. Remember that aromatic atoms are entered in lower
case.





The following examples
illustrate ring systems where the rings are not connected to each other
at two or more atoms (not fused):

In certain types of
ring systems, it is impossible to draw the "snake" completely through
all rings. In these situations, it is necessary to use "ring branching".
The examples of benzene and acenaphthene below demonstrate ring
branching; neither of these structures require it, but it is available.
The strychnine structure example needs it; a SMILES can not otherwise be
written.


2.5. Aromatic Conversions
The SMILES interpreter in the SRC programs will convert certain
aliphatic rings to aromatic rings if aromaticity is detected. For
example, the following conversions are made:
C1=CC=CC=C1 ---->
c1ccccc1 (benzene)
N1C=CC=C1 ----> n1cccc1
(pyrrole)
O1C=CC=C1 ----> o1cccc1
(furan)
S1C=CC=C1 ----> s1cccc1
(thiofuran)
N1=CC=CC=C1 ---->
n1ccccc1 (pyridine)
Other single ring and
fused ring conversions are also made. See section 3 for additional
information pertaining to aromatic conversions.
2.6. Aromatic Limitations
Certain valid
aromatic structures are flagged as "Illegal Structures" by some of the
SRC programs. This is because the estimation techniques used by the
program can not evaluate that type of structure. An example is the
compound azulene which is comprised of a fused 7-member ring and a
5-member ring. It is flagged as illegal because a 7-member aromatic was
found. Currently, the SRC programs will not accept an aromatic ring with
7 or more ring-members. Most estimation methods have not been extended
to include 7-member aromatic rings due to lack of data. Azulene (and
similar structures), can be estimated by entering the "upper case"
(aliphatic) SMILES (e.g. C1=CC=C2C=CC=C2C=C1).
2.7. Notations For
Selected Fragments
Most users who are new to writing SMILES notations have trouble coding
certain chemical fragments. The following list should be useful:
Fragment
SMILES Example
--------------- ---------- -----------
Nitro N(=O)(=O) CCN(=O)(=O) nitroethane
Nitrate ON(=O)(=O) CON(=O)(=O) methyl nitrate
Nitrite ON(=O) CON=O methyl nitrite
Sulfonic acid S(=O)(=O)O CS(=O)(=O)O methyl sulfonic
acid
Cyanide/Nitrile C#N CC#N methyl cyanide
Azide N=N#N CN=N#N methyl azide
Azido as N+=N- N#N
2.8. Metals
Metals are designated by the atomic symbol of the metal enclosed in
square brackets. The current versions of the SRC programs can accept the
following metals:
| [Al] Aluminum |
[As] Arsenic |
[Au] Gold |
[Be] Beryllium |
| [Bi] Bismuth |
[Cd] cadmium |
[Ca] Calcium |
[Fe] Iron |
| [Hg] Mercury |
[K] Potassium |
[Li] Lithium |
[Mg] Magnesium |
| [Na] Sodium |
[Ni] Nickel |
[Pt] Platinum |
[Sb] Antimony |
| [Sn] Tin |
[Zn] Zinc |
[Zr] Zirconium |
In the SRC programs,
sodium, potassium and lithium can be entered without the square
brackets.
2.9. Charged Species
Examples of charged species are: [Na+] and [Ca+2] and [O-] The SRC
programs do not evaluate charged species with the charges....the charges
(including the plus and minus signs and numbers) must be removed. The
current MS-Windows versions of the SRC programs will do this
automatically for all of the metals that can be evaluated. For example,
if sodium acetate is entered as:[Na+][O-]C(=O)Cthe SRC programs willl
convert it to: [Na]OC(=O)C
2.10. Disconnected Structures
Disconnected compounds are designated as individual structures or ions
separated by a period ("."). A common example of a simple disconnected
structure is tetramethyl ammonium bromide; the SMILE could be:
C[N+](C)(C)C.[Br-]. The SRC programs can not evaluate a disconnected
SMILES string. However, they can evaluate the structure if the
disconnected parts are "connected" by attaching charged atoms.
Tetramethyl ammonium bromide can be evaluated if it is entered as:
CN(Br)(C)(C)C. The current MS- Windows versions of the SRC programs will
automatically convert some disconnected SMILES to a "non-disconnected"
SMILES to enable the programs to evaluate the structure. At present, the
automatic conversion is limited to single, charged species such as the
bromide, chloride or iodide ions. More complex disconnected SMILES
require manual "connection".
2.11. Isomeric and Chiral
SMILES
Isomeric configuration is specified by the "slash" characters "\" and
"/". These symbols indicate the relative directionality between
connected atoms. Example SMILES for trans- and cis-1,2- dibromoethene
could be: Br/C=C/Br and Br/C=C\Br . The current MS-Windows versions of
the SRC programs remove all "slashes" from SMILES notations since they
are not used in any evaluation.
SMILES chirality is
specified by the "@" symbol. The current MS- Windows versions of the SRC
programs remove all "@" from SMILES notations since they are not used in
any evaluation.
3. Supplemental Information
The SMILES interpreter used by the SRC programs was programmed
completely at Syracuse Research Corporation. It is not the same SMILES
interpreter used by the U.S EPA's PCGEMS programs such as PCFAP and
PCCHEM or by the CLOGP program. Although these SMILES interpreters are
very compatible, there are some differences. These differences primarily
involve the entry and detection of aromaticity. The operation of the SRC
interpreter is discussed below.
3.1. Aromatic Conversion
Aromatic SMILES characters are entered in lower case letters and
aliphatic characters are entered in upper case letters. Both the SRC and
CLOGP interpreters are capable of converting selected aliphatic ring
entries to aromatic rings if aromaticity is detected. For example, if
benzene is entered as C1=CC=CC=C1 it will be converted to the more
common entry of c1ccccc1. For some types of structures, however, the
CLOGP interpreter will convert the entry to an aromatic structure and
the SRC interpreter will not. The most common example of this difference
involves the carbonyl function (C=O). The SRC interpreter will never
convert a C=O entry to an aromatic c=O.
The current versions of
the SRC programs will convert some "aromatic" carbonyl structures to a
corresponding "aliphatic" SMILES. For example, if maleic hydrazide is
entered as: n1nc(=O)ccc1=O the SRC programs will convert it to: N1NC(=O)C=CC1=O
to allow evaluation.
3.2. Tautomers
"Tautomeric bonds" can not be designated in SMILES notations. It is up
to the user to enter the correct form of the tautomer that needs to be
evaluated. The figure below illustrates the enol form and the keto form
of 2-pyridinol.

Acceptable SMILES
notations are listed for each tautomeric form. The keto form must be
entered with upper case letters. The SRC interpreter will not convert a
keto form tautomer to an aromatic structure; the keto form is evaluated
as an aliphatic by intentional design. The CLOGP interpreter will
convert selected keto form tautomers to aromatics. The 2-pyridinol
example shown in the figure is a tautomer that is comprised of only one
ring. The same rules apply to tautomers with multiple rings. For
example, in a multiple ring keto form tautomer, the C=O must be entered
in upper case letters; in addition, other members of the ring containing
the C=O must be entered in upper case unless they are aromatic members
of other rings. If a tautomer needs to be evaluated as an aromatic
structure, then the enol form of the tautomer should be entered by the
user.
3.3. Entering Hydrogen
Directly
The SRC programs allow hydrogen to be entered when it is explicitly
connected to either an aromatic or aliphatic nitrogen atom. However, the
hydrogen is used by the programs only if the valence of the nitrogen
atom is greater than +3. Nitrogens with a valence of +3 ignore the
direct hydrogen entries. For example, a SMILE entry of: CCN(H)(H) will
be converted to: CCN (ethylamine) because the nitrogen is already
understood to have two hydrogens implicitly connected to it. However,
for various structures, the hydrogens must be entered to specify the
correct structure. For example, the SMILES notation for ethyl ammonium
bromide must include the hydrogens {e.g. CCN(H)(H)(H)Br }... if CCNBr is
entered instead, the nitrogen will be evaluated as a +3 valence instead
of a +5 valence. In cases where the nitrogen is greater than a +3
valence, the hydrogens must be explicitly entered in the SMILES.
There are two common
instances where explicit entry of hydrogens is necessary: (1) various
organic hydrochlorides and (2) various zwitterionic compounds. The
evaluation of hydrochlorides and zwitterionic compounds applies
primarily to octanol-water partition coefficients.
3.3.1. Hydrochlorides
SMILES notations for hydrochlorides usually envolve a "disconnected"
structure (see section 2.10). For example, the SMILES notation for
benzenepentanamine hydrochloride may be specified as: c1ccccc1CCCCCN.HCL
; however, as noted above, the SRC programs can not evaluate a
disconnected SMILES. To specify this compound, the SMILES should be
entered with explicit hydrogens and no "." symbol as: c1ccccc1CCCCCN(H)(H)(H)CL
. The SRC KOWWIN program (octanol-water partition coefficient) can now
evaluate the hydrochloride as the ionized form of the compound. The
non-ionized form of benzenepentanamine hydrochloride can be evaluated by
removing the hydrochloride altogether (and simply entering
c1ccccc1CCCCCN). For SRC programs, hydrochlorides (and any similar
disconnected structures) must use explicit hydrogens for correct
evaluation.
3.3.2. Zwitterionic Compounds
The User's Guide for the SRC KOWWIN Program contains a more complete
discussion of zwitterionic considerations. A brief discussion is given
here. With the exception of amino acids, zwitterionic forms of compounds
must use SMILES notations containing explicit hydrogens. For example,
consider the non- zwitterionic and zwitterionic forms of
1-leucyl-L-proline as shown here.

Non-zwitterionic form:
SMILES: CC(C)CC(N)C(=O)N1CCCC1C(=O)O
Estimated log P: 0.73

Zwitterionic form:
SMILES: CC(C)CC(N(H)(H)H)C(=O)N1CCCC1C(=O)O
Estimated log P: -1.60
Evaluation is different
for the zwitterionic and non-zwitterionic form (note the estimated log P
values). It is the user's responsibility to explicitly enter a
zwitterionic SMILES into SRC programs (the only exceptions are amino
acids). The SRC SMILECAS database does not use explicit zwitterionic
SMILES.
Explicit zwitterionic
entry is another difference between the CLOGP program and the SRC
programs. CLOGP will always consider compounds such as
1-leucyl-L-proline (and drugs such as amoxicllin) to be zwitterionic...the
SRC programs give the user the option. |