About SRC Locations Business Areas Career Center Training News
 
Environmental Science
estimation software
databases
expert consulting
toxicology & risk assessment
environmental chemistry
free demos & databases
reports/presentations
research publications
investigations
of interest
related links
guest book
PRA Center of CNY
Upstate NY SRA
 

SMILECAS Notations Descriptions

1. Introduction
SMILES is an acronym for Simplified Molecular Input Line Entry System. It is a chemical notation system used to represent a molecular structure by a linear string of symbols. The SMILES notation system was specifically designed for computer use by chemists. The encoding rules for SMILES can be learned quickly and easily by anyone with any type of chemistry background. The history of SMILES notation as a chemical language and the basic encoding rules for SMILES have been presented by David Weininger (J. Chem. Inf. Comput. Sci. 28(1): 31-6).

This on-line help outlines the basic rules used to formulate a SMILES notation for a chemical structure. The encoding rules outlined here document focus directly on the SRC software programs.

Learning to write a SMILES notation for most chemicals is not difficult. However, writing a SMILES notation for a complicated ring system can be tricky and time-consuming. The SMILECAS Database (available on-line with our LogKow demo and from SRC as an add-on product to our estimation software) is extremely helpful and time-efficient in obtaining SMILES notations. This database contains the SMILES notations for 103,000 compounds; all you need is the CAS (Chemical Abstract Service) Reigistry number.

2. Encoding Rules
A SMILES notation depicts a molecular structure as a two-dimensional picture as if drawn on a piece of paper. A two-dimensional drawing of a single chemical structure is possible in many different forms. That is, a single structure can be depicted correctly by many different drawings. In a similar manner, a single structure can be depicted correctly by many different SMILES notations. In fact, any modestly large structure has literally dozens of SMILES notations that will correctly depict the structure. Any one of the correct depictions is acceptable for computer interaction.

SMILES notations are comprised of atoms (designated by atomic symbols), bonds, parentheses (used to show branching), and numbers (used to designate ring opening and closing positions). With the exception of designating ring positions, numbers are not used in SMILES notation.

2.1. Atoms
Atoms are represented by their atomic symbols. For example:

C is carbon N is nitrogen S is sulfur F is fluorine

I is iodine P is phosphorus O is oxygen Cl is chlorine

Upper and lower case letters are important. All aliphatic atoms are entered in upper case. All aromatic atoms are entered in lower case. The possible aromatic atoms are carbon, oxygen, sulfur, selenium and nitrogen. Other potential aromatic atoms are not currently allowed by the SRC programs because the current estimation methods used in the programs can not evaluate them.

Atoms with two letter atomic symbols, such as chlorine or bromine, must have the first letter entered in upper case. In the case of chlorine or bromine, the second letter of the atomic symbol can be either upper or lower case. The "r" in bromine's symbol is usually entered in lower case. It is suggested that the "l" in chlorine's symbol be entered in upper case ("L") because it is possible to mis-identify a lower case "l" and the number one "1". Therefore, chlorine can be entered as either Cl or CL and bromine can be entered as either Br or BR.

With very rare exception (see section 3.3), the hydrogen atom is not included in a SMILES notation. Hydrogen attachments are determined by the program. This greatly simplifies a SMILES notation. For example:

Compound    Molecular Formula    SMILES Notation
---------   -----------------    ---------------
Ethylene    CH2=CH2              C=C
Propylene   CH2=CH-CH3           C=CC
2-Butene    CH3-CH=CH-CH3        CC=CC

2.2. Bonds
The four basic bonds in SMILES notation are single, double, triple, and aromatic bonds. Single bonds do not need to be shown and are usually omitted. A single can be designated with the hyphen symbol "-". For example, a correct SMILES notation for propane is C-C-C ;however, there is no advantage to entering the single bond. Therefore, it is not normally used (the SRC programs automatically remove any hyphens entered in a SMILES string).

The double bond is designated by the equal symbol "=" and is required to identify double bond. The following examples illustrate the double bond:

Compound    Molecular Formula    SMILES Notation
---------   -----------------    ---------------
Ethylene    CH2=CH2              C=C
Propylene   CH2=CH-CH3           C=CC
2-Butene    CH3-CH=CH-CH3        CC=CC

The triple bond is designated by the number symbol "#" and is required to identify a triple bond. The following examples illustrate the triple bond:

Compound         SMILES Notation
-------------    ---------------
Acetylene        C#C
Propyne          C#CC
Butyne           C#CCC
Acetonitrile     CC#N
Acrylonitrile    C=CC#N

The aromatic bond has no designation. It is explicitly implied by a "lower case letter" for carbon, nitrogen, oxygen, selenium and sulfur. For example, a typical SMILES notation for benzene is c1ccccc1 and a typical notation for pyridine is n1ccccc1. The use of the numbers as ring opening and closing positions is discussed in section 2.4.


2.3 Branches
Branches in molecular structures are designated by enclosures in parentheses. The examples of SMILES given in the lists above represent straight, linear compounds. When a structure contains a branch, the SMILES Notation of the structure requires that the branch be designated in enclosed parentheses. The figure below illustrates branching.

As previously noted, a single structure can have more than one valid SMILES notation. As an example, valid SMILES notations for the isobutyric acid structure (above figure) include the following:

CC(C)C(=O)O
C(C)(C)C(=O)O
OC(=O)C(C)C
O=C(O)C(C)C

A branch can not begin a SMILES notation. For example, (C)CCO is an invalid SMILES notation. A branch must immediately follow the atom to which it is connected. If an atom has more than one branch, the branches are coded as consecutive pairs of parentheses. The tert-butanol structure shown above is an example. The order of the parentheses is not important; for example, tert-butanol can be either CC(C)(O)C or CC(O)(C)C.

A branch can not immediately follow a double bond symbol "=" or a triple bond symbol "#"; it must immediately follow the atom. For example: C=(CC)C is invalid; if the double bond is connected to the carbon inside the parentheses, the SMILES should be C(=CC)C; if the double bond is connected to the final carbon, the SMILES should be C(CC)=C.

"Nested branches" or "branches-within-branches" are allowed (and frequent needed). The following figure illustrates nested branches.

Dozens of different, valid SMILES notations could be coded for the structure above. The notation could begin at any carbon in the structure. For example, if the notation begins at the center-most carbon, the SMILES notation could be: C(C=C)(CC)(C(C)C)(C(C)(C)C)

The SMILES interpreter used in the SRC programs does not allow two or more consecutive left-sided (starting) parentheses such as "((" to be used. An example would be: CC((CC))CC. The reason is: two left-sided parentheses are never needed to correctly represent any structure; their use promotes poorly coded SMILES notations. SMILES notations are usually easiest to comprehend when they have the fewest number of possible branches! Unnecessary branching can complicate a SMILES notation. For example, butane is best coded as: CCCC although, it is valid to code it as: C(C(C(C))).

2.4. Cyclic Structures
The most difficult aspect of writing SMILES notations is writing a correct SMILES notation for a complicated ring system! Writing SMILES notations for structures containing only one or two rings is fairly simple however. The following encoding rules apply to all cyclic structures:

(1) Cyclic structures require numbers to indicate where the ring starts and stops. The numbers 1 through 9 are used to indicate the starting and terminating atoms.

(2) The SAME number is used to indicate the starting and terminating atom for each ring. The starting and terminating atom must be connected to each other!

(3) Each number that is used (1, 2, 3, etc.) MUST appear twice and ONLY twice in the entire SMILES notation. This rule has an exception in the recent MS-Windows versions of the SRC programs. A SMILES such as c1ccccc1c1ccccc1 is allowed...the programs convert this to c2ccccc2c1ccccc1.

(4) Numbers are entered immediately following the atoms used to indicate the starting and terminating positions. For example, a number can not follow a branch as in: c1ccccc(Br)1; this notation for bromobenzene should written as c1ccccc1(Br) or c1ccccc1Br.

(5) A starting or terminating atom can be associated with two consecutive numbers. For example, naphthalene can be coded as: c12ccccc1cccc2 (see the example below). The "12" following the first carbon indicates that the first carbon is connected to both of the following numbered carbons. Three consecutive number are not currently allowed by the SRC programs.

Examples are the best way to understand SMILES notations for cyclic structures. Several examples are illustrated here. The following concept has been found useful for writing SMILES notations for ring systems: (a) select one ring from the entire structure and label the starting and terminating atoms with the number 1; (b) begin at the starting atom and "snake your way" (draw a free-hand line) through the cyclic structure so that the "snake" passes every ring member once and finishes at the terminating atom. Number each starting and terminating atom of each subsequent ring as it is passed by the "snake". For complicated structures, it may be quite a puzzle with many possible solutions. The key is to select an appropriate ring to start. Once the "snake" has been drawn, simply write the SMILES notation by starting at the initial atom and then follow the "snake". The "snake" in the examples below is the curved line that ends at the arrow head. The "snake" starts at the starting atom and ends at the terminating atom. Remember that aromatic atoms are entered in lower case.

The following examples illustrate ring systems where the rings are not connected to each other at two or more atoms (not fused):

In certain types of ring systems, it is impossible to draw the "snake" completely through all rings. In these situations, it is necessary to use "ring branching". The examples of benzene and acenaphthene below demonstrate ring branching; neither of these structures require it, but it is available. The strychnine structure example needs it; a SMILES can not otherwise be written.

2.5. Aromatic Conversions
The SMILES interpreter in the SRC programs will convert certain aliphatic rings to aromatic rings if aromaticity is detected. For example, the following conversions are made:

C1=CC=CC=C1 ----> c1ccccc1 (benzene)

N1C=CC=C1 ----> n1cccc1 (pyrrole)

O1C=CC=C1 ----> o1cccc1 (furan)

S1C=CC=C1 ----> s1cccc1 (thiofuran)

N1=CC=CC=C1 ----> n1ccccc1 (pyridine)

Other single ring and fused ring conversions are also made. See section 3 for additional information pertaining to aromatic conversions.

2.6. Aromatic Limitations
Certain valid aromatic structures are flagged as "Illegal Structures" by some of the SRC programs. This is because the estimation techniques used by the program can not evaluate that type of structure. An example is the compound azulene which is comprised of a fused 7-member ring and a 5-member ring. It is flagged as illegal because a 7-member aromatic was found. Currently, the SRC programs will not accept an aromatic ring with 7 or more ring-members. Most estimation methods have not been extended to include 7-member aromatic rings due to lack of data. Azulene (and similar structures), can be estimated by entering the "upper case" (aliphatic) SMILES (e.g. C1=CC=C2C=CC=C2C=C1).

2.7. Notations For Selected Fragments
Most users who are new to writing SMILES notations have trouble coding certain chemical fragments. The following list should be useful:

Fragment              SMILES             Example
---------------              ----------               -----------
Nitro                    N(=O)(=O)            CCN(=O)(=O) nitroethane
Nitrate                 ON(=O)(=O)          CON(=O)(=O) methyl nitrate
Nitrite                  ON(=O)                 CON=O methyl nitrite
Sulfonic acid        S(=O)(=O)O          CS(=O)(=O)O methyl sulfonic acid
Cyanide/Nitrile      C#N                     CC#N methyl cyanide
Azide                   N=N#N                 CN=N#N methyl azide
Azido as               N+=N-                 N#N

2.8. Metals
Metals are designated by the atomic symbol of the metal enclosed in square brackets. The current versions of the SRC programs can accept the following metals:

[Al] Aluminum [As] Arsenic [Au] Gold [Be] Beryllium
[Bi] Bismuth [Cd] cadmium [Ca] Calcium [Fe] Iron
[Hg] Mercury [K]  Potassium [Li] Lithium [Mg] Magnesium
[Na] Sodium [Ni] Nickel [Pt] Platinum [Sb] Antimony
[Sn] Tin [Zn] Zinc [Zr] Zirconium

In the SRC programs, sodium, potassium and lithium can be entered without the square brackets.

2.9. Charged Species
Examples of charged species are: [Na+] and [Ca+2] and [O-] The SRC programs do not evaluate charged species with the charges....the charges (including the plus and minus signs and numbers) must be removed. The current MS-Windows versions of the SRC programs will do this automatically for all of the metals that can be evaluated. For example, if sodium acetate is entered as:[Na+][O-]C(=O)Cthe SRC programs willl convert it to: [Na]OC(=O)C

2.10. Disconnected Structures
Disconnected compounds are designated as individual structures or ions separated by a period ("."). A common example of a simple disconnected structure is tetramethyl ammonium bromide; the SMILE could be: C[N+](C)(C)C.[Br-]. The SRC programs can not evaluate a disconnected SMILES string. However, they can evaluate the structure if the disconnected parts are "connected" by attaching charged atoms. Tetramethyl ammonium bromide can be evaluated if it is entered as: CN(Br)(C)(C)C. The current MS- Windows versions of the SRC programs will automatically convert some disconnected SMILES to a "non-disconnected" SMILES to enable the programs to evaluate the structure. At present, the automatic conversion is limited to single, charged species such as the bromide, chloride or iodide ions. More complex disconnected SMILES require manual "connection".

2.11. Isomeric and Chiral SMILES
Isomeric configuration is specified by the "slash" characters "\" and "/". These symbols indicate the relative directionality between connected atoms. Example SMILES for trans- and cis-1,2- dibromoethene could be: Br/C=C/Br and Br/C=C\Br . The current MS-Windows versions of the SRC programs remove all "slashes" from SMILES notations since they are not used in any evaluation.

SMILES chirality is specified by the "@" symbol. The current MS- Windows versions of the SRC programs remove all "@" from SMILES notations since they are not used in any evaluation.

3. Supplemental Information
The SMILES interpreter used by the SRC programs was programmed completely at Syracuse Research Corporation. It is not the same SMILES interpreter used by the U.S EPA's PCGEMS programs such as PCFAP and PCCHEM or by the CLOGP program. Although these SMILES interpreters are very compatible, there are some differences. These differences primarily involve the entry and detection of aromaticity. The operation of the SRC interpreter is discussed below.

3.1. Aromatic Conversion
Aromatic SMILES characters are entered in lower case letters and aliphatic characters are entered in upper case letters. Both the SRC and CLOGP interpreters are capable of converting selected aliphatic ring entries to aromatic rings if aromaticity is detected. For example, if benzene is entered as C1=CC=CC=C1 it will be converted to the more common entry of c1ccccc1. For some types of structures, however, the CLOGP interpreter will convert the entry to an aromatic structure and the SRC interpreter will not. The most common example of this difference involves the carbonyl function (C=O). The SRC interpreter will never convert a C=O entry to an aromatic c=O.

The current versions of the SRC programs will convert some "aromatic" carbonyl structures to a corresponding "aliphatic" SMILES. For example, if maleic hydrazide is entered as: n1nc(=O)ccc1=O the SRC programs will convert it to: N1NC(=O)C=CC1=O to allow evaluation.

3.2. Tautomers
"Tautomeric bonds" can not be designated in SMILES notations. It is up to the user to enter the correct form of the tautomer that needs to be evaluated. The figure below illustrates the enol form and the keto form of 2-pyridinol.

Acceptable SMILES notations are listed for each tautomeric form. The keto form must be entered with upper case letters. The SRC interpreter will not convert a keto form tautomer to an aromatic structure; the keto form is evaluated as an aliphatic by intentional design. The CLOGP interpreter will convert selected keto form tautomers to aromatics. The 2-pyridinol example shown in the figure is a tautomer that is comprised of only one ring. The same rules apply to tautomers with multiple rings. For example, in a multiple ring keto form tautomer, the C=O must be entered in upper case letters; in addition, other members of the ring containing the C=O must be entered in upper case unless they are aromatic members of other rings. If a tautomer needs to be evaluated as an aromatic structure, then the enol form of the tautomer should be entered by the user.

3.3. Entering Hydrogen Directly
The SRC programs allow hydrogen to be entered when it is explicitly connected to either an aromatic or aliphatic nitrogen atom. However, the hydrogen is used by the programs only if the valence of the nitrogen atom is greater than +3. Nitrogens with a valence of +3 ignore the direct hydrogen entries. For example, a SMILE entry of: CCN(H)(H) will be converted to: CCN (ethylamine) because the nitrogen is already understood to have two hydrogens implicitly connected to it. However, for various structures, the hydrogens must be entered to specify the correct structure. For example, the SMILES notation for ethyl ammonium bromide must include the hydrogens {e.g. CCN(H)(H)(H)Br }... if CCNBr is entered instead, the nitrogen will be evaluated as a +3 valence instead of a +5 valence. In cases where the nitrogen is greater than a +3 valence, the hydrogens must be explicitly entered in the SMILES.

There are two common instances where explicit entry of hydrogens is necessary: (1) various organic hydrochlorides and (2) various zwitterionic compounds. The evaluation of hydrochlorides and zwitterionic compounds applies primarily to octanol-water partition coefficients.

3.3.1. Hydrochlorides
SMILES notations for hydrochlorides usually envolve a "disconnected" structure (see section 2.10). For example, the SMILES notation for benzenepentanamine hydrochloride may be specified as: c1ccccc1CCCCCN.HCL ; however, as noted above, the SRC programs can not evaluate a disconnected SMILES. To specify this compound, the SMILES should be entered with explicit hydrogens and no "." symbol as: c1ccccc1CCCCCN(H)(H)(H)CL . The SRC KOWWIN program (octanol-water partition coefficient) can now evaluate the hydrochloride as the ionized form of the compound. The non-ionized form of benzenepentanamine hydrochloride can be evaluated by removing the hydrochloride altogether (and simply entering c1ccccc1CCCCCN). For SRC programs, hydrochlorides (and any similar disconnected structures) must use explicit hydrogens for correct evaluation.

3.3.2. Zwitterionic Compounds
The User's Guide for the SRC KOWWIN Program contains a more complete discussion of zwitterionic considerations. A brief discussion is given here. With the exception of amino acids, zwitterionic forms of compounds must use SMILES notations containing explicit hydrogens. For example, consider the non- zwitterionic and zwitterionic forms of 1-leucyl-L-proline as shown here.


Non-zwitterionic form:
SMILES: CC(C)CC(N)C(=O)N1CCCC1C(=O)O
Estimated log P: 0.73

Zwitterionic form:
SMILES: CC(C)CC(N(H)(H)H)C(=O)N1CCCC1C(=O)O
Estimated log P: -1.60

Evaluation is different for the zwitterionic and non-zwitterionic form (note the estimated log P values). It is the user's responsibility to explicitly enter a zwitterionic SMILES into SRC programs (the only exceptions are amino acids). The SRC SMILECAS database does not use explicit zwitterionic SMILES.

Explicit zwitterionic entry is another difference between the CLOGP program and the SRC programs. CLOGP will always consider compounds such as 1-leucyl-L-proline (and drugs such as amoxicllin) to be zwitterionic...the SRC programs give the user the option.