Converting to CML [was Re: [cml/ccml-discuss] RE: [Zinc-fans] database
formats? (fwd from jji [] cgl.ucsf.edu)] (fwd from pm286 [] cam.ac.uk)
- From: Eugen Leitl <eugen [] leitl.org>
- Subject: Converting to CML [was Re: [cml/ccml-discuss] RE:
[Zinc-fans] database formats? (fwd from jji [] cgl.ucsf.edu)] (fwd from pm286 []
cam.ac.uk)
- Date: Thu, 7 Apr 2005 10:38:58 +0200
----- Forwarded message from Peter Murray-Rust <pm286 [] cam.ac.uk>
-----
From: Peter Murray-Rust <pm286 [] cam.ac.uk>
Date: Thu, 07 Apr 2005 08:54:05 +0100
To: Geoff Hutchison <grh25 [] cornell.edu>,
"John Irwin" <jji [] cgl.ucsf.edu>
Cc: cml-discuss [] lists.sourceforge.net,
openbabel-discuss [] lists.sourceforge.net,
simon Tyrrell <simon.tyrrell [] virgin.net>,
cdk-devel [] lists.sourceforge.net, Tom Oinn <tmo [] ebi.ac.uk>
Subject: Converting to CML [was Re: [cml/ccml-discuss] RE: [Zinc-fans]
database formats? (fwd from jji [] cgl.ucsf.edu)]
X-Mailer: QUALCOMM Windows Eudora Version 6.0.1.1
[Crossposted to 3 lists, please be considerate]
[John Irwin]
>>... Can you recommend software for
>>preparing and manipulating CML files? If OE offered CML, we could and
might
>>offer CML tomorrow.
There are many good tools for converting files to CML. First, some words
about strategy.
CML is powerful enough to hold compound documents such as compound data
cards, computational chemistry output and (when combined with XHTML)
complete scientific documents. So "converting to CML" can involve
components such as molecules, reactions, their properties, spectra,
eigenproperties, etc. In general CML can hold any information composed of
simple datatypes (numbers, strings, array, matrixes, etc.) and predefined
schema elements (reactions, spectra...). We are devising a mechanism for
building complex datatypes (e.g. critical point, phase diagrams).
Most people currently want to manage molecular data and I'll stick with
that. (JohnI and I have already corresponded usefully so I believe that a
Zinc entry consists of at least:
* a molecule
* its provenance
* published names
* published properties
* calculated properties
* intellectual property rights
CML can manage all of this except the IPR. To summarise John's mail, Zinc
consists of molecular information supplied by compound suppliers under
contract, for which properties are calculated using software made available
under contract and then collected in a database which itself has
restrictions on use (e.g. only limited subsets can be distributed, and for
restricted use]). CML is not capable of managing the complexity of this IPR
so the converter would have to add this, preferable in RDF. [Note that this
problem does not occur for Open data since we can simply add a BOAI or
Creative Commons license.]
The provenance (without rights) is managed by the DublinCore dc:creator and
dc:publisher in CML:
<metadataList>
<metadata name="dc:creator" content="Foobarchem"/>
<metadata name="dc:publisher" content="ji []
cgl.ucsf.edu"/>
</metadataList>
CML can, in principle, hold everything else without loss. Since I don't
know the range of properties I don't know which are complex, but assuming
that most are scalar, then the simple approach is to render them as:
<property dictRef="zinc:mpt">
<scalar units="units:celsius" min="121"
max="123.5" errorBasis="range"/>
</property>
=== OK, most people weren't expecting that! BUT provenance and
redistribution is increasingly important. That is why the default action of
OpenBabel when outputting CML is to add metadata. We would hope that if
users add metadata to the input (only possible in CML) it would be
transported through ===
I suspect the question could be rephrased as "how do I convert a file
containing small-molecule information and produce a CML file which contains
the atoms, bonds and their properties without loss? Each molecule is
separately identifiable and there is no contextual linkage between them
(e.g. they aren't poses, supramolecules, etc. The file(s) may contain many
independent molecules and batch conversion is required"
I currently know of the current tools, and would approach them in this order:
* Openbabel. This has the widest range of file types and can deal with
lists of molecules. Billy Tyrrell, Chris Morley, Geoff Hutchison and I have
variously developed this and Henry Rzepa has carried out roundtripping. We
intend to maintain this a flagship for CML conversion - i.e. if there is a
problem we will try to respond.
* JUMBO. We have concentrated on complex formats and currently offer
* MDL Molfile, SDF (and RXN). This attempts to follow the
published spec for V2000 files. However since some of the spec appears to
be specific to MDL programs it is necessarily a subset, albeit a fairly
comprehensive one.
* MOL2 format taken from the Tripos spec. This again is a subset
and does not address recognition of atom type and fragments. Not validated.
* CDX and CDXML. Most of the spec relating to molecules and
reaction, but not graphic layout, has been implemented. Since CD is a very
graphically oriented format it is extremely easy to create objects which do
not formally represent the semantics of the molecule. Conversion of any CDX
file is likely to be lossy and fuzzy.
* CIF. This is a complete interpretation of DDL1 with manual
coding of some of the core dictionary. Although CIF can contain chemical
structure information this is virtually never used. Hence we have to use
heuristics to calculate the chemistry and this is almost lossless for GOOD
CIFs (as published by Acta Cryst.)
* SMILES. I think this is fairly complete and should include
stereochemistry.
* CDK. This has a range of file readers and a CML writer. We haven't been
directly involved in the coding but correspond daily with the group. If
there are any problems then I am sure the CDK group would be keen to
address them and we'll help in the discussions.
* JOELib. This has a wide range of functionality, including the calculation
of properties. Again we are in frequent touch, and although I haven't used
it for CML I am sure the authors are responsive.
* BlueObelisk, WebServices and Taverna.
(http://wwmm.ch.cam.ac.uk/presentations/acs2005) This is a
recent movement
among a number of OpenSource and Open Data groups to ensure
interoperability. "File conversions" will increasingly be packaged as
WebServices (http://wwmm.ch.cam.ac.uk/gridsphere/gridsphere) or workflows
(such as http://taverna.sf.net). Scientists can then select the
services
they require and compose their own application. This will include
conversion, validation, checks for uniqueness, submission to repository,
etc. I suspect that Zinc actually requires a Taverrna-like workflow for its
maintenance. Taverna can be used to warp closed source programs, but of
course these cannot be distributed. We offer WebServices for OpenBabel and
JUMBO as above so anyone can link their conversion requirements. Also our
WS are Open so anyone can clone them to avoid connection problems. We do
not currently offer WebServices that use close source programs because
there are usually license restrictions by the suppliers and WS cannot yet
deal with complex IPR negotiations. There is no reason why we might not
create some in the future - if so the WebService wrapper would probably be
OpenSource.
There are some other Open Source programs (with whose authors we have had
discussions) which read and/or emit CML including:
* BKChem
* Ghemical
I don't know whether these can be used in batch but as they are open source
then anyone can add this. I am also sure they'd be keen to help. I don't
know the degree of conformance.
There are an increasing number of computational chemistry programs which
emit (and often read) CML but this is out of scope in this thread.
We welcome implementations and use of CML by for-profit organisations. CML
itself is an openly published, read-only specification and does not require
implementations to be OpenSource. It does, however, require best efforts to
conform and we shall write more of this later. Although, in principle, it
is possible to write conformant software by reading the spec, in practice
no spec is completely watertight and we encourage discussion. Obviously any
posting to this list advertises the origin of the poster, so companies may
wish to mail privately and will get a private reply. However we have
limited resources and cannot generally give extended free private advice.
There are some closed source tools which read/emit CML. Some of their
authors have not approached us at all. Others have approached us but
expected us to provide complete CML implementations at our own expense.
Since, at present, this not an attractive business proposition, we haven't
been able to accept these offers. We note that some of them (unidentified)
have since added "CML". We do not know the degree of conformance or
comprehensiveness. Note that some of them are only available through
purchase and we may not have access to them. We do know that some of them
do not conform to the published CML specification and shall be advising
them that this is inconsistent with the use of the term and mark
"CML".
Other list readers might like to comment, but please make sure that
statements are factually correct and avoid political discussions.
* ACDLabs. No public information on conformance.
* CambridgeSoft. No public information on conformance.
* Chemaxon (Marvin). We have had no contact from them. This company lists
the CML elements they supports and adds many others in the same namespace
which are not CML. The "CML" is therefore not conformant to the
published
Schema. There are also semantics which are incompatible with CML (e.g. the
order of atoms may be important). This is "semantic pollution". We
shall
write to them soon, advising them that this is unacceptable. There are
technical fixes to some of this such as the use of proprietary namespaces
for attributes, elements and datatypes.
* Foo. private communication.
* Bar. private communication
* Xyzzy. private communication.
I shall write separately on compchem and semantics.
P.
Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069 Fax: +44 1223 763076
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
cml-discuss mailing list
cml-discuss [] lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cml-discuss
----- End forwarded message -----
--
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144 http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
http://moleculardevices.org http://nanomachines.net