Converting to CML [was Re: [cml/ccml-discuss] RE: [Zinc-fans] database formats? (fwd from jji [] cgl.ucsf.edu)] (fwd from pm286 [] cam.ac.uk)



----- Forwarded message from Peter Murray-Rust <pm286 [] cam.ac.uk>
 -----
 From: Peter Murray-Rust <pm286 [] cam.ac.uk>
 Date: Thu, 07 Apr 2005 08:54:05 +0100
 To: Geoff Hutchison <grh25 [] cornell.edu>,
 	"John Irwin" <jji [] cgl.ucsf.edu>
 Cc: cml-discuss [] lists.sourceforge.net,
 	openbabel-discuss [] lists.sourceforge.net,
 	simon Tyrrell <simon.tyrrell [] virgin.net>,
 	cdk-devel [] lists.sourceforge.net, Tom Oinn <tmo [] ebi.ac.uk>
 Subject: Converting to CML [was Re: [cml/ccml-discuss] RE: [Zinc-fans]
   database formats? (fwd from jji [] cgl.ucsf.edu)]
 X-Mailer: QUALCOMM Windows Eudora Version 6.0.1.1
 [Crossposted to 3 lists, please be considerate]
 [John Irwin]
 >>... Can you recommend software for
 >>preparing and manipulating CML files? If OE offered CML, we could and
 might
 >>offer CML tomorrow.
 There are many good tools for converting files to CML. First, some words
 about strategy.
 CML is powerful enough to hold compound documents such as compound data
 cards, computational chemistry output and (when combined with XHTML)
 complete scientific documents. So "converting to CML" can involve
 components such as molecules, reactions, their properties, spectra,
 eigenproperties, etc. In general CML can hold any information composed of
 simple datatypes (numbers, strings, array, matrixes, etc.) and predefined
 schema elements (reactions, spectra...). We are devising a mechanism for
 building complex datatypes (e.g. critical point, phase diagrams).
 Most people currently want to manage molecular data and I'll stick with
 that. (JohnI and I have already corresponded usefully so I believe that a
 Zinc entry consists of at least:
 * a molecule
 * its provenance
 * published names
 * published properties
 * calculated properties
 * intellectual property rights
 CML can manage all of this except the IPR. To summarise John's mail, Zinc
 consists of molecular information supplied by compound suppliers under
 contract, for which properties are calculated using software made available
 under contract and then collected in a database which itself has
 restrictions on use (e.g. only limited subsets can be distributed, and for
 restricted use]). CML is not capable of managing the complexity of this IPR
 so the converter would have to add this, preferable in RDF. [Note that this
 problem does not occur for Open data since we can simply add a BOAI or
 Creative Commons license.]
 The provenance (without rights) is managed by the DublinCore dc:creator and
 dc:publisher in CML:
 <metadataList>
   <metadata name="dc:creator" content="Foobarchem"/>
   <metadata name="dc:publisher" content="ji []
 cgl.ucsf.edu"/>
 </metadataList>
 CML can, in principle, hold everything else without loss. Since I don't
 know the range of properties I don't know which are complex, but assuming
 that most are scalar, then the simple approach is to render them as:
 <property dictRef="zinc:mpt">
   <scalar units="units:celsius" min="121"
 max="123.5" errorBasis="range"/>
 </property>
 === OK, most people weren't expecting that! BUT provenance and
 redistribution is increasingly important. That is why the default action of
 OpenBabel when outputting CML is to add metadata. We would hope that if
 users add metadata to the input (only possible in CML) it would be
 transported through ===
 I suspect the question could be rephrased as "how do I convert a file
 containing small-molecule information and produce a CML file which contains
 the atoms, bonds and their properties without loss? Each molecule is
 separately identifiable and there is no contextual linkage between them
 (e.g. they aren't poses, supramolecules, etc. The file(s) may contain many
 independent molecules and batch conversion is required"
 I currently know of the current tools, and would approach them in this order:
 * Openbabel. This has the widest range of file types and can deal with
 lists of molecules. Billy Tyrrell, Chris Morley, Geoff Hutchison and I have
 variously developed this and Henry Rzepa has carried out roundtripping. We
 intend to maintain this a flagship for CML conversion - i.e. if there is a
 problem we will try to respond.
 * JUMBO. We have concentrated on complex formats and currently offer
         * MDL Molfile, SDF (and RXN). This attempts to follow the
 published spec for V2000 files. However since some of the spec appears to
 be specific to MDL programs it is necessarily a subset, albeit a fairly
 comprehensive one.
         * MOL2 format taken from the Tripos spec. This again is a subset
 and does not address recognition of atom type and fragments. Not validated.
         * CDX and CDXML. Most of the spec relating to molecules and
 reaction, but not graphic layout, has been implemented. Since CD is a very
 graphically oriented format it is extremely easy to create objects which do
 not formally represent the semantics of the molecule. Conversion of any CDX
 file is likely to be lossy and fuzzy.
         * CIF. This is a complete interpretation of DDL1 with manual
 coding of some of the core dictionary. Although CIF can contain chemical
 structure information this is virtually never used. Hence we have to use
 heuristics to calculate the chemistry and this is almost lossless for GOOD
 CIFs (as published by Acta Cryst.)
         * SMILES. I think this is fairly complete and should include
 stereochemistry.
 * CDK. This has a range of file readers and a CML writer. We haven't been
 directly involved in the coding but correspond daily with the group. If
 there are any problems then I am sure the CDK group would be keen to
 address them and we'll help in the discussions.
 * JOELib. This has a wide range of functionality, including the calculation
 of properties. Again we are in frequent touch, and although I haven't used
 it for CML I am sure the authors are responsive.
 * BlueObelisk, WebServices and Taverna.
 (http://wwmm.ch.cam.ac.uk/presentations/acs2005) This is a
 recent movement
 among a number of OpenSource and Open Data groups to ensure
 interoperability. "File conversions" will increasingly be packaged as
 WebServices (http://wwmm.ch.cam.ac.uk/gridsphere/gridsphere) or workflows
 (such as http://taverna.sf.net). Scientists can then select the
 services
 they require and compose their own application.  This will include
 conversion, validation, checks for uniqueness, submission to repository,
 etc. I suspect that Zinc actually requires a Taverrna-like workflow for its
 maintenance. Taverna can be used to warp closed source programs, but of
 course these cannot be distributed. We offer WebServices for OpenBabel and
 JUMBO as above so anyone can link their conversion requirements. Also our
 WS are Open so anyone can clone them to avoid connection problems. We do
 not currently offer WebServices that use close source programs because
 there are usually license restrictions by the suppliers and WS cannot yet
 deal with complex IPR negotiations. There is no reason why we might not
 create some in the future - if so the WebService wrapper would probably be
 OpenSource.
 There are some other Open Source programs (with whose authors we have had
 discussions) which read and/or emit CML including:
 * BKChem
 * Ghemical
 I don't know whether these can be used in batch but as they are open source
 then anyone can add this. I am also sure they'd be keen to help. I don't
 know the degree of conformance.
 There are an increasing number of computational chemistry programs which
 emit (and often read) CML but this is  out of scope in this thread.
 We welcome implementations and use of CML by for-profit organisations. CML
 itself is an openly published, read-only specification and does not require
 implementations to be OpenSource. It does, however, require best efforts to
 conform and we shall write more of this later. Although, in principle, it
 is possible to write conformant software by reading the spec, in practice
 no spec is completely watertight and we encourage discussion. Obviously any
 posting to this list advertises the origin of the poster, so companies may
 wish to mail privately and will get a private reply. However we have
 limited resources and cannot generally give extended free private advice.
 There are some closed source tools which read/emit CML. Some of their
 authors have not approached us at all. Others have approached us but
 expected us to provide complete CML implementations at our own expense.
 Since, at present, this not an attractive business proposition, we haven't
 been able to accept these offers. We note that some of them (unidentified)
 have since added "CML". We do not know the degree of conformance or
 comprehensiveness. Note that some of them are only available through
 purchase and we may not have access to them. We do know that some of them
 do not conform to the published CML specification and shall be advising
 them that this is inconsistent with the use of the term and mark
 "CML".
 Other list readers might like to comment, but please make sure that
 statements are  factually correct and avoid political discussions.
 * ACDLabs. No public information on conformance.
 * CambridgeSoft. No public information on conformance.
 * Chemaxon (Marvin). We have had no contact from them. This company lists
 the CML elements they supports and adds many others in the same namespace
 which are not CML. The "CML" is therefore not conformant to the
 published
 Schema. There are also semantics which are incompatible with CML (e.g. the
 order of atoms may be important). This is "semantic pollution". We
 shall
 write to them soon, advising them that this is unacceptable. There are
 technical fixes to some of this such as the use of proprietary namespaces
 for attributes, elements and datatypes.
 * Foo. private communication.
 * Bar. private communication
 * Xyzzy. private communication.
 I shall write separately on compchem and semantics.
 P.
 Peter Murray-Rust
 Unilever Centre for Molecular Informatics
 Chemistry Department, Cambridge University
 Lensfield Road, CAMBRIDGE, CB2 1EW, UK
 Tel: +44-1223-763069 Fax: +44 1223 763076
 -------------------------------------------------------
 SF email is sponsored by - The IT Product Guide
 Read honest & candid reviews on hundreds of IT Products from real users.
 Discover which products truly live up to the hype. Start reading now.
 http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
 _______________________________________________
 cml-discuss mailing list
 cml-discuss [] lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/cml-discuss
 ----- End forwarded message -----
 --
 Eugen* Leitl <a href="http://leitl.org";>leitl</a>
 ______________________________________________________________
 ICBM: 48.07078, 11.61144            http://www.leitl.org
 8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
 http://moleculardevices.org         http://nanomachines.net