XML for Bioinformtics Data

From: Gerald Loeffler <Gerald.Loeffler.,at,.vienna.at>
Organization: Apollo Imaging
Subject: XML for Bioinformtics Data
Date: Fri, 30 Apr 1999 11:26:53 +0200
Hi!
 Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
 and e.g. http://www.ibm.com/xml/), which is a standard,
 human-readable,
 extensible markup-language that is rapidly becoming _the_ method of
 choice for exchange and storage of any kind of data and documents. It
 seems to me that XML would simply be _perfect_ for data exchange and
 maybe even data storage in bioinformatics (see end of message for a note
 on chemistry and CML).
 E.g. (from the top of my head), a DNA/protein sequence similarity search
 engine (e.g. NCBIs BLAST server) might return its search results in the
 form of an XML document that
 could look like this:
 <seq-sim-search-results>
   <query>
     <type>                         protein     </type>
     <seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
     <algorithm>                    FASTA3      </algorithm>
     <db>                           SwissProt   </db>
     <gap-open>                    -12          </gap-open>
     <gap-extension>               -2           </gap-extension>
   </query>
   <hits>
     <hit>
       <accession>      HPS_HUMAN    </accession>
       <organism>       homo sapiens </organism>
       <overlap>        11           </overlap>
       <overlaping-seq> GAEVLFYWTDQ  </overlaping-seq>
       <z-score>        129.3        </z-score>
     </hit>
     <hit>
       <accession>      PA24_MOUSE   </accession>
       <organism>       mus musculus </organism>
       <overlap>        8            </overlap>
       <overlaping-seq> VFIFYWTT     </overlaping-seq>
       <z-score>        133.3        </z-score>
     </hit>
   </hits>
 </seq-sim-search-results>
 There are several important points here:
 1) Without knowing what this XML document is about, a program can assert
 that it is well-formed! These programs exist, are free and are
 applicable to all XML documents!
 2) The rules for the nesting and naming of the tags in XML documents of
 this type can be formally defined in XML. The above document would be of
 type "seq-sim-search-results" and you could easily write a formal
 definition (in a DTD file) that says that such a document must contain a
 "query" and a "hits" tag; the "query" tag in turn
 must contain exactly
 one of each "type", "seq", ... The "hits" tag in
 turn may contain 0 or
 more "hit" tags which in turn ...
 3) Having a formal definition of documents of this type, a program can
 verify that our above XML document complies with the formal definiton
 (is valid). These programs exist, are free and are applicable to all XML
 documents!
 4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
 write and read (parse) any XML document and thus give a program access
 to the structure and content of the document!! (No more perl-parsers for
 BLAST-output!!)
 5) This file is human-readable! (in contrast to a Corba struct or a
 serialized Java object!)
 6) Modern WWW-browsers can (if a style-sheet is supplied) directly
 display this XML document. For old browsers, the XML document can easily
 be converted to HTML for display.
 I think you get the idea.
 Does such an XML-based approach sound reasonable?
 What does this approach leave to be desired?
 Are efforts underway in this direction?
 Wouldn't it be a better world if we all used XML (-:
 I know that XML is currently being used for chemistry-related data (CML,
 see http://www.xml-cml.org/), but I haven't heard of any efforts
 in the
 area of Bioinformatics. So please view this message as targeted towards
 the Bioinformatics community that is not served by CML. (CML has a
 DNA/protein sequence tag.)
         cheers,
         gerald
 --
  Gerald Loeffler
  Email: Gerald.Loeffler.,at,.vienna.at
  Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
  Phone: +43 676 3289588 (+43 1 5952333 27)
  Fax:   +43 1 5952333 20
  Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
            Computational Biology, Computational Biophysics
  "Wir haben nichts zu berichten, als dass wir erbaermlich sind."
                                                (Thomas Bernhard)