XML for Bioinformtics Data
- From: Gerald Loeffler <Gerald.Loeffler.,at,.vienna.at>
- Organization: Apollo Imaging
- Subject: XML for Bioinformtics Data
- Date: Fri, 30 Apr 1999 11:26:53 +0200
Hi!
Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
and e.g. http://www.ibm.com/xml/), which is a standard,
human-readable,
extensible markup-language that is rapidly becoming _the_ method of
choice for exchange and storage of any kind of data and documents. It
seems to me that XML would simply be _perfect_ for data exchange and
maybe even data storage in bioinformatics (see end of message for a note
on chemistry and CML).
E.g. (from the top of my head), a DNA/protein sequence similarity search
engine (e.g. NCBIs BLAST server) might return its search results in the
form of an XML document that
could look like this:
<seq-sim-search-results>
<query>
<type> protein </type>
<seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
<algorithm> FASTA3 </algorithm>
<db> SwissProt </db>
<gap-open> -12 </gap-open>
<gap-extension> -2 </gap-extension>
</query>
<hits>
<hit>
<accession> HPS_HUMAN </accession>
<organism> homo sapiens </organism>
<overlap> 11 </overlap>
<overlaping-seq> GAEVLFYWTDQ </overlaping-seq>
<z-score> 129.3 </z-score>
</hit>
<hit>
<accession> PA24_MOUSE </accession>
<organism> mus musculus </organism>
<overlap> 8 </overlap>
<overlaping-seq> VFIFYWTT </overlaping-seq>
<z-score> 133.3 </z-score>
</hit>
</hits>
</seq-sim-search-results>
There are several important points here:
1) Without knowing what this XML document is about, a program can assert
that it is well-formed! These programs exist, are free and are
applicable to all XML documents!
2) The rules for the nesting and naming of the tags in XML documents of
this type can be formally defined in XML. The above document would be of
type "seq-sim-search-results" and you could easily write a formal
definition (in a DTD file) that says that such a document must contain a
"query" and a "hits" tag; the "query" tag in turn
must contain exactly
one of each "type", "seq", ... The "hits" tag in
turn may contain 0 or
more "hit" tags which in turn ...
3) Having a formal definition of documents of this type, a program can
verify that our above XML document complies with the formal definiton
(is valid). These programs exist, are free and are applicable to all XML
documents!
4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
write and read (parse) any XML document and thus give a program access
to the structure and content of the document!! (No more perl-parsers for
BLAST-output!!)
5) This file is human-readable! (in contrast to a Corba struct or a
serialized Java object!)
6) Modern WWW-browsers can (if a style-sheet is supplied) directly
display this XML document. For old browsers, the XML document can easily
be converted to HTML for display.
I think you get the idea.
Does such an XML-based approach sound reasonable?
What does this approach leave to be desired?
Are efforts underway in this direction?
Wouldn't it be a better world if we all used XML (-:
I know that XML is currently being used for chemistry-related data (CML,
see http://www.xml-cml.org/), but I haven't heard of any efforts
in the
area of Bioinformatics. So please view this message as targeted towards
the Bioinformatics community that is not served by CML. (CML has a
DNA/protein sequence tag.)
cheers,
gerald
--
Gerald Loeffler
Email: Gerald.Loeffler.,at,.vienna.at
Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
Phone: +43 676 3289588 (+43 1 5952333 27)
Fax: +43 1 5952333 20
Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
Computational Biology, Computational Biophysics
"Wir haben nichts zu berichten, als dass wir erbaermlich sind."
(Thomas Bernhard)