From chemistry-request@server.ccl.net  Fri Apr 30 05:32:25 1999
Received: from www.ccl.net (www.ccl.net [192.148.249.5])
	by server.ccl.net (8.8.7/8.8.7) with ESMTP id FAA31403
	for <chemistry@ccl.net>; Fri, 30 Apr 1999 05:32:23 -0400
Received: from mordor.ai.private (fw.apollo-imaging.com [195.26.208.226])
        by www.ccl.net (8.8.3/8.8.6/OSC/CCL 1.0) with ESMTP id FAA20461
        Fri, 30 Apr 1999 05:28:47 -0400 (EDT)
Received: from vienna.at (gl@bombadil.ai.private [10.33.246.70])
	by mordor.ai.private (8.9.2/8.9.2/Debian/GNU) with ESMTP id MAA13341
	for <chemistry@www.ccl.net>; Fri, 30 Apr 1999 12:31:50 +0200 (CEST)
Sender: gl@apollo-imaging.com
Message-ID: <3729775D.C5D6C8BA@vienna.at>
Date: Fri, 30 Apr 1999 11:26:53 +0200
From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
Organization: Apollo Imaging
X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.0.36 i686)
X-Accept-Language: en
MIME-Version: 1.0
To: Computational Chemistry Mailing List <chemistry@www.ccl.net>
Subject: XML for Bioinformtics Data
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi!

Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
and e.g. http://www.ibm.com/xml/), which is a standard, human-readable,
extensible markup-language that is rapidly becoming _the_ method of
choice for exchange and storage of any kind of data and documents. It
seems to me that XML would simply be _perfect_ for data exchange and
maybe even data storage in bioinformatics (see end of message for a note
on chemistry and CML).

E.g. (from the top of my head), a DNA/protein sequence similarity search
engine (e.g. NCBIs BLAST server) might return its search results in the
form of an XML document that
could look like this:

<seq-sim-search-results>
  <query>
    <type>                         protein     </type>
    <seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
    <algorithm>                    FASTA3      </algorithm>
    <db>                           SwissProt   </db>
    <gap-open>                    -12          </gap-open>
    <gap-extension>               -2           </gap-extension>
  </query>
  <hits>
    <hit>
      <accession>      HPS_HUMAN    </accession>
      <organism>       homo sapiens </organism>
      <overlap>        11           </overlap>
      <overlaping-seq> GAEVLFYWTDQ  </overlaping-seq>
      <z-score>        129.3        </z-score>
    </hit>
    <hit>
      <accession>      PA24_MOUSE   </accession>
      <organism>       mus musculus </organism>
      <overlap>        8            </overlap>
      <overlaping-seq> VFIFYWTT     </overlaping-seq>
      <z-score>        133.3        </z-score>
    </hit>
  </hits>
</seq-sim-search-results>

There are several important points here:

1) Without knowing what this XML document is about, a program can assert
that it is well-formed! These programs exist, are free and are
applicable to all XML documents!

2) The rules for the nesting and naming of the tags in XML documents of
this type can be formally defined in XML. The above document would be of
type "seq-sim-search-results" and you could easily write a formal
definition (in a DTD file) that says that such a document must contain a
"query" and a "hits" tag; the "query" tag in turn must contain exactly
one of each "type", "seq", ... The "hits" tag in turn may contain 0 or
more "hit" tags which in turn ...

3) Having a formal definition of documents of this type, a program can
verify that our above XML document complies with the formal definiton
(is valid). These programs exist, are free and are applicable to all XML
documents!

4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
write and read (parse) any XML document and thus give a program access
to the structure and content of the document!! (No more perl-parsers for
BLAST-output!!)

5) This file is human-readable! (in contrast to a Corba struct or a
serialized Java object!)

6) Modern WWW-browsers can (if a style-sheet is supplied) directly
display this XML document. For old browsers, the XML document can easily
be converted to HTML for display.

I think you get the idea.

Does such an XML-based approach sound reasonable?
What does this approach leave to be desired?
Are efforts underway in this direction?
Wouldn't it be a better world if we all used XML (-:

I know that XML is currently being used for chemistry-related data (CML,
see http://www.xml-cml.org/), but I haven't heard of any efforts in the
area of Bioinformatics. So please view this message as targeted towards
the Bioinformatics community that is not served by CML. (CML has a
DNA/protein sequence tag.)

        cheers,
        gerald
-- 
 Gerald Loeffler
 Email: Gerald.Loeffler@vienna.at
 Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
 Phone: +43 676 3289588 (+43 1 5952333 27)
 Fax:   +43 1 5952333 20
 Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
           Computational Biology, Computational Biophysics

 "Wir haben nichts zu berichten, als dass wir erbaermlich sind."
                                               (Thomas Bernhard)
From chemistry-request@server.ccl.net  Fri Apr 30 07:00:50 1999
Received: from www.ccl.net (www.ccl.net [192.148.249.5])
	by server.ccl.net (8.8.7/8.8.7) with ESMTP id HAA32085
	for <chemistry@ccl.net>; Fri, 30 Apr 1999 07:00:50 -0400
Received: from comsig.nibsc.ac.uk (comsig.nibsc.ac.uk [193.62.43.13])
        by www.ccl.net (8.8.3/8.8.6/OSC/CCL 1.0) with ESMTP id GAA20724
        Fri, 30 Apr 1999 06:57:12 -0400 (EDT)
Received: from nibsc.ac.uk (dlinmf.nibsc.ac.uk [193.62.42.144]) by comsig.nibsc.ac.uk (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id LAA01369; Fri, 30 Apr 1999 11:55:05 +0100 (BST)
Message-ID: <37298C17.B38AB66@nibsc.ac.uk>
Date: Fri, 30 Apr 1999 11:55:19 +0100
From: Mark Forster <mforster@nibsc.ac.uk>
Organization: NIBSC
X-Mailer: Mozilla 4.05 [en] (Win95; I)
MIME-Version: 1.0
To: Gerald Loeffler <Gerald.Loeffler@vienna.at>, chemistry@www.ccl.net
Subject: Re: CCL:XML for Bioinformtics Data
References: <3729775D.C5D6C8BA@vienna.at>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Dear Gerald

That is a nice summary of the capabilities and possibilities offered by
XML. Some work in this area has already been done. For more
information on the Biosequence Markup Language (BSML)
see the WWW page of Visual Genomics Inc. at

    http://www.visualgenomics.com/bsml/index.html

A BSML browser and examples are available for download.

What is not currently clear to me is whether a given markup language
must to be approved by the WWW consortium,  the Math markup
language 1.0 (http://www.w3.org/Math/) has been released as
a W3C recommendation. in April 98; but is this required ?


Gerald Loeffler wrote:

> Hi!
>
> Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
> and e.g. http://www.ibm.com/xml/), which is a standard, human-readable,
> extensible markup-language that is rapidly becoming _the_ method of
> choice for exchange and storage of any kind of data and documents. It
> seems to me that XML would simply be _perfect_ for data exchange and
> maybe even data storage in bioinformatics (see end of message for a note
> on chemistry and CML).
>
> E.g. (from the top of my head), a DNA/protein sequence similarity search
> engine (e.g. NCBIs BLAST server) might return its search results in the
> form of an XML document that
> could look like this:
>
> <seq-sim-search-results>
>   <query>
>     <type>                         protein     </type>
>     <seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
>     <algorithm>                    FASTA3      </algorithm>
>     <db>                           SwissProt   </db>
>     <gap-open>                    -12          </gap-open>
>     <gap-extension>               -2           </gap-extension>
>   </query>
>   <hits>
>     <hit>
>       <accession>      HPS_HUMAN    </accession>
>       <organism>       homo sapiens </organism>
>       <overlap>        11           </overlap>
>       <overlaping-seq> GAEVLFYWTDQ  </overlaping-seq>
>       <z-score>        129.3        </z-score>
>     </hit>
>     <hit>
>       <accession>      PA24_MOUSE   </accession>
>       <organism>       mus musculus </organism>
>       <overlap>        8            </overlap>
>       <overlaping-seq> VFIFYWTT     </overlaping-seq>
>       <z-score>        133.3        </z-score>
>     </hit>
>   </hits>
> </seq-sim-search-results>
>
> There are several important points here:
>
> 1) Without knowing what this XML document is about, a program can assert
> that it is well-formed! These programs exist, are free and are
> applicable to all XML documents!
>
> 2) The rules for the nesting and naming of the tags in XML documents of
> this type can be formally defined in XML. The above document would be of
> type "seq-sim-search-results" and you could easily write a formal
> definition (in a DTD file) that says that such a document must contain a
> "query" and a "hits" tag; the "query" tag in turn must contain exactly
> one of each "type", "seq", ... The "hits" tag in turn may contain 0 or
> more "hit" tags which in turn ...
>
> 3) Having a formal definition of documents of this type, a program can
> verify that our above XML document complies with the formal definiton
> (is valid). These programs exist, are free and are applicable to all XML
> documents!
>
> 4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
> write and read (parse) any XML document and thus give a program access
> to the structure and content of the document!! (No more perl-parsers for
> BLAST-output!!)
>
> 5) This file is human-readable! (in contrast to a Corba struct or a
> serialized Java object!)
>
> 6) Modern WWW-browsers can (if a style-sheet is supplied) directly
> display this XML document. For old browsers, the XML document can easily
> be converted to HTML for display.
>
> I think you get the idea.
>
> Does such an XML-based approach sound reasonable?
> What does this approach leave to be desired?
> Are efforts underway in this direction?
> Wouldn't it be a better world if we all used XML (-:
>
> I know that XML is currently being used for chemistry-related data (CML,
> see http://www.xml-cml.org/), but I haven't heard of any efforts in the
> area of Bioinformatics. So please view this message as targeted towards
> the Bioinformatics community that is not served by CML. (CML has a
> DNA/protein sequence tag.)
>
>         cheers,
>         gerald
> --
>  Gerald Loeffler
>  Email: Gerald.Loeffler@vienna.at
>  Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
>  Phone: +43 676 3289588 (+43 1 5952333 27)
>  Fax:   +43 1 5952333 20
>  Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
>            Computational Biology, Computational Biophysics
>
>  "Wir haben nichts zu berichten, als dass wir erbaermlich sind."
>                                                (Thomas Bernhard)
> -= This is automatically added to each message by mailing script =-
> CHEMISTRY@ccl.net -- To Everybody    |   CHEMISTRY-REQUEST@ccl.net -- To Admins
> MAILSERV@ccl.net -- HELP CHEMISTRY or HELP SEARCH
> CHEMISTRY-SEARCH@ccl.net -- archive search    |    Gopher: gopher.ccl.net 70
> Ftp: ftp.ccl.net  |  WWW: http://www.ccl.net/chemistry/   | Jan: jkl@ccl.net



--

  Dr Mark J Forster Ph.D.
  Principal Scientist
  Informatics Laboratory
  National Institute for Biological Standards and Control
  Blanche Lane, South Mimms,
  Hertfordshire EN6 3QG, United Kingdom.

  Tel  +44 (0)1707 654753
  FAX  +44 (0)1707 646730
  E-mail  mforster@nibsc.ac.uk


