Mass storage devices - summary

From: Claudio Perottoni <perott -x- at -x- if.ufrgs.br>
Subject: Mass storage devices - summary
Date: Fri, 31 Mar 2000 13:44:45 -0300
 	Hi people!
 	I've got several messages addressing the relative performance of mass
 storage devices.  Thanks a lot to all who responded to my question!
 	Here is the original question:
 **************************************************************************************
 A feature common to most of the computational chemistry software currently
 available is the huge amount of data that is frequently read/write from/to
 scratch files. The performance of a computational chemistry application so
 depends largely on transference rates from/to hard discs or other mass storage
 devices. In this context, let me address a couple of questions to the list:
 i) what is the best strategy (in terms of best performance): write a few, huge
 files to huge disk partitions or, alternatively, split big files into smaller
 ones over many disk partitions?
 ii) could someone point to additional information on fast access mass storage
 devices?
  I'll summarize to the list!
                                                                 Claudio.
 **************************************************************************************
 	And the replies:
 From: Tony Ferriera <aferreir -x- at -x- memphis.edu>
 To: PEROTT -x- at -x- if.ufrgs.br
 Date: Tue, 07 Mar 2000 14:04:08 -0600
 Claudio,
 There are several ways to address this problem.  First is to employ
 "direct"
 methods which calculate the one and two electron integrals as needed, thus
 avoiding
 the disk storage problem entirely.
 On Unix systems it is possible to created "striped" volumes which can
 span more
 than one physical device.  This tends to be the fastest method (using a single
 scratch directory) but some tuning may be needed (e.g. - setting the block size
 to
 match the output scheme of the software).  Splitting files over several physical
 volumes is a good idea IF you run them through separate controllers.  Accessing
 several drives through the same controller will slow you down by increasing bus
 traffic through the drive controller.  If you have multiple controllers in your
 machine, the best option for splitting files across several drives is to have
 each
 drive controlled independently.
 I hope this helps.
 Tony Ferreira
 From: Soaring Bear <bear -x- at -x- dakotacom.net>
 To: PEROTT -x- at -x- if.ufrgs.br
 Date: Tue, 07 Mar 2000 13:39:42 -0700
 This is a very interesting topic which has application outside
 of chemistry, as well, and I would appreciate a copy of the
 correspondence you receive.
 A lot of separate files is likely to slow down disk
 access in contrast to streaming a bunch of sequential
 data in one big pull.    But if you only need a little
 bit of info at a time, small files might be faster than
 hunting through one huge file.
 Soaring Bear Ph.D. Research Pharmacologist, Informatics,
 Chemistry & Biochemistry, Herbalist & Healthy Lifestyles
 From: Eugene Leitl <eugene.leitl -x- at -x- lrz.uni-muenchen.de>
 To: PEROTT -x- at -x- if.ufrgs.br
 Date: Tue, 07 Mar 2000 13:14:45 -0800 (PST)
 It depends on the operating system. For Linux, get as much RAM as you
 can (0.5-1 GByte), since unused memory is being automatically used for
 file caching, buy several large disks (e.g. 2-3 of the 40 GByte
 Diamond Max), operate them on individual EIDE host adapter each, using
 soft RAID.
 Try http://linuxdoc.org/HOWTO/Software-RAID-HOWTO.html
 From: phil stortz <pstortz -x- at -x- coffey.com>
 To: PEROTT -x- at -x- if.ufrgs.br
 Date: Tue, 07 Mar 2000 14:42:17 -0700
 scsi drives with large on drive cache's may be faster, as scsi
 drives can hold one command while running another, should there
 be overlap between writing and writing.  also a RAID array with
 data striping can nearly triple data rate, the data is automagically
 split between the 3 drives in small chunks-effectively using the drives
 in parallel.  of course avoiding fragmentation and a fast seek time
 of the drives involved will also help.
 additionally some raid controller cards also have a large/expandable cache.
 i'd say partitions are probably a bad idea as they force fragmentation and extra
 seek
 distance/time.
 From: "Dr. T. Daniel Crawford" <crawdad -x- at -x-
 ne095.cm.utexas.edu>
 To: PEROTT -x- at -x- if.ufrgs.br
 Date: Tue, 07 Mar 2000 20:01:01 -0600 (CST)
 Claudio,
 I/O efficiency is a complicated issue for which there is no single,
 best solution.  Operating systems (particularly commercial ones designed
 for high-performance workstations and supercomputers) vary widely in their
 ability to handle transfers of large quantities of data to and from disk.
 Although others on the CCL are probably more qualified to answer (and
 I very much want to hear what they have to say!), there are perhaps a
 few concepts to follow and a few to avoid.  For example, when splitting
 read/write calls across filesystems, greater efficiency can be obtained
 when each filesystem has its own hardware controller (in particular,
 with its own memory buffer).  This allows the OS to send a buffer of data
 to one controller while another read/write is pending on another disk.
 On the other hand, there may be no reason to resort to "micromanaging"
 the I/O on many modern operating systems, which often choose their own
 read/write pattern regardless of what your program requests.  At one time,
 for example, programmers worried about requesting particular buffer sizes of
 data (e.g., 1024k/read or write) in order to ensure alignment with the disk
 sector length and avoid delays as the disk spun to orient sector boundaries.
 In my own (perhaps limited) I/O testing on IBM RS6ks and DEC Alphas, such
 sector-based I/O is apparently unnecessary in general since the OS uses a
 highly efficient I/O buffering scheme already.
 Again, I would very much like to hear about the experiences of other
 programmers on the CCL in dealing with I/O efficiency issues.  Good question,
 Claudio!
 Best regards,
 -Daniel
 --
 T. Daniel Crawford, Ph.D.                              Institute for Theoretical
 Chemistry
 crawdad -x- at -x- jfs1.cm.utexas.edu                      Departments of
 Chemistry and Biochemistry
 http://zopyros.ccqc.uga.edu/~crawdad/                The
 University of Texas
 From: Laurence Cuffe <Laurence.Cuffe -x- at -x- ucd.ie>
 To: PEROTT -x- at -x- if.ufrgs.br
 Date: Wed, 08 Mar 2000 10:11:40 +0000
 Hmm The fastest way to get big data sets in and out of a machine
 is to stripe them i.e. write the data over an array of disk platens
 and sets of disks so that all disks are being read at once. I know
 digital unix can do this and I'm shure many other O.S's can do it
 to. Again big disk caches help.  This kind of problem has arisen in
 large Data Base applications and in digital recording where you
 want sustained transfer rates: a lot of hard disks can look good on
 peak data transfer rates but fall down for sustained transfers.
 Hope this helps Larry Cuffe
 From:    IN%"garciae -x- at -x- boojum.Colorado.EDU"  "Edgardo
 Garcia"  8-MAR-2000 13:56:06.34
 To:      IN%"PEROTT -x- at -x- if.ufrgs.br"
 Oi Claudio,
 Sobre suas perguntas, recentemente tive problemas com
 o software Gaussian justamente devido a este colocar
 tudo em um arquivo. Durante um calculo MP2 que
 consome bastante memoria o arquivo ficou tao grande
 que estourou o limite maximo permitido pelo sistema operacional,
 no caso era 4Gb. Temos que tentar reconfigurar o sistema para
 ver se podemos aumentar ese limite ou deixa-lo ilimitado.
 Minha sugestao e' que seria melhor colocar os dados em
 arquivos separados, especialmente se sao blocos com informacao
 diferente que pode inclusive ser acessada de maneira
 diferente e em determinados momentos apenas. A velocidade de
 leitura de arquivos menores e' maior.
 Um abraco
 Edgardo
 Edgardo Garcia
 Universidade de Brasilia
 From: Jochen K?pper <jochen -x- at -x- pc1.uni-duesseldorf.de>
 To: PEROTT -x- at -x- if.ufrgs.br
 Cc: chemistry -x- at -x- server.ccl.net
 Date: Wed, 08 Mar 2000 17:05:51 +0100 (CET)
 Use RAID. That defers the first problem back to OS' or hardware
 designers. Striping should give you best average performance.
 Many OS's allow software RAID without any additional hardware, but
 then if you are willing to spend the cash, you could use hardware RAID
 as well :-)
 Fail-safety is something else to be considered, but probably not for
 your "scratch"-files.
 Jochen
 --
 Heinrich-Heine-Universität
 Institut für Physikalische Chemie I
 Jochen Küpper
 Universitätsstr. 1, Geb. 26.43 Raum 02.29
 40225 Düsseldorf, Germany
 phone ++49-211-8113681, fax ++49-211-8115195
 http://www.Jochen-Kuepper.de
 From: Eugene Leitl <eugene.leitl -x- at -x- lrz.uni-muenchen.de>
 To: Jochen K?pper <jochen -x- at -x- pc1.uni-duesseldorf.de>
 Cc: PEROTT -x- at -x- if.ufrgs.br,  chemistry -x- at -x- server.ccl.net
 Date: Wed, 08 Mar 2000 15:09:20 -0800 (PST)
 Jochen K?pper writes:
  > Use RAID. That defers the first problem back to OS' or hardware
  > designers. Striping should give you best average performance.
 I would add to this very good advice the following:
 1) use modern, large EIDE drives (currently, 40 GBytes for ~$250:
     http://www.pricewatch.com/1/26/2119-1.htm ). If money is not
 an
     issue at all, use 10 k rpm (UW..etc)SCSI drives:
     http://www.pricewatch.com/1/26/2150-1.htm
     If you have a money printing press, use hardware RAID with them:
     http://www.pricewatch.com/1/26/1537-1.htm
     http://www.linuxdoc.org/HOWTO/mini/DPT-Hardware-RAID.html
 2) Use several of large, modern EIDE drives as soft or hard RAID.
     http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO.html
     if you use EIDE drives, put them on individual EIDE host adapter
     interface. These are ~$30 apiece, for instance:
     http://www.buy.com/comp/product.asp?sku=10023443
 3) if you're using Linux, use lots of RAM. 512 MByte-1 GByte or more
     (make sure it is being recognized, you might have to supply
     options at boot, though probably not with newer kernels). Linux
     utilizes extra memory for file caching, which can speed up things
     dramatically.
 4) use a recent (preferably, cutting, not bleeding edge) kernel,
     because sometimes improvements in caching algorithms/drivers
     translate in very noticeable improvement in overall performance.
 5) if you have large files (>2 GByte) you'll need 64 bit clean file
     systems (patches for Linux vanilla ext2 are available). Also, look
     around on http://www.beowulf-underground.org/ , there are
     patches/software to be found there which also might improve
     performance even for nonparallel systems.
 Regards,
 Eugene Leitl
 --
 ****************************************************
 Claudio A. Perottoni
 Universidade Federal do Rio Grande do Sul
 Instituto de Fisica - Laboratorio de Altas Pressoes
 Av. Bento Goncalves, 9500
 CAIXA POSTAL 15051
 91501-970  PORTO ALEGRE - RS
 BRAZIL
 PHONE:55-51-316-6500
 FAX  :55-51-319-1762
 http://www.if.ufrgs.br/~perott/index.html
 ****************************************************