Mass storage devices - summary
Hi people!
I've got several messages addressing the relative performance of mass
storage devices. Thanks a lot to all who responded to my question!
Here is the original question:
**************************************************************************************
A feature common to most of the computational chemistry software currently
available is the huge amount of data that is frequently read/write from/to
scratch files. The performance of a computational chemistry application so
depends largely on transference rates from/to hard discs or other mass storage
devices. In this context, let me address a couple of questions to the list:
i) what is the best strategy (in terms of best performance): write a few, huge
files to huge disk partitions or, alternatively, split big files into smaller
ones over many disk partitions?
ii) could someone point to additional information on fast access mass storage
devices?
I'll summarize to the list!
Claudio.
**************************************************************************************
And the replies:
From: Tony Ferriera <aferreir -x- at -x- memphis.edu>
To: PEROTT -x- at -x- if.ufrgs.br
Date: Tue, 07 Mar 2000 14:04:08 -0600
Claudio,
There are several ways to address this problem. First is to employ
"direct"
methods which calculate the one and two electron integrals as needed, thus
avoiding
the disk storage problem entirely.
On Unix systems it is possible to created "striped" volumes which can
span more
than one physical device. This tends to be the fastest method (using a single
scratch directory) but some tuning may be needed (e.g. - setting the block size
to
match the output scheme of the software). Splitting files over several physical
volumes is a good idea IF you run them through separate controllers. Accessing
several drives through the same controller will slow you down by increasing bus
traffic through the drive controller. If you have multiple controllers in your
machine, the best option for splitting files across several drives is to have
each
drive controlled independently.
I hope this helps.
Tony Ferreira
From: Soaring Bear <bear -x- at -x- dakotacom.net>
To: PEROTT -x- at -x- if.ufrgs.br
Date: Tue, 07 Mar 2000 13:39:42 -0700
This is a very interesting topic which has application outside
of chemistry, as well, and I would appreciate a copy of the
correspondence you receive.
A lot of separate files is likely to slow down disk
access in contrast to streaming a bunch of sequential
data in one big pull. But if you only need a little
bit of info at a time, small files might be faster than
hunting through one huge file.
Soaring Bear Ph.D. Research Pharmacologist, Informatics,
Chemistry & Biochemistry, Herbalist & Healthy Lifestyles
From: Eugene Leitl <eugene.leitl -x- at -x- lrz.uni-muenchen.de>
To: PEROTT -x- at -x- if.ufrgs.br
Date: Tue, 07 Mar 2000 13:14:45 -0800 (PST)
It depends on the operating system. For Linux, get as much RAM as you
can (0.5-1 GByte), since unused memory is being automatically used for
file caching, buy several large disks (e.g. 2-3 of the 40 GByte
Diamond Max), operate them on individual EIDE host adapter each, using
soft RAID.
Try http://linuxdoc.org/HOWTO/Software-RAID-HOWTO.html
From: phil stortz <pstortz -x- at -x- coffey.com>
To: PEROTT -x- at -x- if.ufrgs.br
Date: Tue, 07 Mar 2000 14:42:17 -0700
scsi drives with large on drive cache's may be faster, as scsi
drives can hold one command while running another, should there
be overlap between writing and writing. also a RAID array with
data striping can nearly triple data rate, the data is automagically
split between the 3 drives in small chunks-effectively using the drives
in parallel. of course avoiding fragmentation and a fast seek time
of the drives involved will also help.
additionally some raid controller cards also have a large/expandable cache.
i'd say partitions are probably a bad idea as they force fragmentation and extra
seek
distance/time.
From: "Dr. T. Daniel Crawford" <crawdad -x- at -x-
ne095.cm.utexas.edu>
To: PEROTT -x- at -x- if.ufrgs.br
Date: Tue, 07 Mar 2000 20:01:01 -0600 (CST)
Claudio,
I/O efficiency is a complicated issue for which there is no single,
best solution. Operating systems (particularly commercial ones designed
for high-performance workstations and supercomputers) vary widely in their
ability to handle transfers of large quantities of data to and from disk.
Although others on the CCL are probably more qualified to answer (and
I very much want to hear what they have to say!), there are perhaps a
few concepts to follow and a few to avoid. For example, when splitting
read/write calls across filesystems, greater efficiency can be obtained
when each filesystem has its own hardware controller (in particular,
with its own memory buffer). This allows the OS to send a buffer of data
to one controller while another read/write is pending on another disk.
On the other hand, there may be no reason to resort to "micromanaging"
the I/O on many modern operating systems, which often choose their own
read/write pattern regardless of what your program requests. At one time,
for example, programmers worried about requesting particular buffer sizes of
data (e.g., 1024k/read or write) in order to ensure alignment with the disk
sector length and avoid delays as the disk spun to orient sector boundaries.
In my own (perhaps limited) I/O testing on IBM RS6ks and DEC Alphas, such
sector-based I/O is apparently unnecessary in general since the OS uses a
highly efficient I/O buffering scheme already.
Again, I would very much like to hear about the experiences of other
programmers on the CCL in dealing with I/O efficiency issues. Good question,
Claudio!
Best regards,
-Daniel
--
T. Daniel Crawford, Ph.D. Institute for Theoretical
Chemistry
crawdad -x- at -x- jfs1.cm.utexas.edu Departments of
Chemistry and Biochemistry
http://zopyros.ccqc.uga.edu/~crawdad/ The
University of Texas
From: Laurence Cuffe <Laurence.Cuffe -x- at -x- ucd.ie>
To: PEROTT -x- at -x- if.ufrgs.br
Date: Wed, 08 Mar 2000 10:11:40 +0000
Hmm The fastest way to get big data sets in and out of a machine
is to stripe them i.e. write the data over an array of disk platens
and sets of disks so that all disks are being read at once. I know
digital unix can do this and I'm shure many other O.S's can do it
to. Again big disk caches help. This kind of problem has arisen in
large Data Base applications and in digital recording where you
want sustained transfer rates: a lot of hard disks can look good on
peak data transfer rates but fall down for sustained transfers.
Hope this helps Larry Cuffe
From: IN%"garciae -x- at -x- boojum.Colorado.EDU" "Edgardo
Garcia" 8-MAR-2000 13:56:06.34
To: IN%"PEROTT -x- at -x- if.ufrgs.br"
Oi Claudio,
Sobre suas perguntas, recentemente tive problemas com
o software Gaussian justamente devido a este colocar
tudo em um arquivo. Durante um calculo MP2 que
consome bastante memoria o arquivo ficou tao grande
que estourou o limite maximo permitido pelo sistema operacional,
no caso era 4Gb. Temos que tentar reconfigurar o sistema para
ver se podemos aumentar ese limite ou deixa-lo ilimitado.
Minha sugestao e' que seria melhor colocar os dados em
arquivos separados, especialmente se sao blocos com informacao
diferente que pode inclusive ser acessada de maneira
diferente e em determinados momentos apenas. A velocidade de
leitura de arquivos menores e' maior.
Um abraco
Edgardo
Edgardo Garcia
Universidade de Brasilia
From: Jochen K?pper <jochen -x- at -x- pc1.uni-duesseldorf.de>
To: PEROTT -x- at -x- if.ufrgs.br
Cc: chemistry -x- at -x- server.ccl.net
Date: Wed, 08 Mar 2000 17:05:51 +0100 (CET)
Use RAID. That defers the first problem back to OS' or hardware
designers. Striping should give you best average performance.
Many OS's allow software RAID without any additional hardware, but
then if you are willing to spend the cash, you could use hardware RAID
as well :-)
Fail-safety is something else to be considered, but probably not for
your "scratch"-files.
Jochen
--
Heinrich-Heine-Universität
Institut für Physikalische Chemie I
Jochen Küpper
Universitätsstr. 1, Geb. 26.43 Raum 02.29
40225 Düsseldorf, Germany
phone ++49-211-8113681, fax ++49-211-8115195
http://www.Jochen-Kuepper.de
From: Eugene Leitl <eugene.leitl -x- at -x- lrz.uni-muenchen.de>
To: Jochen K?pper <jochen -x- at -x- pc1.uni-duesseldorf.de>
Cc: PEROTT -x- at -x- if.ufrgs.br, chemistry -x- at -x- server.ccl.net
Date: Wed, 08 Mar 2000 15:09:20 -0800 (PST)
Jochen K?pper writes:
> Use RAID. That defers the first problem back to OS' or hardware
> designers. Striping should give you best average performance.
I would add to this very good advice the following:
1) use modern, large EIDE drives (currently, 40 GBytes for ~$250:
http://www.pricewatch.com/1/26/2119-1.htm ). If money is not
an
issue at all, use 10 k rpm (UW..etc)SCSI drives:
http://www.pricewatch.com/1/26/2150-1.htm
If you have a money printing press, use hardware RAID with them:
http://www.pricewatch.com/1/26/1537-1.htm
http://www.linuxdoc.org/HOWTO/mini/DPT-Hardware-RAID.html
2) Use several of large, modern EIDE drives as soft or hard RAID.
http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO.html
if you use EIDE drives, put them on individual EIDE host adapter
interface. These are ~$30 apiece, for instance:
http://www.buy.com/comp/product.asp?sku=10023443
3) if you're using Linux, use lots of RAM. 512 MByte-1 GByte or more
(make sure it is being recognized, you might have to supply
options at boot, though probably not with newer kernels). Linux
utilizes extra memory for file caching, which can speed up things
dramatically.
4) use a recent (preferably, cutting, not bleeding edge) kernel,
because sometimes improvements in caching algorithms/drivers
translate in very noticeable improvement in overall performance.
5) if you have large files (>2 GByte) you'll need 64 bit clean file
systems (patches for Linux vanilla ext2 are available). Also, look
around on http://www.beowulf-underground.org/ , there are
patches/software to be found there which also might improve
performance even for nonparallel systems.
Regards,
Eugene Leitl
--
****************************************************
Claudio A. Perottoni
Universidade Federal do Rio Grande do Sul
Instituto de Fisica - Laboratorio de Altas Pressoes
Av. Bento Goncalves, 9500
CAIXA POSTAL 15051
91501-970 PORTO ALEGRE - RS
BRAZIL
PHONE:55-51-316-6500
FAX :55-51-319-1762
http://www.if.ufrgs.br/~perott/index.html
****************************************************