From chemistry-request %-% at %-% server.ccl.net Fri Mar 31 14:10:29 2000 Received: from if1.if.ufrgs.br (if1.if.ufrgs.br [143.54.4.2]) by server.ccl.net (8.8.7/8.8.7) with ESMTP id OAA29050 for ; Fri, 31 Mar 2000 14:09:30 -0500 Received: from pclap3.if.ufrgs.br by if1.if.ufrgs.br (PMDF V4.3-10 #7035) id <01JNOSX0OX6O0019OA -A_T- if1.if.ufrgs.br>; Fri, 31 Mar 2000 16:07:47 -0003 Date: Fri, 31 Mar 2000 13:44:45 -0300 From: Claudio Perottoni Subject: Mass storage devices - summary To: chemistry[ AT ]ccl.net Reply-to: perott;at;if.ufrgs.br Message-id: <00033114023000.20321-!at!-pclap3.if.ufrgs.br> Hi people! I've got several messages addressing the relative performance of mass storage devices. Thanks a lot to all who responded to my question! Here is the original question: ************************************************************************************** A feature common to most of the computational chemistry software currently available is the huge amount of data that is frequently read/write from/to scratch files. The performance of a computational chemistry application so depends largely on transference rates from/to hard discs or other mass storage devices. In this context, let me address a couple of questions to the list: i) what is the best strategy (in terms of best performance): write a few, huge files to huge disk partitions or, alternatively, split big files into smaller ones over many disk partitions? ii) could someone point to additional information on fast access mass storage devices? I'll summarize to the list! Claudio. ************************************************************************************** And the replies: From: Tony Ferriera To: PEROTT ^at^ if.ufrgs.br Date: Tue, 07 Mar 2000 14:04:08 -0600 Claudio, There are several ways to address this problem. First is to employ "direct" methods which calculate the one and two electron integrals as needed, thus avoiding the disk storage problem entirely. On Unix systems it is possible to created "striped" volumes which can span more than one physical device. This tends to be the fastest method (using a single scratch directory) but some tuning may be needed (e.g. - setting the block size to match the output scheme of the software). Splitting files over several physical volumes is a good idea IF you run them through separate controllers. Accessing several drives through the same controller will slow you down by increasing bus traffic through the drive controller. If you have multiple controllers in your machine, the best option for splitting files across several drives is to have each drive controlled independently. I hope this helps. Tony Ferreira From: Soaring Bear To: PEROTT {*at*} if.ufrgs.br Date: Tue, 07 Mar 2000 13:39:42 -0700 This is a very interesting topic which has application outside of chemistry, as well, and I would appreciate a copy of the correspondence you receive. A lot of separate files is likely to slow down disk access in contrast to streaming a bunch of sequential data in one big pull. But if you only need a little bit of info at a time, small files might be faster than hunting through one huge file. Soaring Bear Ph.D. Research Pharmacologist, Informatics, Chemistry & Biochemistry, Herbalist & Healthy Lifestyles From: Eugene Leitl To: PEROTT ^at^ if.ufrgs.br Date: Tue, 07 Mar 2000 13:14:45 -0800 (PST) It depends on the operating system. For Linux, get as much RAM as you can (0.5-1 GByte), since unused memory is being automatically used for file caching, buy several large disks (e.g. 2-3 of the 40 GByte Diamond Max), operate them on individual EIDE host adapter each, using soft RAID. Try http://linuxdoc.org/HOWTO/Software-RAID-HOWTO.html From: phil stortz To: PEROTT -A_T- if.ufrgs.br Date: Tue, 07 Mar 2000 14:42:17 -0700 scsi drives with large on drive cache's may be faster, as scsi drives can hold one command while running another, should there be overlap between writing and writing. also a RAID array with data striping can nearly triple data rate, the data is automagically split between the 3 drives in small chunks-effectively using the drives in parallel. of course avoiding fragmentation and a fast seek time of the drives involved will also help. additionally some raid controller cards also have a large/expandable cache. i'd say partitions are probably a bad idea as they force fragmentation and extra seek distance/time. From: "Dr. T. Daniel Crawford" To: PEROTT ^%at%^ if.ufrgs.br Date: Tue, 07 Mar 2000 20:01:01 -0600 (CST) Claudio, I/O efficiency is a complicated issue for which there is no single, best solution. Operating systems (particularly commercial ones designed for high-performance workstations and supercomputers) vary widely in their ability to handle transfers of large quantities of data to and from disk. Although others on the CCL are probably more qualified to answer (and I very much want to hear what they have to say!), there are perhaps a few concepts to follow and a few to avoid. For example, when splitting read/write calls across filesystems, greater efficiency can be obtained when each filesystem has its own hardware controller (in particular, with its own memory buffer). This allows the OS to send a buffer of data to one controller while another read/write is pending on another disk. On the other hand, there may be no reason to resort to "micromanaging" the I/O on many modern operating systems, which often choose their own read/write pattern regardless of what your program requests. At one time, for example, programmers worried about requesting particular buffer sizes of data (e.g., 1024k/read or write) in order to ensure alignment with the disk sector length and avoid delays as the disk spun to orient sector boundaries. In my own (perhaps limited) I/O testing on IBM RS6ks and DEC Alphas, such sector-based I/O is apparently unnecessary in general since the OS uses a highly efficient I/O buffering scheme already. Again, I would very much like to hear about the experiences of other programmers on the CCL in dealing with I/O efficiency issues. Good question, Claudio! Best regards, -Daniel -- T. Daniel Crawford, Ph.D. Institute for Theoretical Chemistry crawdad #at# jfs1.cm.utexas.edu Departments of Chemistry and Biochemistry http://zopyros.ccqc.uga.edu/~crawdad/ The University of Texas From: Laurence Cuffe To: PEROTT "-at-" if.ufrgs.br Date: Wed, 08 Mar 2000 10:11:40 +0000 Hmm The fastest way to get big data sets in and out of a machine is to stripe them i.e. write the data over an array of disk platens and sets of disks so that all disks are being read at once. I know digital unix can do this and I'm shure many other O.S's can do it to. Again big disk caches help. This kind of problem has arisen in large Data Base applications and in digital recording where you want sustained transfer rates: a lot of hard disks can look good on peak data transfer rates but fall down for sustained transfers. Hope this helps Larry Cuffe From: IN%"garciae # - at - # boojum.Colorado.EDU" "Edgardo Garcia" 8-MAR-2000 13:56:06.34 To: IN%"PEROTT ^at^ if.ufrgs.br" Oi Claudio, Sobre suas perguntas, recentemente tive problemas com o software Gaussian justamente devido a este colocar tudo em um arquivo. Durante um calculo MP2 que consome bastante memoria o arquivo ficou tao grande que estourou o limite maximo permitido pelo sistema operacional, no caso era 4Gb. Temos que tentar reconfigurar o sistema para ver se podemos aumentar ese limite ou deixa-lo ilimitado. Minha sugestao e' que seria melhor colocar os dados em arquivos separados, especialmente se sao blocos com informacao diferente que pode inclusive ser acessada de maneira diferente e em determinados momentos apenas. A velocidade de leitura de arquivos menores e' maior. Um abraco Edgardo Edgardo Garcia Universidade de Brasilia From: Jochen K?pper To: PEROTT- at -if.ufrgs.br Cc: chemistry #at# server.ccl.net Date: Wed, 08 Mar 2000 17:05:51 +0100 (CET) Use RAID. That defers the first problem back to OS' or hardware designers. Striping should give you best average performance. Many OS's allow software RAID without any additional hardware, but then if you are willing to spend the cash, you could use hardware RAID as well :-) Fail-safety is something else to be considered, but probably not for your "scratch"-files. Jochen -- Heinrich-Heine-Universität Institut für Physikalische Chemie I Jochen Küpper Universitätsstr. 1, Geb. 26.43 Raum 02.29 40225 Düsseldorf, Germany phone ++49-211-8113681, fax ++49-211-8115195 http://www.Jochen-Kuepper.de From: Eugene Leitl To: Jochen K?pper Cc: PEROTT&$at$&if.ufrgs.br, chemistry&$at$&server.ccl.net Date: Wed, 08 Mar 2000 15:09:20 -0800 (PST) Jochen K?pper writes: > Use RAID. That defers the first problem back to OS' or hardware > designers. Striping should give you best average performance. I would add to this very good advice the following: 1) use modern, large EIDE drives (currently, 40 GBytes for ~$250: http://www.pricewatch.com/1/26/2119-1.htm ). If money is not an issue at all, use 10 k rpm (UW..etc)SCSI drives: http://www.pricewatch.com/1/26/2150-1.htm If you have a money printing press, use hardware RAID with them: http://www.pricewatch.com/1/26/1537-1.htm http://www.linuxdoc.org/HOWTO/mini/DPT-Hardware-RAID.html 2) Use several of large, modern EIDE drives as soft or hard RAID. http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO.html if you use EIDE drives, put them on individual EIDE host adapter interface. These are ~$30 apiece, for instance: http://www.buy.com/comp/product.asp?sku=10023443 3) if you're using Linux, use lots of RAM. 512 MByte-1 GByte or more (make sure it is being recognized, you might have to supply options at boot, though probably not with newer kernels). Linux utilizes extra memory for file caching, which can speed up things dramatically. 4) use a recent (preferably, cutting, not bleeding edge) kernel, because sometimes improvements in caching algorithms/drivers translate in very noticeable improvement in overall performance. 5) if you have large files (>2 GByte) you'll need 64 bit clean file systems (patches for Linux vanilla ext2 are available). Also, look around on http://www.beowulf-underground.org/ , there are patches/software to be found there which also might improve performance even for nonparallel systems. Regards, Eugene Leitl -- **************************************************** Claudio A. Perottoni Universidade Federal do Rio Grande do Sul Instituto de Fisica - Laboratorio de Altas Pressoes Av. Bento Goncalves, 9500 CAIXA POSTAL 15051 91501-970 PORTO ALEGRE - RS BRAZIL PHONE:55-51-316-6500 FAX :55-51-319-1762 http://www.if.ufrgs.br/~perott/index.html ****************************************************