Re: Linux -malign-double?

From: "M. Nicklaus" <mn1 ^at^ helix.nih.gov>
Subject: Re: Linux -malign-double?
Date: Fri, 26 Jun 1998 15:08:33 -0400 (EDT)
 Hi,
 On Tue, 23 Jun 98 15:44:32, Dave Close <R29CLOSE ^at^ ETSU.ETSU.EDU>
 wrote:
 >   Recently I posted a LINUX makefile for installing G94 on a Pentium
 > running RedHat Linux.  Milan Hodoscek wrote me to say that I had
 > omitted the -malign-double compiler option.  In fact I recalled his
 > posting of this several months ago.  When we tried it here there was
 > hardly any difference in runtime.  Milan has written me again to say
 > this option could make a large difference in run time on a Pentium.
 >   My question is does anyone know about this option?  Is it now a default
 > setting so now it is not important?  Or have I done something wrong?
 >   To see what we used go to:
 >
 >   http://physweb,etsu.edu/c_r_g94_linux.html
 In the G94 benchmark paper I've mentioned before (Milan Hodoscek is a
 co-author; in press in JCICS), we've analyzed the influence of various
 compiler options quite extensively.  One of the most important ones--
 --on Pentium based systems (only)--is indeed "-malign-double".  In
 many
 runs, both for the small Gaussian test jobs and for a larger 'real-life'
 DFT job, G94 executables compiled with "-malign-double" were always
 faster than G94 compiled with the standard f2c makefile.
 However, the degree to which "-malign-double" sped up G94 varied
 greatly
 from machine to machine (but was totally reproducible on each machine,
 at least under the benchmarking conditions of an empty machine).
 The speed increase ranged from virtually unnoticeable to about 40% for
 the test jobs, and more than 60% for the DFT jobs in some cases.
 For other programs, such as CHARMM, we've seen differences of up to 100%.
 Both Pentium Pro (P6) and Pentium II (PII) systems are affected by this.
 We don't know exactly what causes this very different behavior of individual
 machines.  We think it's a subtle interplay of the placement of binaries in
 RAM coupled with details of the memory hardware architecture, chip sets etc.
 of the system (...not very explanatory, I know.)  It's interesting to note
 in this context that the 300, 333, 350, and 400 MHz PII's are actually
 surprisingly different on a deep hardware level as far as cache and memory
 characterisrics are concerned.  A good place to read up on this is
 http://www2.tomshardware.com/cpuslot1.html.
 We believe that "-malign-double" doesn't actually gain you anything,
 but
 rather cures a defect that makes your PII system work slower than it should.
 With "-malign-double", the timings got much more consistent with CPU
 and
 bus speeds than without.  So, in your case, Dave, I'd say you may be one
 of the lucky ones whose system is already close to optimal even without
 this option.  Still, we've never found it to hurt, so we now routinely
 compile with "-malign-double" on Pentium systems.
 Another factor affecting G94 speed on Pentiums may be noteworthy here.
 This is our--seemingly counterintuitive--finding that, in *all cases* we've
 tested, G94 jobs ran faster the *less* memory we gave them (via %Mem=...).
 This was the case, again, for both P6 and PII, and for all 7 test jobs as
 well as the DFT job (so please don't send me angry e-mails saying "But this
 doesn't work for my CBS-Q [or whatever] job!").  The curves are steepest in
 the beginning, i.e. for the smallest RAM amounts.  The increase in CPU time
 when going from 8MB to 128MB for the DFT job ranged from 20% to 40%, again
 depending on the machine.  Of course, you need to give the job at least the
 amount of memory it needs to run to completion.  So if you're running a large
 frequency calculation, you'll only be able to use the much flatter part of
 the curve.
 We think this has to do with the very high CPU speed vs. the much lower
 memory access speeds (and bandwidths) on Pentium systems.  A very large
 amount of very fast cache might help, but the cache sizes and/or speeds
 on Pentiums don't really make a dent in this problem with Gaussian 94.
 (We compared 256kB and 512kB cache P6's, and found clear, but small effects
 of 0%...5% speedup.)  In effect, it seems that on modern Pentium systems,
 "direct" is actually more efficient than "in-core".  Of
 course, if you
 force your job to go through more passes, or even to store integrals
 on disk by not giving it enough RAM, you'll lose all speed advantages.
 We found 8MB (1MW) to be the minimum amount of RAM to run any G94 job
 on P6 or PII systems.  So it may be a useful strategy to try first with
 this amount, and if the job bombs, to go up from there in, say, 1MW steps.
 For one single-point calculation, this is will not usually be worthwhile,
 but for an optimization, and, even more so, a lengthy potential energy
 scan, this can save a lot of CPU time.
 Hope this is useful,
 Marc
 ------------------------------------------------------------------------
  Marc C. Nicklaus                        National Institutes of Health
  E-mail: mn1 ^at^ helix.nih.gov               Bldg 37, Rm 5B29
  Phone:  (301) 402-3111                  37 Convent Dr, MSC 4255
  Fax:    (301) 496-5839                  BETHESDA, MD 20892-4255    USA
       http://rex.nci.nih.gov/RESEARCH/basic/medchem/mcnbio.htm
     Laboratory of Medicinal Chemistry, National Cancer Institute,  &
   Center for Molecular Modeling, Ctr. for Information Technology, NIH
 ------------------------------------------------------------------------