From owner-chemistry(+ at +)ccl.net Sat Oct 1 01:34:01 2005 From: "Perry E. Metzger perry===piermont.com" To: CCL Subject: CCL: W:hardware for computational chemistry calculations Message-Id: <-29444-050930234757-27004-rNRnCFiiyV7ZwWtS63fJ2Q!A!server.ccl.net> X-Original-From: "Perry E. Metzger" Content-Type: text/plain; charset=us-ascii Date: Fri, 30 Sep 2005 23:47:44 -0400 MIME-Version: 1.0 Sent to CCL by: "Perry E. Metzger" [perry*|*piermont.com] "Eric Bennett ericb-,-pobox.com" writes: > Perry Metzger writes: >>A strong recommendation though that I'll bring up here because it is >>vaguely OS related -- do NOT use more threads than processors in your >>app if you know what is good for you. Thread context switching is NOT >>instant, and you do not want to burn up good computation cycles on >> useless thread switching. > > Somewhat relevant to this: I have seen about a 25% throughput > increase in my MM calculations when using hyperthreading, running > four processes on a 2 CPU Xeon machine with hyperthreading on, as > compared to two processes with hyperthreading off. In the special > case of hyperthreading sometimes you can benefit. Hyperthreading is an entirely different thing -- unfortunate that the terms have a common word in them. An Intel processor with hyperthreading is a processor that can do something useful while it is waiting on other things that are blocked some of the time -- it is somewhat of like having 1.25 processors instead of one. In that case, for selected apps, you want to treat one processor as though it were two and have two threads running. This is still an instance of my rule, though -- you just treat a Hyperthreaded processor as though it were more than one processor. In my comment, I'm referring to the more general case -- you don't want to incur context switch penalties inside your program if you can help it. Event dispatch costs about as much as a procedure call, but thread switches require tens to hundreds of times longer. If you can help it, use threads ONLY to exploit the parallelism of the multiple processors on your machine, and not for things like i/o multiplexing and the like. > Having enough RAM is always the most important thing. If you don't > have enough memory to hold your software and its working data set in > RAM, that will for certain be the limiting factor in your speed. > > 15,000 RPM drives are only available with SCSI interfaces; the SATA > drives, even with their higher data density, don't have performance > specs that match up (15K SCSI gets you max sustained transfers of > around 90 MB/sec). So if you are doing something disk-intensive like > large QM calculations, there are still people who will buy SCSI. QM > jobs can end up writing over 10 GB of scratch files. For MM apps like > dynamics the disk speed is not critical. Lets say you have a computation that is I/O bound on access to a 10G file. Right now, an additional 10G of DRAM will cost ~$1200. The lowest price 15,000RPM drives you can buy are ~$210, plus you need a decent SCSI controller which can be another $200, so call it $410. If you want to stripe a couple of drives, the price goes up more. So, the question becomes, does the difference in speed for your app between having enough memory to hold the whole scratch file in buffer cache increase speed sufficiently to justify the marginal $1000 cost? That depends on how I/O bound you are. If you are very I/O bound, the answer is a clear "yes" -- if you are only lightly I/O bound, the answer is not clear. If you are very I/O bound, the added RAM (versus a fast disk) will essentially eliminate your I/O time, switching you to being compute bound. This can sometimes increases your speed enough that you can use a fraction of the number of computers. So, if you are really strongly I/O bound -- that is, if your CPU is idle most of the time because it is waiting for the disk -- the answer can be is a clear "yes". Say you're only using 30% of the CPU -- eliminating the I/O bottleneck with RAM is worth two more expensive computers to you, because you'll suddenly be using 100% of the machine. Some years ago, I remember when it first became obvious that for some servers I was dealing with, buying 4G of memory so the entire set of files in the working set would fit in RAM meant that one machine could perform something or five or ten times better than boxes with even very fast disks. That was a giant win -- effectively the extra couple of G of RAM meant we didn't need four other computers. However, as the degree of I/O bottleneck goes down, the equation shifts. If you're only idle say 15% of the time, the economics become fuzzier. You have to do the calculation pretty carefully, but you may find that you're better off without the RAM if you are only waiting slightly for the disk. If your working set is, say, 40G, there is no way to fit enough memory into the box, and you just have to bite the bullet (or buy a really big honking RAID array). All such calculations are economics, in the end. By the way, if your scratch file access is not random, your working set may be smaller than you think, and you may be able to get most of the effect with less memory, which of course again shifts the economics of the calculations. Of course, if your working set even slightly exceeds RAM, you totally lose because you're constantly waiting for I/O. Testing to determine your true working set size can be very important. Knowing how to tune your OS so that you get maximum cache hits is also critical, and a bit of an esoteric skill, but one that is very important to pick up. -- Perry E. Metzger perry[a]piermont.com