From chemistry-request@server.ccl.net Fri Apr 25 18:43:26 2003
Received: from ultra.chem.ucsb.edu ([128.111.114.119])
	by server.ccl.net (8.11.6/8.11.0) with ESMTP id h3PMhQa30715
	for <chemistry@ccl.net>; Fri, 25 Apr 2003 18:43:26 -0400
Received: from localhost (localhost [127.0.0.1])
	by ultra.chem.ucsb.edu (Postfix) with ESMTP
	id 223F450357; Fri, 25 Apr 2003 15:43:17 -0700 (PDT)
Received: from ultra.chem.ucsb.edu ([127.0.0.1])
 by localhost (ultra [127.0.0.1]) (amavisd-new, port 10024) with ESMTP
 id 00875-08; Fri, 25 Apr 2003 15:43:13 -0700 (PDT)
Received: by ultra.chem.ucsb.edu (Postfix, from userid 3016)
	id E10A050351; Fri, 25 Apr 2003 15:43:11 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
	by ultra.chem.ucsb.edu (Postfix) with ESMTP
	id D9E011707D2; Fri, 25 Apr 2003 15:43:11 -0700 (PDT)
Date: Fri, 25 Apr 2003 15:43:11 -0700 (PDT)
From: John Bushnell <bushnell@chem.ucsb.edu>
To: jmmckel@attglobal.net
Cc: chemistry@ccl.net
Subject: Re: CCL:AMD Dual Processor Boxes
In-Reply-To: <3EA80E5C.20386F5A@attglobal.net>
Message-ID: <Pine.GSO.4.10.10304251535070.23677-100000@ultra.chem.ucsb.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Virus-Scanned: by amavisd-new

Well, we've got around 64 Tyan S2466N boards that have been
running in 2U rackmounts for approx. 1 year on average.  I
don't think we've lost a motherboard yet.  The most frequent
problems have been cpu-fans, followed by hard drives.

    - John

On Thu, 24 Apr 2003 jmmckel@attglobal.net wrote:

> Folks,
> 
> My usual hardware supplier has an outstanding service warranty: bad
> parts pass replacement parts in teh mail for the first year.  They no
> longer like to sell Tyan dual AMD processor systems, but will if the
> customer is willing to deal directly with Tyan when a motherboard
> crashes...  They clain 33% failure rate in the first year.
> 
> I know there are lots of happy AMD dual processor Tyan board users out
> there...  Who has had problems in the last year?
> 
> [ A dual Tyan with 2-2000MP's and about 1/gig of memory can be had for
> about $900, while a dual 2.4Xeon/Supermicro with 1/2 Gig of memory is
> about $1200.  Maybe on price performance this isn't so bad.  I've never
> had a bad SuperMicro mobo out of 4.]
> 
> Regards,
> 
> John McKelvey



From chemistry-request@server.ccl.net Fri Apr 25 16:53:15 2003
Received: from kafka.net.nih.gov ([165.112.130.10])
	by server.ccl.net (8.11.6/8.11.0) with ESMTP id h3PKrFa28138
	for <CHEMISTRY@ccl.net>; Fri, 25 Apr 2003 16:53:15 -0400
Received: from I.T.S.ME ([129.43.27.170])
	by kafka.net.nih.gov (8.12.9/8.12.9) with ESMTP id h3PKrFAd000617;
	Fri, 25 Apr 2003 16:53:15 -0400 (EDT)
Date: Fri, 25 Apr 2003 16:52:57 -0400 (Eastern Daylight Time)
From: "M. Nicklaus" <mn1@helix.nih.gov>
To: CHEMISTRY@ccl.net
cc: mn1@helix.nih.gov
Subject: SUMMARY -- Temperature monitoring [AMD Dual Processor Boxes]
Message-ID: <Pine.WNT.4.21.0304251606170.-154897@lmchcaddpc1.nci.nih.gov>
X-X-Sender: mn1@helix.nih.gov
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

Dear CCL members,

Quite a while ago I asked the question below; and since John McKelvey's
post from today touches on the same problem, I'm taking this as an
opportunity to summarize the answers I got, as well as to briefly report
on the current status regarding these problems with our cluster.

First off, quite a few people recommended using lm-sensors.  
Unfortunately, this doesn't seem to work well with this specific
motherboard/chipset, so we figured this would not really be of help to us.  
But thanks to everyone suggesting this, in particular to
Greg Galperin <grg22@ai.mit.edu>
Szilveszter Juhos <szilva@ribotargets.com>
"Christian Muck-Lichtenfeld" <cml@uni-muenster.de>
Gali Drudis Sole <gali@klingon.uab.es>

Florian Barth answers proved to be the most interesting ones.  We think it
may indeed be a subtle design flaw of the Tyan Tiger MP dual-Athlon
motherboards that makes them somewhat more heat-sensitive than they should
be, especially for CPU 2.  It appears that some of the components may be
running at the limits of their specifications, and depending on the load,
the cooling setup, the ambient temperature etc., individual motherboards
will show greater or lesser propensity to hang, depending on minor
manufacturing variations.  For some motherboards, these hangs occur at
ambient and CPU temperatures that are quite below what the systems
*should* be able to tolerate.  
    A differential sensitivity between CPU 1 and 2 would also explain some
of the seemingly bizarre behavior we've observed, such as one person's job
dying quite regularly, whereas a different user's jobs virtually never
hanging -- until it became clear that the first user tended to submit jobs
that would typically run in dual-CPU mode, whereas the second user's jobs
were mostly leading to single-CPU runs.

Bottom line: Anything you can do to keep temperature down helps.  We
are now experimenting with stronger fans.  The ones we're trying are
Thermaltake Volcano 7 -- not exactly cheap at $35-$40 apiece.  In
preliminary, very limited tests, they seem to make the system more stable.
No endorsement or final assessment yet, though -- just our first
impression.

Hope this helps,

Marc


##### ORIGINAL QUESTION #####

> We have a Linux cluster consisting of dual XP1900+ Athlon nodes. We've
> found them to be quite heat-sensitive.  We would therefore like to
> monitor component temperature in them (ideally selectively for CPU,
> memory etc.) to pinpoint where the problems lies, and at what
> temperatures the nodes become likely to go down.
>
> What kind of approaches - both hardware and software - have people found
> to be effective for this goal?


##### RECENT POSTING #####

On Thu, 24 Apr 2003 jmmckel@attglobal.net wrote:

> My usual hardware supplier has an outstanding service warranty: bad
> parts pass replacement parts in teh mail for the first year.  They no
> longer like to sell Tyan dual AMD processor systems, but will if the
> customer is willing to deal directly with Tyan when a motherboard
> crashes...  They clain 33% failure rate in the first year.
> 
> I know there are lots of happy AMD dual processor Tyan board users out
> there...  Who has had problems in the last year?

------------------------------------------------------------------------

##### ANSWERS #####

------------------------------

From: Robert J. Doerksen <rjd@cmm.chem.upenn.edu>

        Not directly answering your question (though I suspect it is the
cpu itself that needs cooling), often the fans in such units are not
sufficient. Try to get them modified so at least the heat will be removed
better from within the boxes. Other than that, air conditioning must be
made strong enough to get rid of the heat that the AMD's produce.

------------------------------

From: Geoff Skillman <skillman@www.eyesopen.com>

many of the dual athlon motherboards have built in temperature sensors.

my apologies if you've already checked on this simple solution, but it
would be a pity to miss.  you can often access these sensors through the BIOS
interface.

------------------------------

From: Alessandro Contini <alessandro.contini@unimi.it>

Hi, I use a dual MP 1800+ Athlon workstation with a dual SCSI disk array and I
managed in this way:
2 Fans cooling the power supply
2 alloy cooler of big dimension for cpu
2 takimetric 5000 rpm fan for cpu
1 extractor fan
2 fans for the hard disk

With this architecture I can keep the motherboard T < 35C and the cpu's T<
50C.  Obviously for your cluster it's better to keep it in a air conditioned
ambient (max 20C)

------------------------------

From: Forlani Roberto <roberto.forlani@nikemresearch.com>

        I haven't tried on our cluster but on my linux laptop I've patched
the kernel for acpi support and it works greatly there are also a lot of
applications (with kde I use akpi) that can visualise graphically the status.

------------------------------

From: dominik.auer@mail.uni-wuerzburg.de

we also have this type of Dual Athlons running. Our general
experience is that outside our air conditioned server room after 3-4
hours (especially in summer) they freeze due to heat problems. In my
opinion no heat monitoring will help to solve this problem. I would
suggest testing different cooling systems like other fan types with
betting silicon grease, Peltier Elements or even liquid cooling.

------------------------------

From: Florian Barth <bio_hazard@gmx.de>

I have build a cluster with MP1800+ processors over the last year. I
also experienced serious instability problems due to temperature and
also thought that the CPUs are the cause. After some (very quick and
dirty) measurements I found out that the CPUs where perfectly inside
their allowed temperature limit (50-65C) but instead some components
(mosfets for the power supply for the CPUs and chipset) of the
motherboard where getting far to hot. Only after cooling them
seperatly with additional fans, I was able to get a stable system. The
boards we are using are Tyan Tiger MP, if you use the same boards,
mabe you have the same problems.

The first thing is, Tyan does not regard to what I
think is a design flaw as a hardware problem they are responsible for
(at least thats the statement from the german technical support). I have
not contacted the Tyan headquaters since I have found a solution how to
run the PCs in the meantime.
  In the cluster we have boards of revision A,B and C (which I think is
the latest). All of them have the same problem that I could fix in the
same way.
  Maybe you have already noticed that the cpu in socket 2 is getting
somewhat hotter than the other cpu. The same problem applies to the
mosfets (transistors) below and left to the socket 2 (there are four of
them if I remember right) which are responsible for the power supply of
cpu2 and parts of the chipset and pci bus.
  These mosfets do fine with Athlons up to 1600+ (a colleague of mine has
some of this boxes) but get far to hot if you use Athlon 1800+ and above
(the board is only specified up to Athlon 1900+). I could not get very
precise measurement, since I lack the equipment to to so (which would be
a heat camera) but when I cool the area below cpu socket 2 (without any
other changes) the whole system runs stable where it failed before
(which is running cpuburn for an hour). Another indication is, that when
you run the system with only one processor at a time and test it with
the cpuburn software it will only fail with cpu 2.
  A Tyan technician insured to me, that the mosfets are still running
inside their specifications but I think they are really at their limits
(some of the boards, about 20%, run fine without additional cooling).
Notably, the MP is the only Tyan dual cpu board not recommended by AMD.
I compared the Tyan MP with its successor, the Tyan MPX. On the MPX they
use different mosfets an far bigger cooling areas (the tin lines on the
back of the motherboard) for them. The MPX is very stable, it runs even
with minimal cooling (just 1 case fan) without failure.

Now to my cooling solution (which is really very selfmade and was quite
work intensive): we disassembled all our PCs (130) and drilled a 60mm
hole in the pc case backplane that holds the motherboard. The hole was
placed right behind the cooling areas for the mosfets of cpu 2. Then we
installed a 60x60x10mm fan in front of the hole and mounted the
motherboard again. Now we have this fan blowing right onto these cooling
areas. I found this solution more reliable than using a fan on top of
the mosfets, which would have been far easier to install. If you want, I
can send you some pictures of the modified case and system.
  Additionally to these fan we have only two more fans in the case (except
for the fan of the PSU): one in the bottom front blowing cool air in and
one in the back blowing hot air out. Both of them are 80x80x25mm fans
with >7000rpm.

------------------------------

From: Morten Stroem <ms@scali.com>

We read your posting on the CHEMISTRY@ccl.net and do believe we have the
software you need to do the necessary monitoring on your system. Our Scali
Manage software have all the features needed to do what you acquire. Please find
attached a pdf file with the data sheet for Scali Manage and our White paper for
the same [omitted -MCN].

Scali Manage Benefits
 *  Single management system for mixed interconnects
 *  Rapid installation/reinstallation of software and nodes
 *  Single point-of-management for one or many clusters
 *  Hardware monitoring for early fault detection
 *  Controlled installation of 3rd party applications & user data
 *  Advanced network and user administration
 *  Remote and secure system architecture
 *  Out-of-band management and disaster recovery
 *  Preventative management and monitoring tools
 *  Scalability to manage small and large systems

##### END OF ANSWERS #####



------------------------------------------------------------------------
 Marc C. Nicklaus, Ph.D.                 NIH/NCI at Frederick
 E-mail: mn1@helix.nih.gov               Bldg 376, Rm 207
 Phone:  (301) 846-5903                  376 Boyles Street
 Fax:    (301) 846-6033                  FREDERICK, MD 21702      USA
          Head, Computer-Aided Drug Design MiniCore Facility
     Laboratory of Medicinal Chemistry, Center for Cancer Research,
 National Cancer Institute at Frederick, National Institutes of Health
       http://rex.nci.nih.gov/RESEARCH/basic/medchem/mcnbio.htm
------------------------------------------------------------------------






From chemistry-request@server.ccl.net Sat Apr 26 00:08:57 2003
Received: from t12mail.lanl.gov ([128.165.22.99])
	by server.ccl.net (8.11.6/8.11.0) with ESMTP id h3Q48ua03027
	for <chemistry@ccl.net>; Sat, 26 Apr 2003 00:08:57 -0400
Received: from strontium.lanl.gov (strontium.lanl.gov [128.165.22.138])
	by t12mail.lanl.gov (8.11.1 PATCHED 03/04/03/8.11.1) with ESMTP id h3Q48o406787
	for <chemistry@ccl.net>; Fri, 25 Apr 2003 22:08:50 -0600 (MDT)
Date: Fri, 25 Apr 2003 22:08:48 -0600
From: Artem Masunov <amasunov@LANL.gov>
Reply-To: Artem.Masunov@LANL.gov
To: chemistry@ccl.net
Subject: meetings of interest for computational chemists
Message-ID: <Pine.SGI.4.44.0304252134340.158694-100000@strontium.lanl.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

Dear CCLers,
 I am looking for the meetings of interest for
material science oriented computational chemists.
So far I found 16.
Please let me know if I missed something.
The updated list will be available at

http://www.t12.lanl.gov/home/amasunov/Page.htm

Thanks in advance for your input,
 Artem


Deadline        When?           Where?
________________________________________________

5/1/2003	11/16-20/2003	San Francisco, CA
AIChE/21001: Recent Advances in Molecular Simulation
http://www.aiche.org/annualapp/

5/1/2003	9/7-11/2003	New York, NY
ACS/COMP
http://oasys.acs.org/acs/226nm/comp/papers/index.cgi

5/2/2003	9/7-11/2003	New York, NY
ACS/POLY: Organic and Polymer Materials for Plastic and Molecular Electronics
http://oasys.acs.org/acs/226nm/poly/papers/index.cgi

5/3/2003	5/23-24/2003	Clemson, SC
SouthEastern Theoretical Chemistry Association
http://setca03.ces.clemson.edu

5/6/2003	6/12-14/2003	Iowa SU, Ames, IA
Midwest Theoretical Chemistry Conference
http://www.pmodels.org/~mwtcc

5/15/2003	6/16-20/2003	Ohio SU, Columbus, OH
International Symposium on Molecular Spectroscopy
http://molspect.mps.ohio-state.edu/symposium/index.html

5/25/2003	6/25-27/2003	UofU, Salt Lake City, UT
Electronic Materials Conference
http://www.chemsoc.org/CFCONF/alldetails.cfm?ID=10744

6/1/2003 	8/11-16/2003 	Los Alamos, NM
Excited state processes in organic materials
http://cnls.lanl.gov/~esp

6/5/2003	12/1-5/2003	Boston, MA
MRS meeting
http://www.mrs.org/meetings/fall2003

8/7/2003	3/7-12/2004	Chicago, IL
PITTCON
http://www.pittcon.org

10/1/2003	11/??/2003	Jackson SU, MS
Current Trends in Computational Chemistry
http://cctcc.ccmsi.us

10/1/2003	11/13-15/2003	TexTechU, Lubbock, TX
SouthWestern Theoretical Chemistry Conference
http://www.depts.ttu.edu/chemistry/conf.htm

1/5/2004	2/28-6/2004	St.Augustine, FL
Sanibel Simposium	http://www.qtp.ufl.edu/~sanibel

6/2/2004	10/9-12/2003	San Francisco, CA
Foresight Conference on Molecular Nanotechnology
http://www.foresight.org/conference/MNT11/index.html

6/4/2004	7/4-9/2004	Holderness School, NH
Gordon Conference on Computational Chemistry
http://www.grc.uri.edu/04sched.htm

asap	11/5-7/2003	Santa Fe, NM
Int. Conference on Computational Methods in Materials Characterization
http://www.wessex.ac.uk/conferences/2003/materials03


________________________________________________________
     __    ___________        Artem.Masunov@LANL.gov
    /  \  /  __   __  \   www.t12.lanl.gov/home/amasunov
   /    \/\  \ \  \ \  \  505.665.2635, Fax:505.665.3909
  /  /\  \ \  \ \  \ \  \  Theoretical Division, MS B268
 /  ____  \ \  \ \  \ \  \    Los Alamos National Lab
/__/\ _/\ _\ \ _\ \ _\ \ _\     Los Alamos NM 87545
\ _\/  \/__/\ __/\ __/\ __/ ____________________________
           Let the beauty we love be what we do. -- Rumi




