CCL: PBS



Thank you Marcel, Torben, Reinaldo, and Reuti for your prompt answer to my e-mail and a list of suggestions you provided me with.
 
I think I found a solution which is weird but it worked (at least it has been working for a few hours now).  I checked the scheduler logs and noticed that there was a message I overlooked before
"Not enough memory available"
 
This message makes no sense since I have 2GB on each node (1gb per processor) and I was requesting 100MB, in addition I have set the server variable resources_max.mem=1GB.  I tried different values for this variable but the solution was to unset it altogether.  (qmgr -c 'unset server resources_max.mem)
 
Now, if any of you wants to enlighten me about why is this I will appreciate, but at this point it is just curiosity.
 
I got a couple some tips that may be useful to some of you dealing with PBS so I am going to share.
 
Thanks again
 
Pedro
 
 
********************************************************************************************************
Do you have pbs_sched running ?
And did you start the daemons in the right order ?
E.g. first ctl then sched or so ..
[see manual]
********************************************************************************************************
We have also experienced another oddity with this system on a perhaps
related matter. We observed that the load as displayed by "top" was
very high (>3) when we ran single cpu frequency jobs with Jaguar using
kernel 2.4.20-8smp, that came with the RH 9 distribution (I have not
tested Gaussian, so I can't say for sure that this particular quirk
isn't program specific). However, when we upgraded the kernel to
2.4.20-31.9smp the load dropped to the expected value of ca. one and
the walltime dropped drastically, whereas cpu- and systime were almost
unchanged. The odd thing is that as far as I have understood, there is
nothing in this kernel upgrade that can explain this observation - for
example the I/O subsystem was not improved significantly. However, the
observations are nonetheless reproducible, so if you are not already
running the latest kernel revision I would recommend that you upgrade!
********************************************************************************************************
I also use (open)PBS in our linux cluster under
RedHat 7.2. You should put in your pbs_server.conf
file the line

set server node_pack = False

This should prevent job's node packing.
In any case, there's is a mailing list related
to openPBS. It's outdated as the soft is no longer
distributed but it still has useful information.
Have a look at http://www.openpbs.org and then to
the UserArea (you must register first, it's free -
don't worry).

 
********************************************************************************************************
what PBS version are you running? If it's OpenPBS, you may think about
changing to SGE:

http://gridengine.sunsource.net

In addition to our own cluster with SGE, we have access to a cluster at
another university, which has only problems with OpenPBS because of some
never running jobs. the only solution there is to kill and resubmit the
jobs. Maybe it's similar to your problem.