Thank you Marcel, Torben, Reinaldo, and Reuti for
your prompt answer to my e-mail and a list of suggestions you provided me
with.
I think I found a solution which is weird but it
worked (at least it has been working for a few hours now). I checked the
scheduler logs and noticed that there was a message I overlooked before
"Not enough memory available"
This message makes no sense since I have 2GB on
each node (1gb per processor) and I was requesting 100MB, in
addition I have set the server variable resources_max.mem=1GB. I
tried different values for this variable but the solution was to unset it
altogether. (qmgr -c 'unset server resources_max.mem)
Now, if any of you wants to enlighten me about why
is this I will appreciate, but at this point it is just curiosity.
I got a couple some tips that may be useful to some
of you dealing with PBS so I am going to share.
Thanks again
Pedro
******************************************************************************************************** Do you have pbs_sched running ?
And did you start the daemons in the right order ? E.g. first ctl then sched or so .. [see manual]
********************************************************************************************************
We have also experienced another oddity with this system on a perhaps
related matter. We observed that the load as displayed by "top" was very high (>3) when we ran single cpu frequency jobs with Jaguar using kernel 2.4.20-8smp, that came with the RH 9 distribution (I have not tested Gaussian, so I can't say for sure that this particular quirk isn't program specific). However, when we upgraded the kernel to 2.4.20-31.9smp the load dropped to the expected value of ca. one and the walltime dropped drastically, whereas cpu- and systime were almost unchanged. The odd thing is that as far as I have understood, there is nothing in this kernel upgrade that can explain this observation - for example the I/O subsystem was not improved significantly. However, the observations are nonetheless reproducible, so if you are not already running the latest kernel revision I would recommend that you upgrade! ********************************************************************************************************
I also use (open)PBS in our linux cluster under
RedHat 7.2. You should put in your pbs_server.conf file the line set server node_pack = False This should prevent job's node packing. In any case, there's is a mailing list related to openPBS. It's outdated as the soft is no longer distributed but it still has useful information. Have a look at http://www.openpbs.org and then to the UserArea (you must register first, it's free - don't worry). ********************************************************************************************************
what PBS version are you running? If it's OpenPBS, you may think about
changing to SGE: http://gridengine.sunsource.net In addition to our own cluster with SGE, we have access to a cluster at another university, which has only problems with OpenPBS because of some never running jobs. the only solution there is to kill and resubmit the jobs. Maybe it's similar to your problem. |