CCL:G: Sun Grid Engine or General Cluster Management
- From: Roger Kevin Robinson <scu98rkr|gmail.com>
- Subject: CCL:G: Sun Grid Engine or General Cluster Management
- Date: Mon, 02 Nov 2009 11:07:50 +0000
Sent to CCL by: Roger Kevin Robinson [scu98rkr..gmail.com]
Dear All,
We have a 64 node cluster consisting of 32 dual node machines. There are
about 3 users. 2 of the users run single processor jobs that usually
last between 7 hours and 1-2 days and they tend to queue up a batch of
test cases ie up 70 jobs each.
Im running Gaussian and run many different types of jobs 1-2 hours or
composite calculations several days. Recently I've started running dual
processor open mp2 jobs. I tend to just run a few jobs at a time.
Although I occasionally will run batches.
We've never really come to a satisfactory conclusion of how to manage
the resources most efficiently. Quite often 1-2 users will not be using
the node so I want the all of the resources open to every one. I've set
up the share policy to 33% each so Queued jobs will be order according
to how much computing power each user is using on the cluster. Which is
good but it still means the user with the least jobs has to wait till
the previous jobs have finished until their (possibly 1 hour job) will run.
Also as I mentioned earlier I've started running dual processor jobs.
I've just come back over the weekend to find none of my jobs have ran
even after being at the front of the queue because at no point have 2
nodes on the same machine been free ( rather unsurprisingly)(I can
pretend the jobs only uses 1 processor but I've noticed if you specify 2
processors and some one else starts a job on the same machine the
computation time become much slower than if you'd specified 1 processor)
What I really need SGE to do is monitor the usage of each user check if
any user is using more than 33% of the cluster. If there are currently
any other jobs queued it needs to suspend the user over 33% jobs and
replace them with the queued jobs. SGE doesnt seem to have any problem
suspending jobs so can it running other jobs in that suspended space.
I dont want limit peoples access to queues because I want the whole
cluster available to 1 user if there is space.
Thanks Roger