CCL:G: Sun Grid Engine or General Cluster Management



 Sent to CCL by: Roger Kevin Robinson [scu98rkr..gmail.com]
 Dear All,
 We have a 64 node cluster consisting of 32 dual node machines. There are
 about 3 users. 2 of the users run single processor jobs that usually
 last between 7 hours and 1-2 days and they tend to queue up a batch of
 test cases ie up 70 jobs each.
 Im running Gaussian and run many different types of jobs 1-2 hours or
 composite calculations several days. Recently I've started running dual
 processor open mp2 jobs. I tend to just run a few jobs at a time.
 Although I occasionally will run batches.
 We've never really come to a satisfactory conclusion of how to manage
 the resources most efficiently. Quite often 1-2 users will not be using
 the node so I want the all of the resources open to every one. I've set
 up the share policy to 33% each so Queued jobs will be order according
 to how much computing power each user is using on the cluster. Which is
 good but it still means the user with the least jobs has to wait till
 the previous jobs have finished until their (possibly 1 hour job) will run.
 Also as I mentioned earlier I've started running dual processor jobs.
 I've just come back over the weekend to find none of my jobs have ran
 even after being at the front of the queue because at no point have 2
 nodes on the same machine been free ( rather unsurprisingly)(I can
 pretend the jobs only uses 1 processor but I've noticed if you specify 2
 processors and some one else starts a job on the same machine the
 computation time become much slower than if you'd specified 1 processor)
 What I really need SGE to do is monitor the usage of each user check if
 any user is using more than 33% of the cluster. If there are currently
 any other jobs queued it needs to suspend the user over 33% jobs and
 replace them with the queued jobs. SGE doesnt seem to have any problem
 suspending jobs so can it running other jobs in that suspended space.
 I dont want limit peoples access to queues because I want the whole
 cluster available to 1 user if there is space.
 Thanks Roger