From owner-chemistry -A_T- ccl.net Mon Nov 2 06:05:01 2009 From: "Roger Kevin Robinson scu98rkr/a\gmail.com" To: CCL Subject: CCL:G: Sun Grid Engine or General Cluster Management Message-Id: <-40588-091102060315-15792-cj9X1MRkMVAZGFxIFBODbw/a\server.ccl.net> X-Original-From: Roger Kevin Robinson Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Date: Mon, 02 Nov 2009 11:07:50 +0000 MIME-Version: 1.0 Sent to CCL by: Roger Kevin Robinson [scu98rkr..gmail.com] Dear All, We have a 64 node cluster consisting of 32 dual node machines. There are about 3 users. 2 of the users run single processor jobs that usually last between 7 hours and 1-2 days and they tend to queue up a batch of test cases ie up 70 jobs each. Im running Gaussian and run many different types of jobs 1-2 hours or composite calculations several days. Recently I've started running dual processor open mp2 jobs. I tend to just run a few jobs at a time. Although I occasionally will run batches. We've never really come to a satisfactory conclusion of how to manage the resources most efficiently. Quite often 1-2 users will not be using the node so I want the all of the resources open to every one. I've set up the share policy to 33% each so Queued jobs will be order according to how much computing power each user is using on the cluster. Which is good but it still means the user with the least jobs has to wait till the previous jobs have finished until their (possibly 1 hour job) will run. Also as I mentioned earlier I've started running dual processor jobs. I've just come back over the weekend to find none of my jobs have ran even after being at the front of the queue because at no point have 2 nodes on the same machine been free ( rather unsurprisingly)(I can pretend the jobs only uses 1 processor but I've noticed if you specify 2 processors and some one else starts a job on the same machine the computation time become much slower than if you'd specified 1 processor) What I really need SGE to do is monitor the usage of each user check if any user is using more than 33% of the cluster. If there are currently any other jobs queued it needs to suspend the user over 33% jobs and replace them with the queued jobs. SGE doesnt seem to have any problem suspending jobs so can it running other jobs in that suspended space. I dont want limit peoples access to queues because I want the whole cluster available to 1 user if there is space. Thanks Roger