Extending Grid Engine job runtimes with an execd softstop (2012-06-16)

Notes

Slightly modified from the Emacs Muse source.

Extending Grid Engine job runtimes.

I was in the situation where someone had submitted a long-running job without bothering to do any checkpointing and then found, as they got towards the end of the 2880 hours they asked for, including surviving a machine room power outage where we thought that particular node was going to lose power, they'd "like another 1000 hours".

Because of their lack of checkpointing, I was not aware of how to extend the job's runtime but found a users@gridengine.org

https://gridengine.org/mailman/listinfo/users

thread that offered some hope.

The starting post in the thread

http://gridengine.org/pipermail/users/2012-January/002429.html

can also be identified as follows:

List users@gridengine.org
Subject [gridengine users] qalter not successful
Poster Schmidt U. uschmidt at mpi-halle.mpg.de via gridengine.org
Date: 11 January 2012 19:42

and the follow-up I started is here

http://gridengine.org/pipermail/users/2012-June/003919.html

however, the suggestion there was that one might lose the ability to run other jobs on the node that had effectively been take offline, which was not really an option for me.

The following notes summarise my investigations into starting a second Grid Engine execd on the same node as one on which there is an orphaned job from an original execd, based on the follow ups from Reuti, Alex Chekholko and a final "word to the wise" from Ron Chen.

Anyroad, submit a ten minute job into a three node (master and two exec nodes) Xen VM mimic of a RHEL5.6-based grid running a 62u5 Grid Engine.

for i in {1..10} ; do
  echo -n $i " "
  date
  sleep 60
done

where the default all.q has been modified so that it has an s_rt set.

s_rt                  00:02:15

The job runs (here's the pstree)

     +-sge_execd-+-sge_shepherd---sh---sleep
     |           +-4*[{sge_execd}]

and eventually gets killed

qsub_time    Sun Jun 17 11:43:05 2012
start_time   Sun Jun 17 11:43:05 2012
end_time     Sun Jun 17 11:45:21 2012

delivering this utput

1  Sun Jun 17 11:48:20 NZST 2012
2  Sun Jun 17 11:49:20 NZST 2012
3  Sun Jun 17 11:50:20 NZST 2012

After submitting a new job

     +-sge_execd-+-sge_shepherd---sh---sleep
     |           +-4*[{sge_execd}]

we softstop the original execd

# service sgeexecd.vuwscifachpc01 softstop

and are left with these process

sgeadmin  1971     1  0 11:48 ?        00:00:00 sge_shepherd-6 -bg
buckleke  1972  1971  0 11:48 ?        00:00:00 -sh /var/opt/gridengine/default/spool/scifachpc-c01n03/job_scripts/6

with the pstree as follows

   +-sge_shepherd---sh---sleep

and the job then runs on past the kill time

1  Sun Jun 17 11:48:20 NZST 2012
2  Sun Jun 17 11:49:20 NZST 2012
3  Sun Jun 17 11:50:20 NZST 2012
4  Sun Jun 17 11:51:20 NZST 2012
5  Sun Jun 17 11:52:20 NZST 2012

Starting up the execd without altering anything

06/17/2012 11:52:33|  main|scifachpc-c01n03|I|starting up GE 6.2u5 (lx24-amd64)
06/17/2012 11:52:33|  main|scifachpc-c01n03|W|job 6.1 exceeded soft wallclock time - initiate soft notify method

see it dies.

qsub_time    Sun Jun 17 11:48:08 2012
start_time   Sun Jun 17 11:48:20 2012
end_time     Sun Jun 17 11:52:33 2012

Now start another job but this time, in between the softstop and the restart, replace the execute host's sconf, which just had these defaults

execd_spool_dir              /var/opt/gridengine/default/spool
gid_range                    20000-20100

by creating a local conf for it with new values as follows

execd_spool_dir              /var/opt/gridengine/default/spool2
gid_range                    20101-20200

noting that we may need to create the spool2 dir (actually, no)

scifachpc-c01n04# /etc/init.d/sgeexecd.vuwscifachpc01 softstop

scifachpc10# qconf -sconf scifachpc-c01n04

scifachpc-c01n04# /etc/init.d/sgeexecd.vuwscifachpc01 start

The restart even creates the new spool for us

A qstat still shows the job on that node with a slot taken

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local   BIP   0/1/1          0.00     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1

A pstree shows a new execd tree and the orpahned job

     +-sge_execd---4*[{sge_execd}]
     +-sge_shepherd---sh---sleep

Even after altering the configuration to add another slot works

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local   BIP   0/1/2          0.00     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1

Submitting another job to the same queue sees

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05 all.q@scifachpc-c01n04.local       1
      8 0.55500 qsub3.sh   buckleke     r     06/17/2012 12:07:05 all.q@scifachpc-c01n04.local       1

with the pstree showing both

     +-sge_execd-+-sge_shepherd---sh---sleep
     |           +-4*[{sge_execd}]
     +-sge_shepherd---sh---sleep

with the Grid Engine now believing that both slots are used

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local   BIP   0/2/2          0.01     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1
      8 0.55500 qsub3.sh   buckleke     r     06/17/2012 12:07:05     1

Eventually, the newer job stops as normal yet, the qmaster thinks the old one is still running, even though it has finished

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local   BIP   0/1/2          0.00     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1

and the Grid Engine knows nothing about it finsihing either

# qacct -j 7
error: job id 7 not found

and nor does the user looking for their job

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05 all.q@scifachpc-c01n04.local       1

even though that job has run its course on the node we mangled, with a pstree there now only showing

     +-sge_execd---4*[{sge_execd}]

To get back to the "original" environment, we "softstop" the new execd, although, with no jobs running node it, we could just ==stopp= it..

Modify the execd's conf back to what it was (in this case, the defaults, so we could just delete the local config)

The system now thinks the job that was orpahned finshed when it did (after 10 minutes)

qsub_time    Sun Jun 17 11:59:53 2012
start_time   Sun Jun 17 12:00:05 2012
end_time     Sun Jun 17 12:10:05 2012

with the "moved offside" ==execd== log showing

06/17/2012 12:00:22|  main|scifachpc-c01n04|I|controlled shutdown 6.2u5
06/17/2012 12:23:32|  main|scifachpc-c01n04|W|local configuration scifachpc-c01n04.local not defined - using global configuration
06/17/2012 12:23:32|  main|scifachpc-c01n04|I|starting up GE 6.2u5 (lx24-amd64)
06/17/2012 12:23:32|  main|scifachpc-c01n04|W|job 7.1 exceeded soft wallclock time - initiate soft notify method

So the restarted system just went to kill the job anyway, once it became aware of it again.

Meanwhile in the temporary execd's spool messages, we just see things happen as normal

06/17/2012 12:02:09|  main|scifachpc-c01n04|I|starting up GE 6.2u5 (lx24-amd64)
06/17/2012 12:09:20|  main|scifachpc-c01n04|W|job 8.1 exceeded soft wallclock time - initiate soft notify method
06/17/2012 12:21:52|  main|scifachpc-c01n04|I|controlled shutdown 6.2u5

This will get my user out of a major bind, so thanks to all for the insight and feedback.

Please follow up via the users@gridengine.org mailing list.