Extending Grid Engine job runtimes with an execd softstop (2012-06-16)
Notes
Slightly modified from the Emacs Muse source.
Extending Grid Engine job runtimes.
I was in the situation where someone had submitted a long-running job without bothering to do any checkpointing and then found, as they got towards the end of the 2880 hours they asked for, including surviving a machine room power outage where we thought that particular node was going to lose power, they'd "like another 1000 hours".
Because of their lack of checkpointing, I was not aware of how to
extend the job's runtime but found a users@gridengine.org
thread that offered some hope.
The starting post in the thread
http://gridengine.org/pipermail/users/2012-January/002429.html
can also be identified as follows:
List users@gridengine.org Subject [gridengine users] qalter not successful Poster Schmidt U. uschmidt at mpi-halle.mpg.de via gridengine.org Date: 11 January 2012 19:42
and the follow-up I started is here
http://gridengine.org/pipermail/users/2012-June/003919.html
however, the suggestion there was that one might lose the ability to run other jobs on the node that had effectively been take offline, which was not really an option for me.
The following notes summarise my investigations into starting a second
Grid Engine execd on the same node as one on which there is an
orphaned job from an original execd, based on the follow ups from
Reuti, Alex Chekholko and a final "word to the wise" from Ron Chen.
Anyroad, submit a ten minute job into a three node (master and two
exec nodes) Xen VM mimic of a RHEL5.6-based grid running a
62u5 Grid Engine.
for i in {1..10} ; do
echo -n $i " "
date
sleep 60
done
where the default all.q has been modified so that it has an
s_rt set.
s_rt 00:02:15
The job runs (here's the pstree)
+-sge_execd-+-sge_shepherd---sh---sleep
| +-4*[{sge_execd}]
and eventually gets killed
qsub_time Sun Jun 17 11:43:05 2012 start_time Sun Jun 17 11:43:05 2012 end_time Sun Jun 17 11:45:21 2012
delivering this utput
1 Sun Jun 17 11:48:20 NZST 2012 2 Sun Jun 17 11:49:20 NZST 2012 3 Sun Jun 17 11:50:20 NZST 2012
After submitting a new job
+-sge_execd-+-sge_shepherd---sh---sleep
| +-4*[{sge_execd}]
we softstop the original execd
# service sgeexecd.vuwscifachpc01 softstop
and are left with these process
sgeadmin 1971 1 0 11:48 ? 00:00:00 sge_shepherd-6 -bg buckleke 1972 1971 0 11:48 ? 00:00:00 -sh /var/opt/gridengine/default/spool/scifachpc-c01n03/job_scripts/6
with the pstree as follows
+-sge_shepherd---sh---sleep
and the job then runs on past the kill time
1 Sun Jun 17 11:48:20 NZST 2012 2 Sun Jun 17 11:49:20 NZST 2012 3 Sun Jun 17 11:50:20 NZST 2012 4 Sun Jun 17 11:51:20 NZST 2012 5 Sun Jun 17 11:52:20 NZST 2012
Starting up the execd without altering anything
06/17/2012 11:52:33| main|scifachpc-c01n03|I|starting up GE 6.2u5 (lx24-amd64) 06/17/2012 11:52:33| main|scifachpc-c01n03|W|job 6.1 exceeded soft wallclock time - initiate soft notify method
see it dies.
qsub_time Sun Jun 17 11:48:08 2012 start_time Sun Jun 17 11:48:20 2012 end_time Sun Jun 17 11:52:33 2012
Now start another job but this time, in between the softstop and
the restart, replace the execute host's sconf, which just had these
defaults
execd_spool_dir /var/opt/gridengine/default/spool gid_range 20000-20100
by creating a local conf for it with new values as follows
execd_spool_dir /var/opt/gridengine/default/spool2 gid_range 20101-20200
noting that we may need to create the spool2 dir (actually, no)
scifachpc-c01n04# /etc/init.d/sgeexecd.vuwscifachpc01 softstop scifachpc10# qconf -sconf scifachpc-c01n04 scifachpc-c01n04# /etc/init.d/sgeexecd.vuwscifachpc01 start
The restart even creates the new spool for us
A qstat still shows the job on that node with a slot taken
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local BIP 0/1/1 0.00 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
A pstree shows a new execd tree and the orpahned job
+-sge_execd---4*[{sge_execd}]
+-sge_shepherd---sh---sleep
Even after altering the configuration to add another slot works
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local BIP 0/1/2 0.00 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
Submitting another job to the same queue sees
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 all.q@scifachpc-c01n04.local 1
8 0.55500 qsub3.sh buckleke r 06/17/2012 12:07:05 all.q@scifachpc-c01n04.local 1
with the pstree showing both
+-sge_execd-+-sge_shepherd---sh---sleep
| +-4*[{sge_execd}]
+-sge_shepherd---sh---sleep
with the Grid Engine now believing that both slots are used
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local BIP 0/2/2 0.01 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
8 0.55500 qsub3.sh buckleke r 06/17/2012 12:07:05 1
Eventually, the newer job stops as normal yet, the qmaster thinks
the old one is still running, even though it has finished
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q@scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q@scifachpc-c01n04.local BIP 0/1/2 0.00 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
and the Grid Engine knows nothing about it finsihing either
# qacct -j 7 error: job id 7 not found
and nor does the user looking for their job
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 all.q@scifachpc-c01n04.local 1
even though that job has run its course on the node we mangled, with
a pstree there now only showing
+-sge_execd---4*[{sge_execd}]
To get back to the "original" environment, we "softstop" the new execd, although, with no jobs running node it, we could just ==stopp= it..
Modify the execd's conf back to what it was (in this case, the defaults, so we could just delete the local config)
The system now thinks the job that was orpahned finshed when it did (after 10 minutes)
qsub_time Sun Jun 17 11:59:53 2012 start_time Sun Jun 17 12:00:05 2012 end_time Sun Jun 17 12:10:05 2012
with the "moved offside" ==execd== log showing
06/17/2012 12:00:22| main|scifachpc-c01n04|I|controlled shutdown 6.2u5 06/17/2012 12:23:32| main|scifachpc-c01n04|W|local configuration scifachpc-c01n04.local not defined - using global configuration 06/17/2012 12:23:32| main|scifachpc-c01n04|I|starting up GE 6.2u5 (lx24-amd64) 06/17/2012 12:23:32| main|scifachpc-c01n04|W|job 7.1 exceeded soft wallclock time - initiate soft notify method
So the restarted system just went to kill the job anyway, once it became aware of it again.
Meanwhile in the temporary execd's spool messages, we just see things
happen as normal
06/17/2012 12:02:09| main|scifachpc-c01n04|I|starting up GE 6.2u5 (lx24-amd64) 06/17/2012 12:09:20| main|scifachpc-c01n04|W|job 8.1 exceeded soft wallclock time - initiate soft notify method 06/17/2012 12:21:52| main|scifachpc-c01n04|I|controlled shutdown 6.2u5
This will get my user out of a major bind, so thanks to all for the insight and feedback.
Please follow up via the users@gridengine.org mailing list.