5 qsub-sge.pl -- control processes running on linux SGE system
9 This program throw the jobs and control them running on linux SGE system. It reads jobs
10 from an input shell file. One line is the smallest unit of a single job, however, you can also specify the
11 number of lines to form a single job. For sequential commands, you'd better put them
12 onto a single line, seperated by semicolon. In anywhere, "&" will be removed
13 automatically. The program will terminate when all its jobs are perfectly finished.
15 If you have so many jobs, the efficency depends on how many CPUs you can get,
16 or which queque you have chosen by --queue option. You can use the --maxjob option to
17 limit the number of throwing jobs, in order to leave some CPUs for other people.
18 When each job consumes long time, you can use the --interval option to increase interval
19 time for qstat checking , in order to reduce the burden of the head node.
21 As SGE can only recognize absolute path, so you'd better use absolute path everywhere,
22 we have developed several ways to deal with path problems:
23 (1) We have added a function that converting local path to absolute
24 path automatically. If you like writting absolute path by yourself, then you'd better close this
25 function by setting "--convert no" option.
26 (2) Note that for local path, you'd better write
27 "./me.txt" instead of only "me.txt", because "/" is the key mark to distinguish path with
29 (3) If an existed file "me.txt" is put in front of the redirect character ">",
30 or an un-created file "out.txt" after the redirect character ">",
31 the program will add a path "./" to the file automatically. This will avoid much
32 of the problems which caused by forgetting to write "./" before file name.
33 However, I still advise you to write "./me.txt" instead of just "me.txt", this is a good habit.
34 (4) Please also note that for the re-direct character ">" and "2>", there must be space characters
35 both at before and after, this is another good habit.
37 There are several mechanisms to make sure that all the jobs have been perfectly finished:
38 (1) We add an auto job completiton mark "This-Work-is-Completed!" to the end of the job, and check it after the job finished
39 (2) We check "GLIBCXX_3.4.9 not found" to make sure that the C/C++ libary on computing nodes are in good state
40 (3) We provide a "--secure" option to allow the users define their own job completition mark. You can print a mark
41 (for example, "my job complete") to STDERR at the end of your program, and set --secure "my job complete" at
42 this program. You'd better do this when you are not sure about wheter there is bug in your program.
43 (4) We provide a "--reqsub" option, to throw the unfinished jobs automatically, until all the jobs are
44 really finished. By default, this option is closed, please set it forcely when needed. The maximum
45 reqsub cycle number allowed is 1000.
46 (5) Add a function to detect the died computing nodes automatically.
47 (6) Add checking "iprscan: failed" for iprscan
48 (7) Add a function to detect queue status, only "r", "t", and "qw" is considered correct.
49 (8) Add check "failed receiving gdi request"
51 Normally, The result of this program contains 3 parts: (Note that the number 24137 is the process Id of this program)
52 (1) work.sh.24137.globle, store the shell scripts which has been converted to global path
53 (2) work.sh.24137.qsub, store the middle works, such as job script, job STOUT result, and job STDERR result
54 (3) work.sh.24137.log, store the error job list, which has been throwed more than one times.
56 I advice you to always use the --reqsub option and check the .log file after this program is finished. If you find "All jobs finished!", then
57 then all the jobs have been completed. The other records are the job list failed in each throwing cycle, but
58 don't worry, they are also completed if you have used --reqsub option.
60 For the resource requirement, by default, the --resource option is set to vf=1.9G, which means the total
61 memory restriction of one job is 1.9G. By this way, you can throw 8 jobs in one computing node, because the
62 total memory restriction of one computing node is 15.5G. If your job exceeds the maximum memory allowed,
63 then it will be killed forcely. For large jobs, you must specify the --resource option manually, which
64 has the same format with "qsub -l" option. If you have many small jobs, and want them to run faster, you
65 also need to specify a smaller memory requirement, then more jobs will be run at the same time. The key
66 point is that, you should always consider the memory usage of your program, in order to improve the efficency
71 Author: Fan Wei, fanw@genomics.org.cn
72 Autor: Hu Yujie huyj@genomics.org.cn
73 Version: 8.3, Date: 2010-2-16
77 perl qsub-sge.pl <jobs.txt>
78 --global only output the global shell, but do not excute
79 --queue <str> specify the queue to use, default all availabile queues
80 --interval <num> set interval time of checking by qstat, default 30 seconds
81 --lines <num> set number of lines to form a job, default 1
82 --maxjob <num> set the maximum number of jobs to throw out, default 30
83 --convert <yes/no> convert local path to absolute path, default yes
84 --secure <mark> set the user defined job completition mark, default no need
85 --reqsub reqsub the unfinished jobs untill they are finished, default no
86 --resource <str> set the required resource used in qsub -l option, default vf=1.2G
87 --jobprefix <str> set the prefix tag for qsubed jobs, default work
88 --verbose output verbose information to screen
89 --help output help information to screen
93 1.work with default options (the most simplest way)
94 perl qsub-sge.pl ./work.sh
96 2.work with user specifed options: (to select queue, set checking interval time, set number of lines in each job, and set number of maxmimun running jobs)
97 perl qsub-sge.pl --queue all.q -interval 1 -lines 3 -maxjob 10 ./work.sh
99 3.do not convert path because it is already absolute path (Note that errors may happen when convert local path to absolute path automatically)
100 perl qsub-sge.pl --convert no ./work.sh
102 4.add user defined job completion mark (this can make sure that your program has executed to its last sentence)
103 perl qsub-sge.pl -inter 1 -secure "my job finish" ./work.sh
105 5.reqsub the unfinished jobs until all jobs are really completed (the maximum allowed reqsub cycle is 10000)
106 perl qsub-sge.pl --reqsub ./work.sh
108 6.work with user defined memory usage
109 perl qsub-sge.pl --resource vf=1.9G ./work.sh
111 7.recommend combination of usages for common applications (I think this will suit for 99% of all your work)
112 perl qsub-sge.pl --queue all.q --resource vf=1.9G -maxjob 10 --reqsub ./work.sh
119 use FindBin
qw($Bin $Script);
120 use File::Basename qw(basename dirname);
123 ##get options from command line into variables and set default values
124 my ($Global, $Queue, $Interval, $Lines, $Maxjob, $Convert,$Secure,$Reqsub,$Resource,$Job_prefix,$Verbose, $Help);
128 "maxjob:i"=>\
$Maxjob,
129 "interval:i"=>\
$Interval,
131 "convert:s"=>\
$Convert,
132 "secure:s"=>\
$Secure,
134 "resource:s"=>\
$Resource,
135 "jobprefix:s"=>\
$Job_prefix,
136 "verbose"=>\
$Verbose,
139 ##$Queue ||= "all.q";
144 $Resource ||= "vf=1.2G";
145 $Job_prefix ||= "work";
146 die `pod2text $0` if (@ARGV == 0 || $Help);
148 my $work_shell_file = shift;
151 my $work_shell_file_globle = $work_shell_file.".$$.globle";
152 my $work_shell_file_error = $work_shell_file.".$$.log";
153 my $Work_dir = $work_shell_file.".$$.qsub";
154 my $current_dir = `pwd`; chomp $current_dir;
156 if ($Convert =~ /y/i) {
157 absolute_path
($work_shell_file,$work_shell_file_globle);
159 $work_shell_file_globle = $work_shell_file;
162 if (defined $Global) {
166 ## read from input file, make the qsub shell files
168 my $Job_mark="00001";
170 my @Shell; ## store the file names of qsub sell
171 open IN
, $work_shell_file_globle || die "fail open $work_shell_file_globle";
176 if ($line_mark % $Lines == 0) {
177 open OUT
,">$Work_dir/$Job_prefix\_$Job_mark.sh" || die "failed creat $Job_prefix\_$Job_mark.sh";
178 # open OUT,">$Job_prefix\_$Job_mark.sh" || die "failed creat $Job_prefix\_$Job_mark.sh";
179 push @Shell,"$Job_prefix\_$Job_mark.sh";
182 s/;\s*$//; ##delete the last character ";", because two ";;" characters will cause error in qsub
184 print OUT
$_."; echo This-Work-is-Completed!\n";
186 if ($line_mark % $Lines == $Lines - 1) {
196 print STDERR
"make the qsub shell files done\n" if($Verbose);
199 ## run jobs by qsub, until all the jobs are really finished
203 ## throw jobs by qsub
204 ##we think the jobs on died nodes are unfinished jobs
205 my %Alljob; ## store all the job IDs of this cycle
206 my %Runjob; ## store the real running job IDs of this cycle
207 my %Error; ## store the unfinished jobs of this cycle
208 chdir($Work_dir); ##enter into the qsub working directoy
209 my $job_cmd = "qsub -cwd -S /bin/sh "; ## -l h_vmem=16G,s_core=8
210 $job_cmd .= "-q $Queue " if(defined $Queue); ##set queue
211 $job_cmd .= "-l $Resource " if(defined $Resource); ##set resource
213 for (my $i=0; $i<@Shell; $i++) {
215 my $run_num = run_count
(\
%Alljob,\
%Runjob);
216 if ($i < $Maxjob || ($run_num != -1 && $run_num < $Maxjob) ) {
217 my $jod_return = `$job_cmd $Shell[$i]`;
218 my $job_id = $1 if($jod_return =~ /Your job (\d+)/);
219 $Alljob{$job_id} = $Shell[$i]; ## job id => shell file name
220 print STDERR
"throw job $job_id in the $qsub_cycle cycle\n" if($Verbose);
223 print STDERR
"wait for throwing next job in the $qsub_cycle cycle\n" if($Verbose);
228 chdir($current_dir); ##return into original directory
231 ###waiting for all jobs fininshed
233 my $run_num = run_count
(\
%Alljob,\
%Runjob);
234 last if($run_num == 0);
235 print STDERR
"There left $run_num jobs runing in the $qsub_cycle cycle\n" if(defined $Verbose);
239 print STDERR
"All jobs finished, in the firt cycle in the $qsub_cycle cycle\n" if($Verbose);
242 ##run the secure mechanism to make sure all the jobs are really completed
243 open OUT
, ">>$work_shell_file_error" || die "fail create $$work_shell_file_error";
244 chdir($Work_dir); ##enter into the qsub working directoy
245 foreach my $job_id (sort keys %Alljob) {
246 my $shell_file = $Alljob{$job_id};
250 if (-f
"$shell_file.o$job_id") {
251 #open IN,"$shell_file.o$job_id" || warn "fail $shell_file.o$job_id";
252 #$content = join("",<IN>);
254 $content = `tail -n 1000 $shell_file.o$job_id`;
256 ##check whether the job has been killed during running time
257 if ($content !~ /This-Work-is-Completed!/) {
258 $Error{$job_id} = $shell_file;
259 print OUT
"In qsub cycle $qsub_cycle, In $shell_file.o$job_id, \"This-Work-is-Completed!\" is not found, so this work may be unfinished\n";
265 if (-f
"$shell_file.e$job_id") {
266 #open IN,"$shell_file.e$job_id" || warn "fail $shell_file.e$job_id";
267 #$content = join("",<IN>);
269 $content = `tail -n 1000 $shell_file.e$job_id`;
271 ##check whether the C/C++ libary is in good state
272 if ($content =~ /GLIBCXX_3.4.9/ && $content =~ /not found/) {
273 $Error{$job_id} = $shell_file;
274 print OUT
"In qsub cycle $qsub_cycle, In $shell_file.e$job_id, GLIBCXX_3.4.9 not found, so this work may be unfinished\n";
277 ##check whether iprscan is in good state
278 if ($content =~ /iprscan: failed/) {
279 $Error{$job_id} = $shell_file;
280 print OUT
"In qsub cycle $qsub_cycle, In $shell_file.e$job_id, iprscan: failed , so this work may be unfinished\n";
283 ##check the user defined job completion mark
284 if (defined $Secure && $content !~ /$Secure/) {
285 $Error{$job_id} = $shell_file;
286 print OUT
"In qsub cycle $qsub_cycle, In $shell_file.o$job_id, \"$Secure\" is not found, so this work may be unfinished\n";
292 ##make @shell for next cycle, which contains unfinished tasks
294 foreach my $job_id (sort keys %Error) {
295 my $shell_file = $Error{$job_id};
296 push @Shell,$shell_file;
300 if($qsub_cycle > 10000){
301 print OUT
"\n\nProgram stopped because the reqsub cycle number has reached 10000, the following jobs unfinished:\n";
302 foreach my $job_id (sort keys %Error) {
303 my $shell_file = $Error{$job_id};
304 print OUT
$shell_file."\n";
306 print OUT
"Please check carefully for what errors happen, and redo the work, good luck!";
307 die "\nProgram stopped because the reqsub cycle number has reached 10000\n";
310 print OUT
"All jobs finished!\n" unless(@Shell);
312 chdir($current_dir); ##return into original directory
314 print STDERR
"The secure mechanism is performed in the $qsub_cycle cycle\n" if($Verbose);
316 last unless(defined $Reqsub);
319 print STDERR
"\nqsub-sge.pl finished\n" if($Verbose);
322 `rm -r $work_shell_file_globle $Work_dir`;
326 ####################################################
327 ################### Sub Routines ###################
328 ####################################################
331 my($in_file,$out_file)=@_;
332 my($current_path,$shell_absolute_path);
334 #get the current path ;
338 #get the absolute path of the input shell file;
339 if ($in_file=~/([^\/]+)$/) {
340 my $shell_local_path=$`;
341 if ($in_file=~/^\//) {
342 $shell_absolute_path = $shell_local_path;
344 else{$shell_absolute_path="$current_path"."/"."$shell_local_path";}
347 #change all the local path of programs in the input shell file;
348 open (IN,"$in_file");
349 open (OUT,">$out_file");
352 ##s/>/> /; ##convert ">out.txt" to "> out.txt"
353 ##s/2>/2> /; ##convert "2>out.txt" to "2> out.txt"
354 my @words=split /\s+/, $_;
356 ##improve the command, add "./" automatically
357 for (my $i=1; $i<@words; $i++) {
358 if ($words[$i] !~ /\//) {
360 $words[$i] = "./$words[$i]";
361 }elsif($words[$i-1] eq ">" || $words[$i-1] eq "2>"){
362 $words[$i] = "./$words[$i]";
367 for (my $i=0;$i<@words ;$i++) {
368 if (($words[$i]!~/^\//) && ($words[$i]=~/\//)) {
369 $words[$i]= "$shell_absolute_path"."$words[$i]";
372 print OUT join(" ", @words), "\n";
379 ##get the IDs and count the number of running jobs
380 ##the All job list and user id are used to make sure that the job id belongs to this program
381 ##add a function to detect jobs on the died computing nodes.
388 my $user = `whoami
`; chomp $user;
389 my $qstat_result = `qstat
-u
$user`;
390 if ($qstat_result =~ /failed receiving gdi request/) {
392 return $run_num; ##ϵͳÎÞ·´Ó¦
394 my @jobs = split /\n/,$qstat_result;
395 foreach my $job_line (@jobs) {
396 $job_line =~s/^\s+//;
397 my @job_field = split /\s+/,$job_line;
398 next if($job_field[3] ne $user);
399 if (exists $all_p->{$job_field[0]}){
402 died_nodes(\%died); ##the compute node is down, ÓеÄʱºò½ÚµãÒÑËÀ£¬µ«ÈÔÈ»ÊÇÕý³£×´Ì¬
403 my $node_name = $1 if($job_field[7] =~ /(compute-\d+-\d+)/);
404 if ( !exists $died{$node_name} && ($job_field[4] eq "qw" || $job_field[4] eq "r" || $job_field[4] eq "t") ) {
405 $run_p->{$job_field[0]} = $job_field[2]; ##job id => shell file name
408 `qdel
$job_field[0]`;
413 return $run_num; ##qstat½á¹ûÖеĴ¦ÓÚÕý³£ÔËÐÐ״̬µÄÈÎÎñ£¬²»°üº¬ÄÇЩÔÚÒÑËÀµô½ÚµãÉϵĽ©Ê¬ÈÎÎñ
417 ##HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
418 ##compute-0-24 lx26-amd64 8 - 15.6G - 996.2M -
422 my @lines = split /\n/,`qhost
`;
423 shift @lines; shift @lines; shift @lines; ##remove the first three title lines
427 my $node_name = $t[0];
428 my $memory_use = $t[5];
429 $died_p->{$node_name} = 1 if($t[3]=~/-/ || $t[4]=~/-/ || $t[5]=~/-/ || $t[6]=~/-/ || $t[7]=~/-/);