winsup/cygwin/DevDocs/CgfNotes.OLD

   1 2014-04-26  cgf-000026
   2
   3 Forgot to clear to the end of screen when restoring a screen buffer.
   4 That worked, for some reason, with Take Command but not with normal
   5 consoles.  I don't remember why I didn't resize the screen like a Linux
   6 X terminal emulator but that might have made things work a little
   7 better.  Right now, there is a scroll bar for apps like less or vi and
   8 that doesn't feel right.
   9
  10 2014-03-29  cgf-000025
  11
  12 Reorganized _cygtls::signal_debugger to avoid sending anything to the
  13 debugger if we've seen an exception.  I think it used to work that way
  14 and I changed it without noting why.  It sure seems like, if we don't do
  15 this, gdb will see two signals and, it really does, when there has been
  16 a Windows-recognized exception.
  17
  18 2014-02-15  cgf-000024
  19
  20 Wow.  It's hard getting the screen handling stuff working correctly when
  21 there is a screen buffer larger than screen size and vice versa.  These
  22 changes attempt to use SetConsoleWindowInfo whenever possible so that
  23 the contents of the screen buffer are never wiped out.  They also fix
  24 some previously misbehaving "scroll the screen" commands.
  25
  26 2013-06-07  cgf-000023
  27
  28 Given the fact that the signal thread never exits there is no need
  29 for exit_thread to ever block.  So, nuke this code.
  30
  31 2013-01-31  cgf-000022
  32
  33 While researching the lftp behavior reported here:
  34
  35 http://cygwin.com/ml/cygwin/2013-01/msg00390.html
  36
  37 after a frenzy of rewriting sigflush handling to avoid blocking in the
  38 signal thread (which is now and should ever have been illegal), it
  39 dawned on me that we're not supposed to be flushing the tty input buffer
  40 every time a signal is received.  We're supposed to do this only when
  41 the user hits a character (e.g., CTRL-C) which initiates a signal
  42 action.  So, I removed sigflush from sigpacket::process and moved it to
  43 tc ()->kill_pgrp ().  This function should only be called to send
  44 signals related to the tty so this should have the desired effect.
  45
  46 2013-01-11  cgf-000021
  47
  48 Apparently I got the signal handling semantics of select() wrong again
  49 even though I would have sworn that I tested this on Linux and Windows.
  50
  51 select() is apparently *always* interrupted by a signal and *never*
  52 restarts.  Hopefully, between the comment added to the code and this
  53 note, I'll not make this mistake again.
  54
  55 2013-01-02  cgf-000020
  56
  57 (This entry should have been checked in with the changes but... I forgot)
  58
  59 This is a fairly big revamp of the way that windows signals are handled.
  60 The intent is that all signal decisions should be made by the signal
  61 thread; not by the exception handler.
  62
  63 This required the ability to pass information from the exception handler
  64 to the signal thread so, a si_cyg field was added to siginfo_t.  This
  65 contains information needed to generate a "core dump".  Hmm.  Haven't
  66 checked to see if this breaks Cygwin's hardly-ever-used real core dump
  67 facility.
  68
  69 Anyway, I moved signal_exit back into exceptions.cc and removed it from
  70 the sigpacket class.  This function is now treated like a signal handler
  71 function - Cygwin will cause it to be dispatched in the context of
  72 whatever thread caught the signal.  signal_exit also makes the
  73 determination about when to write a stackdump.
  74
  75 The signal-handler thread will no longer ever attempt to exit.  It will
  76 just keep processing signals (it will not process real signals after
  77 Cygwin stops shutting down, however).  This should make it impossible
  78 for the signal thread to ever block waiting for the process lock since
  79 it now never grabs the process lock.  The signal-handler thread will
  80 notify gdb when it gets a signal now but, in theory, gdb should see the
  81 context of the thread which received the signal, not the signal-handler
  82 thread.
  83
  84 2012-12-28  cgf-000019
  85
  86 (I forgot to mention that cgf-000018 was reverted.  Although I never saw
  87 a hang from this, I couldn't convince myself that one wasn't possible.)
  88
  89 This fix attempts to correct a deadlock where, when a true Windows
  90 signal arrives, Windows creates a thread which "does stuff" and attempts
  91 to exit.  In the process of exiting Cygwin grabs the process lock.  If
  92 the signal thread has seen the signal and wants to exit, it can't
  93 because the newly-created thread now holds it.  But, since the new
  94 thread is relying on the signal thread to release its process lock,
  95 it exits and the process lock is never released.
  96
  97 To fix this, I removed calls to _cygtls::signal_exit in favor of
  98 flagging that we were exiting by setting signal_exit_code (almost forgot
  99 to mark that NO_COPY: that would have been fun).  The new function
 100 setup_signal_exit() now handles setting things up so that ReadFile loop
 101 in wait_sig will do the right thing when it terminates.  This function
 102 may just Sleep indefinitely if a signal is being sent from a thread
 103 other than the signal thread.  wait_sig() was changed so that it will
 104 essentially drop into asychronous-read-mode when a signal which exits
 105 has been detected.  The ReadFile loop is exited when we know that the
 106 process is supposed to be exiting and there is nothing else in the
 107 signal queue.
 108
 109 Although I never actually saw this happen, exit_thread() was also
 110 changed to release the process lock and just sleep indefintely if it is
 111 detected that we are exiting.
 112
 113 2012-12-21  cgf-000018
 114
 115 Re: cgf-000017
 116
 117 It occurred to me that just getting the process lock during
 118 DLL_THREAD_DETACH in dll_entry() might be adequate to fix this
 119 problem.  It's certainly much less intrusive.
 120
 121 There are potential deadlock problems with grabbing a lock in
 122 this code, though, so this check-in will be experimental.
 123
 124 2012-12-21  cgf-000017
 125
 126 The changes in this set are to work around the issue noted here:
 127
 128 http://cygwin.com/ml/cygwin/2012-12/threads.html#00140
 129
 130 The problem is, apparently, that the return value of an ExitThread()
 131 will take precedence over the return value of TerminateProcess/ExitProcess
 132 if the thread is the last one exiting.  That's rather amazing...
 133
 134 For the fix, I replaced all calls to ExitThread with exit_thread().  The
 135 exit_thread function, creates a handle to the current thread and sends
 136 it to a packet via sig_send(__SIGTHREADEXIT).  Then it acquires the
 137 process lock and calls ExitThread.
 138
 139 wait_sig will then wait for the handle, indicating that the thread has
 140 exited, and, when that has happened, removes the process lock on behalf
 141 of the now-defunct thread.  wait_sig will now also avoid actually
 142 exiting since it could trigger the same problem.
 143
 144 Holding process_lock should prevent threads from exiting while a Cygwin
 145 process is shutting down.  They will just block forever in that case -
 146 just like wait_sig.
 147
 148 2012-08-17  cgf-000016
 149
 150 While debugging another problem I finally noticed that
 151 sigpacket::process was unconditionally calling tls->set_siginfo prior to
 152 calling setup_handler even though setup_handler could fail.  In the
 153 event of two successive signals, that would cause the second signal's
 154 info to overwrite the first even though the signal handler for the first
 155 would eventually be called.  Doh.
 156
 157 Fixing this required passing the sigpacket si field into setup_handler.
 158 Making setup_handler part of the sigpacket class seemed to make a lot of
 159 sense so that's what I did.  Then I passed the si element into
 160 interrupt_setup so that the infodata structure could be filled out prior
 161 to arming the signal.
 162
 163 The other changes checked in here eliminate the ResetEvent for
 164 signal_arrived since previous changes to cygwait should handle the
 165 case of spurious signal_arrived detection.  Since signal_arrived is
 166 not a manual-reset event, we really should just let the appropriate
 167 WFMO handle it.  Otherwise, there is a race where a signal comes in
 168 a "split second" after WFMO responds to some other event.  Resetting
 169 the signal_arrived would cause any subsequent WFMO to never be
 170 triggered.  My current theory is that this is what is causing:
 171
 172 http://cygwin.com/ml/cygwin/2012-08/msg00310.html
 173
 174 2012-08-15  cgf-000015
 175
 176 RIP cancelable_wait.  Yay.
 177
 178 2012-08-09  cgf-000014
 179
 180 So, apparently I got it somewhat right before wrt signal handling.
 181 Checking on linux, it appears that signals will be sent to a thread
 182 which can accept the signal.  So resurrecting and extending the
 183 "find_tls" function is in order.  This function will return the tls
 184 of any thread which 1) is waiting for a signal with sigwait*() or
 185 2) has the signal unmasked.
 186
 187 In redoing this it became obvious that I had the class designation wrong
 188 for the threadlist handling so I moved the manipulation of the global
 189 threadlist into the cygheap where it logically belongs.
 190
 191 2012-07-21  cgf-000013
 192
 193 These changes reflect a revamp of the "wait for signal" functionality
 194 which has existed in Cygwin through several signal massages.
 195
 196 We now create a signal event only when a thread is waiting for a signal
 197 and arm it only for that thread.  The "set_signal_arrived" function is
 198 used to establish the event and set it in a location referencable by
 199 the caller.
 200
 201 I still do not handle all of the race conditions.  What happens when
 202 a signal comes in just after a WF?O succeeds for some other purpose?  I
 203 suspect that it will arm the next WF?O call and the subsequent call to
 204 call_signal_handler could cause a function to get an EINTR when possibly
 205 it shouldn't have.
 206
 207 I haven't yet checked all of the test cases for the URL listed in the
 208 previous entry.
 209
 210 Baby steps.
 211
 212 2012-06-12  cgf-000012
 213
 214 These changes are the preliminary for redoing the way threads wait for
 215 signals.  The problems are shown by the test case mentioned here:
 216
 217 http://cygwin.com/ml/cygwin/2012-05/msg00434.html
 218
 219 I've known that the signal handling in threads wasn't quite right for
 220 some time.  I lost all of my thread signal tests in the great "rm -r"
 221 debacle of a few years ago and have been less than enthusiastic about
 222 redoing everything (I had PCTS tests and everything).  But it really is
 223 time to redo this signal handling to make it more like it is supposed to
 224 be.
 225
 226 This change should not introduce any new behavior.  Things should
 227 continue to behave as before.  The major differences are a change in the
 228 arguments to cancelable_wait and cygwait now uses cancelable_wait and,
 229 so, the returns from cygwait now mirror cancelable_wait.
 230
 231 The next change will consolidate cygwait and cancelable_wait into one
 232 cygwait function.
 233
 234 2012-06-02  cgf-000011
 235
 236 The refcnt handling was tricky to get right but I had convinced myself
 237 that the refcnt's were always incremented/decremented under a lock.
 238 Corinna's 2012-05-23 change to refcnt exposed a potential problem with
 239 dup handling where the fdtab could be updated while not locked.
 240
 241 That should be fixed by this change but, on closer examination, it seems
 242 like there are many places where it is possible for the refcnt to be
 243 updated while the fdtab is not locked since the default for
 244 cygheap_fdget is to not lock the fdtab (and that should be the default -
 245 you can't have read holding a lock).
 246
 247 Since refcnt was only ever called with 1 or -1, I broke it up into two
 248 functions but kept the Interlocked* operation.  Incrementing a variable
 249 should not be as racy as adding an arbitrary number to it but we have
 250 InterlockedIncrement/InterlockedDecrement for a reason so I kept the
 251 Interlocked operation here.
 252
 253 In the meantime, I'll be mulling over whether the refcnt operations are
 254 actually safe as they are.  Maybe just ensuring that they are atomically
 255 updated is enough since they control the destruction of an fh.  If I got
 256 the ordering right with incrementing and decrementing then that should
 257 be adequate.
 258
 259 2012-06-02  cgf-000010
 260
 261 <1.7.16>
 262 - Fix emacs problem which exposed an issue with Cygwin's select() function.
 263   If a signal arrives while select is blocking and the program longjmps
 264   out of the signal handler then threads and memory may be left hanging.
 265   Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
 266 </1.7.16>
 267
 268 This was try #4 or #5 to get select() signal handling working right.
 269 It's still not there but it should now at least not leak memory or
 270 threads.
 271
 272 I mucked with the interface between cygwin_select and select_stuff::wait
 273 so that the "new" loop in select_stuff::wait() was essentially moved
 274 into the caller.  cygwin_select now uses various enum states to decide
 275 what to do.  It builds the select linked list at the beginning of the
 276 loop, allowing wait() to tear everything down and restart.  This is
 277 necessary before calling a signal handler because the signal handler may
 278 longjmp away.
 279
 280 I initially had this all coded up to use a special signal_cleanup
 281 callback which could be called when a longjmp is called in a signal
 282 handler.  And cygwin_select() set up and tore down this callback.  Once
 283 I got everything compiling it, of course, dawned on me that just because
 284 you call a longjmp in a signal handler it doesn't mean that you are
 285 jumping *out* of the signal handler.  So, if the signal handler invokes
 286 the callback and returns it will be very bad for select().  Hence, this
 287 slower, but hopefully more correct implementation.
 288
 289 (I still wonder if some sort of signal cleanup callback might still
 290 be useful in the future)
 291
 292 TODO: I need to do an audit of other places where this problem could be
 293 occurring.
 294
 295 As alluded to above, select's signal handling is still not right.  It
 296 still acts as if it could call a signal handler from something other
 297 than the main thread but, AFAICT, from my STC, this doesn't seem to be
 298 the case.  It might be worthwhile to extend cygwait to just magically
 299 figure this out and not even bother using w4[0] for scenarios like this.
 300
 301 2012-05-16  cgf-000009
 302
 303 <1.7.16>
 304 - Fix broken console mouse handling.  Reported here:
 305   http://cygwin.com/ml/cygwin/2012-05/msg00360.html
 306 </1.7.16>
 307
 308 I did a cvs annotate on smallprint.cc and see that the code to translate
 309 %characters > 127 to 0x notation was in the 1.1 revision.  Then I
 310 checked the smallprint.c predecessor.  It was in the 1.1 version of that
 311 program too, which means that this odd change has probably been around
 312 since <= 2000.
 313
 314 Since __small_sprintf is supposed to emulate sprintf, I got rid of the
 315 special case handling.  This may affect fhandler_socket::bind.  If so, we
 316 should work around this problem there rather than keeping this strange
 317 hack in __small_printf.
 318
 319 2012-05-14  cgf-000008
 320
 321 <1.7.16>
 322 - Fix hang when zero bytes are written to a pty using
 323   Windows WriteFile or equivalent.  Fixes:
 324   http://cygwin.com/ml/cygwin/2012-05/msg00323.html
 325 </1.7.16>
 326
 327 cgf-000002, as usual, fixed one thing while breaking another.  See
 328 Larry's predicament in: http://goo.gl/oGEr2 .
 329
 330 The problem is that zero byte writes to the pty pipe caused the dread
 331 end-of-the-world-as-we-know-it problem reported on the mailing list
 332 where ReadFile reads zero bytes even though there is still more to read
 333 on the pipe.  This is because that change caused a 'record' to be read
 334 and a record can be zero bytes.
 335
 336 I was never really keen about using a throwaway buffer just to get a
 337 count of the number of characters available to be read in the pty pipe.
 338 On closer reading of the documentation for PeekNamedPipe it seemed like
 339 the sixth argument to PeekNamedPipe should return what I needed without
 340 using a buffer.  And, amazingly, it did, except that the problem still
 341 remained - a zero byte message still screwed things up.
 342
 343 So, we now detect the case where there is zero bytes available as a
 344 message but there are bytes available in the pipe.  In that scenario,
 345 return the bytes available in the pipe rather than the message length of
 346 zero.  This could conceivably cause problems with pty pipe handling in
 347 this scenario but since the only way this scenario could possibly happen
 348 is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
 349 ok with that.
 350
 351 2012-05-14  cgf-000007
 352
 353 <1.7.16>
 354 - Fix invocation of strace from a cygwin process.  Fixes:
 355   http://cygwin.com/ml/cygwin/2012-05/msg00292.html
 356 </1.7.16>
 357
 358 The change in cgf-000004 introduced a problem for processes which load
 359 cygwin1.dll dynamically.  strace.exe is the most prominent example of
 360 this.
 361
 362 Since the parent handle is now closed for "non-Cygwin" processes, when
 363 strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
 364 and child_info_spawn::handle_spawn couldn't use retrieve information
 365 from the parent.  This eventually led to a strace_printf error due to an
 366 attempt to dereference an unavailable cygheap.  Probably have to fix
 367 this someday.  You shouldn't use the cygheap while attempting to print
 368 an error about the inavailability of said cygheap.
 369
 370 This was fixed by saving the parent pid in child_info_spawn and calling
 371 OpenProcess for the parent pid and using that handle iff a process is
 372 dynamically loaded.
 373
 374 2012-05-12  cgf-000006
 375
 376 <1.7.16>
 377 - Fix hang when calling pthread_testcancel in a canceled thread.
 378   Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
 379 </1.7.16>
 380
 381 This should fix the first part of the reported problem in the above
 382 message.  The cancel seemed to actually be working but, the fprintf
 383 eventually ended up calling pthread_testcancel.  Since we'd gotten here
 384 via a cancel, it tried to recursively call the cancel handler causing a
 385 recursive loop.
 386
 387 2012-05-12  cgf-000005
 388
 389 <1.7.16>
 390 - Fix pipe creation problem which manifested as a problem creating a
 391 fifo.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
 392 </1.7.16>
 393
 394 My change on 2012-04-28 introduced a problem with fifos.  The passed
 395 in name was overwritten.  This was because I wasn't properly keeping
 396 track of the length of the generated pipe name when there was a
 397 name passed in to fhandler_pipe::create.
 398
 399 There was also another problem in fhandler_pipe::create.  Since fifos
 400 use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
 401 PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
 402 as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong.  So,
 403 fifo creation attempted to add the pid of a pipe to the name which is
 404 wrong for fifos.
 405
 406 2012-05-08  cgf-000004
 407
 408 The change for cgf-000003 introduced a new problem:
 409 http://cygwin.com/ml/cygwin/2012-05/msg00154.html
 410 http://cygwin.com/ml/cygwin/2012-05/msg00157.html
 411
 412 Since a handle associated with the parent is no longer being duplicated
 413 into a non-cygwin "execed child", Windows is free to reuse the pid of
 414 the parent when the parent exits.  However, since we *did* duplicate a
 415 handle pointing to the pid's shared memory area into the "execed child",
 416 the shared memory for the pid was still active.
 417
 418 Since the shared memory was still available, if a new process reuses the
 419 previous pid, Cygwin would detect that the shared memory was not created
 420 and had a "PID_REAPED" flag.  That was considered an error, and, so, it
 421 would set procinfo to NULL and pinfo::thisproc would die since this
 422 situation is not supposed to occur.
 423
 424 I fixed this in two ways:
 425
 426 1) If a shared memory region has a PID_REAPED flag then zero it and
 427 reuse it.  This should be safe since you are not really supposed to be
 428 querying the shared memory region for anything after PID_REAPED has been
 429 set.
 430
 431 2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
 432 child for exec.
 433
 434 It seems like 2) is a common theme and an audit of all of the handles
 435 that are being passed to non-cygwin children is in order for 1.7.16.
 436
 437 The other minor modification that was made in this change was to add the
 438 pid of the failing process to fork error output.  This helps slightly
 439 when looking at strace output, even though in this case it was easy to
 440 find what was failing by looking for '^---' when running the "stv"
 441 strace dumper.  That found the offending exception quickly.
 442
 443 2012-05-07  cgf-000003
 444
 445 <1.7.15>
 446 Don't make Cygwin wait for all children of a non-cygwin child program.
 447 Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
 448        http://cygwin.com/ml/cygwin/2012-05/msg00075.html
 449 </1.7.15>
 450
 451 This problem is due to a recent change which added some robustness and
 452 speed to Cygwin's exec/spawn handling by not trying to force inheritance
 453 every time a process is started.  See ChangeLog entries starting on
 454 2012-03-20, and multiple on 2012-03-21.
 455
 456 Making the handle inheritable meant that, as usual, there were problems
 457 with non-Cygwin processes.  When Cygwin "execs" a non-Cygwin process N,
 458 all of its N + 1, N + 2, ...  children will also inherit the handle.
 459 That means that Cygwin will wait until all subprocesses have exited
 460 before it returns.
 461
 462 I was willing to make this a restriction of starting non-Cygwin
 463 processes but the problem with allowing that is that it can cause the
 464 creation of a "limbo" pid when N exits and N + 1 and friends are still
 465 around.  In this scenario, Cygwin dutifully notices that process N has
 466 died and sets the exit code to indicate that but N's parent will wait on
 467 rd_proc_pipe and will only return when every N + ...  windows process
 468 has exited.
 469
 470 The removal of cygheap::pid_handle was not related to the initial
 471 problem that I set out to fix.  The change came from the realization
 472 that we were duping the current process handle into the child twice and
 473 only needed to do it once.  The current process handle is used by exec
 474 to keep the Windows pid "alive" so that it will not be reused.  So, now
 475 we just close parent in child_info_spawn::handle_spawn iff we're not
 476 execing.
 477
 478 In debugging this it bothered me that 'ps' identified a nonactive pid as
 479 active.  Part of the reason for this was the 'parent' handle in
 480 child_info was opened in non-Cygwin processes, keeping the pid alive.
 481 That has been kluged around (more changes after 1.7.15) but that didn't
 482 fix the problem.  On further investigation, this seems to be caused by
 483 the fact that the shared memory region pid handles were still being
 484 passed to non-cygwin children, keeping the pid alive in a limbo-like
 485 fashion.  This was easily fixed by having pinfo::init() consider a
 486 memory region with PID_REAPED as not available.  A more robust fix
 487 should be considered for 1.7.15+ where these handles are not passed
 488 to non-cygwin processes.
 489
 490 This fixed the problem where a pid showed up in the list after a user
 491 does something like: "bash$ cmd /c start notepad" but, for some reason,
 492 it does not fix the problem where "bash$ setsid cmd /c start notepad".
 493 That bears investigation after 1.7.15 is released but it is not a
 494 regression and so is not a blocker for the release.
 495
 496 2012-05-03  cgf-000002
 497
 498 <1.7.15>
 499 Fix problem where too much input was attempted to be read from a
 500 pty slave.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
 501 </1.7.15>
 502
 503 My change on 2012/04/05 reintroduced the problem first described by:
 504 http://cygwin.com/ml/cygwin/2011-10/threads.html#00445
 505
 506 The problem then was, IIRC, due to the fact that bytes sent to the pty
 507 pipe were not written as records.  Changing pipe to PIPE_TYPE_MESSAGE in
 508 pipe.cc fixed the problem since writing lines to one side of the pipe
 509 caused exactly that the number of characters to be read on the other
 510 even if there were more characters in the pipe.
 511
 512 To debug this, I first replaced fhandler_tty.cc with the 1.258,
 513 2012/04/05 version.  The test case started working when I did that.
 514
 515 So, then, I replaced individual functions, one at a time, in
 516 fhandler_tty.cc with their previous versions.  I'd expected this to be a
 517 problem with fhandler_pty_master::process_slave_output since that had
 518 seen the most changes but was surprised to see that the culprit was
 519 fhandler_pty_slave::read().
 520
 521 The reason was that I really needed the bytes_available() function to
 522 return the number of bytes which would be read in the next operation
 523 rather than the number of bytes available in the pipe.  That's because
 524 there may be a number of lines available to be read but the number of
 525 bytes which will be read by ReadFile should reflect the mode of the pty
 526 and, if there is a line to read, only the number of bytes in the line
 527 should be seen as available for the next read.
 528
 529 Having bytes_available() return the number of bytes which would be read
 530 seemed to fix the problem but it could subtly change the behavior of
 531 other callers of this function.  However, I actually think this is
 532 probably a good thing since they probably should have been seeing the
 533 line behavior.
 534
 535 2012-05-02  cgf-000001
 536
 537 <1.7.15>
 538 Fix problem setting parent pid to 1 when process with children execs
 539 itself.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
 540 </1.7.15>
 541
 542 Investigating this problem with strace showed that ssh-agent was
 543 checking the parent pid and getting a 1 when it shouldn't have.  Other
 544 stuff looked ok so I chose to consider this a smoking gun.
 545
 546 Going back to the version that the OP said did not have the problem, I
 547 worked forward until I found where the problem first occurred -
 548 somewhere around 2012-03-19.  And, indeed, the getppid call returned the
 549 correct value in the working version.  That means that this stopped
 550 working when I redid the way the process pipe was inherited around
 551 this time period.
 552
 553 It isn't clear why (and I suspect I may have to debug this further at
 554 some point) this hasn't always been a problem but I made the obvious fix.
 555 We shouldn't have been setting ppid = 1 when we're about to pass off to
 556 an execed process.
 557
 558 As I was writing this, I realized that it was necessary to add some
 559 additional checks.  Just checking for "have_execed" isn't enough.  If
 560 we've execed a non-cygwin process then it won't know how to deal with
 561 any inherited children.  So, always set ppid = 1 if we've execed a
 562 non-cygwin process.