IDL Falls Asleep!

QUESTION: You are not going to believe this, but we are starting to hear rumors of IDL 5.2, running on UNIX machines, falling asleep right in the middle of a big processing job. Here is a recent description of the problem and a typical response on the IDL newsgroup.

Subject: Re: IDL falls to sleep!
From: throop@colorado.edu
Date: Sun, 02 May 1999 09:26:33 GMT

Robert S. Mallozzi wrote:

> 	Terje Fredvik  writes:
> >
> > One of my programs is generating large data arrays which is saved
> > to file using the ordinary "save, [filename], [variable]" command.
> > The program has worked excellent, but yesterday something strange
> > started to happen; after saving a few arrays IDL went into sleep
> > mode [...]
>
> One of my co-workers is also having this problem, although it's not
> when writing files.  He was doing many fits in a loop using SVDFIT.
> At a random point in the loop, IDL halts (no CPU activity), CNTRL-C
> does nothing, and IDL must be killed "brutally".
> I traced the line at which it halted - it was at a call to
> NR__SVDFIT (two underscores) [...]

Praise the lord!  RSI has tried to convince me that I am the only one in the
world having this problem with 5.2, and that it has something to do with my
installation.

I have experienced this problem on three Solaris machines under IDL 5.2.  I
cannot reproduce the problem using 5.0.3 on the two machines that have it
installed.  The systems are an Ultra 5, Ultra 10, and Sparc 10.  I was unable
to reproduce the problem in NT.

I have a long-ish (2000-line) code that consistently causes this problem,
after 1-10 hours of computation.  I've sent it to RSI, who says they're
unable to reproduce it.  I don't know why this would be, but at this point
they tell me there's nothing to be done.  If someone else has a
self-contained program that demonstrates this situation -- or general
observations about it -- _please_ send it to RSI.  E-mail me and I'd be happy
to pass along my code and my exchanges with RSI support.

For background, my program executes about 10^5 iterations through a set
of matrix transformations.  It's not writing any files, and I don't believe
I use the NR__SVFIT routine that Robert mentions above.  Memory usage is
~ 50 MB, and I get the same behavior from IDL and IDLDE.  In the former,
ctrl-c will sometimes give the 'Interrupt encountered' message, not always.

-Henry

And here is another report from the always sober Paul Mix at Sandia National Lab.

Subject: Re: IDL falls to sleep!
From: "L. Paul Mix" 
Date: Mon, 03 May 1999 12:59:56 -0600

We too have observed IDL hanging on large applications.
RSI has been unable to reproduce the results.

We have observed the problem most frequently on multiple processor systems.
(10 processor Sun and 4 processor HP)
Apparently all of the RSI systems have single processors.

--
L. Paul Mix

Other people have responded on the IDL newsgroup with similar reports, although there does not appear to be a pattern--yet. It does seem to happen only after IDL has been cranking away on something for a fairly long time.

So, bottom line, we are looking for more information to track this problem down. If you have anything to share, please contact me and I'll make sure a copy makes it to RSI's technical support folks.

It's good to know you're not crazy, isn't it. :-)

Here is some additional information I have collected on this topic.

Subject: IDL falls to sleep! 
From: Henry Throop  
Date: Fri, 08 Oct 1999 08:42:51 GMT 
Newsgroups: comp.lang.idl-pvwave 
Organization: University of Colorado 

Back in May, several users (myself included) reported problems
with IDL 5.2 literally falling asleep.  (Most of us were using 5.0.3
as a result.)  I talked at length with RSI and my local support people,
who claimed that it was not a problem, they couldn't reproduce it, it
was a coding error, and/or that it was not fixable.  The symptom was,
under Solaris, that CPU usage would drop to zero and IDL would stop
responding to any commands at all, and the process had to be killed.

I finally got RSI's attention when the following program reproduced
the bug consistently, stopping after ~ 1 hour.  (Biking down
to their office and walking up to the tech support cubicles didn't
hurt, either):

pro hang_bug
  for i = 0LL, 90000000LL do begin
    a = dist(50)+randomu(seed,50,50)
    a = a^2 + a
    if (i mod 1000 eq 0) then print, i
  end
end

With this in hand, one of the RSI support people -- although still
unable to reproduce it -- suggested we try a different license
manger.  I'm not familiar with the nitty-gritty of licenses, but I
believe it has to do with switching from a FLEX_LM license to a GENVER
one.  The problem was apparently that IDL was hanging when it was
unable to successfully ping the license server, decided the license
was bad, and stopped.

Bingo -- no 5.2 problems since!

The computer people here apparently had to change every one of LASP's
100+ licenses, but I've not heard problems since then.  I'm not
sure whether the issue's been fixed in 5.3, but I thought I'd pass this
on to anyone still having problems.

-Henry

Subject: more IDL falls asleep
From: deja_jlin@my-deja.com (Johnny Lin)
Date: 24 May 2001 14:57:44 -0700

hi all,

back in 1998 and 1999, a few folks described a problem with long IDL
jobs on solaris going to sleep (0% CPU usage) mid-way through.  Henry
Throop provided a fix involving switching from the GENVER license
manager to FLEX.

i'm encountering the same problem now w/ a node-locked license running
FLEX, for both IDL 5.3 and 5.2.1 (this is a multi-processor machine).
the test code Henry used:

pro hang_bug
  for i = 0LL, 90000000LL do begin
    a = dist(50)+randomu(seed,50,50)
    a = a^2 + a
    if (i mod 1000 eq 0) then print, i
  end
end

causes IDL to fall asleep around 1-3 hours in.  our sys admin folks say
this shouldn't be a license manager problem, since it's node-locked, and
thus only checks the license at the beginning of the job.

i've also encountered similar problems w/ a server license running IDL 5.4.
it also is a solaris machine, but only has a single processor.

has anyone else figured out another fix?  any help would be much appreciated.

thanks!

best,
-Johnny

-------------------------------------------
Johnny Lin
CIRES, University of Colorado
Work Phone:  (303) 735-1636
Web:  http://cires.colorado.edu/~johnny/
-------------------------------------------

Subject: Re: more IDL falls asleep
From: Wayne Landsman 
Date: Fri, 25 May 2001 03:36:38 -0400

Johnny Lin wrote:

i've also encountered similar problems w/ a server license running IDL 5.4.

> it also is a solaris machine, but only has a single processor.
> has anyone else figured out another fix?  any help would be much appreciated.

You're not going to like my answer -- except for the comfort that it might give
in knowing that you are not the only one....

We've had problems with IDL falling asleep under a node-locked V5.4 license and
a dual-processor machine running Solaris 2.7.     We strongly suspected the
license manager since it occurs with both V5.3 and V5.4 under the V5.4 license
manager, but never with V5.3 under the V5.3 license manager.    The problem also
seems to be more likely to occur if the machine has many other processes
running.

The strange thing is that we have another apparently identical machine that has
had no problems with V5.4!       So we are at a loss in trying to understand how
to further diagnose the problem, and have simply downgraded the offending
machine to IDL V5.3.

Wayne Landsman

Subject: Re: more IDL falls asleep
From: deja_jlin@my-deja.com (Johnny Lin)
Date: 29 May 2001 13:39:14 -0700

hi all,

the folks at RSI responded re this problem, and it has to deal with the
license manager.  here's the applicable parts of their reply (the technical
term for IDL sleeping and never waking up again is "deadlock"):

  If it is indeed deadlock, then I believe you are encountering a weakness in
  FLEXlm licensing that is generally only observable in very "large" IDL
  processes. It is based on an implementation of malloc() in FLEXlm that is
  not thread-safe. To keep a constant tab on licensing status in its network
  FLEXlm sends out a periodic query to each of its clients. This query uses
  the unsafe malloc() call. There is a very minute probability that this call
  might be concurrent with an IDL use of system malloc(), and that is where
  the deadlock occurs. Very large IDL processes that have many calls
  allocating new memory are capable of defying the odds and experiencing this
  deadly concurrency. We have actually never been able to (knowingly)
  reproduce this in-house, but identified the clash in 'pstack' output from
  our customers. You also would probably see the concurrent malloc() calls in
  your own run of 'pstack' on the ID of a process that has "fallen asleep."

  Our only solution is Research Systems' own licensing protocol, Genver
  licensing. The downside to Genver licensing is that it must be renewed every
  180 days, and you must have a separate license for each individual host that
  is running your long IDL programs.

they're working on fixing this for the 5.5 release, but aren't sure if it will
be ready then, since fixing it is actually quite tricky.

hope this helps others experiencing this problem!

best,
-Johnny

Google
 
Web Coyote's Guide to IDL Programming