Summary: ACQI/su problems

From: Letitia Yao <letitia_at_nmr.chem.umn.edu>
Date: Tue, 13 Jun 2000 09:25:13 -0500 (CDT)

We want to thank everybody for their suggestions to our su/acqi problems.
Unfortunately, we are still experiencing them with vnmr6.1b and have gone
back to vnmr6.1a. We have replaced the ethernet cable and tried a newer
version of the MSR board, both with no success. This problem has been
reported in the vnmr buglist. Following the bug report, I have
summarized the replies and suggestions that we have gotten. We will
continue to trouble-shoot this when we get more ideas, but for now I'd
rather have a system that doesn't crash every day.

---------------------------
su - Hardware setup hangs Bug-ID: su.6102 (++)

Summary: About one in every three times that the su command is typed,
     a time is shown for "Completion Time" and the setup never
     completes. If go is then typed, the experiment is
     queued. Occasionally after clicking on "Abort Acq" to quit the
     setup, one gets the following message after typing ga:

             ga cannot proceed now, abort of ACQI not complete

     Clicking on "Abort Acq" again then gives the message

             No Abort Acq when system is in interactive mode.

     Turning off DNS has no effect. The problem was not present in
     VNMR 6.1A. Bruce Adams reports that after either a swap of the
     acq. CPU and/or the MSR board, or after an adjustment of the 5V
     supply for the acquisition CPU the situation improved, but bug
     acquisition.6118 now appears with arrayed experiments - this may
     be the same bug. Related: acquisition.6118

Bad versions: VNMR 6.1B / Solaris 2.5 / UNITY INOVA

Reported by: Chris Blake, Australian National University (1999-04-14)

Confirmed by: Bruce Adams, Varian

Workaround: Click on "Abort Acq". Sometimes it is necessary to reboot
the acquisition console.

----------------------------
I can't offer any help, but I can tell you that we see this problem very rarely
(but enough to notice) with our new MercuryVX 200, which runs 6.1B on an
Ultra5/360.

See if you can make Varian fix this in 6.1C, due out in a few weeks!
----------------------------
All of this really doesn't seem like a software problem to me; it sounds
more like you're intermittently losing hardware communication over the
Ethernet link that connects the console to the Sun. Is there a chance your
console Ethernet cables might be damaged? Stepped on, crushed, pulled out
of their jacks, etc.? One thing I'd try is replacing one of the cables.
You'd need a good quality Category 5 crossover patch cable which can be
bought at bigger computer stores (it must say "crossover"; regular patch
cables won't work). Another possibility, if your Suns have 2 Ethernets, is
to take the Sun off the network and try running the console from the 1st
Ethernet interface (you'll need to run setacq again to tell the system which
Ethernet to use to talk to the console).
----------------------------
Quite puzzling problem. You may want to check the MSR board. We upgraded
our MSR board on going from 6.1A to 6.1B. The board may have been upgraded
as part of our service agreement and could well be coincidental with
loading the new software, but it may be worth checking. The MSR version
number was 3 before, and one spectrometer was upgraded to version 5B and
the other is 6. Both spectrometers are running Solaris 7, VNMR 6.1b on an
Ultra 5 and Ultra 10 with no problems. I do know you need to plug the
shims into the MSR board on going to 6.1B, but that shouldn't cause the
problems that you see, and I would guess you have that covered based on
what you've done so far.
----------------------------
Have you tried to reset the INOVA Host Computer yet. This has to be done by
root.

----------------------------
        We occassionally lose communications, but nowhere near the extent that
you do. Sometimes it is the process on the sun that dies, sometimes it is
the console processor that hangs. If it is the sun process that dies,
Acqstat will show inactive and "su acqproc' will reconnect. This happens
about once a week.
        If it is the console that locks up, we will get additional messages
from su acqproc that says that 'channel # on the console is already
connected to another process'. We then su acqproc until acproc is killed,
then reboot the console (we reset the MSR board, then the CPU). We wait
until the lights flash on the remote status (about 2 min), the su acproc,
then su in vnmr. This protocol has always restored the spectrometer
connection. We have found that we can kill the console process by
submitting acquisition jobs too quickly (go jexp2 go jexp3 go, etc. We
are ok if we wait until acquisiotion has started or queued before
submitting a new job). We can routinely get into trouble if we type go or
ga and then abort acquisition very quickly.
        The additional errors you get suggest that your problem is more severe.
Perhaps you have bad ethernet transcievers or cables (or the TBase 10 cables
are picking up interference from something else). We are also running
Solaris 2.6 and vnmr6.1B. It might be something related to 2.7. We also
have NOT installed scsi drivers.
----------------------------
    We have had a similar type problem on a completely different system and
computer platform. It turned out to be Rf interference. Was the devil to
trouble shoot. I needed to check the ground and Rf cables for continuity.
Literally, spent weeks trying to figure it out.
----------------------------
Your question is interesting because it is similar to a problem I have seen
with acqi, but only a few times in 6 mo.s

My problem showed up as acquisitions not automatically doing wft after
acquisition. It turned out this was because there was no "acquisition
complete" message being sent back from the console. I was still able to
get spectra with a ga command.

My problem was fixed with two "su acqproc"s (one to kill, one to start).
I do recall other occasions where I've had to reset the console computer
card, but no recurrent problems.

Our system is a Unity 500 running 6.1b on a sun ultra 5.
-----------------------------
These are very weird problems. Seems like you have tried reasonable
solutions. Sorry, I can't provide any help for you. However, I too have
Inovas with Ultra 5s and vnmr6.1b, so I would be very interested in
learning about the solution(s) to them.

There is one thing: we find that if we start and acquisition, decide to
change a parameter, type aa, then quickly change the parameter and type ga,
this will start an acquisition before the first has been completely
aborted. This starts an invisible queue which doesn't stop with aa.
-----------------------------
I get this problem infrequently, usually an 'su acqproc' fixes it.
How are the mains? When this happens it seems to affect all the
spectrometers in the room, indicating a power glitch. Perhaps a
UPS/voltage regulator would help. The recommended ones are not cheap,
about $3-4000. I'm investigating this here, but don't have the budget.

Is the ethernet cable between the spectrometer and computer good? I
had some problems with a cheap cross-over cable [Varian missed one on
installation, so I bought a cheap one. The Varian one arrived later
and improved matters.]

Please let the list know what solves this problem!
----------------------------
Wow - you seem to have tried everything. We too see problems like yours,
but at a very low frequency which is not really a problem. The loss of
communication and/or console crashes seem to be an endemic problem with
Varian systems.

One thing that may help is to check the revision level on your MSR board.
It may not seem directly related, but we had a problem ca. 18 months ago
when they brought out patch 108 for 6.1A. We loaded the patch on two
systems and everything went crazy (not identical to what you describe, but
somewhat similar. The main feature of the problems we had was not ejecting
samples, or taking a long time to do so, and getting messages like
'acquisition responded xxxx' where xxxx is some gobbldy gook nonsense but
there were symptoms that looked like communication problems). Varian said
"We've tested the patch on our systems, and everything is fine." Eventually
the problem was traced to an old revision of the MSR board in our consoles.
Patch 111 (with the old board) was OK, but gave occasional probs of a
similar type so there may well be something in 6.1B that doesn't like old
MSR boards.

I'm not sure that this is the cause of your problems, but it may be worth
checking out. I was interested to see that your problems go away with 6.1A.
What patch level? If they come back with patch 111 or higher on 6.1A, then
this does look suspiciously like the problem we saw. Don't be put off if
the Varian technical guys tell you the revision of the board doesn't matter,
'cos sometimes it does. The vintage of the 'bad' board was around 1995. Let
me know how you get on, I'd be interested to hear if this is/is not the
cause of your problems.

-----------------------------
I have seen this problem occasionally on our Inova 400 since we purchased
it in 1995 (therefore it has persisted through computer and software
upgrades). On our system, the problem has never continued for more than
half a day. Other than shutting down the console and rebooting, the only
other suggestion I have is to log in as root or superuser, go to /vnmr/bin
and do a ./setacq (once the computer & console have come back up). My
guess is that you have already tried this, but it is the only suggestion I
have. I have no set routine for handling this problem when I see it. I
go through the same steps that you have taken and the system usually comes
up and stays up for several months before seeing this problem again. We
have been having some "odd" communication problems with our new Inova 500
as well. I believe that Varian has a continuing problem with the
way their acquisition CPU, MS&R and ADC boards "communicate" with the Suns,
since we have been seeing these types of problems since the 400 purchase in
1995. I hope you find a resolution to your problem! Good luck.
-----------------------------
We have seen something similar on our system here. However, we have
been fortunate that simply resetting the console (not a full
powerdown) has been enough to recover from it.
-----------------------------
What a strange set of problems that you are having with your two
Inova systems. Have you figured out the cause and found the solution
yet?

Your troubleshooting results seem to point to something (i.e.
hardware, electrical, or whatever) that vnmr6.1b is having problem
with but vnmr6.1a can live with. Did you just upgraded these system
to vnmr6.1b? We have four Inova systems (two 500's, one 600, and one
750) all running Solaris 7/VNMR 6.1B, one on an Ultra5 and the rest
on Sparc5's, and they are all as well behaved as they were with
vnmr6.1a. So I don't think there is a general incompatibility
between vnmr6.1b and Inova consoles. Instead of downgrading to
vnmr6.1a, you might want to try out a beta copy of vnmr6.1c - you may
be in luck and your problems will just go away.
----------------------------
We too suffer from console communication problems, which have been a
feature of Varian (and other) spectrometers since the introduction of Unix
workstations. Our problems have been less severe than yours, but also vary
substantially in incidence. In general, the heavier the computing load,
the more frequent the problems. Our open access INOVA 300, which runs 30k+
routine 1H/13C samples a year, only occasionally shows problems. Our other
INOVA 300 and 400 instruments, with near identical hardware and softwar,e
show an incidence which increases dramatically with the complexity of the
experiments being done and with the amount of operator interaction. Doing
development work which involves frequent aborting of acquisitions and
queuing of experiments may result in a breakdown every few hours.

If you do get to the bottom of the problem we would be very interested to
hear of it; in the mean time we will try the fix given at the end of your
message.

----------------------------------------------------------------
ORIGINAL MESSAGE:

We are having major communication problems with 2 of our spectrometers.
This problem occurs on both our Inova500 and Inova300 running under
vnmr6.1b and solaris2.7 on ultra5 workstations.

Both seem to lose communication with the console. 'Su acqproc' and
resetting the console appears to restore communication temporarily. But
as soon as you type 'su', several errors may occur:

1. 'Su cannot proceed; abort of ACQI not complete'
2. 'Su cannot proceed; console is powered down or not connected'
3. And in the shell window:
'channel # on the console is already connected to another process'
which repeats over and over. Or sometimes:
'No heartbeat reply; console is powered down or not connected' or
'msgehandler.c line490: chkExpQ: Console connection not established
yet.'
4. Sometimes an experiment will queue if you do additional 'su' commands.
5. If you try to abort an acquisition or "ga", you get the error:
'No abort acq [or ga] when system is in interactive mode.'
6. Always, buttons in the Acqi window (lock, insert, eject) disappear
because the spectrometer thinks you are doing something.
7. Sometimes there is no lock signal in the lock window (ie, the yellow
line is missing)
Sometimes aborting the acquisition will restore communication.
Sometimes just typing su again will restore it. But never for very
long.

What we've done:
--We've shutdown the console and rebooted the computer several times.
Sometimes it will come back, but only for about half an hour.
--We hooked up another computer (an Ultra 10) and installed vnmr6.1b
fresh, but saw the same problems.
--All fans are running.
--Voltages across the power supply are close--there is one that is 11
instead of 10. Checking the power supply with an oscilloscope looks ok.
We were originally running at a voltage of 204 when the problem first
occurred, but we have since installed a transformer to boost the voltage
to 218 and the problem persists.
--The problem can be avoided on the Inova500 if we run 6.1a instead of
6.1b, still on the Ultra5. We occasionally get hangups, but one or two
sequences of 'su acqproc' will fix it.
--The problem can be avoided almost completely on the Inova300 if we run
6.1a on a sparcstation 5 (instead of the Ultra5).

--Varian Tech Support suggested the following, which we have also done.
This allows us to run under vnmr6.1b again for a few days at a time before
the problem occurs again.

  Open a terminal window and do the following
    1. First make sure that you do not have more that one
       Expproc running when this happens. You can check this
       by typing on a terminal window the following
        ps -fea
    2.Then be sure that Expproc and associated processes
     are not active. (This includes acqstat and acqi
     windows)
    3.Log in as vnmr1 and type the following:
           vnmr1> cd /vnmr/bin
           vnmr1> ls -al
           check the permission of the following files:
 
 -rwsr-sr-x 1 vnmr1 nmr 1575260 Dec 18 1996 Vnmr
 -rwsr-sr-x 1 vnmr1 nmr 7916 Dec 18 1996 send2Vnmr
 -rwsr-sr-x 1 vnmr1 nmr 113896 Dec 18 1996 Expproc
 
      If they do not have the UID and GID bit set, do the
      following correction:

           vnmr1> chmod 6755 Vnmr
           vnmr1> chmod 6755 send2Vnmr
           vnmr1> cd ../acqbin
           vnmr1> chmod 6755 Expproc
    4.Log in as root and type the following:
      # cd /vnmr/bin
      # ./rmipcs a
      Answer y to the question.
    5.Reboot the spectrometer console and restart Expproc:
          su acqproc


----------------------
Letitia J. Yao | letitia_at_nmr.chem.umn.edu
NMR Laboratory | yao_at_chem.umn.edu
Department of Chemistry | (612) 625-8374 (office)
University of Minnesota | (612) 626-7541 (fax)
Received on Tue Jun 13 2000 - 20:14:20 MST

This archive was generated by hypermail 2.4.0 : Sat Jun 03 2023 - 15:53:26 MST