Dave Lloyd
Computer Services Branch
Software Engineering Department
September 19, 2000
2. The user must supply an initialization file or $ADAPSTABLES/adapsmonitor.default must exist.
An example of the initialization file is located in section 2.4
Figure 1 -- ADAPS Monitor Main Window
To stop the ADAPSMONITOR press the Exit button on the Main window. An Exit
Confirmation window (See Figure 2) will pop up. Pressing the Yes button will
exit the application. Pressing the No button will remove the Exit Confirmation
and the application will continue to run.
Figure 2 -- ADAPS Monitor Exit Confirmation Window
At periodic intervals the monitor will recheck items in the list and update
their status in the main window. If, however, the status of an item is not
normal the user is notified ( See Figure 3).
Figure 3 -- ADAPS Monitor Alert Window
The alert window will remain on the display as long as the conditions that
caused the alert remain or until the user presses either the Acknowledge or
the Apply Changes button. Pressing either of the aforementioned buttons will
not correct the conditions that caused the alert. The alert window will
reappear if the error conditions are not corrected.
Users may disable future alerts on an item by pressing the the button, labeled
"Enabled", to the left of the item. When pressed these buttons toggle between
"Enabled" and "Disabled". The background color also changes to yellow to
indicate that a change has been made (See Figure 4). Note that changing the button to
disabled on the alert window has no effect until the Apply Changes button is
pressed.
Figure 4 -- ADAPS Monitor Alert Window with a Disabled button
It is also possible to change which items are to be monitored without waiting
for an Alert window. Pressing the
Modify button on the main window opens the Modify window (See Figure 5). This window
contains a list of all the items with an Enabled/Disabled toggle button for
each one. These buttons work in the same manner as the toggle buttons on the
alert window. Again, no changes take place until they are applied by pushing
the Apply button. Pressing the Cancel button removes the Modify window
without applying any changes. The Apply button also removes the Modify
window.
Figure 4 -- ADAPS Monitor Modify Window
Figure 5 -- File Select Window
This is a standard Motif file selection window that allows the user to select
a file. Either select the file from the Files listed or enter the file name
in the Selection box and press OK. The file will be loaded into the Modify
window. Once the Apply button is pressed the new list will be used for all
further checks. Pressing the Cancel button will retain the original list of
items.
Another consideration is not to over specify the name. For instance, a qstat
may reveal that archive_daemon.712659 is in the queue. If the full name is
specified and the archive daemon later fails and is restarted its name will
have a different process id suffix. For this reason, it is safest to
specify only the daemon name 'archive_daemon'.
To monitor a system, the system name must be listed in this field and the
next and must be exact.
b. System. This specifies the system on which to check the item. The system
name must be the exact name as reported from a 'uname -n' command. For the
current acquisition system specify vaxh
c. Item Type. This indicates what type of item to monitor. The valid values
for Item Type are: datafile, batch, process, vaxprocess and system.
d. Monitor Interval. This is the interval, in minutes, between checks of the
item. There are two ways to think of the
interval, one for items of type system and vaxprocess, and another for all other
item types.
For system and vaxprocess types this is a simple interval. Every X minutes
the item's status is checked and reported. However, for other types, it must
be thought of as a check for activity within the past X minutes. The monitor
uses the Last Modified time of the file or, in the case of batch and process
types, the log file to determine if activity has occurred. For example,
assume there
is a batch process named ingest_daemon with a monitor interval of 10 minutes
and a log file of $ADAPSLOG/ingest.log. Every 10 minutes the monitor will check
the Last Modified time of $ADAPSLOG/ingest.log. If the log has been modified
within the last 10 minutes the ingest_daemon will be deemed to be running. If,
however, the log is 11 minutes old an alert will be issued warning that the
daemon is not running. Items of type process also use the specified log file
to check for activity. Items of type datafile
use the data file itself for the time check.
d. RWAST Monitor Interval. This interval specifies how long a VMS process may
remain in the RWAST state before triggering an alert.
This field only applies to VMS processes, for all other types a value of -1
should be specified.
e. Initial Monitor State. This specifies the monitoring state of an item at
initial start up. Its valid values are "enabled" or "disabled". An item with
"enabled" in this field will be checked during the initial startup but won't be
checked if its value is "disabled". An item "disabled" at startup can later be
"enabled" through the Modify window.
f. Number of Batch Jobs. This specifies how many batch jobs with this name
should be in the batch queue. Items that are not of type "batch" should specify a
value of -1. If an item, such as refaid_daemon, specifies 2 batch jobs there
must be two refaid_daemon daemons in the batch queue. If there is only one
job a 'Not in queue' alert will be issued.
g. User ID. Specifies the user id for batch items.
For example, assume there are two ncepin daemons running in
batch, one under opsavhrr and another as release. If the item specifies a
user id of opsavhrr, the monitor will only recognize the ncepin daemon that
opsavhrr is running. For non-batch items NONE should be specified.
h. Logfile. This is the file that will be checked for activity for items of
type "batch" and "process". Items of other type should specify NONE. The log
file specified should be a file that is regularly updated at some interval.
The Monitor Interval should be slightly longer than the log update interval.
2.5.2 Process Type. The process's 'Log File' must have been modified within
the last 'Monitor Interval' minutes.
2.5.3 Batch Type. There must be 'Number of Batch Jobs' listed in the batch
queue as 'Userid'. And, if the previous rule is met, The process's 'Log
File' must have been modified within the last 'Monitor Interval' minutes.
2.5.4 Vaxprocess Type. The process must be listed in the VMS 'sh sys'
command. If the process state is RWAST, it must not have been in that state
for more than 'RWAST Monitor Interval' minutes.
2.5.5 System Type. The named system must be able to respond to an 'rsh date'
command.
2.5.7 Not up to date. The data file has not been modified within the last
'Monitor Interval' minutes. Correction of these errors will depend on the
data file and the method used to keep them up to date. Typical examples are
the ADAPSTABLES ".dat" and ".timcor" files.
b. ".dat" files are ephemeris files that are pushed to the production system by
PC normally once every work day. If these files are not up to date on the
production system, check with PC. UPNAV propagates these files to other
systems in the same manner as the ".timcor" files.
2.5.8 Not in queue. There are less than the specified number of batch jobs
with the specified user id in the batch queue for the item. Restart batch
jobs for the item that caused the alert until there are the correct number of
jobs with the same user id. Make sure the user id is the one specified in the
initialization file or the alerts will continue.
2.5.9 RWAST State. The item has been in the RWAST state for more than 'RWAST
Monitor Interval' minutes. This normally indicates that the VMS process is
hung waiting for resources and the process will have to be restarted or the
system rebooted.
2.5.10 Not responding. The monitored system is not responding. This could
be an indication of network problems or a system error. Contact the system
administrators. Note: the monitor uses 'rsh system -n date' for its check.
2.5.11 Not running. The process has not updated its 'Log File' within the
last 'Monitor Interval' minutes. This normally means that the process is
hung and needs to be stopped and restarted. However, it could be
just waiting for processing to complete and hasn't been able to update the
log in a timely manner. Two notable processes exhibit this problem --
NOAAFTP and NCEPIN. Both of these processes ftp large data files from NOAA
and occasionally, due to network traffic or errors, the ftp will take longer
than the 'Monitor Interval' to complete and an alert will be issued. Make
sure this is not the case before restarting these types of processes.
Number
Date and Sections
Notes
1
September 19, 2000
Document Created
2
3
4
5
6
Contents
Contents
2 OPERATION
2.1 Starting and stopping ADAPSMONITOR
2.2 Normal operation
2.3 Loading New Defaults
2.4 The Defaults File
2.5 Alerts and What They Mean
1 INTRODUCTION
1.1 Identification
ADAPSMONITOR is an application that notifies computer operations personnel
of the status of critical AVHRR processes, systems and data files.
1.2 System Overview
ADAPSMONITOR continuously monitors critical AVHRR processes to insure
that they are operational. It also monitor critical systems and data
files. If one of the critical processes fails to produce output, or a system
goes down or a data file is not updated, computer operations personnel are
notified. The user's interface with the ADAPSMONITOR is through a Graphical
User Interface (GUI).
1.3 Document Overview
The purpose of this document is to provide instructions for operation and
use of the ADAPSMONITOR.
1.4 Environment
The ADAPSMONITOR is designed using the X-Window system and uses the Motif
widget set. The monitor should run on any Unix based system running X11R5.
2 OPERATION
2.1 Starting and stopping ADAPSMONITOR
Before starting the monitor two things must be set up:
1. ADAPSBASE must be in the user's path. ADAPSBASE is usually defined
as $EDCSOFT/run/adaps/base.
Given that both of the above requirements are satisfied, to start the monitor
with a user initialization file enter:
adapsmonitor -f user_supplied_initialization_file <ENTER>
to start the monitor with the default file enter:
adapsmonitor <ENTER>
A new window titled "ADAPS Monitor
Main Window" will appear (See Figure 1). Note that depending on the number of remote systems,
processes and files that are to be monitored, the initial window may not be
updated for several minutes. The user will see this as either a window with
no contents or a list of items that are "Unchecked" and the windows mouse
pointer will be changed to an hourglass.
2.2 Normal operation
Once started, the monitor will check each item and report on its status in the
main window. As long as everything is normal there are no other user inputs
needed.
2.3 Loading New Defaults
During normal operations, the only thing that can be changed is whether or
not an item specified in the list is monitored.
If, for example, it is desired to monitor an item not
on the current list, a new list must be loaded. To load a new list press the
Load button on the Modify window. This pops up a File Select window (See Figure 5).
2.4 The Defaults File
The Defaults file is a text file that contains the information the monitor needs
to do its job, a default system configuration. Among the information in the file are
what things to monitor, where they reside, what they are, how
often to check the items, etc. There are nine items of information that must be
specified for each item to be monitored:1 item name
All nine fields must be specified for each item even if they don't apply to
the item. If the information doesn't apply a -1 or NONE can be specified.
The layout of the file is as follows:
2 the system it it on
3 the type of item
4 the monitor interval in minutes
5 an RWAST interval for vax jobs
6 the initial monitor state
7 number of batch jobs
8 user ID
9 log file associated with batch job
# lines beginning with the pound sign are taken to be comments and are ignored
# RWAST Initial Number
# comment Item Monitor Monitor Monitor of Batch
# Item Name System Type Interval Interval State Jobs Userid Logfile
ingest_daemon edcsgs4 batch 10 -1 enabled 2 opsavhrr $ADAPSLOG/ingest.log
ACQUIRE vaxh vaxprocess 15 20 enabled -1 NONE NONE
sg1 sg1 system 4 -1 disabled -1 NONE NONE
$ADAPSTABLES/21263.dat edcsgs4 datafile 1445 -1 enabled -1 NONE NONE
Lines must not be longer that 255 characters. Each bit of information
does not have to be aligned but they must be in the proper order on the line.
Use the following guidelines for choosing values for each column:
a. Item Name. This is the name of the item to be monitored. If it is a
file,
it must be fully specified eg. "/sgs4/km1/adaps/file.name". Environment
variables such as $ADAPSTABLES or $EDCSOFT may be used. Vax and batch
process names should closely resemble the name of the process as it appears
in the 'show sys' command on the vax or 'qstat' command on sgs4 and they should be unique
from other processes. For example, one could use 'up' as the item name for
monitoring upnav but that would not be unique since it also matches
updatesch.
2.5 Alerts and What They Mean
The purpose of the Alert Window is to notify the user that something is not
quite right. The monitor periodically checks each item in the list and
determines if the monitor interval has elapsed for that item. If the
interval has elapsed a further check is made to see if it is operating
properly or up to date. Each item type has a different set of rules to make
the determination.
2.5.1 Datafile Type. The data file must have been modified within the last
'Monitor Interval' minutes.
If any of the criteria are not met an alert will be issued for the item.
This alert will continue to be issued periodically until the error condition
is corrected or monitoring of the item is disabled. The alert conditions,
their meaning and suggestions for correction follow:.
2.5.6 Unchecked. The monitor was unable to make a connection with another
system or was unable to check the batch queue. The monitor relies heavily on
remote shell calls to check the status of processes and files. If the remote
shell connection cannot be made it is interpreted as an error and an alert is
issued. This could be an indication of network problems or the user may not
have access to the system. Note: this is also the normal condition of an
item that was disabled at start up.
a. ".timcor" files are time correction files that are pushed to the
production system by the acquisition system after each live acquisition. The
current acquisition system, vaxh, uses NETCOPY to push the files.
These files are then copied to other systems by UPNAV several times a day.
If the files are out of date on the production system, check that NETCOPY is
running on vaxh. Another reason that the files could be out of date on the
production system is that there have been no live acquisitions for the
satellite. For example, at the time of this writing NOAA 15 was unreliable and
we changed to NOAA 12 acquisitions. Since there were no NOAA 15's scheduled,
the noaa15.timcor file could not be updated on the production system since
the time correction measurement requires a live acquisition. Because of
this, only the ".timcor" files for scheduled live acquisitions should be
monitored.