ADAPSMONITOR
Design Document

Dave Lloyd
Computer Services Branch
Software Engineering Department

June 30, 2000



Signatures

Prepared by: Dave Lloyd, Signature on file
  Software Engineer,
  Raytheon, ITSS
Concurred by: Tim Baltzer, Signature on file
  Software Project Lead,
  Raytheon, ITSS
Approved by: Vicki Neuheisel, Signature on file
  Computer Operations,
  Raytheon, ITSS

Document History

Number Date and Sections            Notes                                                                                        
1 June 30, 2000 Document Created
2 July 21, 2000 Updated with additions, modifications and clarifications per design review.
3    
4    
5    
6    


Contents

Introduction

Functional Summary

Monitor critical AVHRR processes, systems and data files. Notify computer operations personnel when all is not well.

Comments

None.

Background

The AVHRR Data Acquisition and Processing System (ADAPS) has many critical components. Several processes run continuously and are essential to the smooth flow of AVHRR data from acquisition to archive. These processes must communicate with and send data to systems other than the production system. Also there are data files that must be updated frequently. These data files are necessary for scene navigation and geometric registration.

Occasionally a critical AVHRR process will hang or abort or an essential system will go down. If the error goes undetected, loss of data may result or production delays incurred. In the past computer operations personnel were relied upon to monitor the status of all ADAPS processes. However such manual monitoring is not entirely reliable.

The normal procedure for checking the critical processes is to run qstat to see if all the batch jobs are in queue. Qstat lists all processes that are in the queue on the current system. The listing is long and includes batch jobs submitted by other users. It is easy to overlook a missing job. Another drawback to qstat is that there have been several instances where a process was listed as running when, in fact, it was hung.

To check processes on the AVHRR acquisition system, operators run the Vax command 'sh sys' which lists the users processes and their status. Again this is a manual procedure that is subject to error and depends on operators taking the time to run the check. A problem that occurs frequently is the live acquisition software goes into an RWAST state and never comes out. The RWAST state means the process is waiting for some resource, usually a hardware resouce.

There is currently no procedure in place to insure that the critical data files are being updated other than making sure that the update and propagate routines are running in queue.

An application was needed to monitor all the critical processes, systems and data files automatically and notify computer operations personnel of any problems. That is the purpose of the ADAPSMONITOR.

Description from SRF 2119:

A job should be written to monitor the AVHRR jobs that should always be running in batch (MISSFTP_DAEMON, etc.). If one of them is not in batch queue, or is in the queue but appears to be hung, take an appropriate action to alert the operators. This SRF is not very concise because all the problems have not been addressed, but hopefully could be at some future date. The need for this functionality arises because sometimes a critical batch job gets blown out of queue for whatever reason, and undetected for 1, 2 days. Data is consequently lost. Questions: How would one tell if a batch job was hung??, How does one properly "alert the operators"??

Scope/Limitations

Overall design

The ADAPSMONITOR will be a X Windows application and provide a GUI interface for the operators. The interface will provide a main window display and several popup windows. The main window will list all monitored items and their status as well as the time of the last status check. An "Update" popup will be provided to allow the user to enable or disable monitoring of specific processes/systems/files. When status changes are detected, an "Alert" popup will inform the user and provide buttons for disabling future warnings. A "Load" dialog window will allow the user to specify a defaults file to be loaded during program execution.

The status of each process will be monitored by two methods: 1) checking the queue status and 2) making sure process logs are updated. Alerts will be issued if a process is not in the queue, is in a state other than 'running' or if its log file has not been touched at regular intervals. System status alerts will be issued if the system doesn't respond to a ping. Data files will also be checked to see if they have been touched as necessary.

Items to be monitored, their initial states, item type, monitor intervals and batch information will be stored in an ASCII defaults file. The defaults file will be used to initialize the system and to provide major changes to the monitor. Major changes are those that add or remove entries or change the monitor interval.

In this design the following conventions are used:

Flow diagrams

Figure 1 -- User Interface Flow Diagram
Figure 2 -- Main Flow Diagram
Figure 3 -- InitStruct Flow Diagram
Figure 4 -- CheckStatus Flow Diagram

Algorithm

Detail Design

Example Windows

Figure 5 -- Main Display
Figure 6 -- Alert Popup
Figure 7 -- Modify Popup
Figure 8 -- Exit Confirmation Dialog
Figure 9 -- Load Dialog
Figure 10 -- Error Message Dialog

Design Notes

This section is a collection of notes that were compiled during initial discussions of the feasibility and possible solutions of the ADAPSMONITOR. They are presented here for historical purposes only and not necessarily included in the actual design.