Dave Lloyd
Computer Services Branch
Software Engineering Department
June 30, 2000
Prepared by: | Dave Lloyd, Signature on file |
---|---|
Software Engineer, | |
Raytheon, ITSS | |
Concurred by: | Tim Baltzer, Signature on file |
Software Project Lead, | |
Raytheon, ITSS | |
Approved by: | Vicki Neuheisel, Signature on file |
Computer Operations, | |
Raytheon, ITSS |
Number | Date and Sections | Notes |
1 | June 30, 2000 | Document Created |
2 | July 21, 2000 | Updated with additions, modifications and clarifications per design review. |
3 | ||
4 | ||
5 | ||
6 |
Occasionally a critical AVHRR process will hang or abort or an essential system will go down. If the error goes undetected, loss of data may result or production delays incurred. In the past computer operations personnel were relied upon to monitor the status of all ADAPS processes. However such manual monitoring is not entirely reliable.
The normal procedure for checking the critical processes is to run qstat to see if all the batch jobs are in queue. Qstat lists all processes that are in the queue on the current system. The listing is long and includes batch jobs submitted by other users. It is easy to overlook a missing job. Another drawback to qstat is that there have been several instances where a process was listed as running when, in fact, it was hung.
To check processes on the AVHRR acquisition system, operators run the Vax command 'sh sys' which lists the users processes and their status. Again this is a manual procedure that is subject to error and depends on operators taking the time to run the check. A problem that occurs frequently is the live acquisition software goes into an RWAST state and never comes out. The RWAST state means the process is waiting for some resource, usually a hardware resouce.
There is currently no procedure in place to insure that the critical data files are being updated other than making sure that the update and propagate routines are running in queue.
An application was needed to monitor all the critical processes, systems and data files automatically and notify computer operations personnel of any problems. That is the purpose of the ADAPSMONITOR.
Description from SRF 2119:
A job should be written to monitor the AVHRR jobs that should always be running in batch (MISSFTP_DAEMON, etc.). If one of them is not in batch queue, or is in the queue but appears to be hung, take an appropriate action to alert the operators. This SRF is not very concise because all the problems have not been addressed, but hopefully could be at some future date. The need for this functionality arises because sometimes a critical batch job gets blown out of queue for whatever reason, and undetected for 1, 2 days. Data is consequently lost. Questions: How would one tell if a batch job was hung??, How does one properly "alert the operators"??
The status of each process will be monitored by two methods: 1) checking the queue status and 2) making sure process logs are updated. Alerts will be issued if a process is not in the queue, is in a state other than 'running' or if its log file has not been touched at regular intervals. System status alerts will be issued if the system doesn't respond to a ping. Data files will also be checked to see if they have been touched as necessary.
Items to be monitored, their initial states, item type, monitor intervals and batch information will be stored in an ASCII defaults file. The defaults file will be used to initialize the system and to provide major changes to the monitor. Major changes are those that add or remove entries or change the monitor interval.
In this design the following conventions are used:
Initialize system Realize Main window Loop Check for event Run Event Call Back (CB) End-Loop Event CB's 1. Clean Up and Exit 2. Check status of Items in list 3. Alert user of Status changes 4. Modify the process states 5. Load new process state 6. Notify user of Errors 7. 12 additional button and dialog CB's to accept user input such as Cancel, Apply or Acknowledge buttons
Initialization file name (optional) Minimum timer interval (optional) X resource definitions (optional)Outputs:
NoneReturns:
NoneThe main program sets up the user interface by creating the widgets and realizing the main window. The main window contains a list of Items, their status and the last time the status was checked (See Figure 5). The main data structure is a list of Items. The Items describe what is to be monitored, how often it is checked, the type of item it is and its state or status. When setup is complete, the event loop is entered. The event loop never exits and its purpose is to intercept Events and execute the appropriate Call Back.
Call InitStruct to initialize data structures (ItemList) If InitStruct return is fatal Clean up (free widgets, memory and structures) Exit End-if Check for timer interval specification on command line If specified Set interval to that value Else Set interval to default (1 minute) End-If Initialize Shell, Forms and Widgets for main display and popups Register the call backs for buttons and alerts Register a timer for the CheckStatus call back with very short expiration this will force a first check of the ItemList and update the main display Enter event loop
Initialization file name Flag T/F, use default file on initialization file open failOutputs:
NoneReturns:
Initialized Item list structureInput to InitStruct will be an initialization file name and a default flag. The named file is expected to be in the following format:
Item name (process/system/complete file name) System on which the item runs or resides (eg. edcsgs4, vaxh, sg1) Item type (batch, process, vax process, system or file) Monitor interval (in minutes) RWAST interval (only used for vax process) Initial monitor state (enable/disable monitoring) Number of jobs in batch (ignored for non-batch Items) User ID of job (ignored for non-batch Items) Process log file namesAn attempt will be made to open the file. If the open fails and the default flag is Y, an attempt to open $ADAPSTABLES/adapsmonitor.default will be made. If the first open fails, when the flag is N, or both opens fail, a fatal error will be returned. On a successful open each line of the file will be read and checked for proper format. A fatal error will be returned if any line is not correct. The ItemList data structure will be built and initialized with properly formatted lines. In addition to the information listed in the initialization file, the ItemList data structure will contain the following information:
Current monitor state (enabled or disabled) Current state of the item (unchecked, running, up to date, not in queue, RWAST, suspended, hung, or not responding) Local time of the last item check RWAST state flag RWAST initial condition time Batch job numberBatch jobs will have one entry per job in the ItemList. If the end of file is reached without error, ItemList will be returned. Fatal errors will be logged.
While error opening initialization file If initialization file = the default file name Write error to log file Return fatal error Else If DefaultFlag = Yes Change initialization file name to default file name Else Write error to log file Return fatal error End-If End-While While Reading line from input file doesn't reach EOF Parse input line checking for discrepancies If error parsing Write error to log file Return fatal End-If Update the ItemList data structure If error updating Write error to log file Return fatal End-If End-While Return ItemList
Parent widget Item listOutputs:
Updated Item listReturns:
NoneThis CB code is run as the result of an expiration of a timer event. Once the timer has expired it must be reset at the minimum timer interval so that the event can trigger this CB again. To determine if it is time to check the status of an Item get the current system time. Go through the ItemList and compare the current time with the Items last check time. If the difference of these two is greater than the Items check interval it is time to check the Item again. Check the status of required Items and set an alarm for each Item that is not correct. Update the last time checked of the Item. Update the Main window display with any changes. If, after checking the status of the entire list, any alarms are set issue an Alert event. Alerts will be logged.
Register timer for CheckStatus call back at minimum timer interval Get current time For each item in the ItemList If it is time to check the item (current time - last checked time > item check interval) Check the item status call CheckItem If status is NOT OKAY Set alarm for item End-if Update items data structure with current time End-if End-for Update main display If alarms are set Write the alert message to the log Issue alert callback End-if
Item structureOutputs:
NoneReturns:
Item status: Running, hung, not in queue, up to date, RWAST, ...The CheckItem function is called with one Item member of the ItemList structure. The Item is checked based on the Item type. The Item status is returned.
If Item type is batch Get queue status of all Items that match user ID For each Item status returned If status is for current Item break from loop End-for If Item status is not in queue Return not in queue Else if Item status is not running Return hung End-if End-if If Item type is batch OR process OR file If Item is file Retrieve date/time of file Else Retrieve date/time of log file End-if If log file date/time - last time checked > monitor interval If Item is file Return not updated Else Return hung End-if Else If Item is file Return up to date Else Return running End-if End-if End-if If Item type is vax process Issue a "rsh sh sys | grep 'Item'" Extract 3rd field of results If 3rd field is RWAST If RWAST flag in Item structure is set Compare initial RWAST time with current time If difference > RWAST state interval Return RWAST End-if Return running Else Set RWAST flag Set initial RWAST time to current time End-if Else-if 3rd field is SUSP Unset RWAST flag if set Return SUSP Else Unset RWAST flag if set Return running End-if End-if If Item type is system Issue a "rsh date" If system doesn't respond Return system not responding Else Return system responding End-if End-if Return error
Parent widget Client data: Item list Call data: unusedOutputs:
NoneReturns:
NonePressing the Exit button on the Main window will issue an Exit event and execute the Exit CB. The purpose of the Exit CB is to popup an exit confirmation window. The exit confirmation window contains a dialog asking if the user really wants to exit. There will be two buttons on the window: "Cancel" and "Really Exit" (see Figure 8).
Realize the exit confirmation popup
Parent widget Client data: Item list Call data: unusedOutputs:
NoneReturns:
NoneIf the user really wants to exit, pressing the Exit button on the Exit Confirmation window issues an ExitConfirmation event and executes the ExitConfirmation CB. This CB will release all memory and destroy the widgets. The ItemList structure is used to create or overwrite the $ADAPSTABLES/adapsmonitor.state file. The Main window is destroyed. and the process exits. The exit is logged as a "User termination."
Create/overwrite state file Clean up, release memory and widgets Log user termination Destroy main display Exit
Parent widget Client data: unused Call data: unusedOutputs:
NoneReturns:
NoneIf the user doesn't want to exit, pressing the Cancel button on the Exit Confirmation window issues a CancelConfirmation event and executes this CB. The CancelConfirmation CB simply destroys the Exit Confirmation window.
Destroy ExitConfirmation popup
Parent widget Client data: Item list Call data: unusedOutputs:
NoneReturns:
NoneThis CB is registered to the Alert event. The Alert window may already be realized, so check for its existence before realizing the window. If it already exists, either update the window with new information or destroy the current window before realizing a new one. The Alert window contains a list of all alerts: the Item name and its current status along with an Enable/Disable toggle button (See Figure 6). There are Acknowledge and an Apply buttons on the window.
If alert popup is already active Update popup Bring popup to top Else Realize Alert popup End-if Or If alert popup is already active Destroy Alert popup End-if Realize Alert popup
Parent widget Client data: Item list Call data: unusedOutputs:
Updated Item listReturns:
NonePressing the Apply button on the Alert window issues the AlertApply CB. Any changes to the monitor status (enable/disable) of an Item listed on the Alert window will be applied to the ItemList and the modify and main windows are updated with the new information. The ItemList structure is used to create or overwrite the $ADAPSTABLES/adapsmonitor.state file. The Alert window is also destroyed and changes logged.
Update data structures with user changes UpdateModifyWindow UpdateMainWindow Create/overwrite state file Log changes Destroy the alert popup widget
Parent widget Client data: unused Call data: unusedOutputs:
NoneReturns:
NonePressing the Acknowledge button on the Alert window issues the AcknowledgeAlert event and executes the AcknowledgeAlert CB. The acknowledgement is logged and the Alert window destroyed.
Log acknowledgement Destroy the alert popup widget
Parent widget Client data: Item list Call data: unusedOutputs:
Updated Item listReturns:
NonePressing the Modify button on the Main window issues the Modify event and executes the Modify CB. If the Modify window is already present raise the window to the top, otherwise realize the Modify window. The window contains a list of all Items and an Enable/Disable toggle button (See Figure 7). The monitor state of the Item is indicated by the toggle button. There are also three buttons on the window: Apply, Load and Cancel.
If Modify popup is active Raise Modify to top Else Realize the modify popup widget End-If
Parent widget Client data: Item list, new Item list Call data: unusedOutputs:
Updated Item listReturns:
NonePressing the Apply button on the Modify window issues the ApplyModify event and executes the ApplyModify CB. Any changes to the monitor status are made to the Item structure. If the changes are due to a LoadDefault CB a short duration timer is issued. The timer is issued to update the Item list fields of the newly initialized list. The ItemList structure is used to create or overwrite the $ADAPSTABLES/adapsmonitor.state file. The Modify window is destroyed and changes logged.
Make changes to data structure Create/overwrite state file Log changes If a LoadDefault occurred Issue a very short duration timer event End-if Destroy the modify popup widget
Parent widget Client data: unused Call data: unusedOutputs:
NoneReturns:
NonePressing the Cancel button on the Modify window issues the CancelModify event and executes the CancelModify CB. The Modify window is destroyed.
Destroy the modify popup widget
Parent wiget Client data: unused Call data: unusedOutputs:
initialization file nameReturns:
NonePressing the Load button on the Modify window issues the LoadDefault event and executes the LoadDefault CB. The Load Dialog window is realized The window contains a place for the user to enter a file name. There are also Cancel and Load buttons (See Figure 9).
Realize FileDialog popup widget
Parent widget Client data: unused Call data: unusedOutputs:
NoneReturns:
NonePressing the Cancel button on the Load Dialog window issues the CancelLoadDefault event and executes the CancelLoadDefault CB. The Load Dialog window is destroyed.
Destroy FileDialog popup widget
Parent widget Client data: Item list Call data: unusedOutputs:
initialized Item listReturns:
NonePressing the Load button on the Load Dialog window issues the OpenLoadDefault event and executes the OpenLoadDefault CB. After retrieving the file name from the Load Dialog, the file is checked for existence. If the file doesn't exist an ErrorDialog CB is issued with "File not found!" as the text and OpenLoadDefault returns. If the file exists InitStruct is called with the file name and the default flag set to N. If an error is returned from InitStruct an ErrorDialog CB is issued with "Unable to read file [XXXXX]!" as the text and OpenLoadDefault returns. If InitStruct returns successfully, the Load Dialog window is destroyed, the modify window is updated with new information and the new structure is returned.
Retrieve file name from FileDialog widget Check for file existence If file exists Call InitStruct with file name If error returned Issue ErrorDialog Return Else Destroy FileDialog widget UpdateModifyWindow Return new ItemList End-if Else Issue ErrorDialog End-if
Parent widget Client data: error message Call data: unusedOutputs:
NoneReturns:
NoneThe ErrorDialog call back is executed when errors occur that need to be sent to the user. The Error Message window is realized. The window contains the message and an Acknowledge button (See Figure 10). The error message is logged.
Log the error message Realize ErrorDialog popup widget
Parent widget Client data: unused Call data: unusedOutputs:
NoneReturns:
NonePressing the Acknowledge button on the Error Message window issues the AckErrorDialog event and executes the AckErrorDialog CB. The acknowledgement is logged and the Error Message window destroyed.
Log the acknowledgement Destroy ErrorDialog widget
Parent widget Client data: Item list Call data: unusedOutputs:
NoneReturns:
NoneUpdate the main window display with any changes to an items monitor state, name, status or last checked time. If the Item list is new, the current display scrolled list is destroyed and a new one built.
If Item list is new Remove all items from scrolled list End-If For each Item in Item list If Item is not in scrolled list Convert display string to an X string Append display string to end of scrolled list Store scrolled list index in Item Else-If Item needs to be Redisplayed Replace scrolled list entry with new display string End-If End-For
Parent widget Client data: Item list Call data: unusedOutputs:
NoneReturns:
NoneUpdate the Modify window display with any changes to an Items monitor state. If the Item list is new, the current Modify window display is destroyed and a new one built.
If Item list is new Remove all forms from current display End-If For each Item in Item list If Item is not on display Add Item to display by creating a new widget Store form widget in Item Else-If Item monitor state has changed Change displayed monitor state End-If End-For
00000132 ACQUIRE HIB 4 22998 0 00:14:02.52 1114 1586
/sg1/csb/dlloyd [10] baq -Qkgeneral kgeneral: submissions queuing; executions running. ID Job Owner Time State 2336 noaaftp_daemon.2203_ opsavhrr 26705 running 2374 auditsch_daemon.2203 opsavhrr 136077 running 2419 updatesch_daemon.220 opsavhrr 94973 running /sg1/csb/dlloyd [11]