Chapter 110 SYSH (Syshealth) Messages

The messages in this chapter are sent to $0 by the Syshealth subsystem. You can view operator messages using either the Viewpoint application or the Syshealth event viewing screen. To view messages on the event viewing screen, you must first select the $0 message log using the File menu Use Log command. Syshealth operator messages are listed as subsystem SYSH.




	NOTE: Negative-numbered messages are common to most subsystems. If you receive a negative-numbered message that is not described in this chapter, see Chapter 15.

1000

module-title (process-name) - Module launched [with undefined externals].

`module-title`	is the name of the Syshealth module that was launched. It is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.
`process-name`	is the process name or the processor number and process identification number (PIN) of the launched process.

Cause The Persistence Monitor is launching a Syshealth module.

Effect The Syshealth module is started.

Recovery Informational message only; no corrective action is needed.

1001

module-title (process-name) - Too many module relaunch attempts.

`module-title`	is the name of the Syshealth module that failed to start. It is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.
`process-name`	is the process name of the failed module. It is the process name or the processor number and process identification number (PIN) of the launched process.

Cause The monitored module has failed to start.

Effect The module is not running.

Recovery Analyze the previous Syshealth MODULE-FAILED error (#1002) and the error buffer to determine the cause of the problem. Correct the problem and use the Syshealth Management screen to start the module.

1002

module-title (process-name) - Failed due to { CPU failure | ABEND }

`module-title`	is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.
`process-name`	is derived from the process name, processor and process identification number (PIN) of the launched process.

Cause The monitored module failed.

Effect The module is not running.

Recovery Informational message only; no corrective action is needed. The Persistence Monitor tries to relaunch the module.

1003

module-title (process-name) - Internal Error. Error Information:

`module-title`	is the name of the module that had an error. It is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.
`process-name`	is the process name of the module in error. It is the process name or the processor number and process identification number (PIN) of the launched process.

Cause The Persistence Monitor failed because of an internal error.

Effect The Persistence Monitor is not running.

Recovery Analyze the error information in this error message to determine the cause of the problem. Correct the problem and use the Syshealth Management screen to start the Persistence Monitor.

2000

Unable to access Syshealth Command Database file. File error file-error while attempting proc-failing.

`file`	is the name of the Syshealth command database file on which the error was detected.
`file-error`	is the file-system error associated with the failure.
`proc-failing`	is the file-system procedure associated with the failure.

Cause When TMDSAUTO is run, it first tries to open the Syshealth command database. Then it performs a read on the database’s header record. If the open or read fails, this error is generated.

Effect The Syshealth monitoring processes (Persistence Monitor, System Health Monitor, and Notification system) and the Syshealth user interface depend on the command database to implement command security. Syshealth runs, but it uses its default command security.

Recovery If the file was not found, it either was not installed or has been purged. Reinstall the file using Install. If the error was other than “File Not Found,” see Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.

2001

Unable to access Syshealth Management Database file during init-phase. File error file-error while attempting proc-failing. Syshealth was not started.

`file`	is the name of the Syshealth management database file on which the error was detected.
`init-phase`	indicates what phase of the Syshealth initialization failed. If message 2001 is generated, this parameter indicates how the management database was being used when the error occurred.
`file-error`	is the file-system error associated with the failure.
`proc-failing`	is the file-system procedure associated with the failure.

Cause When TMDSAUTO is run, the management database is used to bring up Syshealth. First the database file is opened. Then the file description records are read and used in checking the accessibility of all Syshealth files. The Persistence Monitor process startup information is read and used to start the monitor. Finally, other processes’ startup information is read and sent to the Persistence Monitor to start them.

If any of these steps fails due to an error when accessing the management database, message 2001 is generated.

Effect Syshealth depends on the management database being accessible. If it is not accessible, Syshealth is not started, although TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) collector ($ZLOG).

Recovery See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.

2002

Unable to access Syshealth Configuration Database file. File error file-error while attempting proc-failing.

`file`	is the name of the Syshealth configuration database file on which the error was detected.
`file-error`	is the file-system error associated with the failure.
`proc-failing`	is the file-system procedure associated with the failure.

Cause When TMDSAUTO is run, it attempts to open the Syshealth configuration database and perform a read on the database file’s header record. If the database file is not found, TMDSAUTO attempts to create the file. If the create, open, or read fails, then message 2002 is generated.

Effect The Syshealth user interface depends on the configuration database being accessible. If it is not, the Syshealth user interface cannot alter any of the configurable options of Syshealth.

Recovery See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.

2003

Syshealth not properly installed. n Syshealth files missing

n

is the number of Syshealth files that were not found on the system.

Cause When TMDSAUTO is run, it checks that all of the appropriate Syshealth files are present. It does this by reading the file-verification records in the management database, then by opening each file that should be on the current system. If any files are missing, this error is generated. If the open fails for any other reason, this error also is generated.




	NOTE: This error message is generated only if one of the Syshealth monitor files (Health Monitor, Persistence Monitor, or Notification module) is inaccessible.

Effect If a monitor file is missing (that is, one critical to the operation of Syshealth automatic monitoring), then Syshealth does not run. The effects of other files missing depend on the file: some screens might not be available, help might be missing, and so on. Note that even if Syshealth does not run, TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG).

Recovery Examine the error using the TMDS FIND command (or, if sufficient portions of Syshealth are available, using the Syshealth Event Viewing screen) to see exactly which files are missing. The missing files either were not installed or have been purged. Reinstall the files using the INSTALL program.

2004

Syshealth not properly installed. n files were inaccessible.

n

is the number of Syshealth files that were not accessible to TMDSAUTO.




	NOTE: This error message is generated only if one of the Syshealth monitor files (Health Monitor, Persistence Monitor, or Notification module) is inaccessible.

Cause When TMDSAUTO is run, it checks that all of the appropriate Syshealth files are present. It does this by reading the file-verification records in the management database, then by opening each file that should be on the current system. If the open fails for any reason other than “File Not Found,” then message 2004 is generated.

Effect Syshealth may not be installed or secured correctly. If Syshealth cannot access a monitor file (a file critical to the operation of Syshealth automatic monitoring), it does not run. If other files are inaccessible, some Syshealth functions may be lost, depending on which files are not available. Note that even if Syshealth does not run, TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) collector ($ZLOG).

Recovery See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.

2005

Syshealth Persistence Monitor could not be started due to [ No definition of the process in the Syshealth Management Database.] [ Newprocess error newprocess-error, with program file filename. ] [ File error file-error when performing proc-failing. ] [ SPI error spi-error during proc-failing. ] [ TMDSAUTO internal error during proc-failing. ] Syshealth was not started.

`newprocess-error`	is the error associated with the failure of a NEWPROCESS procedure call.
`filename`	is the name of the program file that experienced the NEWPROCESS error.
`file-error`	is the file-system error associated with the failure.
`proc-failing`	is the procedure associated with the failure.
`spi-error`	is the Subsystem Programmatic Interface (SPI) error code associated with the failure.

Cause When TMDSAUTO is run, it performs initial checks and then starts the Syshealth Persistence Monitor. If any of the functions (memory allocation, NEWPROCESS, OPEN, or WRITEREAD) fail, then Syshealth does not start and message 2005 is generated. Also, if the Syshealth management database was improperly built or corrupted, the Persistence Monitor definition may not be accessible.

Effect The Syshealth Persistence Monitor is a process pair that starts the remaining Syshealth processes (which are not process pairs) and monitors them to ensure that they remain running. If the Persistence Monitor cannot be started, the remaining Syshealth processes are not started and Syshealth automatic fault monitoring and reporting is not operational, although TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG) and TMDS fault analysis ($ZMOM).

Recovery If a file-system error is specified, see Appendix B, for a definition of the error. Check the Guardian Procedure Errors and Messages Manual.for a description of the file-system and NEWPROCESS errors and their recovery actions.

If the problem resulted from an internal error, contact the Global NonStop Solution Center (GNSC) and provide all relevant information as follows:

Descriptions of the problem and accompanying symptoms
Details from the message or messages generated
Supporting documentation such as Event Management Service (EMS) logs, trace files, and a processor dump, if applicable

If your local operating procedures require contacting the Global Mission Critical Solution Center (GMCSC), supply your system number and the numbers and versions of all related products as well.

2006

Syshealth Process Startup Error. target-process Not Started. { File Error | SPI Error } error when attempting proc‑failing while communicating with the Persistence Monitor.

`target-process`	is the name of a Syshealth process that was not started due to the failure to communicate with the Persistence Monitor.
`error`	is the file-system error or the Subsystem Programmatic Interface (SPI) error code associated with the failure, if any.
`proc-failing`	is the procedure associated with the failure.

Cause This message reports a failure when an SPI ADD MODULE or SPI LAUNCH command is sent to the Syshealth Persistence Monitor. The failure can be either an SPI error or a file-system OPEN or WRITEREAD error. Each Syshealth process is added to and then launched by the Persistence Monitor when starting Syshealth. If any of the SPI commands fail, the startup terminates.

Effect If any portion of Syshealth fails to start, all of Syshealth is shut down. In this case, Syshealth is not running. Nevertheless, TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG).

Recovery See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual. If the error is an SPI error, contact your service provider.

2007

Syshealth Startup Verify Error. target-process failed verification. [{SPI error | File Error} error when attempting proc‑failing.] [Verify code = verify-code

`target-process`	is the name of a Syshealth process that was not started due to the failure to communicate with the Persistence Monitor.
`error`	is the file-system error or the Subsystem Programmatic Interface (SPI) error code associated with the failure, if any.
`proc-failing`	is the procedure associated with the failure.
`verify-code`	is the error returned by the Syshealth target process, describing its internal verification failure. Verify codes are defined under “Recovery.”

Cause Syshealth processes are started one at a time and are verified (using an SPI VERIFY command) prior to starting the next process. If the verify fails because the process is down, does not respond within a 60-second time-out, or returns a verify code indicating that it is not functioning correctly, then message 2007 is generated.

Effect If Syshealth processes are down or did not verify, Syshealth is not started, although TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG).

Recovery If a Syshealth process is down, check to see whether a NEWPROCESS error occurred, and examine the NEWPROCESS error codes in the Guardian Procedure Errors and Messages Manual to determine the fault and corrective action.

If the process failed its verification, check the codes below for the fault and corrective actions:

Verify Code	Description	Action
0	OK	None required.
1	Timeout	Restart the process that did not start.

If the process was restarted the maximum number of times and failed, check to see whether there is some reason (such as processors failing) that the process stopped. If not, contact the Global NonStop Solution Center (GNSC) and provide all relevant information as follows:

Descriptions of the problem and accompanying symptoms
Details from the message or messages generated
Supporting documentation such as Event Management Service (EMS) logs, trace files, and a processor dump, if applicable

If your local operating procedures require contacting the Global Mission Critical Solution Center (GMCSC), supply your system number and the numbers and versions of all related products as well.

2008

Syshealth started OK.

Cause When TMDSAUTO is run, it performs the initial verification of Syshealth to check that:

The command and management databases are accessible.
All Syshealth files are present and accessible.

Both Tandem Maintenance and Diagnostic Subsystem (TMDS) and the Syshealth user interface can start the Persistence Monitor. Then the remainder of the Syshealth processes are registered with the Persistence Monitor using Subsystem Programmatic Interface (SPI) ADD MODULE commands and are launched by the Persistence Monitor. TMDSAUTO (or the user interface) sends SPI VERIFY commands to all Syshealth processes.

If everything is working and there were no errors, message 2008 is generated.

Effect Syshealth is operational on the system.

Recovery Informational message only; no corrective action is needed. This message indicates that Syshealth was started correctly.

2009

A Syshealth Shutdown command was issued for target-system.

target-system

is the name of the system to which the Syshealth user interface commands are directed.

Cause The Syshealth user interface provides commands to shut down and start up the Syshealth processes. Syshealth, $ZLOG, and $ZMOM have been shut down.

Effect Syshealth has been shut down. If the Tandem Maintenance and Diagnostic Subsystem (TMDS) alternate collector was shut down, no TMDS errors are logged on the system, so information about hardware faults is lost. With Syshealth shut down, no hardware fault coverage is provided and there are no dial-outs caused by system faults.

Recovery None, if the shutdown was anticipated. Otherwise, the operators should check to ensure that Syshealth has not been left inoperative, which would compromise system fault detection and reporting.

3000

Coldload report of report-severity severity for system‑serial‑number system. resource-name: specific-problem. There are related-alarms related alarms.

report-severity

defines the perceived severity of the system-load report. The levels of severity correspond to the Open Systems Interconnection (OSI) perceived severity definitions for managed objects. This value is determined by using the highest (most critical) perceived severity of all the alarms contained in the system-load report. This variable can have one of the following values:

CRITICAL-SEVERITY	indicates that a service-affecting condition has occurred and an immediate corrective action is needed. Such a severity can be reported, for example, when a resource becomes totally out of service and its capability must be restored.
MAJOR-SEVERITY	indicates that a service-affecting condition has developed and urgent corrective action is required. Such a severity can be reported, for example, when there is the potential for a single point of failure (fault tolerance lost) or severe degradation in resource capability, and full capability must be restored.
MINOR-SEVERITY	indicates that a non-service-affecting fault condition exists and that corrective action should be taken in order to prevent a more serious (for example, service-affecting) fault. Such a severity can be reported, for example, when the dedicated alarm condition is not currently degrading the capacity of the resource.
WARNING-SEVERITY	indicates the detection of a potential or impending service-affecting fault before any significant effects have been felt. Action should be taken to further diagnose (if necessary) and correct the problem in order to prevent it from becoming a more serious service-affecting fault.

report-severity

is the highest (most critical) perceived severity of all the alarms contained in the system-load report. The values for this variable are defined for message 3001.

system-serial-number

is the serial number of the NonStop Kernel from which the system-load report originated.

resource-name

is the name of the system resource involved in the most critical alarm in the system‑load report.

specific-problem

is the specific problem of the most critical alarm in the system-load report.

related-alarms

is a count of the number of alarms in the system-load report.

Cause The NonStop Kernel identified by system-serial-number has completed a system load. The Syshealth Health Monitor has generated a summary of the outstanding alarms after the system load finished.

Effect If the system-load report contains alarms, then one or more system resources have been compromised. The system-load report contains a detailed status for each problem resource.

Recovery On a remote system, use Syshealth to examine this system-load report on the Syshealth Main screen and the Alarm Viewing screen. If corrective action is needed, dial in to the system that generated the report. On a local system or while dialed in remotely, use Syshealth and other Tandem Maintenance and Diagnostic Subsystem (TMDS) diagnostic tools to locate and repair the problem.

3001

Problem report of report-severity severity for system‑serial-number system. resource-name:specific-problem. There are related-alarms related alarms.

system-serial-number

is the serial number of the NonStop Kernel from which the problem report originated.

report-severity

defines the perceived severity of the problem report. The levels of severity correspond to the Open Systems Interconnection (OSI) perceived severity definitions for managed objects. This value is determined by using the highest (most critical) perceived severity of all the alarms contained in the problem report.

The levels of severity can be one of the following values:

CRITICAL-SEVERITY	indicates that a service-affecting condition has occurred and an immediate corrective action is required. Such a severity can be reported, for example, when a resource becomes totally out of service and its capability must be restored.
MAJOR-SEVERITY	indicates that a service-affecting condition has developed and urgent corrective action is required. Such a severity can be reported, for example, when there is the potential for a single point of failure (fault tolerance lost), or severe degradation in resource capability and full capability must be restored.
MINOR-SEVERITY	indicates that a non-service-affecting fault condition exists and that corrective action should be taken in order to prevent a more serious (for example, service-affecting) fault. Such a severity can be reported, for example, when the dedicated alarm condition is not currently degrading the capacity of the resource.
WARNING-SEVERITY	indicates the detection of a potential or impending service-affecting fault, before any significant effects have been felt. Action should be taken to further diagnose (if necessary) and correct the problem in order to prevent it from becoming a more serious service-affecting fault.

system-serial-number

is the serial number of the NonStop Kernel from which the problem report originated.

resource-name

is the name of the system resource involved in the most critical alarm in the problem report.

specific-problem

is the specific problem of the most critical alarm in the problem report.

related-alarms

is a count of the number of alarms in the problem report.

Cause The Health Monitor on the indicated system has detected faults or error conditions in one or more system resources. The severity field defines the urgency of the problem report.

Effect System resources have been compromised. The problem report contains a detailed status for each problem resource.

Recovery On a remote system, use Syshealth to decode this problem report on the Syshealth Main screen and the Alarm Viewing screen. If corrective action is needed, dial in to the system that generated the report. On a local system or while dialed in remotely, use Syshealth and other Tandem Maintenance and Diagnostic Subsystem (TMDS) diagnostic tools to locate and repair the problem.

3002:

System summary report for system-serial-number system.

system-serial-number

is the serial number of the NonStop Kernel from which the system report originated.

Cause The Syshealth Health Monitor is reporting a summary of system activity since the last system summary report.

Effect None

Recovery Informational message only; no corrective action is needed.

3100

System Health Monitor Internal Error. Error Level: error‑level Error Type: error-type Error Code: error-code Error Tag: error-tag

`error-level`	is the severity level of the error.
`error-type`	is the type of internal error that occurred; for example, Event Dispatcher Error.
`error-code`	further defines the cause of the error. This value is different for each error type.
`error-tag`	is the location in code at which the internal error occurred. The value has the format FILENAME_LINE-NUMBER. For example, if an error occurred in source file DISPC at line 25, the value for `error-tag` is DISPC_25.

Cause The Syshealth Health Monitor encountered an unexpected internal error during execution.

Effect The Syshealth Health Monitor abends.

Recovery The Persistence Monitor restarts the Health Monitor a certain number of times in case the condition is due to an intermittent problem. This message should be reported to Tandem Development.

3101

System Health Monitor External Error. Error Level: error‑level Error Type: error-type Error Code: error-code Error Tag: error-tag

`error-level`	is the severity level of the error.
`error-type`	is the type of internal error that occurred. The following types of internal errors are defined: Subsystem Programmatic Interface (SPI) error. Operating system procedure call failed. Library function failed. Variable overflowed. Event Management Service (EMS) Distributor failed. Dynamic System Configuration (DSC) NEWPROCESS call failed. DSC SPI open failed. DSC communication error occurred. DSC returned unexpected error. Subsystem Control Point (SCP) NEWPROCESS call failed. SCP Subsystem Programmatic Interface (SPI) open failed. SCP communication error occurred. SCP returned unexpected error. Memory error occurred. String list error occurred. Task Queue error occurred. Event missing required tokens Required resource object missing Open of $0 collector failed. Open of Tandem Maintenance and Diagnostic System (TMDS) collector failed. Unable to load required filter file. Alarm missing required tokens. Error accessing help file. Alarm database operation error. Error accessing configuration file. $RECEIVE input/output (I/O) error. Scripting error. Program abnormally terminated. Event Dispatcher error. Program terminated because of too many takeovers.
`error-code`	further defines the cause of the error. This value is different for each error type.
`error-tag`	is the location in code at which the internal error occurred. The variable has the format FILENAME_LINE-NUMBER. For example, if an error occurred in source file DISPC at line 25, the value for `error-tag` is DISPC_25.

Cause The Syshealth Health Monitor encountered an external error during execution.

Effect The Syshealth Health Monitor stops.




	NOTE: The Persistence Monitor `does not` restart the Health Monitor for an external error condition, because the Health Monitor cannot continue to execute until the external problem is resolved.

Recovery Fix the external error condition and then start the Health Monitor from the Syshealth Management screen.

4001

Authorization to Deliver Remote Notification Needed. Notification ID = action-id

action-id

is the ID (a 16-bit integer) of the notification that needs to be authorized or denied authorization. This value is the same as the notification identifier used in notification Subsystem Programmatic Interface (SPI) commands.

Cause An occurrence on the indicated system has triggered a remote notification. What occurrences result in a notification depends on the configuration of the notification module on that system. Typically, notification triggers include problem reports (arising from system resources that have encountered a problem), system-load summary reports, or periodic system reports.

Effect A system resource on the specified system may be unavailable.

Recovery Run Syshealth on the indicated system, and examine unauthorized notifications through the notification screen. Either authorize or deny authorization to pending notifications. Authorized notifications are forwarded through the configured notification ports (typically to the NonStop Support Center).

4002

Authorization to Deliver Remote Notification disposition. Notification ID = action-id.

`disposition`	indicates whether authorization has been granted. It can have the value GRANTED or DENIED.
`action-id`	is the ID (a 16-bit integer) of the notification that was authorized or denied authorization. This value is the same as the notification identifier used in notification Subsystem Programmatic Interface (SPI) commands.

Cause A pending notification has been authorized for delivery.

Effect The indicated notification is delivered to all appropriate destinations.

Recovery Informational message only; no corrective action is needed.

4003

Dial-out test performed

Cause A Syshealth user has issued a Test Dial-out command.

Effect This error should cause a dial-out to occur, unless dial-out is not enabled. Use the Syshealth Remote Notification screen to examine the test results.

Recovery Informational message only; no corrective action is needed.

4004

Remote Notification Port Failing. error-text

error-text

is the text of the last port error message.

Cause A notification path is failing. Examine the error-text portion of the message to determine the exact cause.

Effect Remote notification for system resource problems is not occurring on the specified system.

Recovery Examine the error message and take corrective action. If no corrective action is possible, disable the notification port through the Syshealth user interface. Notify the Global NonStop Solution Center (GNSC) of system failures manually until the port is repaired.

4005

REMOTE NOTIFICATION PROCESS notif-process ENCOUNTERED EXCEPTION. CODE = module-error:text

`notif-process`	contains the name of the notification port process that failed.
`module-error`	is the fatal error that was encountered by the notification module.
`text`	is the text of the last notification module error message.

Cause The notification module encountered an exception that it recovered from. Typical exceptions include the inability of the module to find or read the notification database, a filter file, or other file.

Effect The effect of this error depends on the nature of the exception. If the notification module does not find the database or cannot read it, the notification module creates a new one. Filter files that cannot be found or read are ignored, leading to an increased number of dial-outs. In any case, the notification function continues.

Recovery Recovery action depends on the type of exception.

4006

REMOTE NOTIFICATION PROCESS notif-process FAILED. ERROR = module-error:text

`notif-process`	contains the name of the notification port process that failed.
`module-error`	is the fatal error that was encountered by the notification module.
`text`	is the text of the last notification module error message.

Cause The notification module encountered an unrecoverable internal error.

Effect Remote notification is unavailable for a brief period, until the Syshealth Persistence Monitor restarts it.

Recovery Informational message only; no corrective action is needed.

4007

Remote notification redelivery started. Notification ID = action-id.

action-id

is the ID (a 16-bit integer) of the notification that was redelivered. This value is the same as the notification identifier used in notification Subsystem Programmatic Interface (SPI) commands.

Cause A Syshealth user has requested redelivery for a remote notification which could not be delivered earlier.

Effect The indicated notification will be delivered to all appropriate destinations.

Recovery Informational message only; no corrective action is needed.