MAP4080 SRC BE11D05A Heartbeat failure

There is a heartbeat failure between the functional code (CPSS) kernel and the fire house dump (FHD) kernel. One of the two storage facility partitions (also called an LPAR) in the storage facility image (SFI) is failing. The working LPAR has reported the serviceable event. The failing LPAR must be repaired first and then the next level of support will force a crash dump recover.

MAP4080 Section-1

Procedure

  1. The serviceable event that sent you here is reported by the storage facility image (SFI) LPAR (partition) that is working. The other LPAR of this SFI is not responding to the working LPAR and must be repaired first. Display serviceable events that need to be repaired.

    Is there a serviceable event for the LPAR or CEC enclosure that is failing?

    • Yes, repair the serviceable event. When the LPAR or CEC enclosure is working normally, go step 3.
    • No, go to step 2.
  2. Using MAP1100, determine the states of the server and the partition.

    The normal state of the server is Operating. The normal state of the partition is Running. Are the states of the server and partition normal for the failing CEC enclosure?

    • Yes, go to step 3.
    • No, attempt to begin the repair using the visual symptom of the displayed state or contact your next level of support. When the state of the LPAR is normal, go to step 3.
  3. To recover the heartbeat failure for the serviceable event that sent you here, contact your next level of support who must issue the getdebug crashremote command at the storage facility partition that did not report the BE11D05A. Issuing this command will cause the remote storage facility partition (the one that did report the D05A) to force an AIX dump and reboot, which will cause a failover/failback sequence and bring the SFI back to the "dual-cluster" mode of operation.