The Source for Java Technology Collaboration

As the new Java.net infrastructure contains project-level wikis, this main wiki will be shut down in the near future. For wiki page export and general wiki questions please contact the site admin at communitymanager@java.net.
Home | Help | Changes | Index | Search | Go

WebSphereHighAvailabilityManagerFailureDetector

The failure detector uses two detectors for now. The first is an active heart beating approach. Members ping each other periodically. If a member is not pinged by a peer for N heart beat intervals then that member tells the current set of JVMS to suspect that member. This results in the peer being dropped from the membership and any singletons running on that peer are recovered on the survivors. WebSphere allows both the heart beat ping interval and the number of missed heart beats to be modified by the customer using custom properties on the core group. The default values are a heart beat every 20 seconds and 10 missed heart beats to indicate failure. Obviously, this means a default failover time of 200 seconds and this clearly should be modified by the customer for production environments. These custom properties are called IBM_CS_FD_PERIOD_SECS (the number of seconds between heart beats) and IBM_CS_FD_CONSECUTIVE_MISSED (the number of heart beats that must be missed to mark a peer as a suspect).

The other approach is using TCP sockets. A single socket of opened between all pairs of JVMs currently in the membership. When a member detects that a socket to a peer has been closed then it immediately raises the peer as a suspect. WebSphere itself does not close these sockets except during shutdown. If such a socket closed then it usually means the JVM has crashed or the box running the JVM has failed. The failure detection time for this is dependant on the scenario. If a JVM crashes then the OS will close the socket within a short period of time. If the hardware crashes then the socket will close eventually when the TCP KEEP ALIVE logic kicks in. An untuned operating system can take as long as 2 hours for this to happen so clearly, this should be tuned so that the operating system reacts in less than 30 seconds.



Discussion about WebSphereHighAvailabilityManagerFailureDetector

Topic WebSphereHighAvailabilityManagerFailureDetector . { Edit | Ref-By | Printable | Diffs r1 | More }
 XML java.net RSS

  

Revision r1 - 2005-01-21 - 16:51:00 - bnewport
Parents: WebHome > CreateANewPage > WebSpherePartitioningFacility > WebSphereHighAvailabilityManager