My pages
Projects
Communities
java.net
java.net
>
Wiki
>
Javapedia
>
CreateANewPage
>
WebSpherePartitioningFacility
>
WebSphereHighAvailabilityManager
>
WebSphereHighAvailabilityManagerFailureDetector
As the new Java.net infrastructure contains project-level wikis, this main wiki will be shut down in the near future. For wiki page export and general wiki questions please contact the site admin at
communitymanager@java.net
.
Home
|
Help
|
Changes
|
Index
|
Search
| Go
WebSphereHighAvailabilityManagerFailureDetector
The failure detector uses two detectors for now. The first is an active heart beating approach. Members ping each other periodically. If a member is not pinged by a peer for N heart beat intervals then that member tells the current set of JVMS to suspect that member. This results in the peer being dropped from the membership and any singletons running on that peer are recovered on the survivors.
WebSphere
allows both the heart beat ping interval and the number of missed heart beats to be modified by the customer using custom properties on the core group. The default values are a heart beat every 20 seconds and 10 missed heart beats to indicate failure. Obviously, this means a default failover time of 200 seconds and this clearly should be modified by the customer for production environments. These custom properties are called IBM_CS_FD_PERIOD_SECS (the number of seconds between heart beats) and IBM_CS_FD_CONSECUTIVE_MISSED (the number of heart beats that must be missed to mark a peer as a suspect).
The other approach is using TCP sockets. A single socket of opened between all pairs of
JVMs
currently in the membership. When a member detects that a socket to a peer has been closed then it immediately raises the peer as a suspect.
WebSphere
itself does not close these sockets except during shutdown. If such a socket closed then it usually means the
JVM
has crashed or the box running the
JVM
has failed. The failure detection time for this is dependant on the scenario. If a
JVM
crashes then the OS will close the socket within a short period of time. If the hardware crashes then the socket will close eventually when the TCP KEEP ALIVE logic kicks in. An untuned operating system can take as long as 2 hours for this to happen so clearly, this should be tuned so that the operating system reacts in less than 30 seconds.
Discussion about
WebSphereHighAvailabilityManagerFailureDetector
Topic
WebSphereHighAvailabilityManagerFailureDetector
. {
Edit
|
Ref-By
|
Printable
|
Diffs
r1 |
More
}
java.net
RSS
Revision r1 - 2005-01-21 - 16:51:00 - bnewport
Parents:
WebHome
>
CreateANewPage
>
WebSpherePartitioningFacility
>
WebSphereHighAvailabilityManager