NAME

failoverd - Provide rudimentary failover capability for Linux


FUNCTION

failoverd - Watches interfaces, reacts to traffic cessation.


DESCRIPTION

failoverd is an attempt to provide some sort of failover capability for Linux. Failoverd uses the Net::Pcap module so that network traffic can be the metric used to determine host status - Up, or Down?


DETAIL

Generic

Failoverd's packet sniffer takes one packet out of each header that passes each interface, and increments a number. During this process, a timer/alarm is ticking away, and if it is not reset before it goes off, an evaluation process begins, which is described below under Failing.

There are two pieces of failoverd that are of function:

        failoverd
        watchd

Failoverd should be launched by itself. It will determine it's host type via /etc/failover/host_type (or as defined) and take appropriate action. Failoverd forks a ping process that is intended to be a keepalive ``heartbeat'' for the other machine, and vice-versa.

Watchd monitors the ethernet interface it was launched with, as in:

        watchd eth0

Watchd will count packets up to the number specified by $packet_count. During this time, the alarm will be ticking away. When the alarm expires, the sub routine Failover::panic will be initiated. If the interface is in boot mode, meaning that the alarm has gone off before the interface has counted enough packets to reset it, no failure will occur. This first alarm condition will un-set the boot mode flag.

Booting

The variable $mresume may be either ``yes'' or ``no''. If set to ``yes'', this will cause failoverd to contact the slave machine at boot time, ordering it to disable it's interfaces, and assume an inactive slave role. Setting this value to ``no'' prevents this. Note that this can, and likely will, cause duplicate IP addresses until manual intervention. This is intended to be set to ``no'' when you are debugging, and physically present (or attached via a terminal server).

Failing

When an alarm is triggered, three points are checked:

1. ${$dev}[5] (ie. @eth0[5])

This value is initially set to ``boot''. This will prevent an immediate failure caused by not reading enough packets at startup. Once the first alarm has gone off, this value is unset, and further alarms will be interpreted as real events, and measured appropriately. In other words:

if ( (${$dev}[5] eq ``boot'' ) ) { return(0); } else { evaluate it }

2. $low_water

If this is set to 0, no failure will occur based on the ethernet interfaces, however the plip interface (always up) will still initiate a failure. If this is greater than 0, and the number of packets read is less than this number, we will fail. In other words:

if ( ($low_water > 0) && ($packet_count < $low_water) ) { Failover::failover; }

3. ${$dev}[2] (ie. @eth0[2])

This value is the currently read number of packets. If it is initially 0, and we are in boot mode, the number is pushed to location @eth0[3] to be used for latter comparison.

If it is 0 and we are not in boot mode, we push it to location ${$dev}[3] for latter comparison.

If this value is 0, and ${$dev}[3] is also 0, we have successively read zero packets, indicating reason to fail over. However, if $low_water is 0, this will prevent the failure. In other words:

if ( (${$dev}[2] == 0) && (${$dev}[3] == 0) ) { Failover::failover unless ( $low_water == 0 ); }

If this value is greater than 0, that value is pushed to ${$dev}[3] for latter comparison (aka ``last read packets''), the timer is reset, and the process starts over.

In the event of a failure, failoverd will open a socket connection to the slave machine, enter the secret word, and issue the command (presently it is ``fail''). Upon notification, the client will take appropriate action, assume the master's ethernet address, and take over it's duties as defined in the ha-switch script.


REQUIREMENTS

        PERL 5.x
        Net::Daemon
        Net::Pcap
        Net::Ping
        IO::Socket

Failoverd should also have a private interface to be used as the ``always good'' connection. The assumption is that if this interface ever goes down, that is reason to initiate the failover process. We use a plip interface, which has worked reasonably well. There is more than enough bandwidth on this interface to do what is needed.


CONFIGURATION

Everything but the ha-switch script is configured via failoverd.conf.

low_water is the minimum packets we should between panic sessions. Since numbers as low as 1 can cause us not to fail, this will force the failure. if this value is 0, we will not fail on an ethernet request. PLIP interface problems will still cause a failure.

$low_water = "0";

interfaces we care to monitor.

@interfaces = ('eth0','eth1');

log lots of stuff?

$verbose = "yes";

notify slave on boot to down itself?

$mresume = "yes";

how many packets to read on an ethernet interface. decrease on a slow network, increase on a fast one.

$packet_count = "500";

seconds to set the timer for. We expect to read $packet_count number of packets before we go off. set to a higher value to prevent unecessary failures

$timer_seconds = "20";

where is watchd, for the interfaces to watch?

$watchd = "./watchd";

need bsd-style output. this is for terminating processes on startup

$ps = "/usr/local/bin/ps ax";

port to connect to for failures

$tcp_port = "2000";

us, our private interface

$primary = "1.0.0.1";

their private interface (for heartbeat)

$secondary = "1.0.0.2";

where to find host_type, etc.

$failover_root = "/etc/failover";

this file defines what the current host does. it's format is: boot:master or boot:slave

$host_type = "$failover_root/host_type";

the word to pass when we connect to the other end. They must have the same secret.

$secret = "$failover_root/secret";

private interface. Suggestive, no?

$plip_interface = "plip0";

same concept as ethernet, this seems to work pretty well.

$plip_timer = "11";

$plip_hbeat = "10";

$log = "/var/log/watchd.log";

$failover_prog = "/usr/local/failoverd/ha-switch";

dump it. IO bound otherwise, and we don't need it to dump network captures. We don't process it, we only count it.

$dumpfile = "/dev/null";


OTHER

An editor has been developed and is called ha-edit. ha-edit uses rsync to maintain the state of both machines. It is very useful. The author's setup uses two Dual 450mHz machines with 1 gig of ram each, dual EEpro NIC's, and CVS to keep everything in line. All of /etc on the master machine has been checked into CVS, and the repository has been replicated to the slave machine. Any edits to file under /etc are checked in and replicated. So far, ha-edit has been very helpful. ha-edit can be had from http://ps-ax.com


BUGS

There are many bugs.

If you are constantly hitting the alarm before reading any packets, try tuning the timer as follows:

        High traffic environment: many packets, few seconds
        Low traffic environment: few packets, many seconds

The idea is to count up to number X, set the timer, and hope we can read enough packets to reset the timer before it goes off.


LICENSE

This software is released according to the terms of the GPL.


Gratis

I would like to thank Exactis.com for providing me the time to write this for them, and more importantly, allowing me to release this to the world so that others can take over where I am not competent :)


Modifications

Modifications are expressly allowed per the GPL. For the time being, I would like to remain the maintainer of this code. Access to the CVS repository will be given as needed.


Author / Maintainer

 Brad Doctor
 http://ps-ax.com/