Linux-HA cluster setup

After a pretty nasty crash of our LDAP server in December 2007, we started rethinking our concepts of increasing service availability. When the original OpenLDAP setup was devised, slapd replication was about the best you could get, but with the recent hype around Xen and High Availability solutions like heartbeat, we decided to kick it up a notch.

cluster schema

This article describes our setup of an active-active high availability cluster for virtual Xen machines, based on Debian etch. The layout is sketched above and shows the the basic ideas: two identical servers provide storage and network connectivity for the cluster which holds the virtual Xen machines as cluster resources. Each server has two NICs: eth0 connects the physical machines to our server network and is used for the DRBD (see below) traffic, one heartbeat channel and dom0 login. eth1 is a network bridge that connects the virtual machines to the DPHYS network. For HA cluster operation, we use two heartbeat channels: one is via eth0 as described above, the other uses a USB-to-RS232 converter to provide a serial heartbeat channel. The filesystem stack consists of four layers: on the hardware level, each server provides a RAID1 mirror. These two RAIDs are combined into a DRBD8 device, which is basically RAID1 over ethernet. This 2-level redundancy ensures protection against hardware failures of not only individual disks, but an entire server. The DRBD volume holds an LVM container which houses the logical Xen volumes. Many people prefer a cluster filesystem like OCFS2 on DRBD, but we have decided against this idea for three main reasons: first of all, the added complexity of a cluster fs is not needed in our case as we don't need concurrent access to the same files. Furthermore, cluster filesystems don't offer the same performance as single-node filesystems like ext3 do. Finally, out-of-the-box OCFS2 sports a disk-based heartbeating that conflicts with the 'real' heartbeat mechanism used by Linux-HA. The proper way of combining OCFS2 and Linux-HA would be to include OCFS2 into the Linux-HA heartbeat stack. SLES is going this way, and we've tried to port their kernel and OCFS2 patches to our Debian software stack, but in the end it just wasn't worth the trouble.

The Xen domains act as virtual hardware that lives on the cluster. In normal operation they are distributed across the nodes to take full advantage of all hardware available in the cluster. If one node fails, the affected domains are restarted on the remaining node. The only 'waste' in hardware is given by the RAM requirement: each node must be equipped with sufficient RAM to be able to run all Xen domains that could be assigned to this node. In our 2 node, 2 domain setup this effectively means double RAM - not a big deal with today's RAM prices.

The remainder of this document describes the steps required to get the cluster up and running.

We start by installing a basic etch system on all nodes. Configure at least two network cards, we use 129.132.80.x on eth0 and 129.132.86.x on eth1, but no dom0-local IP address for eth1 - it's just a bridge. Before you leave the server room, don't forget to install openssh-server. Also make sure all nodes are present in /etc/hosts on each node and /etc/nsswitch.conf says hosts: files dns in order to be immune to DNS lookup problems. Once the OS is running, we can install Xen:

aptitude install xen-linux-system-2.6.18-6-xen-amd64 xen-tools linux-headers-xen

or whatever the current version might be when you type this. Next, we need the etch backports in /etc/apt/sources.list for DRBD8:

deb http://backports.ethz.ch/debian-backports etch-backports main
deb-src http://backports.ethz.ch/debian-backports etch-backports main

Then do

apt-get install debian-backports-keyring

to import backport's key. Next we install DRBD. Get backported drbd8:

aptitude update && aptitude install drbd8-source drbd8-utils

Build drbd modules:

m-a a-i drbd8-source
update-modules; depmod -a; modprobe drbd
echo drbd >> /etc/modules

DRBD's config is given by /etc/drbd.conf. The relevant parts are:

common {
  syncer { rate 90M; }
}

resource r0 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt  f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";
  }

  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
    on-io-error   detach;
  }

  net {
    allow-two-primaries;        #this allows us to use an active-active configuration
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  syncer {
    rate 90M;
    al-extents 257;
  }

  on phd-hacnode1 {             #node 1 of the cluster
    device     /dev/drbd0;
    disk       /dev/sda3;
    address 192.168.0.1:7788;
    meta-disk  internal;
  }

  on phd-hacnode2 {             #node 2 of the cluster
    device    /dev/drbd0;
    disk      /dev/sda3;
    address   192.168.0.2:7788;
    meta-disk internal;
  }
}

You might have to change some values to reflect your setup. In our case /dev/sda3 will hold the DRBD container for LVM.

Before we create the DRBD device, we install heartbeat and friends so that we can fix some things DRBD will complain about:

aptitude install heartbeat-2 python-gtk2 xbase-clients xutils python-glade2
passwd hacluster        #set a password for the cluster GUI

If we now initialize the DRBD container:

drbdadm create-md r0

it will complain about the heartbeat plugin not having the right privileges. Fix this by running

chgrp haclient /sbin/drbdsetup
chmod o-x /sbin/drbdsetup
chmod u+s /sbin/drbdsetup

chgrp haclient /sbin/drbdmeta
chmod o-x /sbin/drbdmeta
chmod u+s /sbin/drbdmeta

and then continue with DRBD:

/etc/init.d/drbd restart
drbdsetup /dev/drbd0 primary -o
drbdadm primary r0

/etc/init.d/drbd status should now report a Primary/Primary and UpToDate/UpToDate DRBD device. Ok, now have levels 1 and 2 of our fs stack. LVM is level 3:

aptitude install lvm2

Before creating the physical volume, we need to prevent LVM from complaining about duplicate devices (/dev/sda3 and /dev/drbd0): in /etc/lvm/lvm.conf we have to replace

filter = [ "r|/dev/cdrom|" ]

with

filter = [ "r|/dev/cdrom|", "r|/dev/sda3|", "a|/dev/drbd0|" ]

This tells LVM to create the volume on the DRBD device instead of the harddisk directly. On one cluster node only, we do:

pvcreate /dev/drbd0
vgcreate drbdvg /dev/drbd0

The second node only needs to read the LVM information:

vgscan

At this point we don't create any logical volumes - this will be done by Xen later. We can now continue by setting up the Linux-HA framework:

In /etc/ha.d/ we need to create some config files for our cluster. ha.cf defines the basic cluster configuration:

crm             on      #enables heartbeat2 cluster manager - we want that!

use_logd        on
logfacility     syslog
keepalive       1
deadtime        10
warntime        10
udpport         694
auto_failback   on      #resources move back once node is back online

bcast eth1 eth0         #networks to use for heartbeating

node phd-hacnode1       #hostnames of the nodes
node phd-hacnode2

Next we need to configure heartbeat security. Create /etc/ha.d/authkeys containing

auth 1
1 sha1 someVerySecretPassphrase

and chmod 600 it. We also have to set up shared-key based ssh authentication. On both cluster nodes, do

ssh-keygen (defaults, no passphrase)
scp /root/.ssh/id_rsa.pub othernode:/root/.ssh/authorized_keys

Now we have to adjust some Xen settings: Delete /etc/rc2.d/S21xendomains, /etc/rc6.d/K20xendomains and /etc/rc0.d/K20xendomains (the Xen domains will be started as cluster resources), add

dom0-cpus 1     #fixes #410807

to /etc/xen/xend-config.sxp to fix a nasty bug,

(network-script 'network-bridge netdev=eth1')

to attach the Xen network bridge to the second network and set

lvm = drbdvg

in /etc/xen-tools/xen-tools.conf if you intend to use xen-tools to create your domUs. Also make sure that kernel, initrd, mirror and dist have the values you want. Create /etc/init.d/sharedfs that sets up our DRBD8/LVM fs stack upon boot:

 #! /bin/sh
 ### BEGIN INIT INFO
 # Description:       ext3-on-LVM-on-DRBD8-on-RAID
 ### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
case "$1" in
 start)
        sleep 1
        /sbin/drbdadm secondary r0
        sleep 1
        /sbin/drbdadm wait-connect r0
        sleep 1
        /sbin/drbdadm primary r0
        sleep 1

        DRBDOK=`cat /proc/drbd | grep "cs:Connected st:Primary/Primary ds:UpToDate/UpToDate" | wc -l`
        COUNT=1
        while ([ $DRBDOK -ne 1 ] && [ $COUNT -lt 6 ])
        do
                echo "Trying to restart DRBD - attempt $COUNT"
                /sbin/drbdadm secondary r0
                /etc/init.d/drbd stop
                sleep 1
                /etc/init.d/drbd start
                sleep 1
                /sbin/drbdadm secondary r0
                /sbin/drbdadm wait-connect r0
                /sbin/drbdadm primary r0
                COUNT=$(($COUNT+1))
                sleep 1
                DRBDOK=`cat /proc/drbd | grep "cs:Connected st:Primary/Primary ds:UpToDate/UpToDate" | wc -l`
        done
        sleep 1
        DRBDOK=`cat /proc/drbd | grep "cs:Connected st:Primary/Primary ds:UpToDate/UpToDate" | wc -l`
        if [ $DRBDOK -eq 1 ]; then
                echo "DRBD started."
                /sbin/vgscan
                /sbin/lvscan
                /sbin/vgchange -a y drbdvg
        else
                echo "*** DRBD could not be started!! ***"
        fi
        ;;
 stop)
        /sbin/vgchange -a n drbdvg
        /sbin/drbdadm secondary r0
        ;;
 restart|force-reload)
        $0 stop
        sleep 1
        $0 start
        ;;
 *)
        echo "Usage: $0 {start|stop|restart|force-reload}" >&2
        exit 1
        ;;
esac
exit 0

On our fast octocore Opterons we encountered some serious split brain problems upon rebooting the nodes. After careful inspection we attribute them to a race condition in the DRBD init scripts. The problem was solved by commenting out the two lines

 #$DRBDADM wait-con-int # User interruptible version of wait-connect all
 #$DRBDADM sh-b-pri all # Become primary if configured

in /etc/init.d/drbd and the additional logic in /etc/init.d/sharedfs. This file has to be symlinked to our runlevel:

cd /etc/rc2.d; ln -s ../init.d/sharedfs S71sharedfs
cd /etc/rc6.d; ln -s ../init.d/sharedfs K07sharedfs
cd /etc/rc0.d; ln -s ../init.d/sharedfs K07sharedfs
chmod a+x /etc/init.d/sharedfs

Then, finally restart the nodes to boot the Xen hypervisor.

All right, now we have a solid foundation for our HA cluster. You can use the cluster GUI /usr/lib/heartbeat/haclient.py (from Debian package heartbeat-2) to take a look at the (empty) cluster configuration:

cluster_empty.png

the nodes are present, but nothing exciting is happening. To fill the cluster with resources, we use the new heartbeat2 XML file syntax. All config files will reside in /var/lib/heartbeat/crm. The basic cluster properties are defined in basics.xml:

<cluster_property_set id="bootstrap">
        <attributes>
                <nvpair id="bootstrap01" name="transition-idle-timeout" value="60"/>
                <nvpair id="bootstrap02" name="default-resource-stickiness" value="INFINITY"/>
                <nvpair id="bootstrap03" name="default-resource-failure-stickiness" value="-500"/>
                <nvpair id="bootstrap04" name="stonith-enabled" value="true"/>
                <nvpair id="bootstrap05" name="stonith-action" value="reboot"/>
                <nvpair id="bootstrap06" name="symmetric-cluster" value="true"/>
                <nvpair id="bootstrap07" name="no-quorum-policy" value="stop"/>
                <nvpair id="bootstrap08" name="stop-orphan-resources" value="true"/>
                <nvpair id="bootstrap09" name="stop-orphan-actions" value="true"/>
                <nvpair id="bootstrap10" name="is-managed-default" value="true"/>
                <nvpair id="cib-bootstrap-options-stonith_enabled" name="stonith_enabled" value="True"/>
        </attributes>
</cluster_property_set>

and can be applied by running cibadmin -C -o crm_config -x basics.xml. stonith.xml contains the settings for STONITH which makes sure that zombie nodes are being shut down and do not mess with the cluster.

<clone id="stonithclone" globally_unique="false">
        <instance_attributes id="stonithclone1">
                <attributes>
                        <nvpair id="stonithclone01" name="clone_node_max" value="1"/>
                </attributes>
        </instance_attributes>
        <primitive id="stonithclone" class="stonith" type="external/ssh" provider="heartbeat">
                <operations>
                        <op name="monitor" interval="5s" timeout="20s" prereq="nothing" id="stonithclone-op01"/>
                        <op name="start" timeout="20s" prereq="nothing" id="stonithclone-op02"/>
                </operations>
                <instance_attributes id="stonithclone">
                        <attributes>
                                <nvpair id="stonithclone02" name="hostlist" value="phd-hacnode1,phd-hacnode2"/>
                                <nvpair id="stonithclone:1_target_role" name="target_role" value="started"/>
                        </attributes>
                </instance_attributes>
        </primitive>
</clone>

Apply with cibadmin -C -o resources -x stonith.xml. constraints.xml tells the cluster which resources should preferrably run where:

<constraints>
   <rsc_location id="place_xen1" rsc="xen1">
     <rule id="prefered_place_xen1" score="100">
       <expression attribute="#uname" id="5cd826fa-eced-476c-ab9a-5b349d41d887" operation="eq" value="phd-hacnode1"/>
     </rule>
   </rsc_location>
   <rsc_location id="place_xen2" rsc="xen2">
     <rule id="prefered_place_xen2" score="100">
       <expression attribute="#uname" id="b916f946-03b5-4f8f-85b4-8c4f2d3e7ef1" operation="eq" value="phd-hacnode2"/>
     </rule>
   </rsc_location>
 </constraints>

(cibadmin -C -o constraints -x constraints.xml) and finally we have xen.xml that describes the Xen domains as cluster resources.

<resources>
  <primitive id="xen2" class="ocf" type="Xen" provider="heartbeat">
     <operations>
       <op id="xen2-op01" name="monitor" interval="10s" timeout="60s" prereq="nothing"/>
       <op id="xen2-op02" name="start" timeout="60s" start_delay="0"/>
       <op id="xen2-op03" name="stop" timeout="300s"/>
     </operations>
     <instance_attributes id="xen2">
       <attributes>
         <nvpair id="xen2-attr01" name="xmfile" value="/etc/xen/xen2.cfg"/>
         <nvpair id="xen2_target_role" name="target_role" value="started"/>
       </attributes>
     </instance_attributes>
     <meta_attributes id="xen2-meta01">
       <attributes>
         <nvpair id="xen2-meta-attr01" name="allow_migrate" value="true"/>
       </attributes>
     </meta_attributes>
   </primitive>

   <primitive id="xen1" class="ocf" type="Xen" provider="heartbeat">
     <operations>
       <op id="xen1-op01" name="monitor" interval="10s" timeout="60s" prereq="nothing"/>
       <op id="xen1-op02" name="start" timeout="60s" start_delay="0"/>
       <op id="xen1-op03" name="stop" timeout="300s"/>
     </operations>
     <instance_attributes id="xen1">
       <attributes>
         <nvpair id="xen1-attr01" name="xmfile" value="/etc/xen/xen1.cfg"/>
         <nvpair id="xen1-attr02" name="target_role" value="started"/>
       </attributes>
     </instance_attributes>
     <meta_attributes id="xen1-meta01">
       <attributes>
         <nvpair id="xen1-meta-attr01" name="allow_migrate" value="true"/>
       </attributes>
     </meta_attributes>
   </primitive>
</resources>

Run cibadmin -C -o resources -x xen.xml. Now your cluster should be fully operational and the GUI should look something like this:

cluster_full.png