Building a Linux cluster using PXE, DHCP, TFTP and NFS

Building a small Linux cluster is a lot simpler than I thought it would be. That said, there are a number of snags and pitfalls along the way, and it’s hard to find a comprehensive and up to date set of instructions online. There are also different approaches, either doing everything manually or using a system such LTSP. This post describes my experiences setting up a cluster manually.

Warning: This is a long post! The steps are all relatively simple but there are a lot of them. Feedback is welcome in the comments section at the bottom. And if you need a Linux cluster but the instructions below sound like too much work, we’re available for hire! Contact us for a quote.

The plan

The basic idea is to have one cluster head node or master, with one or more identical slave nodes. The head node should have two network cards and is essentially a conventional Linux install. One network card will be connected to your main network, and presumably from there to the internet. The other network card will be connected to a switch, to which the slave nodes are then connected. The slave nodes boot over this small network and get their configuration from the head node. They all share the same read-only NFS filesystem, although if you need local scratch space, e.g. for /tmp, that’s fairly straightforward to set up. I used Ubuntu 16.04 but similar steps will apply to other Linux distros.

Prerequisites

Before starting this guide, you’ll need to install Ubuntu onto your head node. I won’t go into the details of how to do that as there are many, many guides online about setting up an Ubuntu server. Once that’s done, you’ll need to configure each of the slave nodes to boot using PXE. This is usually a BIOS/EFI setting, specific to the particular model of computer. Again, there are plenty of instructions online about how to do this so I won’t cover it here. Come back once you’ve sorted out those two parts. Finally (and optionally) you may want to get the network hardware MAC addresses of the slave nodes. Once again, searching the internet is your friend. I’ll just mention here that if you boot a live CD on each head node in turn and run ifconfig, you’ll see something like this:

eth0      Link encap:Ethernet  HWaddr 01:23:45:67:89:00  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:20 Memory:d0700000-d0720000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:8286 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8286 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:699676 (699.6 KB)  TX bytes:699676 (699.6 KB)

The HWaddr part on the eth0 line is its MAC address. If you have multiple network interfaces, make sure you note down the MAC address of the one you’re going to use to connect the node to the cluster.

...

Done with the prerequisites? Right, you’re now ready to start the configuration. All of these instructions should be carried out on the head node. Don’t touch the slave nodes until the end. All commands should be run as root, I’m just too lazy to keep typing sudo. If you run sudo -s before you start, that will get you a root prompt.

DHCP server

First, install a DHCP server using apt-get install isc-dhcp-server. This allows the head node to hand out IP addresses to the slave nodes. You then need to configure it by editing the file /etc/dhcp/dhcpd.conf. Here’s an example one, assuming that the network addresses are 192.168.144.x (chosen so as not to conflict with most common network setups):

allow booting;
allow bootp;

subnet 192.168.144.0 netmask 255.255.255.0 {
	range 192.168.144.20 192.168.144.250;
	option domain-name "example.com";
	option domain-name-servers 192.168.144.1;
	option broadcast-address 192.168.144.255;
	option routers 192.168.144.1;
	next-server 192.168.144.1;
	option subnet-mask 255.255.255.0;

	filename "/pxelinux.0";
}

# force the client to this ip for pxe.
# This isn't strictly necessary but forces each computer to always have the same IP address
host node21 {
	hardware ethernet 01:23:45:a8:50:26;
	fixed-address 192.168.144.21;
}

host node22 {
	hardware ethernet 01:23:45:a8:50:1e;
	fixed-address 192.168.144.22;
}

If you have more slave nodes, add more host nodeXX { ... } blocks. As I mentioned, this part is optional, but can be helpful in debugging. The hardware ethernet XX:XX:XX:XX:XX:XX part is the MAC address you looked up earlier.

One quick note: if, like me, you end up here because you were trying to get LTSP to work and decided it was too hard to debug, make sure that the file /etc/ltsp/dhcpd.conf is NOT present on the head node! The Ubuntu DHCP server prefers the LTSP DHCP config file to its own, so you may end up bashing your head against the wall wondering why your changes aren’t being recognised.

TFTP

Install a TFTP server with apt-get install tftpd-hpa. Configure it by putting the following in /etc/default/tftpd-hpa:

TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/var/lib/tftpboot"
TFTP_ADDRESS="192.168.144.1:69"
TFTP_OPTIONS="--secure --listen --create"

Make sure you replace the IP address with the correct one for your setup, if you decided to use a different one to the one in the DHCP section above.

Now put the necessary files for boot in the /var/lib/tftpboot directory. You’ll need a kernel image, an initrd to go with it, a BIOS image, a PXELINUX image, and a configuration file to go with the PXELINUX image. For simplicity’s sake I’m assuming that your slave nodes have pretty similar architecture to the head node so you can reuse the kernel, and boot via BIOS. You will need to research online if you want to set up PXELINUX with a different kernel or boot method. Here are the commands you need:

cp /boot/vmlinuz-$(uname -r) /var/lib/tftpboot/
cp /boot/initrd.img-$(uname -r) /var/lib/tftpboot/
cp /usr/lib/syslinux/modules/bios/ldlinux.c32 /var/lib/tftpboot/
cp /usr/lib/PXELINUX/pxelinux.0 /var/lib/tftpboot/
mkdir /var/lib/tftpboot/pxelinux.cfg/

Finally, create the file /var/lib/tftpboot/pxelinux.cfg/default and put the following contents in it:

DEFAULT linux
LABEL linux
KERNEL vmlinuz-4.4.0-24-generic
APPEND root=/dev/nfs initrd=initrd.img-4.4.0-24-generic nfsroot=192.168.144.1:/clusternfs,ro ip=dhcp ro
IPAPPEND 2

You’ll need to replace the exact kernel and initrd.img filenames with the ones you copied into /var/lib/tftpboot. If you chose a different IP address block, you’ll need to change that part too. The /clusternfs part is the location of the slave node filesystem on the head node. We’ll set this up in the next section. You can use a different path if you want, but you’ll need to make sure it’s consistent between TFTP and NFS. The final line is to work around a bug in Ubuntu affecting PXE boot.

Create slave node filesystem

We’re now ready to create the base Ubuntu image that the cluster nodes will boot into. Here’s how.

mkdir /clusternfs
debootstrap xenial /clusternfs/
cp -a /lib/modules /clusternfs/lib/

The final line isn’t necessary, but if you want to be able to load any extra kernel modules, e.g. for hardware sensors, you’ll need to do it. If you want a different version of Ubuntu, replace xenial with the version of your choice.

This will give you a very basic filesystem. It’s highly likely that you’ll want to customise it. For simple changes, you can just edit the files in /clusternfs directly. For more complicated things, such as installing extra software using apt-get or adding users via adduser, you’ll need to chroot into the slave node setup with the command chroot /clusternfs followed by mount -t proc none /proc. Make sure to exit the chroot after you’re done by logging out of it.

NFS

Now you have your slave node filesystem set up, but you need to make it accessible over the network. To do that, we’re going to use NFS. First, install it with apt-get install nfs-kernel-server. Then edit /etc/exports and put the following line into it:

/clusternfs 192.168.144.0/24(ro,no_root_squash,async,insecure,no_subtree_check)

Then tell NFS to pick up the changes with exportfs -r.

You’ll also need to tell the slave nodes that the root filesystem lives on an NFS share. We’ve already told the kernel that in the PXE boot setup above, but the change needs to be made in one other place. Edit /clusternfs/etc/fstab and make sure it contains the following two lines:

proc            /proc         proc   defaults       0      0
/dev/nfs        /             nfs    defaults       0      0

Firing it up

Now, the moment of truth. Connect all of your slave nodes to a network switch, and connect that switch to the second network card on your head node. Boot the first slave node, preferably with a monitor attached, so you can see what it’s doing. It should tell you that it’s PXE booting, and that it’s picked up an IP address. After that it will tell you that it’s loaded the kernel over the network, followed by the initrd. From there it should only be a short wait until it brings you to the standard login prompt. If it doesn’t, see the troubleshooting section below.

Assuming it worked, well done! Boot the remaining slave nodes and you now have a working Linux cluster. The following bonus points section will helpfully give you a few pointers on how you might actually use it.

Bonus points

You can run processes on the slave nodes by installing an SSH server into the slave node image. You’ll probably want to use SSH keys to allow passwordless logins from the head node. You might also want to consider something like Open MPI to manage dividing jobs up between the slave nodes rather than rolling your own job management system. Unfortunately, that’s outside the scope of this blog post, which is already far too long as it is!

All nodes will have the same hostname. This can be mildly confusing, and can cause problems with Open MPI. To get each node to dynamically set its hostname based on its IP address, add something like the following to /clusternfs/etc/rc.local:

hostname node$(ifconfig | sed -En 's/127.0.0.1//;s/.*inet (addr:)?(([0-9]*\.){3}[0-9]*).*/\2/p' | cut -d . -f 4)

Now each slave node will have the hostname nodeXX, where XX is the final octet of its IP address. Make sure you put the line above the exit 0 at the end of /clusternfs/etc/rc.local

You might want swap and/or scratch space on each slave node. To set that up, each node will need an internal disk. Make sure that you partition each one the same way, e.g. swap in partition /dev/sda1, ext4 filesystem for /tmp in /dev/sda2, etc. Then add something like the following two lines into /clusternfs/etc/fstab:

/dev/sda1    none       swap   sw             0      0
/dev/sda2    /tmp       ext4   nodev,nosuid,noexec,noatime 0      2

Troubleshooting

There are a number of places where this can go wrong, and I can’t cover them all in depth. First, the slave nodes might try to boot from some other means than PXE boot. You’ll need to consult the documentation that came with them on how to configure that, as it varies between different motherboards and network hardware. Second, the DHCP server might not work. To troubleshoot that you’re probably best off plugging in a laptop or similar and seeing if you can get that to pick up an IP address from the head node via DHCP. After that, the slave nodes might try to PXE boot but fail to find the kernel and/or initrd. If that happens, you may find gPXE useful in debugging. Once the kernel and/or initrd are loading, the NFS mount might fail. Again, your best bet here is probably to plug in a laptop and try to mount the NFS share manually. The error messages you receive while doing that should help you track down the cause of the problem. And finally, if the slave nodes start the boot process successfully but are very slow reaching login, with error messages like “IP-Config: no response after 3 secs - giving up”, or even kernel panic due to timeouts, make sure that you’ve included the line IAPPEND 2 in /var/lib/tftpboot/pxelinux.cfg/default. Good luck!

Footnote

Just in case anyone’s wondering why I didn’t use LTSP: I looked at that first and spent quite a lot of time on it. After hours of trawling the internet for fixes to the slave nodes repeatedly crashing partway through the boot process and dropping down to a BusyBox prompt in the initramfs, I decided that LTSP was too complicated for me. I’m sure it’s great when you understand it and it works, but I don’t and it didn’t. The manual method might be a bit more fiddly but at least I understand every step of it, which makes troubleshooting a lot simpler.

Add new comment

(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.