Building a Linux cluster using PXE, DHCP, TFTP and NFS

Building a small Linux cluster is a lot simpler than I thought it would be. That said, there are a number of snags and pitfalls along the way, and it’s hard to find a comprehensive and up to date set of instructions online. There are also different approaches, either doing everything manually or using a system such as LTSP. This post describes my experiences setting up a cluster manually.

Warning: This is a long post! The steps are all relatively simple but there are a lot of them. Feedback is welcome in the comments section at the bottom. And if you need a Linux cluster but the instructions below sound like too much work, we’re available for hire! Contact us for a quote.

The plan

The basic idea is to have one cluster head node, with one or more identical worker nodes. The head node should have two network cards and is essentially a conventional Linux install. One network card will be connected to your main network, and presumably from there to the internet. The other network card will be connected to a switch, to which the worker nodes are then connected. The worker nodes boot over this small network and get their configuration from the head node. They all share the same read-only NFS filesystem, although if you need local scratch space, e.g. for /tmp, that’s fairly straightforward to set up. I used Ubuntu 16.04 but similar steps will apply to other Linux distros.

Prerequisites

Before starting this guide, you’ll need to install Ubuntu onto your head node. I won’t go into the details of how to do that as there are many, many guides online about setting up an Ubuntu server. Once that’s done, you’ll need to configure each of the worker nodes to boot using PXE. This is usually a BIOS/EFI setting, specific to the particular model of computer. Again, there are plenty of instructions online about how to do this so I won’t cover it here. Come back once you’ve sorted out those two parts. Finally (and optionally) you may want to get the network hardware MAC addresses of the worker nodes. Once again, searching the internet is your friend. I’ll just mention here that if you boot a live CD on each head node in turn and run ifconfig, you’ll see something like this:

eth0      Link encap:Ethernet  HWaddr 01:23:45:67:89:00
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:20 Memory:d0700000-d0720000 

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:8286 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8286 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:699676 (699.6 KB)  TX bytes:699676 (699.6 KB)

The HWaddr part on the eth0 line is its MAC address. If you have multiple network interfaces, make sure you note down the MAC address of the one you’re going to use to connect the node to the cluster.

...

Done with the prerequisites? Right, you’re now ready to start the configuration. All of these instructions should be carried out on the head node. Don’t touch the worker nodes until the end. All commands should be run as root, I’m just too lazy to keep typing sudo. If you run sudo -s before you start, that will get you a root prompt.

DHCP server

First, install a DHCP server using apt-get install isc-dhcp-server. This allows the head node to hand out IP addresses to the worker nodes. You then need to configure it by editing the file /etc/dhcp/dhcpd.conf. Here’s an example one, assuming that the network addresses are 192.168.144.x (chosen so as not to conflict with most common network setups):

allow booting;
allow bootp;

subnet 192.168.144.0 netmask 255.255.255.0 {
	range 192.168.144.20 192.168.144.250;
	option domain-name "example.com";
	option domain-name-servers 192.168.144.1;
	option broadcast-address 192.168.144.255;
	option routers 192.168.144.1;
	next-server 192.168.144.1;
	option subnet-mask 255.255.255.0;

	filename "/pxelinux.0";
}

# force the client to this ip for pxe.
# This isn't strictly necessary but forces each computer to always have the same IP address
host node21 {
	hardware ethernet 01:23:45:a8:50:26;
	fixed-address 192.168.144.21;
}

host node22 {
	hardware ethernet 01:23:45:a8:50:1e;
	fixed-address 192.168.144.22;
}

If you have more worker nodes, add more host nodeXX { ... } blocks. As I mentioned, this part is optional, but can be helpful in debugging. The hardware ethernet XX:XX:XX:XX:XX:XX part is the MAC address you looked up earlier.

One quick note: if, like me, you end up here because you were trying to get LTSP to work and decided it was too hard to debug, make sure that the file /etc/ltsp/dhcpd.conf is NOT present on the head node! The Ubuntu DHCP server prefers the LTSP DHCP config file to its own, so you may end up bashing your head against the wall wondering why your changes aren’t being recognised.

TFTP

Install a TFTP server with apt-get install tftpd-hpa. Configure it by putting the following in /etc/default/tftpd-hpa:

TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/var/lib/tftpboot"
TFTP_ADDRESS="192.168.144.1:69"
TFTP_OPTIONS="--secure --listen --create"

Make sure you replace the IP address with the correct one for your setup, if you decided to use a different one to the one in the DHCP section above.

Now put the necessary files for boot in the /var/lib/tftpboot directory. You’ll need a kernel image, an initrd to go with it, a BIOS image, a PXELINUX image, and a configuration file to go with the PXELINUX image. For simplicity’s sake I’m assuming that your worker nodes have pretty similar architecture to the head node so you can reuse the kernel, and boot via BIOS. You will need to research online if you want to set up PXELINUX with a different kernel or boot method. Here are the commands you need:

cp /boot/vmlinuz-$(uname -r) /var/lib/tftpboot/
cp /boot/initrd.img-$(uname -r) /var/lib/tftpboot/
cp /usr/lib/syslinux/modules/bios/ldlinux.c32 /var/lib/tftpboot/
cp /usr/lib/PXELINUX/pxelinux.0 /var/lib/tftpboot/
mkdir /var/lib/tftpboot/pxelinux.cfg/

Finally, create the file /var/lib/tftpboot/pxelinux.cfg/default and put the following contents in it:

DEFAULT linux
LABEL linux
KERNEL vmlinuz-4.4.0-24-generic
APPEND root=/dev/nfs initrd=initrd.img-4.4.0-24-generic nfsroot=192.168.144.1:/clusternfs,ro ip=dhcp ro
IPAPPEND 2

You’ll need to replace the exact kernel and initrd.img filenames with the ones you copied into /var/lib/tftpboot. If you chose a different IP address block, you’ll need to change that part too. The /clusternfs part is the location of the worker node filesystem on the head node. We’ll set this up in the next section. You can use a different path if you want, but you’ll need to make sure it’s consistent between TFTP and NFS. The final line is to work around a bug in Ubuntu affecting PXE boot.

Create worker node filesystem

We’re now ready to create the base Ubuntu image that the cluster nodes will boot into. Here’s how.

mkdir /clusternfs
debootstrap xenial /clusternfs/
cp -a /lib/modules /clusternfs/lib/

The final line isn’t necessary, but if you want to be able to load any extra kernel modules, e.g. for hardware sensors, you’ll need to do it. If you want a different version of Ubuntu, replace xenial with the version of your choice.

This will give you a very basic filesystem. It’s highly likely that you’ll want to customise it. For simple changes, you can just edit the files in /clusternfs directly. For more complicated things, such as installing extra software using apt-get or adding users via adduser, you’ll need to chroot into the worker node setup with the command chroot /clusternfs followed by mount -t proc none /proc. Make sure to exit the chroot after you’re done by logging out of it.

NFS

Now you have your worker node filesystem set up, but you need to make it accessible over the network. To do that, we’re going to use NFS. First, install it with apt-get install nfs-kernel-server. Then edit /etc/exports and put the following line into it:

/clusternfs 192.168.144.0/24(ro,no_root_squash,async,insecure,no_subtree_check)

Then tell NFS to pick up the changes with exportfs -r.

You’ll also need to tell the worker nodes that the root filesystem lives on an NFS share. We’ve already told the kernel that in the PXE boot setup above, but the change needs to be made in one other place. Edit /clusternfs/etc/fstab and make sure it contains the following two lines:

proc            /proc         proc   defaults       0      0
/dev/nfs        /             nfs    defaults       0      0

Firing it up

Now, the moment of truth. Connect all of your worker nodes to a network switch, and connect that switch to the second network card on your head node. Boot the first worker node, preferably with a monitor attached, so you can see what it’s doing. It should tell you that it’s PXE booting, and that it’s picked up an IP address. After that it will tell you that it’s loaded the kernel over the network, followed by the initrd. From there it should only be a short wait until it brings you to the standard login prompt. If it doesn’t, see the troubleshooting section below.

Assuming it worked, well done! Boot the remaining worker nodes and you now have a working Linux cluster. The following bonus points section will helpfully give you a few pointers on how you might actually use it.

Bonus points

You can run processes on the worker nodes by installing an SSH server into the worker node image. You’ll probably want to use SSH keys to allow passwordless logins from the head node. You might also want to consider something like Open MPI to manage dividing jobs up between the worker nodes rather than rolling your own job management system. Unfortunately, that’s outside the scope of this blog post, which is already far too long as it is!

All nodes will have the same hostname. This can be mildly confusing, and can cause problems with Open MPI. To get each node to dynamically set its hostname based on its IP address, add something like the following to /clusternfs/etc/rc.local:

hostname node$(ifconfig | sed -En 's/127.0.0.1//;s/.*inet (addr:)?(([0-9]*\.){3}[0-9]*).*/\2/p' | cut -d . -f 4)

Now each worker node will have the hostname nodeXX, where XX is the final octet of its IP address. Make sure you put the line above the exit 0 at the end of /clusternfs/etc/rc.local

You might want swap and/or scratch space on each worker node. To set that up, each node will need an internal disk. Make sure that you partition each one the same way, e.g. swap in partition /dev/sda1, ext4 filesystem for /tmp in /dev/sda2, etc. Then add something like the following two lines into /clusternfs/etc/fstab:

/dev/sda1    none       swap   sw             0      0
/dev/sda2    /tmp       ext4   nodev,nosuid,noexec,noatime 0      2

Troubleshooting

There are a number of places where this can go wrong, and I can’t cover them all in depth. First, the worker nodes might try to boot from some other means than PXE boot. You’ll need to consult the documentation that came with them on how to configure that, as it varies between different motherboards and network hardware. Second, the DHCP server might not work. To troubleshoot that you’re probably best off plugging in a laptop or similar and seeing if you can get that to pick up an IP address from the head node via DHCP. After that, the worker nodes might try to PXE boot but fail to find the kernel and/or initrd. If that happens, you may find gPXE useful in debugging. Once the kernel and/or initrd are loading, the NFS mount might fail. Again, your best bet here is probably to plug in a laptop and try to mount the NFS share manually. The error messages you receive while doing that should help you track down the cause of the problem. And finally, if the worker nodes start the boot process successfully but are very slow reaching login, with error messages like “IP-Config: no response after 3 secs - giving up”, or even kernel panic due to timeouts, make sure that you’ve included the line IPAPPEND 2 in /var/lib/tftpboot/pxelinux.cfg/default. Good luck!

Footnote

Just in case you’re wondering why I didn’t use LTSP: I looked at that first and spent quite a lot of time on it. After hours of trawling the internet for fixes to the worker nodes repeatedly crashing partway through the boot process and dropping down to a BusyBox prompt in the initramfs, I decided that LTSP was too complicated for me. I’m sure it’s great when you understand it and it works, but I don’t and it didn’t. The manual method might be a bit more fiddly but at least I understand every step of it, which makes troubleshooting a lot simpler.

Comments

Anonymous - Wed, 24/05/2017 - 12:39

Permalink

Thank you so much for this, helped me a lot, especially the part in which you say to install 'debootstrap'! That's the utility I was missing and didn't know how to prep the directory.
Going to figure out why it is stuck at boot though, debootstrapped debian doesn't like nfs?

Thanks for the quick and easy workaround for the known bug. What is it doing and why does it work?

Mahir Sayar - Thu, 20/09/2018 - 00:40

Permalink

Hey there, this is awesome reading.. but i'm just trying to get my head around this..

so we've set up a head node, and set up a root image the workers will boot... but how exactly does that give me a cluster for high performance tasks?

am i to understand that by running a program on the head machine, it will utilize the CPU of the worker machines?

You will need some kind of parallel processing/queuing framework to distribute the computational tasks that you want to run. For example, Open MPI or Gearman, although there are many alternatives out there. You will need to identify a suitable framework for the kind of jobs that you want to run and the programming skills that you have available in your organisation. Unfortunately that topic is very open-ended and a bit outside the scope of this post!

In a situation where the head node has two network cards (one for the cluster and one for the internet) and each worker has only one, how would you add internet connectivity to the worker nodes?

Hello:
The kernel and the initrd boot well but then the system go to busybox, it seems the system find /dev/nbd0 instead nfs.
Thanks

Hello:
If nfs is mounted ro, how can I modified my files in my home directory? Thanks a lot

Hi Luis, if you need write access to home directories, you should follow similar steps to the ones noted for setting up a swap partition of /tmp. Add a line like this to your /etc/fstab:

/dev/sda3 /home ext4 nodev,nosuid,noexec,noatime 0 2

Bear in mind that files in home directories will only be accessible on the individual node, they will not be accessible between nodes. If you need shared file access between nodes, you could set up a separate NFS directory for /home, and mount that rw. In this case you need to be extremely careful about file locking and write conflicts though.

Hope that helps.

Add new comment

CAPTCHA