Categories
Admin Computers

Folding@Home on AWS to kick the arse of coronavirus

Folding@Home popped up on my radar due to a recent announcement that their computational research platform is adding a bunch of projects to study (and ultimately help fight) the COVID-19 virus. Previously I haven’t had any good machine at hand to be able to help in such efforts (my 9 years old Lenovo X201 is still cozy to work with, but doesn’t pack a computing punch). At work, however I get to to be around GPU machines much more, and gave me ideas how to contribute a bit more.

Poking around the available GPU instance types on AWS, seen that there are some pretty affordable ones in the G4 series, going down to as low as roughly $0.60/hour to use some decent & recent CPU and an NVIDIA Tesla T4 GPU. This drops even further if I use spot instances, and looking around in the different regions, I’ve seen available capacity at $0.16-0.20/hour, which feels really in the bargain category. Thus I thought spinning up a Folding@Home server in the cloud on spot instances, to help out and hopefully learning a thing or two, at the price of roughly 2 cups of gourmet London coffee (or taking the tube to work) per day.

Looking at the instance types, there are a few others than the mentioned g4dn.xlarge to choose from, but going to stick with that for the time being:

  • larger g4dn instances don’t really worth it, since the GPU will do the heavy lifting, and it’s the same size until going up to 12xlarge that comes with 4 GPUs, but that’s more than 4x as expensive, so would be rather wasted.
  • Compute optimised p3 instances also don’t seem to particularly worth it, as the difference between its NVIDIA V100 and the T4 is much smaller multiplier than the price difference (based on a quick search for benchmarks: performance is roughly x2, while price of the smallest machine, that’s 2xlarge is x5-6).

Software setup

I’ve spun up an instance simply enough, and with a bit of trial & error got the setup sorted.

Using an Ubunutu system, the required fahclient installed just fine as per the documentation, but the GPU side needed some extra poking, things were unblocked by the NVIDIA drivers and OpenGL packages (thanks to the F@H forums), in my case:

sudo apt install -qy nvidia-headless-435 ocl-icd-opencl-dev

The next was adding a good Folding@Home config, again a bit of trial and error. The docs say lots of the pieces can be left to self-configure (the folding slots in particular), but I’ve found that explicitly setting it works better overall. Thus my /etc/fahclient/config.xml file looks something like this:

<config>
  <!-- Client Control -->
  <fold-anon v='true'/>

  <!-- Folding Slot Configuration -->
  <gpu v='true'/>

  <!-- Slot Control -->
  <power v='full'/>

  <!-- User Information -->
  <passkey v='111111111111111111111'/>
  <team v='xxxxx'/>
  <user v='YYYYYYYYYYYY'/>

  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'/>

  <allow>127.0.0.1 A.B.C.D</allow>
  <web-allow>127.0.0.1 A.B.C.D</web-allow>

  <!-- Remote Command Server -->
  <password v='zzzzzzzzz'/>
</config>

Here I omitted my user name and passkey (naturally), so fill others can fill in their own. I’ve also joined the ArchLinux team (number 45032 ;), but to each of their own. The last part in allow/web-allow section is that I’ve added my VPN’s IP address, so I can connect to the server remotely, without opening it up to the rest of the world. That part (A.B.C.D) can be removed, and could, for example, use SSH port forwarding to connect to the server (forwarding the required port 7396). Finally, the password section allows the remote FAHControl graphical interface to connect to the folding service remotely (without port forwarding).

This setup then got to fold. To ensure that things were running on the GPU fine, I’ve also built nvtop on the machine and checked that the unit is maxed out

nvtop when folding happily

Launch Template

So far it’s fine, but let’s make things more automatic. Spot instances can be killed, or I might want to spin up some extra instances, and would rather have as little manual work to do as possible. What I converged on then is having a Launch Template which sets up all the things needed and I could start a new folder with a couple of clicks. In there I’ve set:

  • the instance type, g4dn.xlarge
  • an Ubuntu 18.04 system
  • the security group, that allows all traffic to my VPN (otherwise port 22 for ssh would be enough with the mentioned ssh tunneling above)
  • that these are spot requests
  • my default AWS keypair for ssh access
  • some tags for housekeeping (definitely optional)
  • user data that does the whole setup on on system start
Launch templating, took 5 versions to converge

Of the parts above, I guess naturally the user data took the most to figure out, because of some peculiarities of the setup.

First, FAHClient keeps wanting to interactively set things up when it is installed, so had to get around that. If I pre-create the correct config.xml file before the install, fortunately only a single question remains (whether it should start the service on automatically) and that one thing is taken care buy a bit of expect scripting.

#!/bin/bash

export DEBIAN_FRONTEND=noninteractive
sudo apt update
sudo apt install -qy nvidia-headless-435 ocl-icd-opencl-dev expect

wget https://download.foldingathome.org/releases/public/release/fahclient/debian-testing-64bit/v7.4/fahclient_7.4.4_amd64.deb
sudo mkdir /etc/fahclient/ || true
sudo chmod 777 /etc/fahclient
cat  <<EOF > "/etc/fahclient/config.xml"
<config>
  <!-- Client Control -->
  <fold-anon v='true'/>

  <!-- Folding Slot Configuration -->
  <gpu v='true'/>

  <!-- Slot Control -->
  <power v='full'/>

  <!-- User Information -->
  <passkey v='111111111111111111111'/>
  <team v='xxxxx'/>
  <user v='YYYYYYYYYYYY'/>

  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'/>

  <allow>127.0.0.1 A.B.C.D</allow>
  <web-allow>127.0.0.1 A.B.C.D</web-allow>

  <!-- Remote Command Server -->
  <password v='zzzzzzzzz'/>
</config>
EOF

cat <<EOF > "/home/ubuntu/install.sh"
#!/usr/bin/expect
spawn dpkg -i --force-confdef --force-depends fahclient_7.4.4_amd64.deb
expect "Should FAHClient be automatically started?"
send "\r"
# done
expect eof
EOF

chmod +x /home/ubuntu/install.sh

sudo /home/ubuntu/install.sh

With this script passed to the instance as user data now it all falls into place, and can spin up new folding any time.

Then there are two ways to connect to the server and monitor it remotely:

  • the web client, on port 7396, with an interface like at the top of this post, or
  • using FAHClient desktop client, that can monitor and control multiple folding instances, and I feel has better control over & more information about what’s being done. This is by default on port 36330 and to work remotely, have to have a “password” set in the configuration.

Using these settings, the remote workload (both the CPU and GPU pops up, and possible to monitor & control:

And this should be done for now…

Notes & Future

Thus far I’ve learned:

  • A bit about spot instances. There are a lot more options which I haven’t touched and might be useful in general, such as targets & instance pools, using the time-limited spot instances, etc, but those are more in general, not in this particular case)
  • A lot about launch templates. They seem handy, though one request would be being able to edit the description of them, or when starting from a previous version, that description is pre-filled (which is currently not, unlike all the other settings).
  • Some apt/dpkg coercion tricks for non-interactive setup, though there seems to be a more to know. How nice it is on ArchLinux that non-interactive mode is basically a single -y flag away in pacman.
  • How to use user data, though that’s definitely just scratching the surface. What would be much better is to learn cloud-init instead, which seems much more like the proper way to supply files to install and scripts to run on these virtual machines.

I’ve also experienced that Folding@Home might be struggling a bit with the current load, earlier today the work servers (but now they seem to be okay), but also the statistics servers, so I’m guessing the whole infrastructure is under load. I wonder how are they set up, and where their bottlenecks are…

But now this is done, the ball is in the court of the research, keep them computational biochemistry research coming. In the meantime keep safe, everyone. Wash hands, not touch faces, and take good care of people around you.

Edit 2020/03/14: Looking at their server stats and connecting up the dots with their project stats, they might have run out of relevant work items for the time being. That’s kinda both good (likely large response to their shout out) and a bummer (resources just idling).