Improving reliability and power efficency of my CI rack

This article is about hacking a UPS so it can manage power to my CI rack (which has 30 physical and virtual machines on it) so it can handle outages cleanly and be switched off when no jobs to do.

Storm Eunice Public Domain. By NASA - https://worldview.earthdata.nasa.gov/, Public Domain, https://commons.wikimedia.org/w/index.php?curid=115344722

In mid-Feb 2022 there were a series of storms in the UK that caused widespread damage. Our physical damage was limited to our bin shed being blown over and reduced to matchwood. But due to a power outage, which is unusual in the UK, there was quite a bit more virtual damage.

The LWS CI rack runs a mixture of 30 physical and virtual platforms and some of those did not react well to the outage. After a day figuring out what was broken in each virtual context I fixed most of it, but today over a month later there are still several platforms out. After updating the host OS for the nspawns, I can no longer share the related ttyUSB device nodes for the embedded platforms into the VM, Xenial and another platform can no longer boot. So it’s in a bit of a degraded state, and although it will be fixed, I don’t want to have to run around dealing with the same amount of fallout again from events I can’t predict or control.

In addition to that, energy security is going to be a global problem in the medium term, with prices that will only be going up, it’s no longer reasonable to just burn the power this uses 24h a day when it is idle.

So Something Must Be Done.

Step 1: Get a suitable UPS

Since power is normally very reliable in the UK, I did not bother with a UPS until now; but nothing else can be improved until there’s a way both to handle outages cleanly and programmatically power off the rack as a whole. I chose a 3U rackmount 1200VA one, at under GBP200 it’s a pretty good deal. They’re available from other vendors than Amazon if you look around. It has an LCD display showing instantaneous power usage as a percentage of the 1200VA budget, which proved very interesting. But if you read on, you’ll see it has problems for this use-case and we’ll be deep- diving into the guts of that and hacking our way around it.

The UPS comes with a mini CD which is about as useful as a stone tablet in 2022, you can find a copy of the windows 8 driver that’s on it if you look around the net, but that’s no use to me. There’s no vendor Linux support.

I expected UPSes to all use the UPS USB class devices nowadays, but no, the UPS market is a big stinking mess of proprietary hacks on silicon vendor reference designs with each one doing things slightly differently. The FOSS project for UPS management, NUT Network Ups Tools, is therefore in a difficult position and the userland driver part of that is also by necessity a big mess of duplicated onetime hacks nobody is really able to unify, since they don’t own all the 100+ models of UPS they support for testing and the original user lost interest when it worked well enough for him.

The typical flow is discussion on a mailing list between someone with the UPS and a dev, who guesstimates the required changes and gets the user to test it iteratively. He can never test it directly since he doesn’t have it and was never in the same room with one either.

This particular UPS model is not explicitly supported by NUT, and is somewhat bizarrely set to use a USB vid:pid of 0001:0000 which Fry’s Electronics got associated with first, this is not the sign of competence you would hope to see. It also reports a product ID MEC0003 which is seems is also found in many other UPS products, although with the variety of configurations that are supposed to work with products reporting like that, clearly that refers to the inner protocol and not the USB layer one. After some trial and error using Fedora’s packaged NUT I was able to get it to report all zeros as its status, which is something but not much use.

read: (000.0 000.0 000.0 000 00.0 0.00 00.0 00000000

With the CD, windows 8, stoneage USB proprietary protocol, it looks like it was designed in the early 2000s against a silicon vendor reference design and they just keep churning them out since they are good enough to be competitive.

I cloned the NUT sources and started fiddling in there. In the sources is a comment suggesting that Powercool users might have luck with nutdrv_qx userland driver set to use the hunnox subdriver. This subdriver does not even exist in the packaged Fedora NUT version. That did indeed work, it buys me sane lowlevel status

read: (238.0 000.0 240.0 023 50.1 27.5 29.0 00001000

The magic ups.conf for that (remember hunnox only exists in the git NUT) is

[powercool]
        driver=nutdrv_qx
        vendorid=0001
        productid=0000
        product=MEC0003
        subdriver=hunnox
        langid_fix=0x409
        port = auto

Using upsc to cook the lowevel driver status, we get

battery.voltage: 27.60
device.type: ups
driver.name: nutdrv_qx
driver.parameter.langid_fix: 0x409
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.product: MEC0003
driver.parameter.productid: 0000
driver.parameter.subdriver: hunnox
driver.parameter.synchronous: auto
driver.parameter.vendorid: 0001
driver.version: 2.7.4-5059-ga8e3687a
driver.version.data: Q1 0.07
driver.version.internal: 0.32
driver.version.usb: libusb-1.0.23 (API: 0x1000107)
input.frequency: 50.0
input.voltage: 243.0
input.voltage.fault: 0.0
output.voltage: 245.0
ups.beeper.status: disabled
ups.delay.shutdown: 30
ups.delay.start: 180
ups.load: 22
ups.productid: 0000
ups.status: OL
ups.temperature: 29.0
ups.type: online
ups.vendorid: 0001

… which looks sane.

Step 2: No more Always On

Until now I just left the CI rack on all the time and hooked up the the Sai server ready to go. But actually the rack doesn’t have anything to do except CI jobs, and although those sometimes come thick and fast, typically the rack is in fact idle.

The LCD on the UPS shows the power being used as a percentage of the 1200VA capacity, after arranging that the contents of the rack, and its DUT LAN switch, is powered via the UPS, I can see the rack idles at around 30%, which is 400VA. This… is a lot of power when it’s on 24h a day and mainly idle.

What would be desirable is if the rack was off - the UPS can turn it all off programmatically - except when something to do on the Sai server. That’s easy to describe, but harder to implement.

To do this, there needs to be a “UPS manager” device that the UPS is plugged into, which is the only thing that is “always ON”. It needs to be on and have access to the internet normally when mains power is available, to check for remote jobs, regardless of whether the UPS has the rest of the rack powered or not. And it needs to be ON, with local DUT LAN access, when mains power has failed, so it can inform the different devices in the rack they need to perform an orderly shutdown.

UPS monitoring architecture

Rack physical devices are on their own ethernet switch + DUT LAN subnet with noi (which is the big PC we hope to be mainly OFF) acting as a router on to the home LAN, so it means on backup power the UPS Manager RPi can talk to anything on the DUT subnet, which will also be taking backup power. During an outage, the UPS Manager RPi is just trying to inform all the machines in the rack they should shut down in an orderly fashion due to an outage, and then turn off the UPS Backup and wait for happier days.

It boils down to the “UPS manager” device must be powered from both sides and have dual ethernet interfaces, one to access the internet when there is mains power and connectivity, and the other to access the rack devices on their LAN to inform them when they must cleanly shut down.

Step 3: Create the USB manager RPi with Rocky

I redeployed an RPI4 in the rack with Rocky Linux 8.5 on it, and rebuilt NUT from git on that to the point I could reproduce the UPS operation the same on the RPi 4. The rackmount kit that I used adds a 2.4mm DC jack to the Rpi at the back so there are already two power sources, the USB-C on the RPi and this jack, and I checked I can run both and unplug either without crashing the RPi.

Switchmode PSUs naturally adapt their output to the voltage at the load dynamically, so they don’t find it strange if their load is already close to or above their target voltage, they just stop driving the load until it falls below their target voltage.

The Rocky image for RPi can be found here

https://download.rockylinux.org/pub/rocky/8/rockyrpi/aarch64/images/

Don’t bother trying to use an SD card for this, they are too unreliable. Update the RPi4 bootloader to support USB boot and xzcat | dd the image on to a USB3 flash drive and use that as your storage. Hopefully the next gen of RPi boards will have an eMMC.

After install, don’t forget to add your own user and userdel their default rocky user since it has a fixed password rockylinux. Similarly, set up .ssh/authorized_keys and associated chmod for your user with your main PC user key so you can ssh in, you should check it works and then also change PasswordAuthentication no in /etc/ssh/sshd_config and restart sshd service.

The Rocky xz image expands to a fixed ~4GB size, run this script included with Rocky to resize the partition and expand the fs to fill your storage device.

$ sudo rootfs-expand

Building nut from git requires adding the nut user with useradd nut, the group is already existing in Rocky.

I added a udev rule in /lib/udev/rules.d/52-nut-usbups.rules

ATTR{idVendor}=="0001", ATTR{idProduct}=="0000", ATTRS{product}=="MEC0003", MODE="0774", GROUP="nut", SYMLINK+="usb-ups"

then

$ sudo udevadm control --reload-rules && sudo udevadm trigger

to coldplug it and get the correct group on the device node.

After that, enable the Powertools repo at /etc/yum.repos.d/Rocky-PowerTools.repo setting enabled=1 and do the dnf update -y.

Open the default nut port so clients will be able to connect to us

$ sudo firewall-cmd --permanent --add-port 3493/tcp

Set the hostname in /etc/hostname to something like ups-monitor.

The DUT LAN side ethernet muxt use manual / static address / subnet / DNS / gateway, because otherwise it may not be able to reacquire DHCP properly when the RPi stayed up and the DHCP server went and stayed down.

$ sudo nmcli c m "Wired connection 2" ipv4.addresses 192.168.xx.xx/24
$ sudo nmcli c m "Wired connection 2" ipv4.dns 192.168.xx.1
$ sudo nmcli c m "Wired connection 2" ipv4.gateway 192.168.xx.1
$ sudo nmcli c m "Wired connection 2" ipv4.method manual

Reboot with sudo shutdown -h now

Step 4: Building NUT from git

First bring in the build prerequisites

$ sudo dnf install usbutils make git autoconf automake libtool libusb-devel openssl-devel python39

(yes, nut brings in a whole dependency on python, just to parse a config file) then

$ git clone https://github.com/networkupstools/nut.git
$ cd nut
$ ./autogen.sh
$ ./configure
$ make -j8 && sudo make install
$ sudo mkdir -p /var/state/ups
$ sudo chgrp nobody /var/state
$ sudo chgrp nobody /var/state/ups

It defaults to install its stuff in /usr/local/ups/...

Step 5: Realize the UPS has some problems and working around them…

Generally the UPS was workable as a UPS with hunnox and git NUT. Although we do need it to act like a traditional UPS and provide battery backup and failure indication so we can shut down, for us the main use of it is to power down the rack cleanly, either programmatically or upon an outage, keep it mostly powered off, and power it back up again automatically when there is work to do.

The UPS has two problems with that..

1) it will not stay powered down for longer than ~30m .. 2hr (shutdown.stayoff), it will autonomously repower itself presumably due to hardware bugs, and

2) it will not come back up again on command (load.on) or indeed by sending it any variation thereof, I have to press the frontpanel button for 3s to bring it back up with the load on.

As we will see, once it enters “PWR DN”, communication becomes sporadic, presumably due to powersaving sleeps inside the UPS, since it may be effectively “running from last dregs of battery” if it’s like that due to an outage and no mains. As it is, it works only as an always-on battery backup that drains its battery every time. It is not able to work as a programmatic switch for the load.

State Button action Result
Load ON Press Instant OFF
Load OFF Press 1s Show display backlight for a few secs, keep load OFF
Load OFF Press 3s Bring load ON

After understanding the flow in NUT and not finding the main problems there, I removed the power, waited a bit and removed the 8 screws holding the front panel.

WARNING - I don’t recommend you do this unless you understand the dangers from having a 240VAC generator exposed to your hands… there is a “cold” side to the UPS that’s referenced to Earth ground, the USB connector and the metal case are on the safe, cold side. But inside, there is a “hot” side that is referenced to 240VAC as its “hot 0V”, touching this or anything referenced to it is hazardous to life. The internal wiring and PCBs are on this “hot” side. Since it is battery powered, simply having unplugged it has not made it safe at all.

I was initially a bit nonplussed, there are two sealed lead-acid batteries, with space for two more, a large wound transformer and a board with heatsink-ed discretes. But I could not see any ICs that might contain the smarts. I realized later that for power domain isolation standards reasons, all the PCBs are single-sided, and the traces for the “hot side” pcb are below it where you can’t see any SMT. So it’s under the “hot” pcb which I didn’t want to touch while there is 24V battery powering a 240V inverter on that board.

daughterboard

There’s a daughterboard at the back that has the USB connector and two RJ45s, but the RJ45s are not connected to anything but each other and a couple of diodes, it’s trying to be some kind of pointless surge protector or so.

Four wires come back from that to the hot board.

daughterboard

The chip is a CY7C63313 lowspeed HID controller (PDF)

The -13 variant seems to have 8KB flash, there are two opto-isolators mounted there too. These turned out to be RX and TX for a 2400bps link.

I studied what travels on the link from the “cold” side, it’s literally the Q1 type protocol sent over the serial link by the nut hunnox driver, what comes back from the hot-side controller chip is the (000.0 ... stuff at 2400 bps 8/N/1.

I had thought I might replace or reflash the CY7C63313 but since it’s just dumbly passing through the serial protocol from USB <-> 2400bps UART, that’s not the source of the problems.

ups innards from USB

After musing for a bit I brought out an LTV816 5kV-rated isolator I had lying around and added a parallel optoisolated way to programmatically “press the front panel button”. Holding the button for 3s does bring us out of “PWR DN” with the load powered, the isolator hack allows us to control that from the Rpi.

There are still problems… when the UPS is in “PWR DN”, the hot side does not issue status data except at > 30s intervals, the USB HID controller then seems to reply with 0x05 byte indicating that it is still there, but it did not get any response from the hot side to forward back to the USB host.

hunnox does not understand this since it’s not the ( it expected for status. However the NUT driver continues to report the last actual status that it had as if it was current: but this is stale garbage.

However even allowing for this hot-side narcolepsy where it is waking only once per 30s briefly when in PWR DN, it does not respond to sending it even continuous load.on from NUT (which sends C at the protocol level, to cancel the shutdown) for over a minute. So there is no way to switch the UPS back on programmatically over USB as it stands.

It is also part of the reason why you see complaints around the net that “FSD” state reported by upsc is “sticky”, it was the last state received by the USB HID controller from the now OFF “hot side controller” and just keeps telling it in the absence of any new information. The main reason the FSD state is sticky is that nut-server holds the state and won’t stop reporting everything is about to shutdown until you restart the service.

So bringing the load back up and having it stay up can be done by

  • sudo systemctl restart nut-server on the Rpi4
  • have the Rpi4 “press the fontpanel UPS button” for 3s

(Update 2020-04-06: when in PWR DN, unfortunately sending it load.on over and over is not effective to bring the load back on by itself).

And there is a second problem, that even with the NUT RPi powered down, the UPS will restart the load between 30m .. 2h by itself, it looks like the “hot side controller” has a leakage problem when it is OFF that, eg, its power rail used to pull up the button interrupt gradually leaks when depowered until it looks like the button is pressed and it reapplies the load autonomously.

After some experiementation, when in PWR DN, sending shutdown.stayoff every 10m is an effective to keep it in PWR DN for as long as you want.

Tying it all together, all the hacks are aimed at this state diagram:

ups states

Helper scripts:

/usr/local/bin/ups-up.sh

#!/bin/sh

sudo rm -f /tmp/last-stayoff
sudo touch /tmp/last-on
sudo systemctl restart nut-server
sleep 4s
sudo /usr/local/ups/bin/upsrw -s ups.delay.start=1 -u nutuser -p nutpassword powercool
sudo /usr/local/ups/bin/upsrw -s ups.delay.shutdown=1 -u nutuser -p nutpassword powercool
sudo /usr/local/bin/gpioset --mode=time -s 3 0 26=1


for i in 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 ; do
    sleep 2s
    OUT=`/usr/local/ups/bin/upsc powercool output.voltage | cut -d'.' -f1`
    if [ $OUT -gt 90 ] ; then
        echo "back up"
        exit 0
    fi
done

exit 0

/usr/local/bin/ups-down.sh

#!/bin/sh

sudo rm -f /tmp/last-on
sudo systemctl restart nut-driver@powercool
sleep 5s
sudo /usr/local/ups/bin/upsrw -s ups.delay.shutdown=20 -u nutuser -p nutpassword powercool
sudo touch /tmp/last-stayoff
sudo /usr/local/ups/bin/upscmd -u nutuser -p nutpassword  powercool shutdown.stayoff
sudo /usr/local/ups/bin/upscmd -u nutuser -p nutpassword  powercool shutdown.stayoff

/usr/local/bin/ups-poll.sh

#!/bin/sh

# powercool UPS will auto-wake after 30m - 2h if left alone in PWR-DN
# remind it to stay off every 10m while that's what we want avoids this


ON=0
if [ -e /tmp/last-on ] ; then
    ON=`stat /tmp/last-on -c %W`
fi
OFF=0
if [ -e /tmp/last-stayoff ] ; then
    OFF=`stat /tmp/last-stayoff -c %W`
fi

if [ $ON -gt $OFF ] ; then
    echo "exiting: ON more recently than OFF"
    exit 0
fi

UT=`date +%S`

OUT=`/usr/local/ups/bin/upsc powercool output.voltage | cut -d'.' -f1`
if [ -z "$OUT" ] ; then
    echo "Empty output.voltage"
    OUT=0
fi

if [ "$ON" -eq 0 -a "$OFF" -eq 0 -a "$OUT" -gt 90 ] ; then
    echo "exiting: No on or off info, and output.voltage > 90"
    exit 0
fi

if [ $OUT -lt 90 ] ; then
    echo "telling it to stay off"

    /usr/local/ups/bin/upscmd -u nutuser -p nutpassword  powercool shutdown.stayoff
fi

/etc/crontab

  0,10,20,30,40,50 * * * * root /usr/local/bin/ups-poll.sh

… with this set of workarounds we’re finally back in business after a hardware hack and a software hack bound to an external controller solves two faults the UPS shipped with: it’s a bit messier than expected but that’s what you get for your cheapo GBP200 rackmount UPS.

Step 5a: Building libgpiod

The button hack needs libgpiod, naturally that is not packaged in Rocky. Mostly it uses the pieces already needed to build nut from git.

$ sudo dnf install autoconf-archive
$ git clone git://git.kernel.org/pub/scm/libs/libgpiod/libgpiod.git
$ cd libgpiod.git
$ ./autogen.sh
$ ./configure --enable-tools
$ make && sudo make install

Step 6: Set up the networking on NUT

NUT is a bit complicated but it does deal with the distributed shutdown process that is necessary when we have a lot of physical devices hanging off the UPS. There are several different systemd services (config paths are for git make install default configuration). “Server Side” means runs on the device that has the connection to the UPS, “Client Side” means a device that is powered via the UPS, but does not have a connection to it, and wants to be told about power status by the server.

For the Powercool UPS I have, the git version of NUT is needed on the server. For the clients though, it’s possible to user older distro NUT. The main difference is older NUT uses master / slave nomenclature and newer uses primary / secondary. Non-git, distro NUT is also likely built to use distro path conventions, like /etc/ups/.

Side Service Config Functionality
Server nut-driver@upsname /usr/local/ups/etc/ups.conf Userland driver for connection to UPS
Server nut-server /usr/local/ups/etc/upsd.conf Network listener that accepts clients and informs them about UPS status
Client nut-monitor /usr/local/ups/etc/upsmon.conf Network client that connects to a nut-server and reacts locally to UPS status there

For Server /usr/local/ups/etc/ups.conf:

[powercool]
    driver=nutdrv_qx
    vendorid=0001
    productid=0000
    product=MEC0003
    subdriver=hunnox
    langid_fix=0x409
    port = auto

Then start the nut userland driver

$ sudo systemctl enable nut-driver@powercool
$ sudo systemctl start nut-driver@powercool

For Server /usr/local/ups/etc/upsd.conf:

LISTEN 0.0.0.0 3493

For Server /usr/local/ups/etc/upsd.users:

[nutuser]
        password = nutpassword
        actions = set
        actions = fsd
        instcmds = all
        upsmon primary

… and for the clients /usr/local/ups/etc/upsmon.conf (if older NUT, then use slave instead of secondary)

MONITOR powercool@myserver 1 nutuser nutpassword secondary
SHUTDOWNCMD "/sbin/shutdown -h +0"

… on the UPS monitor, set it instead to

MONITOR powercool@localhost 1 nutuser nutpassword primary
SHUTDOWNCMD "/usr/local/ups/bin/shutdown-if-no-mains.sh"

… and create a file /usr/local/ups/bin/shutdown-if-no-mains.sh containing

#!/bin/sh

INFREQ=`/usr/local/ups/bin/upsc powercool input.frequency | cut -d'.' -f1`

if [ $INFREQ -gt 47 ] ; then
        echo "Skipping shutdown as input freq $INFREQ"
        exit 0
fi

/sbin/shutdown -h +0

Also chmod +x that.

Then on the UPS monitor, all of these; on the clients just the last two

$ sudo systemctl enable nut-server
$ sudo systemctl start nut-server
$ sudo systemctl enable nut-driver@powercool
$ sudo systemctl start nut-driver@powercool
$ sudo systemctl enable nut-monitor
$ sudo systemctl start nut-monitor

From the clients or the UPS monitor RPi, you should be able to run upsc to check the status of the ups remotely, if the NUT server IP is in /etc/hosts as ups-monitor, then

$ upsc powercool@ups-monitor

at the client machine (which is not hooked up to the UPS USB, but uses UPS backup power) should show something like

battery.voltage: 27.50
device.type: ups
driver.name: nutdrv_qx
driver.parameter.langid_fix: 0x409
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.product: MEC0003
driver.parameter.productid: 0000
driver.parameter.subdriver: hunnox
driver.parameter.synchronous: auto
driver.parameter.vendorid: 0001
driver.version: 2.7.4-5059-ga8e3687a
driver.version.data: Q1 0.07
driver.version.internal: 0.32
driver.version.usb: libusb-1.0.23 (API: 0x1000107)
input.frequency: 50.0
input.voltage: 244.0
input.voltage.fault: 0.0
output.voltage: 245.0
ups.beeper.status: disabled
ups.delay.shutdown: 30
ups.delay.start: 180
ups.load: 22
ups.productid: 0000
ups.status: OL
ups.temperature: 29.0
ups.type: online
ups.vendorid: 0001

NUT problem: monitor deps

Nut has / had a bug on the nut-monitor.service file, it says it goes After: nut-server but on a client, that is not true. So you may have to hack the service file to remove this on client-only installs.

Otherwise NUT will not auto-start next reboot.

NUT problem: target enables

Related to the above, you must also

$ sudo systemctl enable nut.target nut-driver.target

… so the deps in the NUT service files can be satisfied.