Don't let Production Test Be Special

Lesson 3: Test is not special

Commonly in embedded work test is the "red-haired stepchild", nobody wants to take care of it and by common, silent consent it is always left until last. Eventually the need for a test plan becomes overwhelming as the date to go to the factory nears, and the task is assigned to the most junior engineers available, since everybody knows that test is the death knell of your career. Coming cold to and excluded from being inside an already-existing project, the engineers try to create some kind of test coverage the best way they can. At openmoko two giant test suites were created, DM1 and DM2, written by people who were learning C for the first time. I got the job of modernizing this code so I know from experience the code was already truly terrible and bitrotted at an alarming rate. However I had to admire the guys who wrote it, with everything against them and little experience they did manage to create something that did provide test coverage at the factory, however much it was on life-support.

Totentanz

Similarly, Openmoko used production test jigs, special additional PCBs that formed a kind of custom test environment for the PCB under test. At one version of GTA03 there were so many test points added it was a serious concern that the board would break down under the overall pressure needed to mate the spring-loaded test probes to the test points. Jigs and test points have an obvious advantage in terms of test throughput, but there are some big disadvantages. First, you have to design and build the jig, and track changes to the actual device with it. This effort is completely disconnected from moving your actual product on, except that it's meant to help in production. Second, test points don't test your connectors; the test point may be connected OK but not the connector pin the user actually accesses. Third, you need something else outside the device to assess what is happening on the test points, the code for that also has to be written and maintained against changes in the actual product. It also means that it's not possible for the tests to be casually performed outside the factory, or maybe by the original engineers if they have access to the ATE gear themselves.

Pain into torture

Additionally the bringup of GTA02 required special versions of U-Boot and kernel which had added "test magic" created by the test guys and unknown to anyone else. These versions were seldom uplevelled. Since GTA02 had raw NAND, it needed filling up at the factory with the rootfs. The way to do this was via a very fragile OpenOCD using a custom USB - serial based device that was bitbanged. It only worked with certain versions of the usb library needed to talk to it. All of these quirks and requirements at the factory made production runs difficult and expensive to get right.

I only hurt you because I love you

I spent a lot of time thinking about how to avoid this end result next time I would design something. The mistakes started in having anything special for test I concluded. The jig: special, and so evil. Test kernels or bootloader: special -> evil. Test rootfs -> Evil. test software, like Openmoko's DM1 and DM2, evil. The device should naturally be able to test itself with the arrangements that already exist inside it to operate at all. The answer to the problem of "production test" is to completely subsume it into the rest of the design. So it is the responsibility of Linux drivers to provide enough functionality by probe errors, or sysfs features, that one can perform test and diagnosis. The "test suite" should boil down to a bash script that is using features exposed in a normal shipping rootfs and kernel. Bash is ideal because most of the test action will be calling existing commandline tools like ifconfig, ping, l2ping and grepping or looking at their return code, this is what bash is best at. It's also easily understood and edited by anyone who has worked with Linux for a while. The bootloader is required for test in only one capacity, it is the only part of the system that is capable to run the SDRAM tests; once you enter Linux you can't perform a full SDRAM test any more. But even that should be done by the one shipping bootloader image. In many cases, device interfaces can be tested by external loopback connectors, this proves connectivity through the connectors and it leaves open the possibility of end-users being able to run the same tests on the shipping rootfs.