Archive for November, 2007

Whirlygig GPL’d HWRNG

Saturday, November 24th, 2007

Hardware random for the masses

I made available the result of the ring oscillator random generator as a GPL project called Whirlygig. It’s a 2.75cm x 4cm PCB with a mini USB connector, it provides a sustained 5.5Mbps (~620KBytes/sec) of apparently very high quality random bits using the Linux hw_random API. The large amount of randomness should make it useful for statistical tests as well as hard crypto.

I prototyped it using a couple of boards I had lying around, so I know it works fine, but I am waiting for the PCBs to come back from fabrication to actually build a final one. I placed the CPLD VHDL, the board hardware design, the driver software and the firmware for the USB controller into http://git.warmcat.com.

Dieharder

I spent some time worrying about how to test the quality of the result — I found that “diehard” mentioned in an earlier post has been superceded by “dieharder”. This has a much tougher general testing regime, even though many of its test are reproductions of the diehard ones — it runs each test many times and forms histograms of the p-value results from the many runs, and gives an assessment of fail, poor, possibly weak or pass on the spread of results rather than a single result.

At first the RNG failed three of the 18 tests, but on looking closer one of the tests (#2) currently fails for all RNG input and is marked up as not for use with assessing RNG quality, and the two others required by default more than the 400MBytes of randomness I had prepared. Unfortunately in that case they simply rewind the randomness file and re-use the same data to make up the balance! Of course this is no longer quite “random”. When I adjusted those two tests to use a smaller sample that fitted into the 400MBytes without repetition, the output of the RNG get a “pass” on all 17 of the relevant dieharder suite tests.

Max Entropy

During the validation phase I changed the RNG algorithm in the CPLD significantly. The scheme is described on the project page, but basically I moved away from a bit-centric to a byte-centric design with 8 identical sets of 3 oscillators. To stop any characteristic of a particular oscillator’s routing from being associated with a particular bit of the result byte and creating a bias, I introduced a “mixer” that first generates 8 random bits by combining six oscillator outputs each with XOR, then rotates these oscillator sets between the result bits sequentially at 24MHz. I also removed the toggling action and used the random bit directly.

I also found the Linux rng-tools suite which repeatedly runs FIPS-140-2 tests on the bits, this fails 1 in 1200 or so packets of testing over 20 billion bits, I believe this is normal for a real random generator that it will produce sequences with low probability that don’t look very random in the short term.

Aside from passing dieharder and FIPS-140-2, the changes also got me a reported 8.000000 bits of entropy per byte from the ENT test, so there are reasons to imagine the quality of the output is very good.

FIPS-140-2 and ENT validation vs ring RNG

Thursday, November 15th, 2007

NIST lists some more test suites. NIST also have their own suite, but it is now Windows-only, and lacks a necessary DLL to run there. The last UNIX version segfaulted here before giving any results… sigh.

I ran the last 10MByte sample against ENT and TestU01… to cut a long story short

$ ./ent ../die.c/dump3
Entropy = 7.999980 bits per byte.

Optimum compression would reduce the size
of this 10002432 byte file by 0 percent.

Chi square distribution for 10002432 samples is 281.26, and randomly
would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.4958 (127.5 = random).
Monte Carlo value for Pi is 3.140111525 (error 0.05 percent).
Serial correlation coefficient is -0.000212 (totally uncorrelated = 0.0).

7.9999 bits of entropy per byte! TestU01 is less turnkey than the other suites — it’s literally a test library with some example code. I amended an example to call the FIPS-140-2 tests:

============== Summary results of FIPS-140-2 ==============

 File:             dump3
 Number of bits:   20000

       Test          s-value        p-value    FIPS Decision
 --------------------------------------------------------
 Monobit               9933           0.83       Pass
 Poker                11.88           0.69       Pass

 0 Runs, length 1:     2482                      Pass
 0 Runs, length 2:     1227                      Pass
 0 Runs, length 3:      630                      Pass
 0 Runs, length 4:      319                      Pass
 0 Runs, length 5:      161                      Pass
 0 Runs, length 6+:     166                      Pass

 1 Runs, length 1:     2466                      Pass
 1 Runs, length 2:     1302                      Pass
 1 Runs, length 3:      620                      Pass
 1 Runs, length 4:      311                      Pass
 1 Runs, length 5:      140                      Pass
 1 Runs, length 6+:     146                      Pass

 Longest run of 0:       16           0.14       Pass
 Longest run of 1:       14           0.46       Pass
 ----------------------------------------------------------
 All values are within the required intervals of FIPS-140-2

So the design’s output is compliant to FIPS-140-2, a requirement for many uses.

Diehard validation vs ring RNG

Wednesday, November 14th, 2007

RNG Quality assessment

A timely article flew by on Reddit about the RANDU pseudo-random generator algorithm widely used in the 1960s, which it turns out was very flawed indeed. It was explained to one student that ”We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random”. Basically it produced numbers that belonged to one of 15 “planar” groupings and nothing in the gaps between the planes. It isn’t just a minor annoyance, because many statistical studies in the 60s and 70s used it, and it can easily have contaminated their results. That’s definitely not what I am trying to reproduce with the ring oscillator device — so how can I figure out how “good” the randomness is in an objective way?

RNG quality test suites

It turns out that empirically testing RNG outputs has been the subject of a lot of work for decades, and there are some established testing suites available online. A major one seems to be the “diehard” suite — I guess it is a pun on die as the plural of dice.

It needs you to fetch 10M bytes of random numbers or more and let it run a bunch of tests on them. The output was a little hard to assess initially: most tests issue a “p” number which only suggests something is bad if it is 0.000… OR 0.999…. All other numbers inbetween are to be taken as a good result as I understood it. Except there is a warning that even good RNGs can produce the occasional test fail.

Thus you should not be surprised with occasional p-values near 0 or 1, such as .0012 or .9983. When a bit stream really FAILS BIG, you will get p`s of 0 or 1 to six or more places. By all means, do not, as a Statistician might, think that a p < .025 or p> .975 means that the RNG has “failed the test at the .05 level”. Such p`s happen among the hundreds that DIEHARD produces, even with good RNGs. So keep in mind that “p happens”

I duly fetched 10M bytes of 115kbps randomness from the device and fed it to diehard. It seemed to give fine results except on “Count the 1s stream” and “Squeeze” (devastating p=0.000000), “Count the 1s specific” for bits 1-11 (p=0.000030) and 9-16 (p=0.000064), and QQSO 2-6 (p=0.000005). It passed the dozens of other tests but it was disappointing, looks like a big fat ‘failed’.

Triple Scoop

Well, since my test CPLD was an XC95288XL with 288 Macrocells to burn, I naturally wondered if I could improve matters by tripling the amount of ring oscillators getting Xor-ed — that is to implement the three varying sized oscillators 3 times each, totaling nine, and sum them with a big XOR. They’ll all be drifting around individually as much as together, it should be a mighty noise-fest.

I edited the VHDL and blew it into the CPLD… visually the summed RNG output “bit” was an awful lot more noisy than before. I pulled another 10M bytes from that setup: but just looking at the byte distribution as I did before told me something is still up.

That sawtooth type distribution is “not random” to coin a phrase. If you look at the large jump at 0×80 (128) it is telling us that we are more likely to get 1000000 binary than we are to get 01111111, in other words, since this is over 10M bytes, there is a distribution problem favouring ’0′. When I analyze the distributions of 1s and 0s I find

0: 40436204, 1: 39563804... delta=872400, skew=1.090500%

You can see the same thing even better looking at 0×00 (42,000 hits) vs 0xFF (36,000 hits), they are like 8% off the median of 39,000. Clearly that distribution of 1s and 0s has to have a very small skew to stop these kinds of effects showing up, and equally clearly this is telling us something deep about the RNG hardware.

Spiky

Although the individual oscillators are quite slow thanks to the number of inverter stages, at 4 – 6MHz, the way they are being summed makes for trouble from bandwidth limitations inside the CPLD. At the moment it just uses a dumb asynchronous XOR action, that means that potentially very fast spikes can be seen when one “slow” oscillator changes state very shortly after another “slow” oscillator. For example:

You can see on the left (this is 5ns/div notice) a runt pulse where this happened, the XOR was convinced to rise by one oscillator changing and then countermanded when another oscillator changed state less than 5ns later, resulting in a doubtful pulse that was probably not visible as a ’1′. This also happens when going from ’1′ to ’0′, but maybe the threshold for the transistors in the CPLD is not at exactly 50% of the 3.3V supply. So we suddenly have it seeing more ’0′s than ’1′s on average when spikes are involved.

This whole high bandwidth summing step is completely needless, it’s only there because it is a literal interpretation of the diagram in the original RFC. I changed it instead to have nine latches sample the nine oscillators every 125ns (there is an 8MHz clock on the prototype board) and sum those results with XORs into a single bit. In turn this output is sampled by another latch at 8MHz to hide any metastability.

Latched up

The latched summing version performs much better and has gotten rid of most of the bit skew, and the sawtooth behaviour:

…but there is still a problem with 0×00…. the bit skew looks like this

0: 39960076, 1: 40039932... delta=79856, skew=0.099820%

so the skew is now on the side of ’1′s but only by 0.1%. You can see the byte count spread is much tighter than before too — 1800 instead of 6000 counts before.

Balancing out the skew

Well if the remaining skew is something to do with the ratio of rise to fall times, or the non-squareness of the oscillator outputs for some other reason by something as low as 0.1%, that is hard to do much about, especially as it may vary on the specific silicon die.

But it shouldn’t matter — now the bandwidth situation at the XOR summer is sane, if we invert the summed output 50% of the time it should spread any excess on ’1′s or ’0′s to the opposite as well, cancelling any bias. I added a couple of terms to the summer to xor against the UART bit index LSB and a bit which toggles after every byte sent by the UART. It’s the equivalent of xor with 0×55 for the first byte and then 0xAA for the second byte, over and over.

That glitch in the middle is actually at 134 (0×86), maybe it is random but I guess we will see…. the skew is further reduced as anticipated

0: 39974218, 1: 40025790... delta=51572, skew=0.064465%

Diehard sequel

I ran 10M bytes from this version through Diehard again… the really bad p-value results are gone. For example Squeeze was a deadly 0.000000 before and is now 0.255260.

I made one last adjustment, I added the current state of the latched random value to the XOR term. That means it decides whether to keep or invert the latched value, it no longer directly accepts the value from the RNG. This got me to the promised land: 0.0005% skew between ’1′ and ’0′.

0: 40000206, 1: 39999802... delta=404, skew=0.000505%

This also gets me the apparently good diehard results with no obvious failures on any tests, you can see the actual results here. So it seems the current version can tentatively be called a “real RNG”.

Ring oscillator RNG performance

Monday, November 12th, 2007

Pretty random

After some scrabbling around porting my Jtag SVF interpreter to Octotux and creating a kernel module for the PIO end of it — and moving to a different board with a XC95288XL CPLD to prototype it, the triple ring oscillator RNG is working. It issues a 9600 baud result, but after some initial confusion I modified it 1/8th of the time to sit out a sample time leaving “break” on the serial line. This should make sure that the receiving UART does not get confused by the data as a start bit. The true data rate is something like 800 random bytes per second at 9600 baud.

Here are the three chains of inverters (19, 23 and 29 long) oscillating at the different fundamentals

… and here is what the xor summing looks like, first over 1s then sampled once.

Although the single shot sample doesn’t look very random, the oscillators are drifting around all the time. If you wait a little while between samples (currently it is 104us, a 9600 baud bit-period) it’s pretty hard to guess what phase all the oscillators have drifted to — at least, that’s the plan.

Distribution of binary levels

The first test I did was to see what the distribution of ’1′ and ’0′ in the results was… clearly if the device is really random it should on average be 50% each. I fetched 1M random bytes, or 8Mbits:

0: 4008913, 1: 3991095… delta=17818, skew=0.222725%

Its okay for a really random source to deviate to 50:50 at any given time, although on average it should be 50:50.

Octet distribution

Next I looked at the distribution of the results from 0×00 through 0xFF as the result “random byte”. This would show up if the RNG fails to ever issue some result or favours certain results over others — every result should on average have an equal chance of showing up and so an equal count. I ran it for 1M random bytes…

This is pretty decent, every possible result is seen with a frequency within +/-200 counts of the 3,900 average after 1M bytes.

115200 baud results

Encouraged by this I cranked the baud rate up to 115220 or 8.68us between samples and around 10K random bytes per second. The skew is increased somewhat and the spread of result counts is increased a little.

0: 4028746, 1: 3971262… delta=57484, skew=0.718549%

So far so good!

Adding entropy to /dev/random

Wednesday, November 7th, 2007

A hard RNG is good to find

The recent statistical analysis for drumbeat reminded me I could do with a proper source of random numbers, not generated by a pseudorandom feedback action. Back in the early 1990s I was looking at statistical profiling of execution on microcontrollers, I was surprised then to discover that only by making the sampling period random could I get a true picture of execution distribution. If the address bus was sampled at a fixed rate, say 100kHz, instead of a true picture it would be distorted by activity that was happening at some fraction or harmonic of the sampling frequency. So you would alias out pieces of loops completely or get a bloated count for other areas. Only by true randomness in the sampling timing could you see the reality — a paradox.

Analogue RNG methodologies

A Google or two around showed that most of the techniques are analogue one way or the other. Many of the methods suffer from a problematic need to amplify some very tiny source of noise, a Zener diode or avalanche transistor junction, by really huge amounts, 90dB or more. There are a couple of suppliers of RF “noise diodes” with flat spectra across a wide frequency range, but they are hard to source.

Digital non-pseudorandom technique

However there is one technique which while still relying on analogue noise is basically digital — to run multiple chains of unlocked inverting oscillators and xor the outputs. The unlocked oscillators have no reference at all, they’re basically an inverter fed back on its own input — in fact a chain of inverters. Such a circuit oscillates according to the period of the total delay through the inverter chain… and that is highly sensitive to temperature. Normally with synchronous digital design we choose a clock rate for a circuit that is just below the maximum possible at the worst temperature it is expected to operate at — and after that we can forget about temperature. But with this asynchronous unlocked oscillator concept, the micro- and macro- temperature dependence is revealed in all its freaky glory, causing the oscillation to drift unpredictably slightly every cycle and over larger period with gross temperature fluctuations.

RFC4086

RFC4086 mentions a recommendation for a RNG based on unlocked inverter chains that is found in IEEE 802.11i.

             |\     |\                |\
         +-->| >0-->| >0-- 19 total --| >0--+-------+
         |   |/     |/                |/    |       |
         |                                  |       |
         +----------------------------------+       V
                                                 +-----+
             |\     |\                |\         |     | output
         +-->| >0-->| >0-- 23 total --| >0--+--->| XOR |------>
         |   |/     |/                |/    |    |     |
         |                                  |    +-----+
         +----------------------------------+      ^ ^
                                                   | |
             |\     |\                |\           | |
         +-->| >0-->| >0-- 29 total --| >0--+------+ |
         |   |/     |/                |/    |        |
         |                                  |        |
         +----------------------------------+        |
                                                     |
             Other randomness, if available ---------+

This has three unlocked, wandering oscillator chains of different lengths being summed at an XOR gate.

Implementing the RFC4086 RNG

Since it needs 71 inverters, you would need 12 74hc04 or similar, it makes more sense to put it all in one CPLD. I have an old XC95108 lying around, so I wrote up the design in VHDL and added a UART interface to issue the sampled random data. This brings up the issue of how quickly it can be sampled and still get high quality randomness… clearly if we sampled it at 10ps it wouldn’t be very random at all, since it didn’t have time to change between samples. On the other hand if we sampled it at some high multiple of the fastest free-running oscillator period, then there is a lot of opportunity for each oscillator phase to have been affected over the longer time. By using the UART we can control how often we sample the RNG by the baud rate… I initially set it to 9600 baud or 104us/sample. The oscillators should have periods on the order of 150 – 200ns (5 – 6MHz), so this is allowing 500+ cycles of jitter to accumulate in each oscillator before the summed sample is taken.

I’m currently waiting for a programming tool to be delivered so I can program another device to allow programming the XC95108 — I no longer have any PCs with a printer port I realized yesterday. I am very interested to see what the performance and quality of the randomness is like!