Adding entropy to /dev/random

November 7th, 2007

A hard RNG is good to find

The recent statistical analysis for drumbeat reminded me I could do with a proper source of random numbers, not generated by a pseudorandom feedback action. Back in the early 1990s I was looking at statistical profiling of execution on microcontrollers, I was surprised then to discover that only by making the sampling period random could I get a true picture of execution distribution. If the address bus was sampled at a fixed rate, say 100kHz, instead of a true picture it would be distorted by activity that was happening at some fraction or harmonic of the sampling frequency. So you would alias out pieces of loops completely or get a bloated count for other areas. Only by true randomness in the sampling timing could you see the reality — a paradox.

Analogue RNG methodologies

A Google or two around showed that most of the techniques are analogue one way or the other. Many of the methods suffer from a problematic need to amplify some very tiny source of noise, a Zener diode or avalanche transistor junction, by really huge amounts, 90dB or more. There are a couple of suppliers of RF “noise diodes” with flat spectra across a wide frequency range, but they are hard to source.

Digital non-pseudorandom technique

However there is one technique which while still relying on analogue noise is basically digital — to run multiple chains of unlocked inverting oscillators and xor the outputs. The unlocked oscillators have no reference at all, they’re basically an inverter fed back on its own input — in fact a chain of inverters. Such a circuit oscillates according to the period of the total delay through the inverter chain… and that is highly sensitive to temperature. Normally with synchronous digital design we choose a clock rate for a circuit that is just below the maximum possible at the worst temperature it is expected to operate at — and after that we can forget about temperature. But with this asynchronous unlocked oscillator concept, the micro- and macro- temperature dependence is revealed in all its freaky glory, causing the oscillation to drift unpredictably slightly every cycle and over larger period with gross temperature fluctuations.

RFC4086

RFC4086 mentions a recommendation for a RNG based on unlocked inverter chains that is found in IEEE 802.11i.

             |\     |\                |\
         +-->| >0-->| >0-- 19 total --| >0--+-------+
         |   |/     |/                |/    |       |
         |                                  |       |
         +----------------------------------+       V
                                                 +-----+
             |\     |\                |\         |     | output
         +-->| >0-->| >0-- 23 total --| >0--+--->| XOR |------>
         |   |/     |/                |/    |    |     |
         |                                  |    +-----+
         +----------------------------------+      ^ ^
                                                   | |
             |\     |\                |\           | |
         +-->| >0-->| >0-- 29 total --| >0--+------+ |
         |   |/     |/                |/    |        |
         |                                  |        |
         +----------------------------------+        |
                                                     |
             Other randomness, if available ---------+

This has three unlocked, wandering oscillator chains of different lengths being summed at an XOR gate.

Implementing the RFC4086 RNG

Since it needs 71 inverters, you would need 12 74hc04 or similar, it makes more sense to put it all in one CPLD. I have an old XC95108 lying around, so I wrote up the design in VHDL and added a UART interface to issue the sampled random data. This brings up the issue of how quickly it can be sampled and still get high quality randomness… clearly if we sampled it at 10ps it wouldn’t be very random at all, since it didn’t have time to change between samples. On the other hand if we sampled it at some high multiple of the fastest free-running oscillator period, then there is a lot of opportunity for each oscillator phase to have been affected over the longer time. By using the UART we can control how often we sample the RNG by the baud rate… I initially set it to 9600 baud or 104us/sample. The oscillators should have periods on the order of 150 – 200ns (5 – 6MHz), so this is allowing 500+ cycles of jitter to accumulate in each oscillator before the summed sample is taken.

I’m currently waiting for a programming tool to be delivered so I can program another device to allow programming the XC95108 — I no longer have any PCs with a printer port I realized yesterday. I am very interested to see what the performance and quality of the randomness is like!

CE Technical Documentation

October 25th, 2007

The other week I went on a workshop to learn more about the new 89/336/EEC regulations that came into force in the UK on 20th July 2007. Here are some notes cribbed from my notes, they’re intended to be an overview: obviously for something this important you should get your own advice.

Out with the old ways…

For a long time it has been a requirement to certify that any product you manufacture for sale in the EU meets the “relevant standards”, so it can have a “CE” mark. Until this July in the UK you could either do that by:

  • paying for test at a “competent body”, a company with a ton of test gear that will empirically test your device against the emission and immunity standards, or
  • Writing up a Technical Construction File, or TCF, which described the product design in a deep way, and included tests and logic showing why you are compliant

Device torture at the House of Pain

The last design I completed, for a smart 4-channel Analogue telephony device that can hook to the Internet, I went down the empirical test route. At a total cost of several thousand pounds the production device was tested at a real competent body with calibrated receivers and emitters, blasted with wideband radio signals, zapped with +/- 8kV discharges. The resulting report gave a very clear okay except on a minor issue to do with the AC power supply we had used. We also had to do specialized testing for the analogue telephony end, which we again passed, although not until getting a component supplier to make a special that actually complies with the standard.

(Actually walking the device through this testing is a pretty sweaty business, since time is literally money at the test facility and wiggle room in the case of trouble is also in short supply. In one instance for example I was able to patch the sources in realtime when an issue came up during ESD testing that broke normal operation but wasn’t enough to trigger the watchdog, turning a fail into a pass. So I would never send a device unaccompanied for testing, or go to a test house without a laptop with full sources to expect the unexpected.)

In with the new ways…

However the demands in the new regulations have changed significantly. You must now generate “Technical Documentation” for any product you will be selling in the EU. This is basically the old TCF route to compliance, but it doesn’t itself necessarily remove or even perhaps reduce the need for absolute tests for a given device.

Less well known is that if you are still selling devices first sold before 20 July 2007 come 20 July 2009, you will need to have made a new style Technical Documentation for them, or stop selling them. A lot of tech products from 2007 will be old hat by 2009 solving the problem, but it is not true in all markets.

Typically the Technical Director of the company is the “responsible” as the French say who must sign off that the device meets the standards. What you are signing off on is that WHEN it is properly installed and maintained, and used for the purpose it is intended for:

  • the device creates an EM disturbance low enough that radio and telecoms equipment can operate as intended
  • it has a level of intrinsic immunity which is adequate to enable it to operate as intended

So what is done with this “Technical Documentation”?

Nothing if you’re lucky. The only people who can ask to look at it are the regulatory authorities, OFCOM in the UK. You don’t publish it or register a copy of it. But you have to keep it for ten years after the last sale of the device for the authorities to ask for. It literally only exists to keep the signatory out of jail if the authorities ask for it. Not kidding about the jail — if you don’t have a satisfactory Technical Documentation to show, the criminal penalties can include a GBP5,000 fine and/or 3 months in jail.

The key words about the Technical Documentation are that it should be “reasonable” and “duly diligent”, as in “All reasonable steps are exercised and all due diligence to avoid committing the offense”. That really sums up the job of writing it, you are trying to have an answer for anything that could be said was unreasonable or not duly diligent. While meeting budget constraints from the customer :-/

Spread of outcomes

The ways that problems might pan out were discussed informally. It was proposed that roughly a third of companies, the presenter reckoned, have their head totally in the sand about it, and could expect trouble. Another third had made some effort in the right direction and another third spent the money and were golden. Another factor in how much shit would rain down in the event of problems was the number of devices sold, if it was millions and they were crap, expect maximum warp to jail. If it is five and they don’t quite comply despite obvious efforts to prove it, maybe that won’t be so bad. But who knows, some overkill is called for.

How likely is my Technical Documentation to be demanded?

In Germany, we were told, the authorities have a system of testing 10,000 models of devices a year, spread over the various types of product. In the first year (IIRC) it resulted in 105 prosecutions :-/

Another tidbit is in the UK, OFCOM are allegedly looking at training up 85 new enforcement officers. The mobile phone companies, due to the ruinously expensive spectrum auctions of a few years ago, are apparently agitating for more enforcement of the cleanliness of their expensive 3G spectrum.

What goes in the Technical Documentation then?

Here is the briefest outline:

  • Description of apparatus – brand/model/manufacturer, intended function, limitations on operation… Technical description – block diagram, technical drawings, interconnections, variations, versions of design documents referenced
  • Procedures used to ensure conformity – Technical Rationale: what you’re testing against, why you did particular tests; Details of design: EMC features, component specifications, QA to control variation; Test data: Logical processes to decide if the tests are adequate, EMC tests and their results, external test reports on subassemblies/components

You can also get a Competent Body to “comment” on your Technical Documentation, as some fairly convincing assurance that it is adequate. This is really a seal on the “due diligence” aspect so you can really show you totally ticked every box to make sure it was compliant, but I guess only large companies can afford it.

Conclusion

If you manufacture or import stuff to sell in the EU, you are going to have to have Technical Documentation to keep yourself out of jail.

For a standalone device, that means you’re really going to have to not only look to dealing with EMC early in the design, with some kind of inhouse testing ability, but find the budget of a few thousand pounds to take it for testing at a Competent Body so you have something convincing and calibrated to put into that Technical Documentation.

Not only that, but even determining which are the applicable standards is a huge headache if you try to do it yourself, there are hundreds of them: a Competent body can also help select the basic issue of which tests you are targeting.

But it’s not all bad — if you make product variations around the same base, you can choose which variation to actually test as a baseline, and then for each variation see if it stands up to show they would not push the original base design over the edge. I have done this in the last couple of weeks, creating for a customer Technical Documentation for a sister device to one that went through actual testing at a Competent Body, and using the very large similarities to limit the amount of retesting needed.

There are definite advantages to requiring this level of design scrutiny and justification, but the change to requiring Technical Documentation and the trend to increased enforcement over the ten years you must keep the documentation has definitely pushed the minimum cost and effort of bringing something to market up several notches.

Drumbeat

October 25th, 2007

The magic code project has gained a name and there are some new results to share.

The full sources for the experimental modem using the correlator codes is available at http://git.warmcat.com under GPL2+ license.

The big change is what I call “scrambling”.  The 126-bit magic correlator sequence is disordered and xor-ed in 258 random ways, which are selected to not correlate well with each other at any offset (less than 43/128 match).  This allows us to issue whole bytes in one code by selecting which disordered correlation code to transmit.  The other two codes are for start and end of packet markers. I first tried 22 scrambles to allow 4 bits per code and some extra codes, this worked fine but I was able to extend it to 258 without really damaging the “best” scramble-scramble false correlation score too much.

The gain here is that the self-ordering and threshold properties of the code now reaches up to entire bytes: you always get a clean, aligned byte or you don’t get anything: with high probability you don’t get a wrong byte.

I also changed the demodulator code to something that is currently quite expensive to run.  Instead of tracking a reference phase at the receiver, which is tough to do in the presence of extreme noise and phase wrapping, at each sample it tries to demodulate using that sample as “0 degrees” and a 50% phase slicer to get the bits.  It’s expensive because it has to do that against each of the 258 scrambles every sample — but on the plus side the correlator does not run at all for 124 symbols out of 126 (98.4% of the time) after there has been a successful decode, since it knows it is partway through a 126-symbol code. Still the number of demodulation attempts can definitely be reduced, maybe by subsequently just doing it on the last winning phase and a few either side.

I found that the noise performance of the demodulator is strongly dependent on the relationship between the sample rate and the carrier frequency.  It’s much less dependent on the number of carrier cycle periods per symbol, so long as you have 4 or 5 or above (2 works but is killed by hardly any noise).  Since there is plenty of filtering to the carrier going on I didn’t really expect that. Here is a graph of the noise performance when there are 96 samples per carrier (48kHz sample rate, 500Hz carrier, 100baud — these are intra-code symbols, it’s 6.25bits/sec effective):

The noise performance doesn’t look too bad, at least with these synthetic tests and unrealistic 48kHz sampling. It can recover all 7 bytes that are sent even at -28dB, when only 4% of the received power is the signal and the rest is white noise. And because of the code properties, these are definite bytes being captured with quite high probability, not the kind of uncertain decode that would normally be expected in such noise.

The “average quality %” shown in the graph is the percentage of demodulated bits in the matched code scramble that actually were “right”, averaged over all the bytes. If a byte is missing because its quality was below the threshold of 70/128, it counts as 0% quality. Up until -20dB, the demodulator is doing well and the recovering individual bits almost perfectly. Without the Correlator code, after -20dB you would be dealing with a rapid increase in bit errors from the demodulator and relying on ECC. By using the Correlator coding though, we are able to push performance another 8dB into the noise (we even recover half the symbols at -30dB, or 10dB further) and still maintain the alignment and high probability of correct decode advantages.

This shows the effect of keeping everything else the same, but bringing the sample rate down to 8kHz from 48kHz

You can see the BPSK demodulation starts to fail at -10dB instead of -20dB and the correlation code again buys you another 8 – 10dB into the noise after that.

Here is the spectrum after bandpass and lowpass filtering that is actually “transmitted”, the peak is the carrier at 500Hz in this case.

Another aspect of this setup is that although the demodulator is currently pretty expensive in CPU (and power), the modulator is much simpler. It just requires a 1KByte table for the scrambles and a precooked integer sine table if you are running a separate carrier than the RF one itself. Then it just needs enough logic to walk through the scramble table entry at the symbol rate (and the sine tables at the sample rate if you’re using it). That can fit in part of a tiny flash controller, using a 1-bit output of a shifter to switch the phase of the transmitter. It can be simplified even further by not using scrambles but just sending a bit per code sending the code forwards or backwards to signal a ‘1′ or ‘0′.

It seems that I can begin to understand where this system fits into the existing high noise codings already used by ham radio folks. The correlator code is a special case of using an ECC code, in this case where 8-ish bits are exploded into an 126-bit “space” filled with correct decodes, damaged but recoverable decodes and invalid decodes. The error correction performance of adding 8-10dB “coding gain” is probably a bit (not so much, from what I can work out) poorer than an optimal use of 126 bits to code for 8, but the advantages that attracted me to the code in the first place can offset that for some applications, the guaranteed self alignment right down from demodulation to byte boundaries, and a quite firm decode success threshold. In addition, it seems some of the more optimal decoding schemes for stuff like Viterbi can be very compute-intensive, whereas although at times we do a lot of it, the correlation action is simple and lends itself to being done in parallel. Turbo codes are patented. So it’s not a one size fits all technique, but it has its niche.

I guess the next stage is looking to BPSK modulate and recover an RF carrier directly, but that will need some thinking on because it will be a very frequency-specific design, unlike these baseband tests where you can just edit the important variables and recompile. For example if it can be done initially at the UK 40MHz ISM band it will be possible to consider logic looking at carrier zero-crossings for phase assessment easily enough, and to autocorrelate just the averaged recovered phase at 4 times the symbol rate.

Heading deeper into the noise

September 28th, 2007

QPSK abandoned for BPSK

The noise performance of the QPSK decoder wasn’t what I was hoping after several iterations. I pretty quickly threw out the Costas loop because loss of phase lock was the weakest link in the chain, and moved to a system with a tight bandpass filter at the carrier frequency, and measuring zero crossings of the carrier against a local oscillator. But still the noise performance was not impressive at 300baud / 3kHz carrier.

I had hoped to go back to a damaged QPSK code recovery and to correct the data phase on the broken code bits, because we know what they should be according to the code. But there was no visible commonality between what was happening on the phase information that carried the code and the phase information carrying the data.

So I have moved back for now to a two-phase BPSK system interleaving the code and the payload symbol by symbol, in order to understand better how far the code can be pushed with noise.

Improved noise performance

Currently I use a 4.8kHz carrier and 1200baud symbols. Here is the performance today:

The RED line is the best decode seen for the correct match offset, the BLUE line is the best decode for all other (wrong) offsets. The PURPLE line is the mean decode for the correct match offset over the 100 runs. GREEN is the “quality cutoff” basically keeping us from getting near the blue line’s false hits: because of the properties of the code it is extremely unlikely noise will get near the green line.

Here is a close-up on the noisy end of things:

So far in terms of detecting the code, we can on average do so when the input energy is 82% noise (-15dB SNR). We can still detect codes ~1% of the time even at 92% noise (-21dB SNR). Here is a further close-up (it is 1000 runs, not 100)

suggesting you can still get good recoveries ~0.1% of the time at 95% noise (-25.5dB SNR). And of course we are now measuring the whole system performance here including the demodulator part, not just damaging the code bits directly.

Current BPSK receiver

Here is the receiver for the current BPSK method:

I spent several days meddling with the QPSK version and arriving at the carrier zero-crossing method for phase detection. The original plan to have a symbol sampler running from a locked LO hung around causing lots of problems. The indications of a change in symbol — detected by zero crossings — were variously delayed by the phase itself and the filters used, making it difficult to convert their jittery indications into a guide for the symbol recovery clock. This resulted in double bits being sampled.

The code is the symbol clock

Because of this I eventually realized that no symbol clock was needed, by running the correlator at the sample frequency, and sampling symbols at a fixed period, because of the false match rejection properties of the code it would “discover” the correct phase and offset from the behaviour of the correlator output. And when the code was recovered best then the data interleaved with it will be recovered best too. This sounds power-unfriendly, but after the first offset is used, you don’t run the correlator until enough time has passed for 256 symbols to be acquired, and then you should still be locked from when you did the first 256 symbols: if not you run the correlator a few dozen times to find lock again.

Also of note is that what is stored in the ringbuffer is a weighted average of the last four phase results, this includes information from the phase of multiple carrier cycles for the same symbol, helping to reduce the effect of noise on the decode.

The code is the data!

Well recovering the code at high SNR is interesting, but how useful is it if the data bits interleaved with it have been subject to the same beating without the properties of the code to protect them? We can use the autosyncing properties of the code to help with trying to get some payload signal gain through averaging, but I think I have seen where this is headed now… the code IS the signalling system for the payload data. That means throwing out the interleaved data concept.

A ‘0′ can be signalled as the normal code, and a ‘1′ as a time-reversed code. Because it’s intended for low data rate communication at VERY low signal levels compared to noise, it’s okay if we are reduced to 1 byte/sec, which will be the end result of this at 1200 baud. Basically with this we carry over all of the great robustness qualities of the code to be attributes of the payload data.

So what is the point compared to just blasting 128 symbol times of the same phase carrier, which is a hell of a lot simpler?

  • Three high accuracy results, “no result”, a ‘0′ or a ‘1′ detected. The carrier-only method will happily return a bogus result if there is noise energy at the carrier — if the code method ever claims a ‘0′ or a ‘1′ you can be almost certain it is genuine
  • “Fuzzy” robust damage-tolerant signal detection, better performance than a simple threshold comparator
  • Automatic bit sync (”bit clock recovery”) with almost no chance of wrong sync, sync recaptured each symbol; bit sync for the carrier-only version in high noise is unreliable
  • Absolute phase polarity can be recovered from one symbol despite the 180 degree lock uncertainty for BPSK — you can even lose lock one or more times inside the symbol and the code with absolute phase can still be recovered; the carrier-only concept needs a coding at a higher level to determine absolute recovered phase, in turn needing multiple correct symbols recovered without losing lock

QPSK demodulator / slicer / correlator vs noise

September 18th, 2007

Well, the first cut of the QPSK demodulator, bit slicing and correlation code works, and this stochastic performance graph sums it up.

It shows 1,000 Monte Carlo runs each from 0% to 100% noise on the raw channel, considering correct correlator matches only, the blue line shows the WORST correlator match result at that noise level, the red line is the BEST correlator result seen at that noise level, and the purple line is the mean correlator match result seen at that noise level.

The Green line is the +64 threshold as a reference… below this we don’t consider the correlator to have matched, something we chose based from studying the code response to noise a couple of posts ago.

Basically it shows that up to about 20% channel noise there is a very strong probability we return perfect results (+128 result means that no bit errors were present in the recovered correlator code).

After that there is a region up to about 50% noise where perfect results are sometimes seen, normally there is some corruption, but on average we can still recover a corrupted but high probability correctly sync’d code (whether the attached data payload can survive that beating is another issue… we can at least have reasonably correct bit-sync to whatever is there, allowing payload averaging for example).

But after 60% noise, the probability of finding a usable recovered code is less than 1 in 1000.

If you consider the purple mean line as the overall average recovery capability, it’s clear that at the moment after 40% noise things stop being much fun.

These figures reflect the whole reception system: at the moment the data slicer is a primitive edge-triggered type thing, pretty sure that is the limiting factor here. When I eyeball the recovered data payload when it is challenged by heavy channel noise, It is often broken in a “sticky” way:

hello magic code
hello ma��������
hello magic code
gllo magic co$�
hello magic code
���oo magic code
hello magic code
Yello magic co�7
hello magic code
hello magic cod


I think the sequential bit errors will be harder to create if a more robust bit-slicer is figured out. Here is an idea of what 45% noise in the channel looks like:

and what happens to that nice clean, digital-looking demodulation when you give it that instead of the nice clean sine waves:

That last one is the input into the bit slicer, which has the job of choosing where the bit boundaries are… at the moment it looks at zero-crossings, if it were possible to do a more sophisticated job than that a lot more packets could be saved you would think. I have an extremely cool idea about this I will look into next.

Magic Correlator and baseband QPSK

September 18th, 2007

Time for some practical experiments with the robustness of the magic correlator code. Rather than build any hardware, although some interesting RF hardware is available, I decided to first model the system in software, so I can change things around much easier while there is a lot uncharacterized about the performance and capabilities of the coding.

Testing plan

The general plan is to bind payload data to the magic correlator code, such that the correlator code alignment acts as both a frame sync and assurance that the recovered bit clock is correct (in fact it should act as a clock recovery synthesis pilot as well, since we know the sequence). Because of the properties of the correlator code, it should be possible to just add FEC-coded payload without further ado, either interleaved bit-by-bit or bound another way.

At the moment I am sending plain payload without FEC. Further I am going to initially do all the testing at baseband directly, in fact specifically at audio frequencies. This will eventually allow real world testing using a laptop “transmitting” the coding through its speakers and my moving another microphone-enabled laptop around the house seeing how far we can push the data recovery vs what I can hear myself. My house it pretty noisy, with cars going by, kids jumping around and so on, there is a good mix of whitish noise and short duration dropout crud in the audio spectrum. I am really interested to see how far it can be pushed, particularly if averaging is possible to get working.

Modulation strategy

Currently the test C code modulates the data and the magic correlator code bits together on QPSK. QPSK has four “modulation states” encoded as four phase angles of the carrier, so you are signalling two bits per symbol — one payload bit and one magic correlator code bit. Eventually, because we can recover even badly damaged correlator code quite often, it will be possible to score the likelihood of false payload recovery bit by bit based on looking back at the magic code bits that turned out to be wrong in a particular symbol: the payload and the coding bits were transmitted in the same symbol. This “distrust” indication per payload bit can open the door to some novel, if possibly expensive, error correction.

The current C code uses 44.1kHz sample rate and a 3kHz carrier modulated with QPSK carrying 300 baud symbols, ie, there are 10 carrier cycles per symbol. Each symbol contains two bits due to the QPSK coding. Maybe as we go on it will be necessary to drop the symbol rate to allow more cycles and better recovery, but this is a starting point. You can listen to the QPSK coded WAV file here, this has the magic code and a 16 byte ASCII message payload modulated on QPSK. The symbol transitions look like this:

So, the “transmitting” part is fine… don’t be fooled by those little spikes where it changes phase, those easily lost spikes are not carrying the information. The whole rest of the bit period carrier goes on at the new phase after that discontiguity, the next ten cycles of carrier phase encodes the information. The spike could and probably should be completely filtered out and not cause trouble on recovery.

QPSK symbol recovery

QPSK recovery requires a coherent oscillator in frequency and phase sync with the incoming carrier. You recover the symbols by watching the gyrations the local oscillator has to perform to keep the phase sync.

The RF guys were on to all this stuff back in ancient times, a version of a Costas loop capable of separately dealing the with quadrature “carriers” in QPSK was invented way back when. Its basic idea is to feed the coherent local oscillator to two mixers, one with the local oscillator via a 90 degree phase shift, then lowpass filter the result from the mixers and use that to feed back an error term to the local oscillator. The lowpassed mixer outputs are the “result” from the loop for the two bits encoded in the QPSK modulation.

Now described there and the literature one can find in Google, it’s simple, right? Some of the PDFs on QPSK recovery from Google even had beautiful smooth digital recovery pictures. But the actual implementation is a lot trickier. The problems come from the need to tune various filters in the Costas loop, they need to be tuned for the carrier and VCO lock performance considering the symbol rate. I got started by looking at the Costas loop implementation in GnuRadio, but this has no filters at all in it (I guess this is for flexibility you can add the filters outside it). No doubt someone somewhere has been through all this back in 1970 and written up all the equations, but I couldn’t find it. In the end I found some general advice about matching the lowpass filter 3db point to half the symbol rate and meddled around until it worked. Of course doing it in software I didn’t even have the nightmares of filter matching for the two mixers which bedevil an analogue implementation. I also found a lot of 2 x carrier noise in my loop, which I tried to notch out with some success. Anyway here is what the recovery looks like right now, being fed “perfect” noise-free signal… it looks okay but I am pretty uncertain about how it will react to noise

QPSK absolute phase uncertainty

The receiver locked phase compared to the transmitter 0 degrees phase is unknown, therefore the decoded bits can appear on either of the two output bitstreams and be inverted. This has to be taken into account when looking at what you’re getting. Initially at least you have to run the four possible correlations against the magic code to find out the effective phase offset / symbol coding you have locked at. I guess while the receiver does not lose lock (one can study the error term in the Costas loop I guess) you can just use the lock you previously determined.

Incidentally received absolute phase determination is yet another exploitation of the magic correlation code properties of robust matching.

Next stop

Next task is to recover the bit clock from the received bits and perform the correlation action, and to try to recover the message. All of will still be in pure software with no noise yet.

AT91RM9200 FIQ FAQ and simple Example code / patch

September 17th, 2007

One of the coolest features of the AT91RM9200 we have been designing with for a couple of years now is the FIQ, or Fast Interrupt Request. This is basically the NMI of the ARM world. It is a bit difficult to get working (Milosch Meriac helped me with the initial version some years ago) because its job is to interrupt WHATEVER is happening and start running your FIQ handler within 1us or so NO EXCUSES. This is the very very hard end of hard realtime, it does what it claims but it does not respect any privacy that Linux may need as an OS and we will see that needs care.

Here is a patch against 2.6.20 with the AT91RM9200 patches: it should probably apply okay to later kernels. The patch adds many comments so you probably want to read that and this at the same time. First the things you can’t do, which will save you much pain from finding them out yourself.

Things you can’t do with FIQ

One of the “private times” Linux needs to itself is the virtual memory pagetable management action. It shuts off all interrupts and rewrites the pagetable at intervals, and then goes on as before. That stops driver interrupts coming in and trying to do stuff while the pagetable is empty, or incomplete or just full of garbage.

However FIQ ignores any claim to privacy performed by shutting off normal interrupt response. That means your FIQ ISR code can come in at a “bad time”, if it tries to access memory areas mapped through the pagetable it will instead find nothing or the wrong thing or… the end result is that the FIQ ISR cannot touch any memory mapped by vmalloc.

Unfortunately, when a kernel module is loaded, its various memory footprints including the module code are allocated by… yep, vmalloc. That means your FIQ ISR code cannot live in a module, it has to be part of the monolithic kernel.

Finally all sorts of Linux code also wants “private time” or to guard against multiple access to objects by spinlocks or whatever. FIQ ISRs cannot play those games, it comes it in the middle of anything and has to get out quickly again too. So unless it is a simple macro, you can’t use any Linux APIs in the FIQ ISR.

Things you can do with FIQ

Well reading those constraints, you’re probably wondering if it is still useful. It sure is!

You can touch the memory-mapped IO in the chip using the AT91RM9200 APIs.

FIQ has the super power it will run your ISR within ~1us NO EXCUSES. That means you can rely on the ISR code to act like hardware, you trigger it and 1us later your programmed sequence occurs without fail. In turn that means FIQ is perfect for many hardware interfacing tasks, in particular management of low latency (small buffer) PDC DMA setup.

Low latency for audio traffic for example is highly desirable, but of course if there are ANY delays setting up the next PDC DMA, you get dropouts and clicks. If you allow the FIQ to handle generation of samples and management of PDC DMA, there won’t be any delays for sure, you will have perfect audio.

We found that the AT91RM9200 at 180MHz can easily handle 8kHz FIQs (a common rate for telephony) with an ISR duration of ~8us, without affecting the Ethernet or USB performance.

IPC between the FIQ and kernel worlds

The general communication for FIQ ISRs with the “real world” in the patch is to define a struct in include/asm-arm/arch-at91rm9200/at91rm9200_fiq_ipc_type.h that contains all of the data that is shared between FIQ and normal kernel code. The example one looks like this:

struct at91rm9200_fiq_ipc {
	int nCountFiqEvents;
};

One of these structs is defined in the main part of the patch code in arch/arm/mach-at91rm9200/at91_fiq.c like this

struct at91rm9200_fiq_ipc at91rm9200_fiq_ipc;
EXPORT_SYMBOL(at91rm9200_fiq_ipc);

That means in your other kernel code — which can be in a module, only the FIQ ISR must be in the kernel — you can have

#include <arch/at91rm9200_fiq_ipc_type.h>
extern struct at91rm9200_fiq_ipc at91rm9200_fiq_ipc;

and use the same struct to communicate with the FIQ ISR.

How to customize summary

1) Change struct at91rm9200_fiq_ipc in include/asm-arm/arch-at91rm9200/at91rm9200_fiq_ipc_type.h to have the data types you need

2) Add your FIQ ISR code to arch/arm/mach-at91rm9200/at91_fiq.c where it says “your C code goes here”

3) Import extern struct at91rm9200_fiq_ipc at91rm9200_fiq_ipc; in your own kernel module and communicate with the FIQ ISR using that.

FIQ shadowing with IRQ

Ultimately the FIQ actions are going to need to interface to Linux kernel objects sooner or later, perhaps there has to be some locking or blocking action for usermode access. But we are banned from using Linux APIs in the FIQ ISR.

A powerful solution is to physically tie the FIQ signal to an IRQ input additionally. Code that is REALLY hard realtime, like the PDC DMA management, goes in FIQ, and a count of FIQs is kept. Data out of FIQ can go into a software FIFO. The less reliable IRQ watches the count of FIQs and compares it to its own count of IRQs, if it sees it has blacked out and has fallen behind, it will loop up to a certain number of times “catching up”, using the data placed by FIQ in the FIFO.

In this way, by splitting the code into “no excuses” realtime in the FIQ ISR, and “reliable only on average” realtime in the IRQ ISR, it is possible to bind actions in the FIQ to code in the ISR which can execute Linux APIs, and have the best of both worlds.

Magic correlator code analysis

September 12th, 2007

Intrigued by the magic correlator possibilities, I wrote some code to simulate a proper worst-case Monte Carlo analysis of the performance vs noise, with fascinating results. (Although I tried to choose reasonably large number of random runs considering the CPU time needed, please bear in mind the numbers identified in the rest of this are only as accurate as the number of runs allows.)

False indication rejection when given only noise

What about the reaction to having no signal at all… can it tell there is no transmission or does it falsely detect correlation? What is the highest false correlation result seen when challenged with noise? Here is the distribution of highest correlation results for 50 Million runs feeding it only white-ish binary noise with no correlation sequence component.

The largest false response seen even once in the runs is +58, out of a full-scale match with a 0% bit-error rate of +128: one can put it that the probability of seeing a match better than +58 from noise is something greater than 1 in 50M. So we learn from this we can’t trust any correlation result lower than, say, +64, to allow some margin. (This +64 requirement is shown with a blue line in the following graphs).

Random bit-error rate response

In this graph I ran the self-correlation 10,000 times per offset with different noise each time, and picks the worst (lowest) correct “position 0″ sync match value (red) and plots it against the best (highest) wrong offset match value (green) in absolute match quality. The thin blue line shows the absolute correlation value of +64 we selected based on the first graph.

On the left where there is no noise, we can tell the correct sync by a wide margin. Where the red line crosses the green, at around 0.2 bit-error probability, it means the correct sync position can no longer be distinguished from a false match. But before then, the absolute correlation value for the correct offset has fallen below our +64 limit (selected because noise can create a +58 result) so detection is lost first at a 0.12 ber.

Here is a plot of the ranking of the correct offset vs all of the other offsets. I expected the correct one to start at #1 and then slip down the rankings, but instead it starts at #1 and falls right to the bottom when it can’t be selected as #1 any more.

What it means is that up to around 0.12 – 0.15 ber (equates to 15 – 19 randomly selected flipped bits of the 128 in the pattern) you can detect the pattern VERY reliably. Any higher ber – with randomly selected bit errors – and your probability of detecting the pattern is very low.

Multibit dropout tolerance

From my WiFi work I know that a common failure mode in RF packets is a multibit continuous dropout, that’s different from the random bit errors introduced above. These graphs show the effect on worst correct offset margin from dropouts of all possible lengths randomly placed in the packet, where the dropout is filled with white noise, all zeroes or all ones.



Clearly it is beautifully insensitive to multibit contiguous dropouts. If the problem is that you have white noise crapping on the transmission, the loss of 39 contiguous bits can be sustained without dropping below the +64 result limit. If the problem is events that cause continuous static 1 or 0 to be read during the disturbance, the code is very insensitive to this and can still be detected with fully half of the bits sequentially zero’d out or up to 50 set to ‘1′. So the sync detection performance faced with contiguous dropouts actually exceeds that of random dropouts.

This last dropout graph shows performance when there are TWO dropped-out areas randomly (5,000 runs at each dropout length) placed in the packet at various dropout lengths (the dropout length is the same for both and they can overlap, explaining the noise at the end as they grow larger).

Again looking at the absolute result values for the graph (blue line) the optimal absolute result cutoff of +64 is seen at two blocks of 18 contiguous bits contaminated with noise. These are very severe insults that still allow a correct sync detection.

Conclusion

This means (to the accuracy of these simulations) if you draw a line at 15% bit-error rate, if you ever see any offset of the correlator giving an absolute result of +64 or better, there is a very high probability that:

  • there is a genuine transmission in progress
  • the offset reporting that result is the correct sync offset, and
  • your bit-error rate is 15% or less

Conversely if no correlator offset gives +64 or better:

  • the bit-error rate is higher than 15%, or
  • there is no transmission

This is a very robust correlator pattern! It can be improved further: at the moment the “score” for correlation adds 1 for a matched binary bit level and subtracts 1 for a binary mismatch. If the demodulator that is providing these bits gives a probability of a ‘1′ or a ‘0′ instead of a binary ‘1′ or ‘0′, then the result can be made from more information. A few “looks a bit like a 0″ inputs will more weakly override many “definitely a 1″ inputs, for example.

There is another great advantage to interleaving this pattern with the payload. If the sync pattern can be recovered considering the 15% bit-error rate that is allowed, it is possible to identify then which bits of the pattern were corrupted. Because the correlator code bits are interleaved with the payload, it suggests that if the payload is broken, that the problem is coming from the payload bits next to the known-bad correlator code bits. For example, if it is shown that say three contiguous bits of the correlator code channel are wrong, one has to wonder about the two payload bits that are inbetween them. If there are a small number of bits involved, it can be possible to “fuzz” the suspected bad payload bits to see if an otherwise unrecoverable ECC error can be solved.

One more advantage is that the robustness margin of 15% allows channel bit-error rate to be continually assessed during reception.

Autocorrelation code and weak signal recovery

September 12th, 2007

Looking at weak signal capture at the moment, there has been considerable work done on this by Radio hams. The extreme cases for these guys are bouncing signals off the moon or meteors to reach other places on the planet. The most recent protocol I could find is called JT65, and it makes some pretty extraordinary claims for data recovery: 100% recovery at -27dB SNR, ie, the noise floor is 27dB above the signal. Unfortunately it seems the author of this otherwise cool and interesting protocol took it a step too far, and used “forbidden Black Magic” in his implementation to get results at that level.

However removing the black magic the claim of 100% recovery at -22dB SNR using another “forbidden” but less magical technology is not being disputed. This is a patent-encumbered “soft” Reed-Solomon decoder which is able to recover from more damage faster than the normal “hard decision” decoder: this means you have to give up another few dB to get a distributable implementation. An open source implementation exists at berlios but it’s written in freaking Fortran. Multithreaded Fortran with a Python GUI. This provides a normal Reed-Solomon FEC implementation which is used if you don’t have the external forbidden one.

One awfully limiting “trick” and two really interesting techniques are used in the protocol. The bad news is that very very long symbol-times are used for transmission, 372ms per 6-bit symbol. Considering the various bloatages it’s about one byte per two seconds. They are sending one of 64 “tones” to encode the six bits during that time… obviously the symbol duration helps with recovery. This “trick” is the core feature of weak signal recovery… repeat what you are doing a lot, in this case repeat the “tone” cycles a lot to “amplify” the signal at a receiver which knows how to take advantage of looking for something happening multiple times to increase probability of detection.

The first interesting trick is just the amount of Reed-Solomon used… this is not new to me since I used it as part of Penumbra. But in this protocol, every 72-bit packet has an additional 306 bits of error correction attached to it :-O. That’s more than 4 times as much ECC as data, and despite that it still pays off for capturing the signal.

The second cool technique is to interleave the payload data with a binary autocorrelation “clock”. Since the noise level is so crazy, it’s of little use to expect a 1-bit channel in the data to be usable as a “start of frame” marker or somesuch as you would normally expect with digital serialized communication. Instead, they spread the sync information in this interleaved “channel” using a 126-bit sequence which has a magically cool property… if you autocorrelate the sequence with itself, even in the presence of a fair bit of noise, every correlation offset except the right one matches MUCH worse than the 1:1 lineup. Here is the sequence extended to 128 bits and correlating with itself. The y axis is the number of bits that match…. obviously that is 128 when it compares itself to itself at the 0 offset on the X axis. The cool part is how low the self-correlation is everywhere else, no better than 20, or a 14dB “SNR” between a match and a non-match.

This remains the case even under pretty bad noise, up to 25% of the bits being trashed (still 9dB sync SNR):

but at 30% of the bits being trashed, the performance falls off a cliff:

Not only does the noise floor rise due to falsely improved correlations, but the one true correlation is also falsely degraded. After about 28% bit errors the reliability is gone. (Note the noise is one-shot with the test program, rather than being Monte Carlo’d, but I ran it several times and the graphs shown are representative).

But that isn’t the end of the story for this code. First the correlation action is a filter for transmission presence all by itself. And if you detect the transmission by the presence of the correlation code, you have also sync’d the receiver to the transmitted frame, since the correlation bits are interleaved with the actual data and the “0″ offset marks the start of the frame.

With deep memory and a known period of retransmission from the source, temporally averaged autocorrelation can take place to increase the chances to find the presence of a transmitter and to sync up to its data. After a transmitter “sync” has been found in the averaged data with high probability, the averaging memory can be turned to only store the times when a transmission was expected from the known schedule of the transmitter.

Here is the magic code with the 128-bit sequence and the test loops

#include 

static unsigned int u8Auto[] = {
	0x19, 0xbf, 0xa2, 0x89, 0xf3, 0xf6, 0x58, 0xcd,
	0x2a, 0x81, 0x01, 0x4b, 0xab, 0x4c, 0xc2, 0xbf
 };

#define AC_LEN 128

char GetAc(int n)
{
	n = n & (AC_LEN - 1);
	return (u8Auto[n >> 3] >> (n & 7)) & 1;
}

int main(int argc, char ** argv)
{
	int n, n1;
	int nSum;
	int nNoise = 0;
	int nSeed;
	FILE *f = fopen("/dev/urandom", "r");

	fread(&nSeed, sizeof(nSeed), 1, f);
	fclose(f);
	srand(nSeed);

	if (argc == 2)
		nNoise = (1024 * atoi(argv[1])) / 100;

	fprintf(stderr, "Noise: %d%%\n", nNoise);

	for (n = -(AC_LEN - 1); n < AC_LEN; n++) {
		nSum = 0;
		for (n1 = 0; n1 < AC_LEN; n1++) {
			char c = GetAc(n + n1);
			/* simulate white noise */
			if ((rand()&1023) < nNoise)
				c = c ^ 1;

			if (GetAc(n1) == c)
				nSum++;
			else
				nSum--;
		}
		printf("%d %d ", n, nSum);
	}
	return 0;
}


and the graph command that generated the graphs (the 28 is the percentage of noise to graph)

gcc test.c -o test ; ./test 28 | graph -Tpng --bitmap-size 1200x1200 -FHersheySans>temp.png && convert temp.png -scale 300x300 png:temp1.png

Embedded procmail and dovecot

September 6th, 2007

For over a year I have been using an 32MB ARM9-based board I designed with a 1GB USB stick as my mailserver. It is powered from a USB port on my firewall box and takes around 1W.

I use our Octotux Linux distro with Postfix as the MTA, gps for the greylisting and Dovecot IMAP to provide secure access to the mailstore over SSL. This has worked out really well, the warmcat.com MX record points directly to the external IP here, and the firewall box port-forwards port 25 to the embedded device. It’s silent and runs cold 24 hours a day and has never missed a beat.

A couple of weeks ago I had to look at the box again because the greylisting software was hanging. I discovered that we were being bombarded with spam, one new spam every two seconds on average, from all over the world. I adjusted the ordering of the filtering in postfix to first reject on an unknown username, that stopped so many concurrent gps sessions being needed. The server weathered that storm and the spam people gave up a few days later without getting a single one through. (They were also targeting the warmcat.com A record IP, I suppose in case it was a backup for the real MX, but they had zero luck with that either).

However it reminded me of the one inadequacy of this mailserver… when you wake up your laptop in the morning, thunderbird takes ages to run all the filters and move the new mails remotely into the right IMAP folders. That’s pretty annoying when you see the titles of mails you want to read but the USB stick on the server is maxxed out for a couple of minutes sorting eight hours worth of new emails into folders on the server. I have been pondering changing the box to one with USB2 High Speed, but it occurred to me that otherwise, the existing USB 1.1 “Full Speed”, that is, 12Mbps is completely adequate. Changing folders and moving to other emails in thunderbird is snappy. It’s just the client mail filters under the load of 500 mails in the morning. So I decided to port procmail to ARM9 Octotux, in effect to do the folder sorting as each mail came in, so there would no longer be any processing done at the client for that.

Read the rest of this entry »