Hyperram bandwidth in context

Now the hyperram Bus Interface Unit is working, the next problem is sharing its bandwidth efficiently so we can make use of it between the other functions in the FPGA confidently and efficiently.

On the other side of the Hyperram pipe, under the pseudo-static veneer, it's nothing more than traditional SDRAM. That means it's expensive in latency to select a column, but after that it's cheap to spam contiguous row data.

In other words since we have to spend 3 SDR clocks + latency (at least 2 more and as many as 5 more at our low speed) to select a random address, plus dead time around CS up and down, it's relatively expensive to move to a new address unless it's the old address plus one word. But once you have started the packet to set a random address, bursting is very cheap, 16-bits coming every SDR clock (15.6ns at 64MHz SDR).

(CPU access to Hyperram over SPI is a special case in this system, it's handled with a burst size of 1 every time since it's relatively uncommon).

Interfacing to a bursty bus

You could design the system with peripherals that only make individual 1-word "bursts" in sequence, ie, ignore bursting. (In fact I tested this after the BIU was able to work, with two masters doing accesses wordwise and it works fine, it blocks the bus for around 250ns each time overall). But as explained above, that goes directly against the main characteristic of SDRAM, that bursting is cheap.

Therefore the solution involves:

https://warmcat.com/biu.png

  • Designing the peripherals to favour sequential accesses, as we will see, that may mean instead of storing a "struct" or "descriptor" sequentially, it may be advantageous to store the members in their own sequential storage separately.

  • The peripherals must be able to wait until ALL the data they need is available.... another peripheral may be bursting on the bus and only one peripheral can do that at a time

  • Independent FIFOs that leverage the high bandwidth burst action so we can get in and grab a chunk of sequential data, and then hand it out to the peripheral on demand with usually no latency. The RAMs are acting like a rate adapter, between the constrained Hyperram burst and the unconstrained FIFO consumer.

  • An arbitration scheme between possibly many clients who may need to share the bus at reasonably low latency

ICE5 block RAM

The big ICE5 chip has 20 x 256x16 dualport RAMs on the die for exactly this kind of task. These can be converted to Hyperram compatible FIFOs with a modest amount of glue logic, giving us a primitive that can read from a range of Hyperram addresses into the 256x16 SRAM (aka "block ram"), but present a dumb FIFO interface on the other side.

This is the interface for a VHDL component that buffers burst reads from the Hyperram:

https://warmcat.com/cbl_brfifo_read.png

The dumb FIFO side signals if something is there to read, and gets a 1 clock strobe from the consumer to say it has been used. That's all the downstream consumer needs to concern itself with; it might have to wait for "some reason" for the next FIFO data being available, but as for why, that it only has serialized access to a single SDRAM, or the number of steps needed to initialize a burst there: it doesn't have to understand or deal with it.

Fundamentally these abstract everything about the hyperram into a "stream" which the consumer can draw down as it (becomes available & it wants to use it), without negatively impacting the bandwidth characteristics needed to interoperate with Hyperram.

The probability of having to wait because the FIFO is empty is related to

  • the latency gaining access to the SDRAM: itself related to

    • the number of competing peers on the bus, and
    • the restriction placed on burst length
  • the depth of the FIFO and

  • the rate that it is drawn down from the peripheral side.

Bus Arbitration

Since this model involves many competing FIFOs that may simultaneously need access to the SDRAM, there has to be some arbitrator to decide who will get that access next.

Today I use the simplest form, a sequential round-robin arbitrator. If you have 16 FIFOs who may want to access the hyperbus, after each allocation completes it wastes a clock even if the related FIFO doesn't need the SDRAM right now. So FIFO 0 has a go, then next clock it checks FIFO 1 and so on. It's "correct" but it's not optimized and it becomes more expensive the more bus masters appear. (Currently with 4 masters, this doesn't make any issue, but it will grow to a dozen or so).

There are more efficient ways of handling this, where the priority dependent on who went last time is burned into a single logic expression for each state. So if FIFO 0 went last time, the expression to choose who goes next prioritizes FIFO 1 then 2 etc and FIFO 0 last. That way in one clock, it always chooses someone to use the bus, if anyone at all was waiting. But the number of client FIFOs is still fluid at the moment, so optimizing this is for later.

In more complex scenarios the individual FIFOs may need to be tagged with a general latency limit: for example FIFOs related to video generally have a higher priority since they cannot defer issuing the next pixel. But in my design, it is possible to defer issuing the next video pixel and this additional prioritization or deadline scheduling not necessary.

The most expensive part of the arbitrator is the mux required to allow every FIFO to control the hyperram address, write data, and various strobes. The logical addresses for everything are 32-bit to allow expansion, so there are a lot of signals being muxed. As the number of FIFOs needing access grows, the mux complexity also grows accordingly, putting a brake on the max system clock rate.

System burst limit

The Hyperram itself will just burst forever (although if you want to do that, you may need to observe individual wait states signalled back to the master using RWDS for read). But although that's nice for empty FIFOs who are getting their turn as bus master, it's a disaster for max latency for the FIFOs who need topping up but are stuck waiting their turn to talk to the Hyperram.

For that reason, it's necessary the Hyperram BIU FPGA RTL component itself reserves the right to end burst transactions unilaterally, in the name of putting a limit on how long anybody else has to get stuck with having to wait.

Dependency on multiple FIFOs

A common pattern is that a peripheral requires multiple streams, from various places in the SDRAM map, to be processed together. That's no problem with this scheme since the peripheral can understand it should stall until every FIFO required reports it has the next data to draw down, ie it stalls until all the dependencies ANDed together are met.

If you consider this structure starting up, initially all the FIFOs are reset and empty, then in turn each get an opportunity from the arbitrator to burst until they are at least semi-full (the policy for how empty to get before triggering a burst, and how much to burst at one sitting, is up to the FIFO implementation, plus or minus the BIU vetoing continuing a burst because it considers it too long). When the last dependent FIFO gets his turn to burst, as he starts to fill and finally signals he has valid data, the peripheral, who has been ANDing together all of the dependent FIFO valid data availability starts to process data from all the FIFOs he cares about.

Pipeline-friendly IO primitive

In order to not introduce latency about use of the streaming data or deciding when to stop, there must be two critical pipelined signals between the BIU and each master. The basic semantics of these signal can be summed up as:

  • "continue" - from master to BIU: when high, it means we want to continue the burst one more cycle

  • "word" - from BIU to master: when high, it means the word of data to / from the Hyperram is valid

These two are enough to encapsulate the BIU streaming functionality while being able to stream at the rate of one word per SDR clock.

What we learned this time

  • Peripherals must take advantage of Hyperram burst characteristics to get any kind of efficiency, that means using a block SRAM FIFO per bus master stream

  • Just having a BIU for the Hyperram is a critical part of the puzzle, but it's just a part of the puzzle

  • The BIU must be preceded by an Arbitrator, who can make sure all of the masters (there may be dozens) can get access in turn below some latency limit.

  • The Arbitrator comprises a round-robin scheme so everybody gets a turn within a limited latency, and a complex mux, so each master can individually control the address (to 32 bits) and other critical hyperram signals.

  • The master should define its own burst length (and start address), but the BIU must be able to truncate it, and the master deal with any remainder next time, in the interests of achieving reasonable unconditional system latency.

  • Peripherals needing to use the data from the FIFO bus masters must AND together the validity from the FIFOs they are dependent on before being able to process anything

  • "continue" and "word" are the two key semantics for no-latency stream processing