SPI as video and alpha compositor

SPI LCD extreme sports

As I wrote up a month ago, there are some nice LCD panels available with SPI interface and a frame buffer in the controller chip. You can see all kinds of projects on the Internet using them, many contaminated by Arduino since you can buy a shield with the panel on, since SPI is easy to wire up and even bitbang if necessary.

But all these projects simply use a CPU to scribble in the on-panel RAM. That means none of them look very much as people expect LCDs to look in a post-Android age. Redraws are unlocked to the video update, because none of these panels provide both SPI and TE (Tearing Effect, aka VSYNC), and there is no support for pageflipping at the panel, so you can see the CPU doing updates on the display. This reflects an assumption "nobody will do video on SPI".

Another issue with typical usage level of these panels is people using Arduino as a crutch are screwed for reading back the framebuffer data due to a design flaw in the shield implementation discussed in the other article. Even if you can read back, it is relatively slow to read the background then apply the new foreground; after you wrote it back once the original background is lost. So effects like alpha done in software must be composed in software, holding the related planes in CPU memory simultaneously, and in many cases the CPU is too weak or resource-constrained to do this well. In fact due to the very limited scope of the overall projects using this kind of panel already, interest in driving the panel well tends not to be the focus of the activities: it usually gets some blocky fixed-width numbers on it and that's it. Overall, there's a mismatch between what these nice panels can do and what people are using them for.

We can't fix the lack of TE, but the limited approaches to leveraging these nice panels just reflects prejudice not borne out by any technical restriction, because "SPI is not real video".

SPI as real video

The ST7735 (which has nearly identical second-sources with other names, eg, ILI9163V) has an efficient SPI command RAMWR which says all following SPI data will be copied into a user-definable 2D region until the transaction ends. So nothing stops us setting the 2D region to the whole display and using the FPGA to spam 565 video at <= 20MHz SPI rate. For 160x128, we can easily deliver 30fps this way: SPI is truly acting as a digital video streaming protocol then.

The impact of no TE signal (or no VSYNC lock in other words) depends on what and how we update in the frame.

For updates where large contiguous areas changed data (intensity or chroma, although from which colour to which matters) completely, we will noticeably tear.
For regions that "scrolled by one pixel" though, any tearing is restricted to part of one line of pixels. That may be completely unnoticeable.
For regions where characters update-in-place, like a counter type display, the tearing can impact any amount of the character update area, although only in the characters that changed anyway and only for the one frame they changed in.

In cases where restricted areas of the display update, or things move in one pixel increments then, tearing is not necessarily an issue.

Display Subsystem Architecture

Block Diagram

This is an overview of the video subsystem implemented in the FPGA. It's designed to be paired with a very resource-constrained CPU.

Video compositor

Since we have hardware FIFO-buffered access to hyperram casually, a huge bandwidth to the hyperram, and no lack of FIFOs (20 x 256 x 16 on the FPGA), we also have the ability to compose (combine) the video stream in realtime from multiple sources; since we have all the data at the time we issue it, alpha composition or other effects between the layers are relatively simple to implement.

In my case the FPGA will collect data autonomously and render it into one separate hyperram buffer, and this also can be a hardware overlay plane dynamically composed into the output video stream without needing CPU intervention.

Overall there are three hardware planes composed together.

The CPU retains his own interleaved access to hyperram, so we don't lose any flexibility.

Blitter

Even though the display resolution isn't huge, a big problem is the amount of CPU time needed for copying font gylphs into the display plane. CPU access to the hyperram, like the LCD panel directly, is via SPI, so although interleaved access from the CPU to the hyperram is easy it's still relatively slow. Drawing large characters in a display plane will block the CPU for a relatively long time and in my case some text is updated very often, potentially inside an IRQ.

Since only a small proportion of the text is updated at a high rate, I considered for a while implementing a complex sprite unit which dynamically composed 2D areas at the video rate... this can work but it quickly becomes more complicated than we will ever get any benefit from, when you consider overlapping or different height sprites on the same scanline. In an FPGA stuff that takes a lot of LUT real estate (and / or design time...) has to pay for itself reasonably directly in functionality: since it's not much like PacMan or so sprites don't fit.

Instead I implemented a blitter engine, this is basically an autonomous "2D memcpy" that copies a given width x height from a source to a destination. This works fine, but considering it will be doing copies as finegrained as individual font glyphs at say 8x8px, it's difficult to synchronize just the blitter itself to the CPU activity without excessive overhead.

While working with Fujitsu in Taiwan, I wrote a Linux kernel driver for a much more complex but basically the same kind of blitter hardware on their silicon: although it's very efficient for medium - large bitmaps, for many small actions like glyph blitting, the driver CPU overhead of

formatting the blit in userspace
transferring it by IOCTL
queuing it in a driver (soft) ringbuffer
receiving the idle interrupt, and
setting up the transfer repeatedly

massively overwhelmed the benefit of getting hardware to blit 8x8 or 16x16 compared to the CPU just poking it on to the plane memory. So not only blitter local efficiency is important, but it it's going to be useful at the system level also how it cooperates with the CPU.

Blitter descriptor ring

Cognizant of that I added a hardware blitter descriptor engine in front of the blitter which is controlled by a large descriptor ringbuffer held in Hyperram... big enough for 1K descriptors. The CPU can then "fire and forget" by appending 8-word descriptors per-blit here and kick the blitter unit when a one or more of the descriptors is updated. It will autonomously read the blit descriptors in order, and perform them synchronously until it runs out of active descriptors, where it will wait in idle to be kicked again to service more appended descriptors in the ringbuffer. This lets multiple different processes share the descriptor ringbuffer, and it's able to interleave blits between completely different source and / or destination planes without problems, and allows the CPU to append whole strings of font blits quickly to hyperram.

Because the descriptor is only 8 words, even for a tiny font like 8x8 px that's definitely < 1/8th of the CPU load compared to drawing it into memory from the CPU, and has the advantage font can live in hyperram (on platforms like ESP8266, there is no RAM for it to live in since it's > 64KB). At an average 10 x 21 px font I plan to use, the CPU saving is 95% writing the 8-word descriptor instead of drawing the font directly; considering we are usually updating a string instead of isolated characters, the benefits of being able to treat the whole string as "fire and forget" while it's drawn asynchronously to the CPU adds up quickly.

The last descriptor word being zero indicates that the descriptor is invalid, and causes the descriptor engine to stop when it fetches it, until "kicked" over SPI to look again and attempt to restart. Making the last word hold the validity attribute simplifies protecting against the hardware seeing half-written descriptors, since when the last word is written, the descriptor is both valid and completely written.

Font

There's no problem using a proportional font with this scheme, although since the font glyph is simply copied into place the font cannot do dynamic pair kerning. However for many Sans type fonts, there are no overlapping serifs to make trouble.

The font is arranged into a bitmap with 16 characters per line, each character starting at the left of a 16 pixel space. So finding the start of a character is simple. The cpu has a small table of character widths it uses when advancing the X part of the descriptor, providing the proprtional behaviour.

The overlays are additive per primary, with clipping per primary if they overflow; there is no space in the FPGA for the multipliers needed for real alpha. The background largely being black to facilitate this additive scheme naturally leads to black being transparent in the upper overlays, since it's basically "adding 0" to each colour channel.

For this reason the font bitmap uses black as its background; the font bitmap itself is 565 though, so there is some "additive alpha blending" in the anti-aliased parts of the glyph. In addition coloured characters are possible in the font for status symbols, etc.

What we learned this time

The 160x128 LCD panels are unexpectedly capable, even with SPI and no TE / VSYNC, if given some hardware support
By adding the right hardware various kinds of limitations in expectation of inexpensive SPI panel performance can be transcended.
We can once again leverage the hyperram to get a quite sophisticated video unit running in about 20% of the FPGA. That includes 3 x overlay mixing, associated FIFOs and a blitter
Three overlay layers with additive alpha blending simplifies object update by removing the need to care about background handling from software. You can choose the effective Z order when you choose which overlay plane to blit to
The blitter is perfectly matched to the cpu via a descriptor ringbuffer, so it can handle 1024-deep font glyph rendering in one step, at the cost of writing an 8 byte descriptor per transaction, independent of the bitmap dimensions
We can do antialiased, proportional fonts simply and rapidly, in multiple sizes
Updated video streams to the panel in 565 format, over SPI, at 30fps