Libwebsockets Lightweight Embedded JSON Stream Parser
Overview
- CMake option:
LWS_WITH_LEJP
(default ON) - Public header: (included by libwebsockets.h) include/libwebsockets/lws-lejp.h
- Implementation: ./lib/misc/lejp.c
- Helper app: ./test-apps/test-lejp.c
- Example: ./lib/secure-streams/policy.c
Introduction
JSON is deservedly very popular, and there are a lot of JSON parsers to choose from.
With JSON, one trades off minimizing the representation of the data for readability and extensability. Often this tradeoff is gratefully accepted by programmers with tears in their eyes, compared to having to deal with, log, debug and maintain some proprietary binary coding directly. If you bloated some binary coding from 4 bytes sent very infrequently to 400 bytes, but you or the next guys are going to be able to understand, debug and extend that 20 years from now easily, plus casually read the logs and understand who was trying to do what, it’s a great deal.
But depending on what you’re doing, while you may be able to accept increasing the size of the data transferred accordingly, you may not have the memory at the receiving side to even store all the JSON in one place at one time, let alone instantiate parser objects for what might be a deep hierarchy.
Most JSON parser libraries require the JSON to be in one linear array and then transform it into objects which can be walked by the user code to transform it once again, before destroying the parser objects. In this model, the existence of the JSON all in one place and the generation of the JSON object model overlap, meaning the peak heap usage is both together… you can free the JSON after creating the model, but then you are creating your user representation in heap too.
The JSON parsing action can’t commence until all the JSON has been received and is in one place, and the transformation into the parser object model has happened.
In addition many implementations recurse, with potentially large impact on stack usage.
lws LEJP model
lws offers a JSON stream parser with insanely amazing characteristics compared to “the usual”.
LEJP is a stateful stream parser, it means it processes whatever data is coming in as it comes in, ie, it does not need all the JSON in one place at one time but processes and discards each chunk as it becomes available. “chunk” size can be as low as 1 byte, so it’s completely immune to fragmentation issues.
it does not allocate any heap, at all, and has a fixed-size parsing context object of around 560 bytes on a 32-bit machine that exists while the parsing is ongoing. As JSON objects are parsed, a user callback is informed and can produce the related user objects directly. Peak heap is drastically reduced compared to having a JSON parser model in memory as the user’s object is created… there is no JSON parser model.
it does not recurse on the stack, at all, it manages its own parsing stack inside the LEJP parsing object
it handles floats as a type of string, ie, does not bring float types into the picture itself; your user code can choose whether to use floating point itself or, eg, fractional scaling using integers
In the case of long strings, the strings are “chunked” into a 254-byte buffer (already allocated in the parsing context object) and passed to user code together with information about if this chunk is the beginning and / or end. So huge strings are supported cleanly without huge buffers or needing it all in one place at one time.
the code size for all this generic functionality on 32-bit ARM is 2.1KB!
Understanding the parsing model
lws includes a helpful test app for LEJP that’s built and installed with lws when LEJP is
enabled at cmake. This allows you to parse arbitrary strings from stdin and decompose
them to LEJP’s parsing events and paths, so you can see the correct paths to “program”
the lejp context with for your schema. For example, with this in /tmp/my.json
{
"schema":"xxx",
"uid":1004,
"len":194,
"timestamp":1641458307868,
"channel":2,
"finished":0,
"task_uuid":"2a31db22f1180d77734ccaee5af18472c733bce4078405eadc8568d8173eb855"
}
You can find out the paths and events that lejp will use to parse it with the test tool
$ cat /tmp/my.json | libwebsockets-test-lejp
[2020/03/07 10:55:49:3890] N: libwebsockets-test-lejp (C) 2017 - 2018 andy@warmcat.com
[2020/03/07 10:55:49:3891] N: usage: cat my.json | libwebsockets-test-lejp
[2020/03/07 10:55:49:3891] N: LEJPCB_CONSTRUCTED: path match 0 statckp 0
[2020/03/07 10:55:49:3891] N: LEJPCB_START: path match 0 statckp 0
[2020/03/07 10:55:49:3892] N: LEJPCB_OBJECT_START: path match 0 statckp 0
[2020/03/07 10:55:49:3892] N: path: 'schema' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3892] N: LEJPCB_PAIR_NAME: path schema match 0 statckp 6
[2020/03/07 10:55:49:3892] N: LEJPCB_VAL_STR_START: path schema match 0 statckp 6
[2020/03/07 10:55:49:3892] N: value 'xxx' (LEJPCB_VAL_STR_END)
[2020/03/07 10:55:49:3892] N: path: 'uid' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3892] N: LEJPCB_PAIR_NAME: path uid match 0 statckp 3
[2020/03/07 10:55:49:3892] N: value '1004' (LEJPCB_VAL_NUM_INT)
[2020/03/07 10:55:49:3892] N: path: 'len' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3893] N: LEJPCB_PAIR_NAME: path len match 0 statckp 3
[2020/03/07 10:55:49:3893] N: value '194' (LEJPCB_VAL_NUM_INT)
[2020/03/07 10:55:49:3893] N: path: 'timestamp' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3893] N: LEJPCB_PAIR_NAME: path timestamp match 0 statckp 9
[2020/03/07 10:55:49:3893] N: value '1641458307868' (LEJPCB_VAL_NUM_INT)
[2020/03/07 10:55:49:3893] N: path: 'channel' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3893] N: LEJPCB_PAIR_NAME: path channel match 0 statckp 7
[2020/03/07 10:55:49:3893] N: value '2' (LEJPCB_VAL_NUM_INT)
[2020/03/07 10:55:49:3893] N: path: 'finished' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3893] N: LEJPCB_PAIR_NAME: path finished match 0 statckp 8
[2020/03/07 10:55:49:3894] N: value '0' (LEJPCB_VAL_NUM_INT)
[2020/03/07 10:55:49:3894] N: path: 'task_uuid' (LEJPCB_PAIR_NAME)
[2020/03/07 10:55:49:3894] N: LEJPCB_PAIR_NAME: path task_uuid match 0 statckp 9
[2020/03/07 10:55:49:3895] N: LEJPCB_VAL_STR_START: path task_uuid match 0 statckp 9
[2020/03/07 10:55:49:3895] N: value '2a31db22f1180d77734ccaee5af18472c733bce4078405eadc8568d8173eb855' (LEJPCB_VAL_STR_END)
[2020/03/07 10:55:49:3895] N: LEJPCB_OBJECT_END: path task_uuid match 0 statckp 9
[2020/03/07 10:55:49:3895] N: Parsing Completed (LEJPCB_COMPLETE)
[2020/03/07 10:55:49:3895] N: LEJPCB_COMPLETE: path task_uuid match 0 statckp 9
[2020/03/07 10:55:49:3896] N: okay
[2020/03/07 10:55:49:3896] N: LEJPCB_DESTRUCTED: path task_uuid match 0 statckp 9
Setting up the lejp context
There are three pieces to the puzzle… first, specify paths you want to easily
identify from your callback. These are matched before the callback gets called
and reduced to a single uint8_t in ctx.path_match
, so if there are just some
patterns you are interested in you can list them here. 0 in path_match
means
no match, and 1+ means matched path_match - 1
.
static const char * const paths[] = {
"release",
"product",
"schema-version",
"via-socks5",
"retry[].*.backoff",
"retry[].*.conceal",
"retry[].*.jitterpc",
...
};
typedef enum {
LSSPPT_RELEASE,
LSSPPT_PRODUCT,
LSSPPT_SCHEMA_VERSION,
LSSPPT_VIA_SOCKS5,
LSSPPT_BACKOFF,
LSSPPT_CONCEAL,
LSSPPT_JITTERPC,
...
}
Patterns like name[]*.entry
follow the path scheme lejp uses to track
its path during parsing. You can feed the test example your json and
watch what is happening in ctx.path to see how it works.
Specify your callback that handles parsing events from lejp
static signed char
cb(struct lejp_ctx *ctx, char reason)
{
...
return 0;
}
reason
is one of enum lejp_callbacks
, describing the reason for the callback.
struct lejp_ctx ctx;
lejp_construct(&ctx, cb, NULL, paths, LWS_ARRAY_SIZE(paths));
m = lejp_parse(&ctx, (uint8_t *)buf, n);
if (m < 0 && m != LEJP_CONTINUE) {
...
}
lejp_destruct(&ctx);
What did we learn this time?
- You can have a full-featured JSON parser in a couple of KB suitable for a microcontroller
- It handles subobjects, arrays of objects, huge strings floats-as-strings etc
- It doesn’t use any heap, and doesn’t recurse
- It’s stateful and can process the JSON in arbitrary chunks as it comes in