Libwebsockets string processing helpers

Overview

CMake option: part of core lws
Public header: (included by libwebsockets.h) include/libwebsockets/lws-misc.h
Public header: (included by libwebsockets.h) include/libwebsockets/lws-purify.h
Public header: (included by libwebsockets.h) include/libwebsockets/lws-tokenize.h
Implementation: ./lib/core/libwebsockets.c
Api unit tests: ./minimal-examples/api-tests/api-test-lws_tokenize

Introduction

Writing string processing in C is a dangerous and unrewarding occupation, you have to always be mindful of how the code will respond to malicious input. And to operate well on machines with very limited resources, without limiting the size of the input you can handle, sometimes it’s desirable to operate on C strings in memory, but other times it’s desirable to be able to be able to operate in a stateful way on partial blocks of input without restriction, ie, with complete immunity to any fragmentation attack because it’s designed for this case from the start.

Libwebsockets has a bunch of important safe helpers like wrappers on strncpy and snprintf that guarantee there won’t be overflows and there will be a NUL, as well as more complex helpers… let’s look at the simple ones first.

`lws_strncpy()`

You might be surprised to learn that libc strncpy() fails at being safe, it can return without having applied a NUL at the end. lws_strncpy() is an alternative that will always safely truncate at the limit.

lws_strncpy() adjusts back the size limit by 1 to always make space for the NUL. For that reason, you can just feed it sizeof(dest) as the length limit safely.

`lws_strnncpy()`

This is a variation of strncpy() that takes two limit numbers at the end, the lowest of the two numbers is used to restrict or truncate the copy and a NUL is always applied at the end of the destination. This is useful when you want to copy a string where you know the length of the source string, but there is no terminating NUL in the source. With this you can do, eg,

        lws_strnncpy(dest, src_no_NUL, srclen, destlen);

and under all conditions end up with a NUL-terminated, possibly truncated copy of the string in dest. This is used widely in lws to cover for the fact that some platforms do not provide the “%.*s” format string option that allows printing non-NUL-delimited strings where you give the length.

destlen is corrected back by 1 to allow for the NUL at the end, so again it’s safe to set this to sizeof(dest).

`lws_snprintf()`

This is the swiss army knife of string generation, it’s a safe version of snprintf(). It uses the platform vsnprintf() but guarantees destination NUL termination. It’s very convenient for additively composing on to a large string where the return value which is the destination length, is used to advance where the next lws_snprintf() will write to.

In the event we reached the end of the string already, it will return 0, at the end of the sequence the result string length can be compared to the destination size to discover if we “crumpled up at the end”. But it won’t crash or blow past the end of the destination in the meanwhile.

        char dest[1234];
        size_t n = 0;

        n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
        n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
        n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);

        if (n >= sizeof(dest) - 1)
            /* truncated */

`lws_purify_`

Lws provides “purification” helpers for arguments that will be expressed in JSON, sqlite, or filenames including externally-provided input. For the sqlite and JSON versions, they use escaping so their arguments are formatted like strncpy(), ie, dest, source, dest len.

For lws_filename_purify_inplace() as the name suggests it purifies the file name inplace, by replacing scary characters like .. or / in the filename as _.

Notice that in the worst case, JSON escaping can turn one character into six and the whole string may consist of those if it’s an attack. So you should allow that the destination is 6x the size of the input.

`lws_tokenize` Overview

Given a UTF-8 string, lws_tokenize robustly and consistently separates it into syntactical units, following flags given to it to control how ambiguous things should be understood by it. By default, contiguous characters in a token are only alphanumeric and _, but the flags can modify that.

Whitespace outside of quoted strings is swallowed by the parser, so it is immune to different behaviours based on different types or amounts of whitespace inbetween tokens. It silently consumes it where it’s valid and just reports the delimiter or token abutting it. Similarly if comments are enabled, the comments are silently and wholly swallowed.

lws_tokenize is designed to operate on an all-in-memory chunk that typically covers “one line”, using it chunked is possible but code outside lws_tokenize must collect enough chars to cover whole tokens first, whatever that means for your use-case.

It’s use-cases cover decoding small or large strings easily and robustly where lws_tokenize already took care of syntax-level error checking like correctness of comma-separated lists or float format, and rejected nonsense… user code just has to look at the flow of tokens and delimiters and decide if that’s valid for its purpose. For example, lws_tokenize is helpful decoding header content where the header has some structure but it otherwise quite free-form… it’s difficult for user code to parse from scratch without missing some validation or introducing bugs, but much easier to deal with a stream of tokenized tokens and delimiters that already restricted the syntax.

lws_tokenize also covers complex usage like parsing config files robustly, including comments.

Token types

Type	Description
LWS_TOKZE_ENDED	We found a NUL and parsing has completed successfully
LWS_TOKZE_DELIMITER	Some character that can’t be in a token appeared, like `,`
LWS_TOKZE_TOKEN	A token appeared, like `my_token`, this is reported as a unit
LWS_TOKZE_INTEGER	A token that seems to be an integer appeared, like `1234`
LWS_TOKZE_FLOAT	A token that seems to be a float appeared, like `1.234`
LWS_TOKZE_TOKEN_NAME_EQUALS	A token followed by `=` appeared
LWS_TOKZE_TOKEN_NAME_COLON	A token followed by `:` appeared (only if flag `LWS_TOKENIZE_F_AGG_COLON` enabled)
LWS_TOKZE_QUOTED_STRING	A quoted string appeared, like `"my,s:t=ring"`

Parsing Errors

Error	Description
LWS_TOKZE_ERR_COMMA_LIST	We were told to expect a comma-separated list, but we saw things like “,tok” or “tok,,”
LWS_TOKZE_ERR_NUM_ON_LHS	We encountered nonsense like 123=
LWS_TOKZE_ERR_MALFORMED_FLOAT	We saw a floating point number with nonsense, like “1..3” or “1.2.3” (float parsing can be disabled by flag)
LWS_TOKZE_ERR_UNTERM_STRING	We saw a `"` and started parsing a quoted string, but the string ended before the close quote
LWS_TOKZE_ERR_BROKEN_UTF8	We encountered a UTF-8 sequence that is invalid

Parser modification flags

There are many different conventions for tokenizing depending on what you’re doing… the default is restrictive in that only alphanumeric and _ can be in a token, but for different cases you want to modify this. There are several flags allowing selection of a suitable parsing regime for what you’re doing

Flag	Meaning
LWS_TOKENIZE_F_MINUS_NONTERM	treat - as part of a token, so `my-token` is reported as one token, not `my` - `token`
LWS_TOKENIZE_F_AGG_COLON	`token:` or `token :` should be reported as a special token type `LWS_TOKZE_TOKEN_NAME_COLON`, not `token` `:`
LWS_TOKENIZE_F_COMMA_SEP_LIST	Enforce comma-separated list syntax, eg “a”, or “a, b” but not “,a” or “a, b,”
LWS_TOKENIZE_F_RFC7230_DELIMS	Allow more chars in a token following http style
LWS_TOKENIZE_F_DOT_NONTERM	Allows, eg, “warmcat.com” to be treated as one token
LWS_TOKENIZE_F_NO_FLOATS	This allows you to process, eg, “192.168.0.1” as a token instead of a floating point format error
LWS_TOKENIZE_F_NO_INTEGERS	Don’t treat strings consisting of numbers as integers, just report them as a string token
LWS_TOKENIZE_F_HASH_COMMENT	Take a `#` on the line as meaning the rest of the line is a comment
LWS_TOKENIZE_F_SLASH_NONTERM	Allow `/` inside string tokens, so `multipart/related` is a single token

Typical usage

{
    struct lws_tokenize ts;
    char *str;
...

    str = "mytoken1, mytoken, my-token";

    lws_tokenize_init(&ts, str, LWS_TOKENIZE_F_NO_INTEGERS |
                                LWS_TOKENIZE_F_MINUS_NONTERM);

    do {
        ts.e = lws_tokenize(&ts);
        switch (ts.e) {
        case LWS_TOKZE_TOKEN:
            /* token is in ts.token, length ts.token_len */
            break;
        case LWS_TOKZE_DELIMITER:
            /* delimiter is in ts.token[0] */
            ...
            break;
        case LWS_TOKZE_ENDED:
            /* reached end of string and tokenizer had no objections */
            ...
            break;
        default:
        }
    } while (ts.e > 0 && /* still space in output buffer */);
...
}

`lws_strexp` Overview

lws_strexp implements generic streaming, stateful, string expansion for embedded symbols like ${mysymbol} in an input of unlimited size chunked to arbitrary sizes for both input and output. It doesn’t deal with the symbols itself but passes instances of the symbol name that needs substitution to a user-provided callback as they are found, where it privately looks up the symbol and emits the substituted data inline.

Neither the input nor the output needs to be all in one place at one time, and either can be arbitrarily fragmented down to single-byte buffers safely, so this api is immune to fragmentation type attacks. Any size input and output can be processed without using any heap other than a ~64-byte context object and the input and output chunk buffers; depending on what you’re doing all of these can be on the stack.

expansion api return	meaning
LSTRX_DONE	We reached the end OK
LSTRX_FILLED_OUT	We filled up the output buffer, but once you spilled it, we need to continue
LSTRX_FATAL_NAME_TOO_LONG	Met a name longer than 31 chars
LSTRX_FATAL_NAME_UNKNOWN	Callback reported it doesn’t know the symbol name

Example usage

The symbol substitution should look like this, to be able to deal with the arbitrary output chunking

int
exp_cb1(void *priv, const char *name, char *out, size_t *pos, size_t olen,
    size_t *exp_ofs)
{
    const char *replace = NULL;
    size_t total, budget;

    if (!strcmp(name, "test")) {
        replace = "replacement_string";
        total = strlen(replace);
        goto expand;
    }

    return LSTRX_FATAL_NAME_UNKNOWN;

expand:
    budget = olen - *pos;
    total -= *exp_ofs;
    if (total < budget)
        budget = total;

    memcpy(out + *pos, replace + (*exp_ofs), budget);
    *exp_ofs += budget;
    *pos += budget;

    if (budget == total)
        return LSTRX_DONE;

    return LSTRX_FILLED_OUT;
}

… and performing the substitution…

        size_t in_len, used_in, used_out;
        lws_strexp_t exp;
        char obuf[128];
        int n;

        lws_strexp_init(&exp, NULL, exp_cb1, obuf, sizeof(obuf));

        /* for large input, you would do this in a loop */

        n = lws_strexp_expand(&exp, in, in_len, &used_in, &used_out);
        if (n != LSTRX_DONE) {
            lwsl_err("%s: lws_strexp failed: %d\n", __func__, n);

            return 1;
        }

What did we learn this time?

If you deal with strings that have internal structure, C can require a lot of code that is unforgiving with security issues and difficult to switch around after it’s written, or extend without getting a ratsnest.
The tokenizer provides your code with robust, well-formed tokens and delimiters, and hides details like whitespace and if selected, comma-separated list sequencing.
You can configure it at runtime for a variety of kinds of situation
You can very easily deploy ${symbol} string substitution without needing the input or output in one place at one time and even if the substitution is huge.