Overview

Introduction

Writing string processing in C is a dangerous and unrewarding occupation, you have to always be mindful of how the code will respond to malicious input. And to operate well on machines with very limited resources, without limiting the size of the input you can handle, sometimes it’s desirable to operate on C strings in memory, but other times it’s desirable to be able to be able to operate in a stateful way on partial blocks of input without restriction, ie, with complete immunity to any fragmentation attack because it’s designed for this case from the start.

Libwebsockets has a bunch of important safe helpers like wrappers on strncpy and snprintf that guarantee there won’t be overflows and there will be a NUL, as well as more complex helpers… let’s look at the simple ones first.

lws_strncpy()

You might be surprised to learn that libc strncpy() fails at being safe, it can return without having applied a NUL at the end. lws_strncpy() is an alternative that will always safely truncate at the limit.

lws_strncpy() adjusts back the size limit by 1 to always make space for the NUL. For that reason, you can just feed it sizeof(dest) as the length limit safely.

lws_strnncpy()

This is a variation of strncpy() that takes two limit numbers at the end, the lowest of the two numbers is used to restrict or truncate the copy and a NUL is always applied at the end of the destination. This is useful when you want to copy a string where you know the length of the source string, but there is no terminating NUL in the source. With this you can do, eg,

        lws_strnncpy(dest, src_no_NUL, srclen, destlen);

and under all conditions end up with a NUL-terminated, possibly truncated copy of the string in dest. This is used widely in lws to cover for the fact that some platforms do not provide the “%.*s” format string option that allows printing non-NUL-delimited strings where you give the length.

destlen is corrected back by 1 to allow for the NUL at the end, so again it’s safe to set this to sizeof(dest).

lws_snprintf()

This is the swiss army knife of string generation, it’s a safe version of snprintf(). It uses the platform vsnprintf() but guarantees destination NUL termination. It’s very convenient for additively composing on to a large string where the return value which is the destination length, is used to advance where the next lws_snprintf() will write to.

In the event we reached the end of the string already, it will return 0, at the end of the sequence the result string length can be compared to the destination size to discover if we “crumpled up at the end”. But it won’t crash or blow past the end of the destination in the meanwhile.

        char dest[1234];
        size_t n = 0;

        n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
        n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
        n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);

        if (n >= sizeof(dest) - 1)
            /* truncated */

lws_purify_

Lws provides “purification” helpers for arguments that will be expressed in JSON, sqlite, or filenames including externally-provided input. For the sqlite and JSON versions, they use escaping so their arguments are formatted like strncpy(), ie, dest, source, dest len.

For lws_filename_purify_inplace() as the name suggests it purifies the file name inplace, by replacing scary characters like .. or / in the filename as _.

Notice that in the worst case, JSON escaping can turn one character into six and the whole string may consist of those if it’s an attack. So you should allow that the destination is 6x the size of the input.

lws_tokenize Overview

Given a UTF-8 string, lws_tokenize robustly and consistently separates it into syntactical units, following flags given to it to control how ambiguous things should be understood by it. By default, contiguous characters in a token are only alphanumeric and _, but the flags can modify that.

Whitespace outside of quoted strings is swallowed by the parser, so it is immune to different behaviours based on different types or amounts of whitespace inbetween tokens. It silently consumes it where it’s valid and just reports the delimiter or token abutting it. Similarly if comments are enabled, the comments are silently and wholly swallowed.

lws_tokenize is designed to operate on an all-in-memory chunk that typically covers “one line”, using it chunked is possible but code outside lws_tokenize must collect enough chars to cover whole tokens first, whatever that means for your use-case.

It’s use-cases cover decoding small or large strings easily and robustly where lws_tokenize already took care of syntax-level error checking like correctness of comma-separated lists or float format, and rejected nonsense… user code just has to look at the flow of tokens and delimiters and decide if that’s valid for its purpose. For example, lws_tokenize is helpful decoding header content where the header has some structure but it otherwise quite free-form… it’s difficult for user code to parse from scratch without missing some validation or introducing bugs, but much easier to deal with a stream of tokenized tokens and delimiters that already restricted the syntax.

lws_tokenize also covers complex usage like parsing config files robustly, including comments.

Token types

Type Description
LWS_TOKZE_ENDED We found a NUL and parsing has completed successfully
LWS_TOKZE_DELIMITER Some character that can’t be in a token appeared, like ,
LWS_TOKZE_TOKEN A token appeared, like my_token, this is reported as a unit
LWS_TOKZE_INTEGER A token that seems to be an integer appeared, like 1234
LWS_TOKZE_FLOAT A token that seems to be a float appeared, like 1.234
LWS_TOKZE_TOKEN_NAME_EQUALS A token followed by = appeared
LWS_TOKZE_TOKEN_NAME_COLON A token followed by : appeared (only if flag LWS_TOKENIZE_F_AGG_COLON enabled)
LWS_TOKZE_QUOTED_STRING A quoted string appeared, like "my,s:t=ring"

Parsing Errors

Error Description
LWS_TOKZE_ERR_COMMA_LIST We were told to expect a comma-separated list, but we saw things like “,tok” or “tok,,”
LWS_TOKZE_ERR_NUM_ON_LHS We encountered nonsense like 123=
LWS_TOKZE_ERR_MALFORMED_FLOAT We saw a floating point number with nonsense, like “1..3” or “1.2.3” (float parsing can be disabled by flag)
LWS_TOKZE_ERR_UNTERM_STRING We saw a " and started parsing a quoted string, but the string ended before the close quote
LWS_TOKZE_ERR_BROKEN_UTF8 We encountered a UTF-8 sequence that is invalid

Parser modification flags

There are many different conventions for tokenizing depending on what you’re doing… the default is restrictive in that only alphanumeric and _ can be in a token, but for different cases you want to modify this. There are several flags allowing selection of a suitable parsing regime for what you’re doing

Flag Meaning
LWS_TOKENIZE_F_MINUS_NONTERM treat - as part of a token, so my-token is reported as one token, not my - token
LWS_TOKENIZE_F_AGG_COLON token: or token : should be reported as a special token type LWS_TOKZE_TOKEN_NAME_COLON, not token :
LWS_TOKENIZE_F_COMMA_SEP_LIST Enforce comma-separated list syntax, eg “a”, or “a, b” but not “,a” or “a, b,”
LWS_TOKENIZE_F_RFC7230_DELIMS Allow more chars in a token following http style
LWS_TOKENIZE_F_DOT_NONTERM Allows, eg, “warmcat.com” to be treated as one token
LWS_TOKENIZE_F_NO_FLOATS This allows you to process, eg, “192.168.0.1” as a token instead of a floating point format error
LWS_TOKENIZE_F_NO_INTEGERS Don’t treat strings consisting of numbers as integers, just report them as a string token
LWS_TOKENIZE_F_HASH_COMMENT Take a # on the line as meaning the rest of the line is a comment
LWS_TOKENIZE_F_SLASH_NONTERM Allow / inside string tokens, so multipart/related is a single token

Typical usage

{
    struct lws_tokenize ts;
    char *str;
...

    str = "mytoken1, mytoken, my-token";

    lws_tokenize_init(&ts, str, LWS_TOKENIZE_F_NO_INTEGERS |
                                LWS_TOKENIZE_F_MINUS_NONTERM);

    do {
        ts.e = lws_tokenize(&ts);
        switch (ts.e) {
        case LWS_TOKZE_TOKEN:
            /* token is in ts.token, length ts.token_len */
            break;
        case LWS_TOKZE_DELIMITER:
            /* delimiter is in ts.token[0] */
            ...
            break;
        case LWS_TOKZE_ENDED:
            /* reached end of string and tokenizer had no objections */
            ...
            break;
        default:
        }
    } while (ts.e > 0 && /* still space in output buffer */);
...
}

lws_strexp Overview

lws_strexp implements generic streaming, stateful, string expansion for embedded symbols like ${mysymbol} in an input of unlimited size chunked to arbitrary sizes for both input and output. It doesn’t deal with the symbols itself but passes instances of the symbol name that needs substitution to a user-provided callback as they are found, where it privately looks up the symbol and emits the substituted data inline.

Neither the input nor the output needs to be all in one place at one time, and either can be arbitrarily fragmented down to single-byte buffers safely, so this api is immune to fragmentation type attacks. Any size input and output can be processed without using any heap other than a ~64-byte context object and the input and output chunk buffers; depending on what you’re doing all of these can be on the stack.

expansion api return meaning
LSTRX_DONE We reached the end OK
LSTRX_FILLED_OUT We filled up the output buffer, but once you spilled it, we need to continue
LSTRX_FATAL_NAME_TOO_LONG Met a name longer than 31 chars
LSTRX_FATAL_NAME_UNKNOWN Callback reported it doesn’t know the symbol name

Example usage

The symbol substitution should look like this, to be able to deal with the arbitrary output chunking

int
exp_cb1(void *priv, const char *name, char *out, size_t *pos, size_t olen,
    size_t *exp_ofs)
{
    const char *replace = NULL;
    size_t total, budget;

    if (!strcmp(name, "test")) {
        replace = "replacement_string";
        total = strlen(replace);
        goto expand;
    }

    return LSTRX_FATAL_NAME_UNKNOWN;

expand:
    budget = olen - *pos;
    total -= *exp_ofs;
    if (total < budget)
        budget = total;

    memcpy(out + *pos, replace + (*exp_ofs), budget);
    *exp_ofs += budget;
    *pos += budget;

    if (budget == total)
        return LSTRX_DONE;

    return LSTRX_FILLED_OUT;
}

… and performing the substitution…

        size_t in_len, used_in, used_out;
        lws_strexp_t exp;
        char obuf[128];
        int n;

        lws_strexp_init(&exp, NULL, exp_cb1, obuf, sizeof(obuf));

        /* for large input, you would do this in a loop */

        n = lws_strexp_expand(&exp, in, in_len, &used_in, &used_out);
        if (n != LSTRX_DONE) {
            lwsl_err("%s: lws_strexp failed: %d\n", __func__, n);

            return 1;
        }

What did we learn this time?

  • If you deal with strings that have internal structure, C can require a lot of code that is unforgiving with security issues and difficult to switch around after it’s written, or extend without getting a ratsnest.

  • The tokenizer provides your code with robust, well-formed tokens and delimiters, and hides details like whitespace and if selected, comma-separated list sequencing.

  • You can configure it at runtime for a variety of kinds of situation

  • You can very easily deploy ${symbol} string substitution without needing the input or output in one place at one time and even if the substitution is huge.