Libwebsockets string processing helpers
Overview
- CMake option: part of core lws
- Public header: (included by libwebsockets.h) include/libwebsockets/lws-misc.h
- Public header: (included by libwebsockets.h) include/libwebsockets/lws-purify.h
- Public header: (included by libwebsockets.h) include/libwebsockets/lws-tokenize.h
- Implementation: ./lib/core/libwebsockets.c
- Api unit tests: ./minimal-examples/api-tests/api-test-lws_tokenize
Introduction
Writing string processing in C is a dangerous and unrewarding occupation, you have to always be mindful of how the code will respond to malicious input. And to operate well on machines with very limited resources, without limiting the size of the input you can handle, sometimes it’s desirable to operate on C strings in memory, but other times it’s desirable to be able to be able to operate in a stateful way on partial blocks of input without restriction, ie, with complete immunity to any fragmentation attack because it’s designed for this case from the start.
Libwebsockets has a bunch of important safe helpers like wrappers on strncpy and snprintf that guarantee there won’t be overflows and there will be a NUL, as well as more complex helpers… let’s look at the simple ones first.
lws_strncpy()
You might be surprised to learn that libc strncpy()
fails at being safe, it
can return without having applied a NUL at the end. lws_strncpy()
is an
alternative that will always safely truncate at the limit.
lws_strncpy() adjusts back the size limit by 1 to always make space for the
NUL. For that reason, you can just feed it sizeof(dest)
as the length limit
safely.
lws_strnncpy()
This is a variation of strncpy()
that takes two limit numbers at the end,
the lowest of the two numbers is used to restrict or truncate the copy and a
NUL is always applied at the end of the destination. This is useful when you
want to copy a string where you know the length of the source string, but there
is no terminating NUL in the source. With this you can do, eg,
lws_strnncpy(dest, src_no_NUL, srclen, destlen);
and under all conditions end up with a NUL-terminated, possibly truncated copy of the string in dest. This is used widely in lws to cover for the fact that some platforms do not provide the “%.*s” format string option that allows printing non-NUL-delimited strings where you give the length.
destlen
is corrected back by 1 to allow for the NUL at the end, so again it’s
safe to set this to sizeof(dest)
.
lws_snprintf()
This is the swiss army knife of string generation, it’s a safe version of
snprintf(). It uses the platform vsnprintf()
but guarantees destination NUL
termination. It’s very convenient for additively composing on to a large string
where the return value which is the destination length, is used to advance
where the next lws_snprintf()
will write to.
In the event we reached the end of the string already, it will return 0, at the end of the sequence the result string length can be compared to the destination size to discover if we “crumpled up at the end”. But it won’t crash or blow past the end of the destination in the meanwhile.
char dest[1234];
size_t n = 0;
n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
n += lws_snprintf(dest + n, sizeof(dest) - n, "%s etc", etc);
if (n >= sizeof(dest) - 1)
/* truncated */
lws_purify_
Lws provides “purification” helpers for arguments that will be expressed in
JSON, sqlite, or filenames including externally-provided input. For the sqlite
and JSON versions, they use escaping so their arguments are formatted like
strncpy()
, ie, dest, source, dest len.
For lws_filename_purify_inplace()
as the name suggests it purifies the file
name inplace, by replacing scary characters like ..
or /
in the filename
as _
.
Notice that in the worst case, JSON escaping can turn one character into six and the whole string may consist of those if it’s an attack. So you should allow that the destination is 6x the size of the input.
lws_tokenize
Overview
Given a UTF-8 string, lws_tokenize
robustly and consistently separates it into
syntactical units, following flags given to it to control how ambiguous things
should be understood by it. By default, contiguous characters in a token are
only alphanumeric and _
, but the flags can modify that.
Whitespace outside of quoted strings is swallowed by the parser, so it is immune to different behaviours based on different types or amounts of whitespace inbetween tokens. It silently consumes it where it’s valid and just reports the delimiter or token abutting it. Similarly if comments are enabled, the comments are silently and wholly swallowed.
lws_tokenize
is designed to operate on an all-in-memory chunk that typically
covers “one line”, using it chunked is possible but code outside lws_tokenize
must collect enough chars to cover whole tokens first, whatever that means for
your use-case.
It’s use-cases cover decoding small or large strings easily and robustly where lws_tokenize already took care of syntax-level error checking like correctness of comma-separated lists or float format, and rejected nonsense… user code just has to look at the flow of tokens and delimiters and decide if that’s valid for its purpose. For example, lws_tokenize is helpful decoding header content where the header has some structure but it otherwise quite free-form… it’s difficult for user code to parse from scratch without missing some validation or introducing bugs, but much easier to deal with a stream of tokenized tokens and delimiters that already restricted the syntax.
lws_tokenize
also covers complex usage like parsing config files robustly,
including comments.
Token types
Type | Description |
---|---|
LWS_TOKZE_ENDED | We found a NUL and parsing has completed successfully |
LWS_TOKZE_DELIMITER | Some character that can’t be in a token appeared, like , |
LWS_TOKZE_TOKEN | A token appeared, like my_token , this is reported as a unit |
LWS_TOKZE_INTEGER | A token that seems to be an integer appeared, like 1234 |
LWS_TOKZE_FLOAT | A token that seems to be a float appeared, like 1.234 |
LWS_TOKZE_TOKEN_NAME_EQUALS | A token followed by = appeared |
LWS_TOKZE_TOKEN_NAME_COLON | A token followed by : appeared (only if flag LWS_TOKENIZE_F_AGG_COLON enabled) |
LWS_TOKZE_QUOTED_STRING | A quoted string appeared, like "my,s:t=ring" |
Parsing Errors
Error | Description |
---|---|
LWS_TOKZE_ERR_COMMA_LIST | We were told to expect a comma-separated list, but we saw things like “,tok” or “tok,,” |
LWS_TOKZE_ERR_NUM_ON_LHS | We encountered nonsense like 123= |
LWS_TOKZE_ERR_MALFORMED_FLOAT | We saw a floating point number with nonsense, like “1..3” or “1.2.3” (float parsing can be disabled by flag) |
LWS_TOKZE_ERR_UNTERM_STRING | We saw a " and started parsing a quoted string, but the string ended before the close quote |
LWS_TOKZE_ERR_BROKEN_UTF8 | We encountered a UTF-8 sequence that is invalid |
Parser modification flags
There are many different conventions for tokenizing depending on what you’re
doing… the default is restrictive in that only alphanumeric and _
can be in
a token, but for different cases you want to modify this. There are several
flags allowing selection of a suitable parsing regime for what you’re doing
Flag | Meaning |
---|---|
LWS_TOKENIZE_F_MINUS_NONTERM | treat - as part of a token, so my-token is reported as one token, not my - token |
LWS_TOKENIZE_F_AGG_COLON | token: or token : should be reported as a special token type LWS_TOKZE_TOKEN_NAME_COLON , not token : |
LWS_TOKENIZE_F_COMMA_SEP_LIST | Enforce comma-separated list syntax, eg “a”, or “a, b” but not “,a” or “a, b,” |
LWS_TOKENIZE_F_RFC7230_DELIMS | Allow more chars in a token following http style |
LWS_TOKENIZE_F_DOT_NONTERM | Allows, eg, “warmcat.com” to be treated as one token |
LWS_TOKENIZE_F_NO_FLOATS | This allows you to process, eg, “192.168.0.1” as a token instead of a floating point format error |
LWS_TOKENIZE_F_NO_INTEGERS | Don’t treat strings consisting of numbers as integers, just report them as a string token |
LWS_TOKENIZE_F_HASH_COMMENT | Take a # on the line as meaning the rest of the line is a comment |
LWS_TOKENIZE_F_SLASH_NONTERM | Allow / inside string tokens, so multipart/related is a single token |
Typical usage
{
struct lws_tokenize ts;
char *str;
...
str = "mytoken1, mytoken, my-token";
lws_tokenize_init(&ts, str, LWS_TOKENIZE_F_NO_INTEGERS |
LWS_TOKENIZE_F_MINUS_NONTERM);
do {
ts.e = lws_tokenize(&ts);
switch (ts.e) {
case LWS_TOKZE_TOKEN:
/* token is in ts.token, length ts.token_len */
break;
case LWS_TOKZE_DELIMITER:
/* delimiter is in ts.token[0] */
...
break;
case LWS_TOKZE_ENDED:
/* reached end of string and tokenizer had no objections */
...
break;
default:
}
} while (ts.e > 0 && /* still space in output buffer */);
...
}
lws_strexp
Overview
lws_strexp
implements generic streaming, stateful, string expansion for
embedded symbols like ${mysymbol}
in an input of unlimited size chunked to
arbitrary sizes for both input and output. It doesn’t deal with the symbols
itself but passes instances of the symbol name that needs substitution to a
user-provided callback as they are found, where it privately looks up the symbol
and emits the substituted data inline.
Neither the input nor the output needs to be all in one place at one time, and either can be arbitrarily fragmented down to single-byte buffers safely, so this api is immune to fragmentation type attacks. Any size input and output can be processed without using any heap other than a ~64-byte context object and the input and output chunk buffers; depending on what you’re doing all of these can be on the stack.
expansion api return | meaning |
---|---|
LSTRX_DONE | We reached the end OK |
LSTRX_FILLED_OUT | We filled up the output buffer, but once you spilled it, we need to continue |
LSTRX_FATAL_NAME_TOO_LONG | Met a name longer than 31 chars |
LSTRX_FATAL_NAME_UNKNOWN | Callback reported it doesn’t know the symbol name |
Example usage
The symbol substitution should look like this, to be able to deal with the arbitrary output chunking
int
exp_cb1(void *priv, const char *name, char *out, size_t *pos, size_t olen,
size_t *exp_ofs)
{
const char *replace = NULL;
size_t total, budget;
if (!strcmp(name, "test")) {
replace = "replacement_string";
total = strlen(replace);
goto expand;
}
return LSTRX_FATAL_NAME_UNKNOWN;
expand:
budget = olen - *pos;
total -= *exp_ofs;
if (total < budget)
budget = total;
memcpy(out + *pos, replace + (*exp_ofs), budget);
*exp_ofs += budget;
*pos += budget;
if (budget == total)
return LSTRX_DONE;
return LSTRX_FILLED_OUT;
}
… and performing the substitution…
size_t in_len, used_in, used_out;
lws_strexp_t exp;
char obuf[128];
int n;
lws_strexp_init(&exp, NULL, exp_cb1, obuf, sizeof(obuf));
/* for large input, you would do this in a loop */
n = lws_strexp_expand(&exp, in, in_len, &used_in, &used_out);
if (n != LSTRX_DONE) {
lwsl_err("%s: lws_strexp failed: %d\n", __func__, n);
return 1;
}
What did we learn this time?
If you deal with strings that have internal structure, C can require a lot of code that is unforgiving with security issues and difficult to switch around after it’s written, or extend without getting a ratsnest.
The tokenizer provides your code with robust, well-formed tokens and delimiters, and hides details like whitespace and if selected, comma-separated list sequencing.
You can configure it at runtime for a variety of kinds of situation
You can very easily deploy ${symbol} string substitution without needing the input or output in one place at one time and even if the substitution is huge.