Parse, Don’t Validate AKA Some C Safety Tips

If you’ve read the original post on “Parse, Don’t Validate” you may have noticed that it focuses primarily on conceptual correctness. Here, I’ll build on that by showing how this technique can be used outside of niche academic languages by demonstrating it in a language that is as practical as it is dangerous - C.

“Parse, Don’t Validate” - the TLDR version

Your first instinct, when your system receives as input an email address (for example), is to perform validateEmail(untrustedInput) and then pass the validated string further into the depths of the system for usage.

The problem is that other code deep within the rest of the system is going to also do some sort of validation on the string they just got. Every single function deep within the bowels of the system will still need to validate the input before processing it.

I’ll bet good money that the processing functions will attempt to validate their input. Because they’re logically far away from the boundary, they’ll either do it a different way or fail to do it altogether.

// Pseudocode
if (validateEmail(untrustedInput) != true) {
   return error;
}
// Rest of system uses `untrustedInput`

// Pseudocode
email_t theEmail = parseEmail(untrustedInput);
if (theEmail == PARSE_ERROR) {
   return error;
}
// Rest of system uses `theEmail`

This removes any opportunity for errors to creep in within the rest of the system, such as some other code using a different validateEmail function on the untrustedInput, for example.

Some conventions for Safety in C.

Much to the surprise of, well, everybody, C actually has type safety. Sure, it isn’t as enforceable as (for example) Rust… and, sure, if you are willing to do extra work you can bypass it, but, at the end of the day, the compiler will still warn you if you try to add a number to a string and assign the result to a function.

With some exceptions, when you mismatch types, the compiler will tell you about it.

The problem isn’t that C lacks type safety (it clearly enforces most types in most expressions), it’s that raw pointers do not encode semantics (e.g., a char * doesn’t tell you if it’s an email, a name, or a filename).

This is pretty much the same in every language; if you have a function store_user() which accepts two strings, an email and a user name, then no type safety in the world is going to save you if you accidentally swap the arguments around when calling the function.

But, you still have options – even in C – by creating new string types; one for email and another for user name.

When writing in C, instead of passing char * around as strings, or (safer, but not by much) using an existing string library that stores length + buffer… rather create an opaque type for it.

You parse the input into the correct type once, and then functions which accept that type will produce a compile error if you mix things up.

When you create the correct types for data entering the system, you can then do this:

// C code, not pseudocode
email_t *email = parse_email(untrusted_input);
if (!email) {
   // Handle error
}

In addition to the safey from using opaque types there’s even more levels of safety here too, starting with:

This is not only for char * types though; you can do it to all values entering your system.You parse them once into the correct data type, and then code deep in the belly of the system cannot be compromised with malicious input, because the only data that the rest of the system will see is data that has been parsed into specific types.

When your functions never accept char * parameters your risk of pwnage is reduced. By leveraging the typing guarantees in C, you can ensure that the system won’t compile even if some heretic decides that they want to pass a char * to a function expecting an email_t.

Only the functions on the boundary of the system, interfacing to the outside world, should parse input. Everything else should accept only type-checkable parameters.

That alone is a big reason to use this approach, but I’ll point out two more opportunities to reduce the attack surface of your system using an actual compilable example consisting of separate compilation units. The untrusted input email and name come from outside the system.

// callee.h

typedef struct email_t email_t;
typedef struct name_t name_t;

#ifdef __cplusplus
extern "C" {
#endif
   email_t *email_parse (const char *untrusted);
   name_t *name_parse (const char *untrusted);

   // Additional tip: letting the callee set the callers pointer to NULL
   // when the value is freed prevents double-frees
   void email_del (email_t **email);
   void name_del (name_t **name);

#ifdef __cplusplus
};
#endif

// callee.c
#include <string.h>
#include <stdlib.h>

#include "callee.h"

struct email_t {
   // In a real program, you might want to store the two components
   // of the email address (before and after the `@`) separately.
   // This example simply copies the input.
   char *email;
};

struct name_t {
   char *name;
};


email_t *email_parse (const char *untrusted)
{
   if (!untrusted)
      return NULL;

   email_t *ret = malloc (sizeof *ret);
   if (!ret)
      return NULL;

   // In a real program, you'll parse this correctly
   if (!(ret->email = strdup (untrusted))) {
      free (ret);
      ret = NULL;
   }

   return ret;
}

name_t *name_parse (const char *untrusted)
{
   if (!untrusted)
      return NULL;

   name_t *ret = malloc (sizeof *ret);
   if (!ret)
      return NULL;

   if (!(ret->name = strdup (untrusted))) {
      free (ret);
      ret = NULL;
   }

   return ret;
}

void email_del (email_t **email)
{
   if (email && *email) {
      free ((*email)->email);
      free (*email);
      *email = NULL;
   }
}

void name_del (name_t **name)
{
   if (name && *name) {
      free ((*name)->name);
      free (*name);
      *name = NULL;
   }
}

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>

#include "callee.h"

void store_record_old (char *email, char *name)
{
   // Do something with the parameters here
   (void)email;
   (void)name;
}

void store_record_new (email_t *email, name_t *name)
{
   // Do something with the parameters here
   (void)email;
   (void)name;
}

bool rx_untrusted_input (char *untrusted_name, char *untrusted_email)
{
   email_t *email = email_parse (untrusted_email);
   name_t *name = name_parse (untrusted_name);
   if (!email || !name) {
      email_del (&email);
      name_del (&name);
      return false;
   }
   // Whoops - we accidentally specified the parameters in the wrong order!
   // Compiler cannot tell that this is a mistake!
   store_record_old (untrusted_name, untrusted_email);

   // Same mistake with opaque types, but now the compiler catches it!
   // error: incompatible pointer types passing 'name_t *' to parameter of type 'email_t *'
   store_record_new (name, email);
   return true;
}

There is now literally no way for any non-boundary code in your system to accidentally use an email value in place of a name value.

Let me count the ways…

This is a practical way of hardening your system to attacks: Parse, Don’t Validate.

Another one is shown in the code snippet above - your “destructor” functions which free a value should always be written to take the address of a pointer to that value.

Why, you ask? It’s because then the destructor function can set the pointer at the callers location to NULL, so even if a caller accidentally calls the email_del() destructor function twice, nothing will happen on the second time around.

And finally, the last upside: with different type names for different types, there will never be a situation where a caller might accidentally switch around the parameters in a call.

With Parse, Don’t Validate, you will never run into the situation of accidentally swapping parameters around in a function call, because the compiler will error out even though those two types are identical under the hood!

Summary: Why Parse, Don’t Validate?

By leveraging the typing guarantees we eliminate entire classes of bugs while making the code more robust and maintainable. Instead of just checking values for correctness, we parse it once and then the compiler enforces some typing guarantees for us.

Parse, Don’t Validate AKA Some C Safety Tips

“Parse, Don’t Validate” - the long version

“Parse, Don’t Validate” - the TLDR version

Some conventions for Safety in C.

Let me count the ways…

Summary: Why Parse, Don’t Validate?