Serialising Structs to Files

Structures are the cornerstone of data representation in every non-trivial C program. Due to the lack of reflection in the C language serialising structures often means writing separate struct_write and struct_read functions for every type of structure in the program.


Posted by Lelanthran
2017-12-24

This article examines multiple technique for serialising structs. Each approach uses a structure containing typical fields found in most structs and two functions to to perform the serialisation: s_read and s_write.

Finally a general method for serialising any arbitrary structure to and from files is presented, implemented as enhanced format-string (containing specifiers) functions.

Introduction

The struct datatype is a cornerstone of composite data representation in C programs. All non-trivial programs will represent much of the data during program execution as one or more struct types. My previous article covered the basics of abstracting away the structure as an opaque data type, with its own functions and hidden internals. This article focuses on storing a struct in a form that can be later loaded back into the program.

The structure that I will use as an example will have fields that represent all the common types we see in fields within a struct:

typedef struct foo_t foo_t;
struct foo_t {
   // All different widths of integers
   uint8_t  u8;
   uint16_t u16;
   uint32_t u32;
   uint64_t u64;

   // Floating point precision numbers
   float    fp_f;
   double   fp_d;

   // A null-terminated C string
   char    *cstring;

   // A 25-element long array of a primitive type
   uint16_t a16[25];

   // Length of allocated array below
   uint16_t ptr16len;
   // A pointer to a runtime allocated array of a primitive type
   uint16_t *ptr16;
};

I’ve left out one commonly seen field of a struct, the pointer to some other struct. In other words this field is missing:

struct foo_t {
   ...
   // A pointer to a runtime allocated structure.
   struct bar_t *bar;
   ...
};

The reason that that particular use-case is omitted is because it must be serialised using the same approach used for the parent struct, i.e. whichever approach is chosen for the parent struct must also be used recursively for every child struct.

Quick Solution: fread() and fwrite()

Most programmers confronted with the need to save and load structures will turn to fread and fwrite as the first solution. It’s not a bad idea to use functions available in the standard library, and because fread and fwrite are both standard functions you can be certain that they are available on all platforms with a conforming compiler.

A first pass of structure serialisation would probably be:

bool s_write (const foo_t *src, FILE *outf)
{
   size_t n_items = fwrite (src, sizeof *src, 1, outf);
   return n_items == 1 ? true : false;
}

bool s_read (const foo_t *dst, FILE *outf)
{
   size_t n_items = fread (dst, sizeof *dst, 1, outf);
   return n_items == 1 ? true : false;
}

Unfortunately this solution has a number of problems:

  1. All the padding bytes between the fields are saved and loaded. This is a lot of extra space that is used, and it can add up to significant waste on embedded platforms.
  2. The padding bytes will not be constant between different machines. The alignment of each field between platforms is not guaranteed so saving the struct on a 32-bit ARM platform and then trying to load it on a 64-bit server will result in the fields having different (possibly random) values.
  3. The padding and/or alignment (or lack thereof) is not guaranteed even on the same platform. A different compiler for the same hardware might arrange the fields differently (and it is allowed to, in terms of alignment and padding) even on the same hardware.
  4. Even the same compiler and same compiler version will, at the drop of the correct flag, change the padding and remove the alignment restrictions resulting in packed structures.
  5. None of the pointer fields will be saved. Upon reloading the pointer fields will contain junk!

It should be clear that relying on the compiler to always maintain the struct layout is not a good idea; a simple recompilation using the same compiler and same version of compiler with a different flag (or, heaven forbid, a pragma buried deep within some source file) can make your program incompatible with its previously saved data. Unfortunately the trivial implementation of s_write and s_read saves and loads structs using the current layout used by the compiler, and the field layout is subject to change across platforms, compilers and compiler settings.

Even if nothing in the program changes, this approach will not work when there are fields that are pointers to other objects.

Minimal per-field approach

Serialisation for Each Field

The easiest way to ensure that we read all the fields in a struct correctly is, when saving the struct, to write each field out individually and, when loading the struct, to read each field in individually. Naturally this results in two very linear but long and tedious functions. Due to the length I will only display the s_write function; the s_read function is very similar - one line of code and one conditional for error checking, per field, with an extra line of code, an extra conditional and a loop construct for each field that has a length. For example, to serialise just a single field in the struct you would have to do the following:

bool s_write (const foo_t *src, FILE *outf)
...
   size_t n_items;
   n_items = fwrite (&src->u16, sizeof src->u16, 1, outf);
   if (n_items != 1) {
      // Handle error
   }
...
}

A simple struct of ten fields requires approximately 45 lines of code to write just for serialisation code. A similar ratio exists for reading the struct back from the file. A non-trivial program would likely define and use tens of structs each probably needing serialisation.

An additional problem is serialising strings - the string length needs to be stored before the string when saving the struct so that the reader knows how much memory to allocate when loading the string. This means that each string (or array) causes the reader to perform extra read operations (such as malloc and error-checking) and the writer to perform extra write operations (such as writing the length before writing the field).

The code to write a single array field now looks like this:

bool s_write (const foo_t *src, FILE *outf)
...
   uint32_t s_len = strlen (src->cstring) + 1;
   size_t n_items;

   // First write the length so that the reader knows how many characters
   // are in this string.
   n_items = fwrite (&s_len, sizeof s_len, 1, outf);
   if (n_items != 1) {
      // Handle error
   }

   // Then write the actual string.
   n_items = fwrite (src->cstring, 1, s_len, outf);
   if (n_items != s_len) {
      // Handle error
   }
...
}

Endianness

The other problem with the above approach is that it is the minimal code needed to write a struct to a file. While it does perform the minimum amount of error-checking required it still does not ensure that we can read back the struct on a different platform because this approach assumes that the Endianness of the integers and the precision types will always be the same on all machines.

This is not true; in a multibyte value (i.e. a value consisting of 2, 4 or 8 bytes) some machines will store the the lowest byte first while others will store the highest byte first. This becomes a problem if you save a value into a file on a particular machine, copy the file to a machine which has a different endianness and then try to load the value in that file. What happens is that the value is read in backwards, and so you get an incorrect value.

The correct thing to do is to save each multibyte value in a particular format so that the loading function s_read will know how to read the value “forwards”. The easiest way to do this is by using the bitwise AND operation to mask off each byte in the multibyte value and write each byte separately. When reading the value back in use the bitwise OR operation to piece together the multibyte value. In this way it doesn’t matter what endian is being used as the operations will always work the same way in all endians, even when transporting the saved file between machines of different endianness.

I’ve made available a set of functions to convert uintXX_t to a sequence of bytes and back again. This standalone C module for use in your own projects can be downloaded from here

Unfortunately the endian-agnostic requirement increases the effort to serialise structs - each field must now also be converted before reading and before writing. The previous code snippet now becomes:

...
   size_t n_items;
   uint8_t buffer[2];
   endian_save16 (buffer, src->u16);
   n_items = fwrite (buffer, sizeof buffer, 1, outf);
   if (n_items!=1) {
      // Handle error
   }
...

That’s just for a single field, in a single struct. For multiple structs much of that code will be repeated with very few changes.

Using Formatted IO

We can ignore the endianness of integers by using formatted IO. The C standard’s formatted output routines, fprintf, will write the integers in a human-readable form that fscanf can read back in. Using the formatted IO functions can save a lot of the work that the per-field approach requires while still including all of the error-checking.

We can ignore the order of the fields in a struct and write them in the most convenient order. For this example, using struct foo_t, we write all the fixed-length fields first, plus a few extra fields containing the length of the variable-length fields so that the reader can read them all in a single function call.

bool s_write (const foo_t *src, FILE *outf)
{
   static const int expected_nfields = 8;

   int actual_nfields = 0;

   size_t s_len = strlen (src->cstring) + 1;

   // First write out all the fixed-length scalar fields
   actual_nfields = fprintf (outf, "0x%02x"        // 1. src->u8
                                   " 0x%04x"       // 2. src->u16
                                   " 0x%08x"       // 3. src->u32
                                   " 0x%0" PRIx64  // 4. src->u64
                                   " %f"           // 5. src->fp_f
                                   " %lf"          // 6. src->fp_d
                                   " %zu"          // 7. length of src->cstring
                                   " %zu",         // 8. length of src->ptr16len
                              src->u8,          // 1
                              src->u16,         // 2
                              src->u32,         // 3
                              src->u64,         // 4
                              src->fp_f,        // 5
                              src->fp_d,        // 6
                              s_len,            // 7
                              src->ptr16len);   // 8
   if (actual_nfields != expected_nfields) {
      // Error - incorrect field count
   }

   // Write out all the fixed-length array fields
   for (size_t i=0; i<sizeof src->a16 / sizeof src->a16[0]; i++) {
      actual_nfields = fprintf (outf, " %04x", src->a16[i]);
      if (actual_nfields != 1) {
         // Error - incorrect field count
      }
   }

   // Write out the variable-length fields.
   // First write out the cstring field
   for (size_t i=0; i<s_len; i++) {
      actual_nfields = fprintf (outf, " %c", src->cstring[i]);
      if (actual_nfields != 1) {
         // Error - incorrect field count
      }
   }
   // Then write out the ptr16 field
   for (size_t i=0; i<src->ptr16len; i++) {
      actual_nfields = fprintf (outf, " %04x", src->ptr16[i]);
      if (actual_nfields != 1) {
         // Error - incorrect field count
      }
   }

   return true;   // On error we would have returned in the error handing
                  // code above.
}

The above function writes the entire struct foo_t out to a file, checks for error on each write and also ensures that a different endian machine can read the values back in.

The function to read in the struct written by s_write is slightly more complicated because two of the fields are dynamically allocated. This means that:

  1. The fixed-length fields must be read in,
  2. The lengths of the variable-length fields must be read in,
  3. The memory for the variable-length fields must be allocated,
  4. The variable-length fields must be read in. In turn, this means slightly more error-checking because s_read must check if the memory allocation succeeded before continuing to read in the variable-length fields.

Specification for Generalised Readers and Writers

The above serialisation function s_write needs to be to rewritten for every structure that we want to serialise, and involves lots of very similar code and logic; write/read scalar fields, write/read length of array fields and write/read actual array fields, allocating space if necessary. As these steps will need to be taken by every serialisation function it might be in our interests to see if we are able to parameterise the writing and reading of the fields.

Our parameterised functions must take a specification that tells it how each field should be written, very similar to the printf and scanf family of functions. While it is indeed possible to reuse the standard format specifiers as our own field specifiers it might not be a good idea to do so as this would break The Principle of Least Astonishment.1 This is because anyone who is changing the code and who sees a string literal with well-known format specifiers such as %02x and %c would naturally (but incorrectly) assume that all format specifiers are supported. We do not wish to confuse the reader.

Our format specification will be different and thus needs to be recognised as different. It is still perfectly possible to make the specification easy and intuitive while ensuring that it is wholly different to the existing printf and scanf family of format specifiers. To this end I adopted the following convention for specifying a single field:

   #[num]<type-spec>[width]
WHERE
   #: Marks the start of a field

   num:
       Optional number indicating that field is an array (not malloced)
       of 'num' elements. Ignored if type-spec is 's'. When num is 'm' then
       the array is treated as a malloced array with the number of elements
       given by the next argument in the function call with the width of the
       length is given by the number immediately following the 'm'.

   <type-spec>:
      Mandatory type specifier that is one of:
         u:    Integer
         f:    Floating pointer number
         s:    C-style NULL-terminated string

   width:
      Optional bitwidth that specifies 8, 16, 32 or 64 bits. Defaults to
      32 if not specified and ignored if type-spec is 's'.

As an example, a field that is an array of uint64_t of 12 elements would have a format specifier of "#12u64". For malloced fields where the size cannot be known during compilation we specify that the number of elements is given by the next argument in the function call using, for example, #m32u64 to specify that the field is a malloced array and the field length is stored in the next parameter to this function which is a uint32_t.

The functions to implement the above specification can be named in a generic manner, such as sstruct_write and sstruct_read. Once created we can use them for just about any serialisation operation on structs by writing our s_write and s_read functions as a single error-checked line that calls the fmt_s_* functions.

For example, our s_write function will be as simple as this:

bool s_write (const foo_t *src, FILE *outf)
{
   return (sstruct_write (outf,
         "#u8 #u16 #u32 #u64 #f32 #f64 #s #25u16 #m16u16",
                                src->u8,
                                src->u16,
                                src->u32,
                                src->u64,
                                src->fp_f,
                                src->fp_d,
                                src->cstring,
                                src->ptr16,
                                src->ptr16len));
}

Of course, this function will have to be written for every struct that needs to be serialised, but it is trivially easy to do so and has only a single decision path with extremely low complexity. The next struct we want to serialise will have an almost identical function, with the only difference being the format specifier string literal and the argument list.

The s_read function also becomes embarrassingly trivial when we have a sstruct_read function:

bool s_read (foo_t *dst, FILE *outf)
{
   return (sstruct_read (outf,
         "#u8 #u16 #u32 #u64 #f32 #f64 #s #25u16 #m16u16",
                               &dst->u8,
                               &dst->u16,
                               &dst->u32,
                               &dst->u64,
                               &dst->fp_f,
                               &dst->fp_d,
                               &dst->cstring,
                               &dst->ptr16,
                               &dst->ptr16len));
}

The difference is that the sstruct_read function needs pointers as arguments in order to write the values we are reading from the file and for some of the fields sstruct_read will have to allocate memory.

In the course of producing this article I developed the sstruct_read and sstruct_write functions. You can download these functions to use in your own projects from this directory over here. Note that you will also need the endian library.

Next…

I listed the most common and simplest strategies for serialising a composite data object to and from a file in a platform-independent manner. In the process of listing these strategies I also presented a generalised struct serialisation method using format specifiers to read and write multiple fields in a single function call, including static and dynamic arrays. This simplifies the code needed to read and write structs to a file.

In my next article dealing with composite data types I shall examine the serialisation of composite data objects, AKA structs, with media other than files. For example, it is frequently useful to send composite data objects over a network, so serialising structs to a data packet or data stream is very useful for transmitting data across different platforms (say, from a mobile cellphone to a server).

Another fairly important persistent mechanism for composite data object persistence is Object-Relational Mapping2 . In future articles I will examine the requirements for mapping struct objects to database rows and columns with a view to producing a simple database persistence layer for C structs.

Both of these ideas build on the format-string specifier method presented here for structs.


Posted by Lelanthran
2017-12-24

  1. See Wikipedia for a full explanation.↩︎

  2. Here’s a good explanation of an ORM.↩︎