Structures are the cornerstone of data representation in every non-trivial
C
program. Due to the lack of reflection in theC
language serialising structures often means writing separatestruct_write
andstruct_read
functions for every type of structure in the program.
This article examines multiple technique for serialising structs. Each approach uses a structure containing typical fields found in most structs
and two functions to to perform the serialisation: s_read
and s_write
.
Finally a general method for serialising any arbitrary structure to and from files is presented, implemented as enhanced format-string (containing specifiers) functions.
The struct
datatype is a cornerstone of composite data representation in C
programs. All non-trivial programs will represent much of the data during program execution as one or more struct
types. My previous article covered the basics of abstracting away the structure as an opaque data type, with its own functions and hidden internals. This article focuses on storing a struct
in a form that can be later loaded back into the program.
The structure that I will use as an example will have fields that represent all the common types we see in fields within a struct
:
typedef struct foo_t foo_t;
struct foo_t {
// All different widths of integers
uint8_t u8;
uint16_t u16;
uint32_t u32;
uint64_t u64;
// Floating point precision numbers
float fp_f;
double fp_d;
// A null-terminated C string
char *cstring;
// A 25-element long array of a primitive type
uint16_t a16[25];
// Length of allocated array below
uint16_t ptr16len;
// A pointer to a runtime allocated array of a primitive type
uint16_t *ptr16;
};
I’ve left out one commonly seen field of a struct
, the pointer to some other struct
. In other words this field is missing:
struct foo_t {
...// A pointer to a runtime allocated structure.
struct bar_t *bar;
... };
The reason that that particular use-case is omitted is because it must be serialised using the same approach used for the parent struct
, i.e. whichever approach is chosen for the parent struct
must also be used recursively for every child struct
.
Most programmers confronted with the need to save and load structures will turn to fread
and fwrite
as the first solution. It’s not a bad idea to use functions available in the standard library, and because fread and fwrite are both standard functions you can be certain that they are available on all platforms with a conforming compiler.
A first pass of structure serialisation would probably be:
bool s_write (const foo_t *src, FILE *outf)
{size_t n_items = fwrite (src, sizeof *src, 1, outf);
return n_items == 1 ? true : false;
}
bool s_read (const foo_t *dst, FILE *outf)
{size_t n_items = fread (dst, sizeof *dst, 1, outf);
return n_items == 1 ? true : false;
}
Unfortunately this solution has a number of problems:
struct
on a 32-bit ARM platform and then trying to load it on a 64-bit server will result in the fields having different (possibly random) values.It should be clear that relying on the compiler to always maintain the struct
layout is not a good idea; a simple recompilation using the same compiler and same version of compiler with a different flag (or, heaven forbid, a pragma
buried deep within some source file) can make your program incompatible with its previously saved data. Unfortunately the trivial implementation of s_write
and s_read
saves and loads structs
using the current layout used by the compiler, and the field layout is subject to change across platforms, compilers and compiler settings.
Even if nothing in the program changes, this approach will not work when there are fields that are pointers to other objects.
The easiest way to ensure that we read all the fields in a struct
correctly is, when saving the struct
, to write each field out individually and, when loading the struct
, to read each field in individually. Naturally this results in two very linear but long and tedious functions. Due to the length I will only display the s_write
function; the s_read
function is very similar - one line of code and one conditional for error checking, per field, with an extra line of code, an extra conditional and a loop construct for each field that has a length. For example, to serialise just a single field in the struct
you would have to do the following:
bool s_write (const foo_t *src, FILE *outf)
...size_t n_items;
sizeof src->u16, 1, outf);
n_items = fwrite (&src->u16, if (n_items != 1) {
// Handle error
}
... }
A simple struct
of ten fields requires approximately 45 lines of code to write just for serialisation code. A similar ratio exists for reading the struct
back from the file. A non-trivial program would likely define and use tens of structs
each probably needing serialisation.
An additional problem is serialising strings - the string length needs to be stored before the string when saving the struct so that the reader knows how much memory to allocate when loading the string. This means that each string (or array) causes the reader to perform extra read
operations (such as malloc and error-checking) and the writer to perform extra write
operations (such as writing the length before writing the field).
The code to write a single array field now looks like this:
bool s_write (const foo_t *src, FILE *outf)
...uint32_t s_len = strlen (src->cstring) + 1;
size_t n_items;
// First write the length so that the reader knows how many characters
// are in this string.
sizeof s_len, 1, outf);
n_items = fwrite (&s_len, if (n_items != 1) {
// Handle error
}
// Then write the actual string.
1, s_len, outf);
n_items = fwrite (src->cstring, if (n_items != s_len) {
// Handle error
}
... }
The other problem with the above approach is that it is the minimal code needed to write a struct
to a file. While it does perform the minimum amount of error-checking required it still does not ensure that we can read back the struct
on a different platform because this approach assumes that the Endianness of the integers and the precision types will always be the same on all machines.
This is not true; in a multibyte value (i.e. a value consisting of 2, 4 or 8 bytes) some machines will store the the lowest byte first while others will store the highest byte first. This becomes a problem if you save a value into a file on a particular machine, copy the file to a machine which has a different endianness and then try to load the value in that file. What happens is that the value is read in backwards, and so you get an incorrect value.
The correct thing to do is to save each multibyte value in a particular format so that the loading function s_read
will know how to read the value “forwards”. The easiest way to do this is by using the bitwise AND operation to mask off each byte in the multibyte value and write each byte separately. When reading the value back in use the bitwise OR operation to piece together the multibyte value. In this way it doesn’t matter what endian is being used as the operations will always work the same way in all endians, even when transporting the saved file between machines of different endianness.
I’ve made available a set of functions to convert uintXX_t
to a sequence of bytes and back again. This standalone C
module for use in your own projects can be downloaded from here
Unfortunately the endian-agnostic requirement increases the effort to serialise structs - each field must now also be converted before reading and before writing. The previous code snippet now becomes:
...size_t n_items;
uint8_t buffer[2];
endian_save16 (buffer, src->u16);sizeof buffer, 1, outf);
n_items = fwrite (buffer, if (n_items!=1) {
// Handle error
} ...
That’s just for a single field, in a single struct
. For multiple structs
much of that code will be repeated with very few changes.
We can ignore the endianness of integers by using formatted IO. The C
standard’s formatted output routines, fprintf, will write the integers in a human-readable form that fscanf can read back in. Using the formatted IO functions can save a lot of the work that the per-field approach requires while still including all of the error-checking.
We can ignore the order of the fields in a struct
and write them in the most convenient order. For this example, using struct foo_t
, we write all the fixed-length fields first, plus a few extra fields containing the length of the variable-length fields so that the reader can read them all in a single function call.
bool s_write (const foo_t *src, FILE *outf)
{static const int expected_nfields = 8;
int actual_nfields = 0;
size_t s_len = strlen (src->cstring) + 1;
// First write out all the fixed-length scalar fields
"0x%02x" // 1. src->u8
actual_nfields = fprintf (outf, " 0x%04x" // 2. src->u16
" 0x%08x" // 3. src->u32
" 0x%0" PRIx64 // 4. src->u64
" %f" // 5. src->fp_f
" %lf" // 6. src->fp_d
" %zu" // 7. length of src->cstring
" %zu", // 8. length of src->ptr16len
// 1
src->u8, // 2
src->u16, // 3
src->u32, // 4
src->u64, // 5
src->fp_f, // 6
src->fp_d, // 7
s_len, // 8
src->ptr16len); if (actual_nfields != expected_nfields) {
// Error - incorrect field count
}
// Write out all the fixed-length array fields
for (size_t i=0; i<sizeof src->a16 / sizeof src->a16[0]; i++) {
" %04x", src->a16[i]);
actual_nfields = fprintf (outf, if (actual_nfields != 1) {
// Error - incorrect field count
}
}
// Write out the variable-length fields.
// First write out the cstring field
for (size_t i=0; i<s_len; i++) {
" %c", src->cstring[i]);
actual_nfields = fprintf (outf, if (actual_nfields != 1) {
// Error - incorrect field count
}
}// Then write out the ptr16 field
for (size_t i=0; i<src->ptr16len; i++) {
" %04x", src->ptr16[i]);
actual_nfields = fprintf (outf, if (actual_nfields != 1) {
// Error - incorrect field count
}
}
return true; // On error we would have returned in the error handing
// code above.
}
The above function writes the entire struct foo_t
out to a file, checks for error on each write and also ensures that a different endian machine can read the values back in.
The function to read in the struct written by s_write
is slightly more complicated because two of the fields are dynamically allocated. This means that:
s_read
must check if the memory allocation succeeded before continuing to read in the variable-length fields.The above serialisation function s_write
needs to be to rewritten for every structure that we want to serialise, and involves lots of very similar code and logic; write/read scalar fields, write/read length of array fields and write/read actual array fields, allocating space if necessary. As these steps will need to be taken by every serialisation function it might be in our interests to see if we are able to parameterise the writing and reading of the fields.
Our parameterised functions must take a specification that tells it how each field should be written, very similar to the printf and scanf family of functions. While it is indeed possible to reuse the standard format specifiers as our own field specifiers it might not be a good idea to do so as this would break The Principle of Least Astonishment.1 This is because anyone who is changing the code and who sees a string literal with well-known format specifiers such as %02x
and %c
would naturally (but incorrectly) assume that all format specifiers are supported. We do not wish to confuse the reader.
Our format specification will be different and thus needs to be recognised as different. It is still perfectly possible to make the specification easy and intuitive while ensuring that it is wholly different to the existing printf and scanf family of format specifiers. To this end I adopted the following convention for specifying a single field:
#[num]<type-spec>[width]
WHERE#: Marks the start of a field
num:
Optional number indicating that field is an array (not malloced)if type-spec is 's'. When num is 'm' then
of 'num' elements. Ignored
the array is treated as a malloced array with the number of elements
given by the next argument in the function call with the width of the'm'.
length is given by the number immediately following the
<type-spec>:
Mandatory type specifier that is one of:
u: Integer
f: Floating pointer number
s: C-style NULL-terminated string
width:8, 16, 32 or 64 bits. Defaults to
Optional bitwidth that specifies 32 if not specified and ignored if type-spec is 's'.
As an example, a field that is an array of uint64_t
of 12 elements would have a format specifier of "#12u64"
. For malloced fields where the size cannot be known during compilation we specify that the number of elements is given by the next argument in the function call using, for example, #m32u64
to specify that the field is a malloced array and the field length is stored in the next parameter to this function which is a uint32_t
.
The functions to implement the above specification can be named in a generic manner, such as sstruct_write
and sstruct_read
. Once created we can use them for just about any serialisation operation on structs by writing our s_write
and s_read
functions as a single error-checked line that calls the fmt_s_* functions.
For example, our s_write
function will be as simple as this:
bool s_write (const foo_t *src, FILE *outf)
{return (sstruct_write (outf,
"#u8 #u16 #u32 #u64 #f32 #f64 #s #25u16 #m16u16",
src->u8,
src->u16,
src->u32,
src->u64,
src->fp_f,
src->fp_d,
src->cstring,
src->ptr16,
src->ptr16len)); }
Of course, this function will have to be written for every struct
that needs to be serialised, but it is trivially easy to do so and has only a single decision path with extremely low complexity. The next struct we want to serialise will have an almost identical function, with the only difference being the format specifier string literal and the argument list.
The s_read
function also becomes embarrassingly trivial when we have a sstruct_read
function:
bool s_read (foo_t *dst, FILE *outf)
{return (sstruct_read (outf,
"#u8 #u16 #u32 #u64 #f32 #f64 #s #25u16 #m16u16",
&dst->u8,
&dst->u16,
&dst->u32,
&dst->u64,
&dst->fp_f,
&dst->fp_d,
&dst->cstring,
&dst->ptr16,
&dst->ptr16len)); }
The difference is that the sstruct_read
function needs pointers as arguments in order to write the values we are reading from the file and for some of the fields sstruct_read
will have to allocate memory.
In the course of producing this article I developed the sstruct_read
and sstruct_write
functions. You can download these functions to use in your own projects from this directory over here. Note that you will also need the endian library.
I listed the most common and simplest strategies for serialising a composite data object to and from a file in a platform-independent manner. In the process of listing these strategies I also presented a generalised struct
serialisation method using format specifiers to read and write multiple fields in a single function call, including static and dynamic arrays. This simplifies the code needed to read and write structs
to a file.
In my next article dealing with composite data types I shall examine the serialisation of composite data objects, AKA structs
, with media other than files. For example, it is frequently useful to send composite data objects over a network, so serialising structs
to a data packet or data stream is very useful for transmitting data across different platforms (say, from a mobile cellphone to a server).
Another fairly important persistent mechanism for composite data object persistence is Object-Relational Mapping2 . In future articles I will examine the requirements for mapping struct
objects to database rows and columns with a view to producing a simple database persistence layer for C
structs
.
Both of these ideas build on the format-string specifier method presented here for structs
.
Here’s a good explanation of an ORM
.↩︎