Building better serialization

The "When in Rome, do as the Romans do" blog post series used serialization as an example of what happens when applying inappropriate coding patterns for some language. It demonstrated how field-based serialization does not fit well in Delphi, and how published-property-based serialization is more appropriate.

But it also briefly covered some poor serialization practices, that are not language-specific. For instance, attribute-based configuration is widely used across various languages and serialization frameworks, and for the same reasons such a configuration style is a poor practice in Delphi, it is a poor practice in any other language.

Knowing what constitutes a good serialization framework is important regardless of whether you are assessing an existing framework or building your own.

While this article approaches serialization from a Delphi point of view, the listed principles are valid in—and can be applied to—any language.



Configuration and attribute based serialization

Attribute- (annotation-) based serialization creates tight coupling between serialization frameworks, serialization formats and serialized classes. It makes it hard to change or use additional frameworks and formats. And that is equally bad in any language.

I cannot explain why such serialization configuration is so popular. Maybe because most business classes will only be serialized in one particular, fixed format, and configuring serialization directly inside the class makes it easier to see what the serialized object will look like. It is convenient to use. Attribute-based serialization works perfectly until you need to support more than one input/output format, or you need to switch to a different framework—which does not happen often, or will happen late in the development process, often after the initial public release.

In other words, you will not have problems, until they smack you right in the face, and at that point it will be too late to prevent a major code refactoring.

How much code refactoring will be needed depends on the serialization framework itself. If the framework supports other means of fully configuring the serialization process, then adapting will be much easier. If not, not only will you have to refactor away attributes, you will also need to shop for another serialization framework.

A good framework must be configurable outside business classes. If you are writing your own framework, adding support for attributes is a pure waste of time. Every single feature in a serialization framework contributes to the serialization process and has an impact on performance. While some features give you more flexibility and the ability to adjust serialization to your classes and not the other way around, attributes are just redundant.

What gets serialized

We have already established that field-based serialization in Delphi is a poor fit, and published properties should be used instead. Any Delphi-based serialization framework should support published properties as the default serialization method. When it comes to other languages, good frameworks should support the most commonly used approach.

Having said that, there is nothing wrong with using other, less standard conventions, as long as they can be easily configured. So you can choose to serialize only published properties, public, protected and private properties, and even fields, or any combination.

Another useful option is to provide the ability to specify whether any individual property or field will be serialized or not. While this is seldom used, it is indispensable in those cases where it's appropriate.

Name aliases and type name aliases

Pascal case, camel case, snake case...

Interoperability is an important consideration for serialization frameworks. Quite often, different platforms and different technologies will be used to process the same data. Every language has commonly used case conventions. Development teams can have their own rules that will be different from the commonly used ones. They can also adapt naming conventions to match their most commonly used toolset.

Different programming languages have different keywords, and not all names are available when naming properties or fields. Sometimes you will have to use an existing class and serialize it using different names. Specifying name and type name aliases for particular entries is a crucial feature for any serialization framework.

While name aliases can be used to resolve case convention incompatibilities, or field prefixes when field-based serialization is used, the ability to specify a general case convention or prefix saves time and avoids writing unnecessary configuration code for each and every data entry.

Converters

The serialization process includes converting data from one type to another. Every serialization framework will implement some commonly used converters that can handle conversions to and from strings, integers, floating-point numbers, booleans, dates, and times. The same data type can also be represented in various ways.

The ability to choose which data converter will be used for a particular class or even for a particular entry, as well as the ability to create custom data converters, is also a must-have in any decent serialization framework. Without it, serializing any even remotely more complex class can turn into an impossible mission.

Error handling

It is important to define behavior for when you have extra or missing fields. Is such data invalid or perfectly acceptable? What happens when the conversion of a field fails because the data is in an inappropriate format?

If the framework does not allow you to configure error handling, you will need additional code to handle all supported cases. In general, less strict serialization frameworks will be easier to handle—since you just need additional verification after the serialization (parsing) is complete—than rigid frameworks that will not allow any deviations in the input data.

Defaults

Choosing good default options, ones that will satisfy most use cases, not only saves time when writing configuration code, but also saves time during the serialization process. The more entries and rules you have in your configuration, the slower the process will be. Every setting needs to be read and applied.

This is the one place where choosing appropriate coding patterns and styles for a particular language matter the most. If the framework uses uncommon conventions, it will require more configuration code.


These are the most important features that need to be considered when choosing or writing serialization frameworks. The simplest frameworks will possibly lack some features, or will not give as many fine-grained configuration options. Make sure that the framework has all features you need before you start using it at large. It is also important to test speed and memory consumption. Even if all other parts make some framework just perfect, if it is too slow or consumes too much memory, it will be of no use, especially if you need to process large amounts of data.

Most textual formats use UTF-8 encoding. If performance, and even more, memory consumption are paramount, frameworks that process UTF-8 encoded data directly will run faster and consume less memory than frameworks that parse or store UTF-16 encoded data during processing. That does not mean that business classes must use UTF-8 encoded strings in fields.

When you write your own framework, it is good to start with the YAGNI principle. Write only the core code you need to make the framework work. You can expand it later on, when and if you really need particular features.

Still, it is good to be aware of all the features that serialization frameworks generally consist of, in order to make your code more extendable. Sooner or later, you will need most of the mentioned features, where the default behavior just won't be sufficient.

Comments

Popular posts from this blog

Just released eBook: Delphi Event-based and Asynchronous Programming

Magic behind FreeAndNil

Are const parameters dangerous?