for inspectIT storage solution

To be able to store inspectIT data to the disk, serialization of the data has to be performed, that's very simple and clear. However, Java native serialization comes with high overhead, and many other problems, thus I performed a examination of different possibilities that can be used in order to effectively persist data on the disk. Effectively means, as quickest as possible, where size of the file should also be as smallest as possible. Since, serialization to binary data is taking less space then any other human-readable (or not) format, I have only analyzed the possibilities that serialize data into "array of bytes".
Generally there are three different solutions that I believe are acceptable:

Cross-language data transformation libraries: Protocol Buffers or Thrift
Fast Java serialization library Kryo
Externalizable Interface

Data transformation libraries

Both Google and Facebook developed a cross-language data transformation libraries to fit their needs of transferring the data between different languages that they use. Thus, Google created Protocol Buffers and Facebook Thrift (although there is a word that actually Thrift was developed by the ex-Google employees that changed to Facebook). Thrift was immediately started as an open-source project, that was later on given to Apache foundation, while Google "open" the source from the version 2.0.
The approach that both libraries have is quite the same. Both defined an IDL for defining the stricture of the data that needs to be transferred. These IDLs are more or less the same, they both have defined general types that exist in most of the programing languages, they both don't support polymorphism, but they have some little differences. For example, I like that in Thrift they created types like Map, Set and List that directly map to the HashMap, HashSet and ArrayList in Java for example, while Protocol Buffers only have a List type of collection. Furthermore, Thrift included also possible definition of service into IDL, so that transfer of the data can be easier (but we totally don't need this ). Here is example of such a definition:

message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}

or in Thrift

struct UserProfile {
1: i32 uid,
2: string name,
3: string blurb
}

So basically for each class that we wish to serialize with these libraries we have to create a similar definition for it. With this definition, a code is generated via the compilers provided by Protocol Buffers and Thrift. It is possible to generate code for any language, Java as well of course. And these generated class know how to serialize and de-serialize objects. Thus, we can say that is some kind of static serialization, because all the definitions are known and this is the basic reason why this is much faster then Java serialization or other serialization libraries that use reflection.

What I definitively did not like are these generated classes (I only produced them for Protocol Buffers). They look somehow ugly, have many inner classes, and are somehow confusing. Not to mention that for a I would say very simple definition, the compiler generated the class that had 2,000 lines. Furthermore, there is a problem of usage with everything that you want to serialize. For example, for our TimerData we would need to create an object of TimerDataProtos class (basically have same getters and setter as our class and is actually the one that we can serialize fast) first, then "clone" the data from our object to this other one and do the serialization. And do the same thing in other direction for de-serializing. Seams a lot of new classes and a lot of new code.

What I definitively like is the backwards and forward compatibility that these two libraries provide, that is there out of the box if some basic rules are followed. Compatibility is achieved, by associating number with fields (see the sample definition code). Thus, new fields can be added to the definition with our any problems, and when field should be removed in the new class version, the definition just has to keep the old field, so that old data can also be read (but this field will be ignored).

Although they are quite simple, I found a nice table that compares the two libraries:

	Thrift	Protocol Buffers
Backers	Facebook, Apache (accepted for incubation)	Google
Bindings	C++, Java, Python, PHP, XSD, Ruby, C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell	C++, Java, Python(Perl, Ruby, and C# under discussion)
Output Formats	Binary, JSON	Binary
Primitive Types	bool byte 16/32/64-bit integers double string byte sequence map<t1,t2> list<t> set<t>	bool 32/64-bit integers float double string byte sequence "repeated" properties act like lists
Enumerations	Yes	Yes
Constants	Yes	No
Composite Type	Struct	Message
Exception Type	Yes	No
Documentation	Bad	Good
License	Apache	BSD
Compiler Language	C++	C++
RPC Interfaces	Yes	Yes
RCP Implementation	Yes	No
Composite Type Extensions	No	Yes

It is clear that Thrift wins in functionality, and this is probably the consequence of having people how already worked on ProtoBuf development, actually creating it, thus they were able to add any non-existing functionality straight forward. But on the other side, Thrift lack any kind of documentation, and this is a serous problem. Except one very good White-paper, it is very difficult to find any other information, tutorials, explanation, etc.

Kryo library

While I was trying to find out who was better performance results, Thrift or Protocol Buffers, I stumbled to a nice benchmarking of different serialization libraries (http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking). The results that they showed are something like this:
Total time: Create object, serialize and deserialize
Serialize time
Deserialize time
Bytes length after serialization
Kryo is in the top all of the list, thus I was supposing that it is worth of checking it. And I stumbled to a very small and very nice library, that really have some nice features. So lets just jump to features right away:

Works with Java NIO
Provides automatic serialization (no description files, no generated classes, etc)
- But classes have to be registered before the synchronization, or leave open registration option that influences the performance and size of the serialized bytes
- Registration of the classes have to be done in same order always (quite non-flexible constraint)
Serialization
- Provides set of already existing serialization classes for Java primitive types, Maps, Collections, etc
- Provides FieldSerializer class that can serialize arbitrary classes. But, this class uses ReflectASM to do the serialization. Fastest serialization is possible only if the class fields are public, or protected/default and setAccessable(true) on the class loader is permitted by Security manager. Otherwise when they are private it uses "fast" reflection (and it is faster, even faster then these libraries above)
- Provides CompatibleFieldSerializer that has the backwards and forwards compatibility, with some overhead compared to one above
- And enables for a user to develop a custom serializers for every class. This should be the fastest solution, but a compatibility control has to also be done manually.
Provides support for circular references in object graph (especially important for the Invocation sequence data)
Multi thread support is available
Provides several data compressing options
- Deflate algorithm – This somehow was not working in my tests, first I did not see any difference in size, and secondly the time needed for serialization was 15 times slower than without
- Delta mechanism – Where only the difference from the last object is serialized, but this to work, objects have to be mainly same class and having similar fields values

Kryo seams as a nice solution. It has amazing results (also in my tests) and it is very simple and very easy to learn. Furthermore I see two different solutions here:

Use CompatibleFieldSerializer and have everything out of the box
Create custom serializers and deal with the versioning control on our own. This is much more work, but should provide at least 20%-30% faster solution. This is actually very close to Externalizable Interface, but in my opinion much easier because serializers for Java classes already exist in the library.

Backward/forward compatibility problems with Kyro

As already said Kyro provides the CompatibleFieldSerializer class that can provide limited backward/forward compatibility. Limited because the type of the field can not be changed. However the way the Kyro handles the compatibility is quite problematic.

So what this serializer does is attaching the names of all fields that will be serialized before the data. Thus for every object there is first a list of field names, and after the real data. This is not problematic for a huge objects, where this "schema" is just a small part of whole serialized data, however our case it would influence the size of serialized objects too much (simple example, TimerData has 12 double and 2 long fields, with out names it can maximally be cca. 56 bytes, to include the field names it is cca. 100+ bytes more, thus raise in size of 200%).

The solution can be to externalize the schema out of the serialized data, where I can see two possible solutions:

Save schema of each class as a part of stored data. Thus, when we need to de-serialize we can refer to the schema that comes with the data. This is a huge constraint, but can work in our case. Since we are serializing a large amount of data, adding schema information would not influence the size of the data at all. On the other hand, for every "recording" we will have a schema created even if nothing has been changed.

Create a general schema for a class, and provide it as a "configuration file". Thus every class will have a schema in a specific directory that can be loaded when the CMR is loading. The schema structure can be very simple, and we can reuse the idea that Protocol buffers and Thrift had about connecting the field name with a unique integer number:

- 1: someFieldName
- 2: someOtherFieldName

Then when serializing, instead of adding the names of the fields to the data, we can add a list of integers that will define the fields that will be serialized. Since for such a small integer only one byte is needed this would not be huge overhead. The only constraint is that the unique number given to the fields can not be changed, so in example place number 1 is forever reserved for "someFieldName". If this field is removed form the class, place number 1 can not be obtained by any other field. When de-serializing if we step on a field number that is not present in the current schema, we simply skip it.

The pros and cons from these two possibilities are quite clear. First one has less data because it does not have to write also which fields are serialized, but on the other hand de-serialization is not possible with out schema provided.

Results of the improved compatible serializer

A prototype that is implementing a solution number 2. described above has shown major improvements in the serialization. The simple test class was created:

public class Test {

	int value;

	String name;

	long longValue;
...

The serialization was tested with values (5, "Ivan", 45545454). The new compatible implementation serialize this data into 19 bytes. The old (ineffective) Kryo compatible serializer serialize to 37 bytes. This is almost 100% more data needed by the old solution. As excepted the time needed for the serialization has significant decrease also, namely new implementation is 50%+ faster.

The new serializer was also compared to the Kryo serializer that does not provide the backward/forward compatibility, and that should have the best performance in general. This non-compatible Kyro serializer needs 50 micro seconds to serialize the object, while our new implementation needs 60 micro seconds. So the price that we pay for the forward/backward compatibility with new solution is around 20% slower serialization time. However, this is still improvement from the compatible serializer provided by Kyro.

Externalizable Interfaces

This would be the last solution, thus using no additional libraries. It should provide very fast serialization, but requires a lot of work and can be error prone also.
It consists of making all our classes implement Externalizable interface, thus implementing these methods:

public void writeExternal(ObjectOutput out) throws IOException;
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException;

In this methods we have to create a byte array on our own using some basic methods provided in the ObjectOutput class.

I don't see any advantage to the option number 2 with Kryo library. Forward and backward compatibility also needs to be created. Thus, I think that there is no reason to consider it as a valid option for us.

Interesting links

Discover the secrets of the Java Serialization API – Basic introduction
Comparison of data serialization formats – Wikipedia
Protocol Buffers – Official page
Getting started with Apache Thrift
Thrift vs ProtocolBuffers
Kryo Project – Official page
Java serialization benchmarking
Java NIO