Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Cross-language data transformation libraries: Protocol Buffers or Thrift
  2. Fast Java serialization library Kryo
  3. Externalizable Interface

Data transformation libraries

Both Google and Facebook developed a cross-language data transformation libraries to fit their needs of transferring the data between different languages that they use. Thus, Google created Protocol Buffers and Facebook Thrift (although there is a word that actually Thrift was developed by the ex-Google employees that changed to Facebook). Thrift was immediately started as an open-source project, that was later on given to Apache foundation, while Google "open" the source from the version 2.0.
The approach that both libraries have is quite the same. Both defined an IDL for defining the stricture of the data that needs to be transferred. These IDLs are more or less the same, they both have defined general types that exist in most of the programing languages, they both don't support polymorphism, but they have some little differences. For example, I like that in Thrift they created types like Map, Set and List that directly map to the HashMap, HashSet and ArrayList in Java for example, while Protocol Buffers only have a List type of collection. Furthermore, Thrift included also possible definition of service into IDL, so that transfer of the data can be easier (but we totally don't need this (smile) ). Here is example of such a definition:

...


It is clear that Thrift wins in functionality, and this is probably the consequence of having people how already worked on ProtoBuf development, actually creating it, thus they were able to add any non-existing functionality straight forward. But on the other side, Thrift lack any kind of documentation, and this is a serous problem. Except one very good White-paper, it is very difficult to find any other information, tutorials, explanation, etc.

Kryo library

While I was trying to find out who was better performance results, Thrift or Protocol Buffers, I stumbled to a nice benchmarking of different serialization libraries (http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking). The results that they showed are something like this:
Total time: Create object, serialize and deserialize
Serialize time
Deserialize time
Bytes length after serialization
Kryo is in the top all of the list, thus I was supposing that it is worth of checking it. And I stumbled to a very small and very nice library, that really have some nice features. So lets just jump to features right away:

...

  1. Use CompatibleFieldSerializer and have everything out of the box
  2. Create custom serializers and deal with the versioning control on our own. This is much more work, but should provide at least 20%-30% faster solution. This is actually very close to Externalizable Interface, but in my opinion much easier because serializers for Java classes already exist in the library.

Backward/forward compatibility problems with Kyro

As already said Kyro provides the CompatibleFieldSerializer class that can provide limited backward/forward compatibility. Limited because the type of the field can not be changed. However the way the Kyro handles the compatibility is quite problematic.

...

The pros and cons from these two possibilities are quite clear. First one has less data because it does not have to write also which fields are serialized, but on the other hand de-serialization is not possible with out schema provided.

Results of the improved compatible serializer

A prototype that is implementing a solution number 2. described above has shown major improvements in the serialization. The simple test class was created:

Code Block

public class Test {

	int value;

	String name;

	long longValue;
...

The serialization was tested with values (5, "Ivan", 45545454). The new compatible implementation serialize this data into 19 bytes. The old (ineffective) Kryo compatible serializer serialize to 37 bytes. This is almost 100% more data needed by the old solution. As excepted the time needed for the serialization has significant decrease also, namely new implementation is 50%+ faster.

The new serializer was also compared to the Kryo serializer that does not provide the backward/forward compatibility, and that should have the best performance in general. This non-compatible Kyro serializer needs 50 micro seconds to serialize the object, while our new implementation needs 60 micro seconds. So the price that we pay for the forward/backward compatibility with new solution is around 20% slower serialization time. However, this is still improvement from the compatible serializer provided by Kyro.

Externalizable Interfaces

This would be the last solution, thus using no additional libraries. It should provide very fast serialization, but requires a lot of work and can be error prone also.
It consists of making all our classes implement Externalizable interface, thus implementing these methods:

...

I don't see any advantage to the option number 2 with Kryo library. Forward and backward compatibility also needs to be created. Thus, I think that there is no reason to consider it as a valid option for us.

Discover the secrets of the Java Serialization API – Basic introduction
Comparison of data serialization formats – Wikipedia
Protocol Buffers – Official page
Getting started with Apache Thrift
Thrift vs ProtocolBuffers
Kryo Project – Official page
Java serialization benchmarking
Java NIO