for inspectIT storage solution
To be able to store inspectIT data to the disk, serialization of the data has to be performed, that's very simple and clear. However, Java native serialization comes with high overhead, and many other problems, thus I performed a examination of different possibilities that can be used in order to effectively persist data on the disk. Effectively means, as quickest as possible, where size of the file should also be as smallest as possible. Since, serialization to binary data is taking less space then any other human-readable (or not) format, I have only analyzed the possibilities that serialize data into "array of bytes".
Generally there are three different solutions that I believe are acceptable:
- Cross-language data transformation libraries: Protocol Buffers or Thrift
- Fast Java serialization library Kryo
- Externalizable Interface
Data transformation libraries
Both Google and Facebook developed a cross-language data transformation libraries to fit their needs of transferring the data between different languages that they use. Thus, Google created Protocol Buffers and Facebook Thrift (although there is a word that actually Thrift was developed by the ex-Google employees that changed to Facebook). Thrift was immediately started as an open-source project, that was later on given to Apache foundation, while Google "open" the source from the version 2.0.
The approach that both libraries have is quite the same. Both defined an IDL for defining the stricture of the data that needs to be transferred. These IDLs are more or less the same, they both have defined general types that exist in most of the programing languages, they both don't support polymorphism, but they have some little differences. For example, I like that in Thrift they created types like Map, Set and List that directly map to the HashMap, HashSet and ArrayList in Java for example, while Protocol Buffers only have a List type of collection. Furthermore, Thrift included also possible definition of service into IDL, so that transfer of the data can be easier (but we totally don't need this ). Here is example of such a definition:
message Person { required int32 id = 1; required string name = 2; optional string email = 3; }
or in Thrift
struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb }
So basically for each class that we wish to serialize with these libraries we have to create a similar definition for it. With this definition, a code is generated via the compilers provided by Protocol Buffers and Thrift. It is possible to generate code for any language, Java as well of course. And these generated class know how to serialize and de-serialize objects. Thus, we can say that is some kind of static serialization, because all the definitions are known and this is the basic reason why this is much faster then Java serialization or other serialization libraries that use reflection.
What I definitively did not like are these generated classes (I only produced them for Protocol Buffers). They look somehow ugly, have many inner classes, and are somehow confusing. Not to mention that for a I would say very simple definition, the compiler generated the class that had 2,000 lines. Furthermore, there is a problem of usage with everything that you want to serialize. For example, for our TimerData we would need to create an object of TimerDataProtos class (basically have same getters and setter as our class and is actually the one that we can serialize fast) first, then "clone" the data from our object to this other one and do the serialization. And do the same thing in other direction for de-serializing. Seams a lot of new classes and a lot of new code.
What I definitively like is the backwards and forward compatibility that these two libraries provide, that is there out of the box if some basic rules are followed. Compatibility is achieved, by associating number with fields (see the sample definition code). Thus, new fields can be added to the definition with our any problems, and when field should be removed in the new class version, the definition just has to keep the old field, so that old data can also be read (but this field will be ignored).
Although they are quite simple, I found a nice table that compares the two libraries:
 |
Thrift |
Protocol Buffers |
---|---|---|
Backers |
Facebook, Apache (accepted for incubation) |
|
Bindings |
C++, Java, Python, PHP, XSD, Ruby, C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell |
C++, Java, Python(Perl, Ruby, and C# under discussion) |
Output Formats |
Binary, JSON |
Binary |
Primitive Types |
bool |
bool |
Enumerations |
Yes |
Yes |
Constants |
Yes |
No |
Composite Type |
Struct |
Message |
Exception Type |
Yes |
No |
Documentation |
Bad |
Good |
License |
Apache |
BSD |
Compiler Language |
C++ |
C++ |
RPC Interfaces |
Yes |
Yes |
RCP Implementation |
Yes |
No |
Composite Type Extensions |
No |
Yes |
It is clear that Thrift wins in functionality, and this is probably the consequence of having people how already worked on ProtoBuf development, actually creating it, thus they were able to add any non-existing functionality straight forward. But on the other side, Thrift lack any kind of documentation, and this is a serous problem. Except one very good White-paper, it is very difficult to find any other information, tutorials, explanation, etc.
Kryo library
While I was trying to find out who was better performance results, Thrift or Protocol Buffers, I stumbled to a nice benchmarking of different serialization libraries (http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking). The results that they showed are something like this:
Total time: Create object, serialize and deserialize
Serialize time
Deserialize time
Bytes length after serialization
Kryo is in the top all of the list, thus I was supposing that it is worth of checking it. And I stumbled to a very small and very nice library, that really have some nice features. So lets just jump to features right away:
- Works with Java NIO
- Provides automatic serialization (no description files, no generated classes, etc)
- But classes have to be registered before the synchronization, or leave open registration option that influences the performance and size of the serialized bytes
- Registration of the classes have to be done in same order always (quite non-flexible constraint)
- Serialization
- Provides set of already existing serialization classes for Java primitive types, Maps, Collections, etc
- Provides FieldSerializer class that can serialize arbitrary classes. But, this class uses ReflectASM to do the serialization. Fastest serialization is possible only if the class fields are public, or protected/default and setAccessable(true) on the class loader is permitted by Security manager. Otherwise when they are private it uses "fast" reflection (and it is faster, even faster then these libraries above)
- Provides CompatibleFieldSerializer that has the backwards and forwards compatibility, with some overhead compared to one above
- And enables for a user to develop a custom serializers for every class. This should be the fastest solution, but a compatibility control has to also be done manually.
- Provides support for circular references in object graph (especially important for the Invocation sequence data)
- Multi thread support is available
- Provides several data compressing options
- Deflate algorithm – This somehow was not working in my tests, first I did not see any difference in size, and secondly the time needed for serialization was 15 times slower than without
- Delta mechanism – Where only the difference from the last object is serialized, but this to work, objects have to be mainly same class and having similar fields values
Kryo seams as a nice solution. It has amazing results (also in my tests) and it is very simple and very easy to learn. Furthermore I see two different solutions here:
- Use CompatibleFieldSerializer and have everything out of the box
- Create custom serializers and deal with the versioning control on our own. This is much more work, but should provide at least 20%-30% faster solution. This is actually very close to Externalizable Interface, but in my opinion much easier because serializers for Java classes already exist in the library.
Externalizable Interfaces
This would be the last solution, thus using no additional libraries. It should provide very fast serialization, but requires a lot of work and can be error prone also.
It consists of making all our classes implement Externalizable interface, thus implementing these methods:
public void writeExternal(ObjectOutput out) throws IOException; public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException;
In this methods we have to create a byte array on our own using some basic methods provided in the ObjectOutput class.
I don't see any advantage to the option number 2 with Kryo library. Forward and backward compatibility also needs to be created. Thus, I think that there is no reason to consider it as a valid option for us.
Interesting links
Discover the secrets of the Java Serialization API – Basic introduction
Comparison of data serialization formats – Wikipedia
Protocol Buffers – Official page
Getting started with Apache Thrift
Thrift vs ProtocolBuffers
Kryo Project – Official page
Java serialization benchmarking
Java NIO