To pictures the problem having the the increase of the storage files, here is the current structure of the storage indexing tree and the new proposal. Both will be described in detail.
Current structure
As you can see in the picture above, we have a complex set of indexer that create many leafs at the end of the tree. This help to easily find the wanted element(s). The invocations are only elements for which we remember the position and the size in the file (ArrayBasedLeaf), since we need to retrieve per object based on its id. All other objects that end up in the same leaf does not have this (LeafWithNoDescriptors), but we only remember the complete size of the file because all the data in one file is sharing same properties (in the given example only SQL on the same platformId and same SQL string in the 5min period will be in the same leaf). This bring's us with two major problems regarding the increase of the storage files:
- The number of files increase by time, meaning that storage that saves data with longer time-span will have more files
- The SQL string indexer is making thousands of branches if the SQLs are not prepared/dynamic
In addition to this, large amount of files slow down the download/export of storage, plus more HTTP requests are needed when data requested is spanned over several files. However, the time we spend in the de-serialization of data is minimal.
Proposed improvement
The main change here is the ID passing or sharing the files. We will still have the same structure of the indexing tree, but the writing will not be spitted in so much files. As you can see on the picture, we can define the level in the tree from which on all leafs create below will share the same id, meaning that the write will be done in the same file.
As for Invocations, everything is more or less the same, except that we stopped the file expanding based on the time. Note that this does not mean all invocations will be in the same file. In this picture all the invocations that have the same platform ident will be in the same file, but we can easily add the method ident indexer above the file sharing line, and invocations will be spread in different files based on their method id (in reality we have this indexer in place). The problem can occur when all invocations start from the same method (like having instrumentation on the doFilter()). Than it can happen that all invocations go to one file, meaning this one file can get very big (invocations are biggest data structures we have).
On the other hand, many LeafWithNoDescriptors now can also point to the same file. But we can not only save the size of the file now, but we need to remember the ranges of the files that belong to one leaf. In best case scenario the ranges should be big, and there should be as less as possible of them. But in reality it is expected that often one range represents only one object in file. Just imagine many SQLs coming and being indexed in different leafs, the asynchronous write will spread the data all over the file and we will end up having similar structure as for ArrayBasedLeaf. Because of this, the indexing tree can grow really fast.
Comparing to the current structure we still de-serialize only needed data, so we are still in the minimum. The number of files decreases for sure, that's the advantage. But we have the new problem regarding the indexing tree size. Plus, some data files can grow really big because we don't write to so much files anymore.