Overview

Currently there is a problem in the indexing structure, because the heap space needed to keep the indexing tree is too high. Usually the indexing tree occupies 10%-25% of the complete space reserved for the buffer based on the structure of data received from agent. There are two main factors that influence the size of the indexing tree:

The average amount of children in the invocations - if the buffer is filled with smaller invocations (up to 100 children) then indexing size will tent to be closer to the 10% occupancy line. This is because then different data objects are equally distributed in the indexing tree (thus throughout several maps). When we have invocation with many children, then the leafs for SQLs and especially timers then to be much bigger, thus occupying more space due to the map expansion policies (storing 600K elements in the map requires array of 1M entries).
Time interval data in buffer is created - when we have a high load and agent sends lot of data in short interval we have a bigger indexing structure, because our time-stamp indexer is indexing in intervals of 15 minutes. Meaning that if all data in the buffer is created in same 15-min interval all data will fail into same branch, which would effectively mean that number of maps for storing data will be smaller.

Thus, the conclusion is that the main problem we have is the amount of space maps where we index data occupy in memory. The problems is bigger if the amount of maps is small or we have one or more maps that have many elements (500K+).

The Java maps are known to be non-storage friendly. It's assumed that often 80%+ of the map size is used for maintenance, where only 20% are real data stored in the map. Note that we use ConcurrentHashMap.

Current size

A test was performed to check the amount of indexed elements in the maps as well as the size of the complete indexing tree. The test was executed with normal load pace and invocations having ~50-100 children. Max size of buffer is 335MB. The results are:

Tree size

2014-11-14 15:44:18,125: 1239302 [indexing-thread] DEBUG it.cmr.cache.impl.AtomicBuffer - Indexing tree new size: 32934038
2014-11-14 15:44:18,125: 1239302 [indexing-thread] DEBUG it.cmr.cache.impl.AtomicBuffer - Indexing occupancy percentage: 10.049%

Map sizes

2014-11-14 15:49:13,842: 1535019 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 17245
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 300778
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 4554
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 32965
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 644
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 10604
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 598
2014-11-14 15:49:13,843: 1535020 [pool-2-thread-1] INFO  ctit.indexing.buffer.impl.Leaf - Leaf map size prior to clean = 10604

As seen, indexing tree in this situation occupies 10% of buffer, totaling to over 30MB. In my opinion this is ideal situation and under 10% we can not go.

We have total of 8 maps, with different sizes, biggest one having 300,778 indexed elements.

ConcurrentHashMap compared with other thread-safe map

We furthermore tested the ConcurrentHashMap with two other map implementations: NonBlockingHashMapLong from Cliff's Click high scale lib and synchronized Long2ObjectMap from FastUtil lib. The results shows that NonBlockingHashMap beats heavily other maps in both size and performance measurements:

Size

Number of elements	ConcurrentHashMap		NonBlockingHashMapLong		Long2ObjectMap
Number of elements	Size	Ch. (%)	Size	Ch. (%)	Size	Ch. (%)
Empty	224	-	712	+317%	568	+253%
8	800	-	824	+3%	696	-13%
32	2720	-	1400	-48%	1496	-45%
256	20640	-	7672	-63%	10904	-47%
1024	82080	-	29176	-64%	43160	-47%
4096	327840	-	115192	-65%	172184	-47%

As seen in the table there are three stages of elements numbers:

empty map makes Java implementation the smallest
with 8 elements have maps similar size
from 256 elements the max saving for NonBlickingHashMapLong and Long2ObjectMap is achieved

Performance

Martin Thompson's style of PerfTest

2 consumers, 2 producers. Producers putting with key value pair that is their id and random object. Consumers removing with key that is their id. Basically a bunch of <1,O> or <2, O> putting and removing with <1> or <2>.

Performance

Java
_______________________
0 - 2 producers 2 consumers: 4,782,614 ops/sec - MapsPerfTest
1 - 2 producers 2 consumers: 4,194,545 ops/sec - MapsPerfTest
2 - 2 producers 2 consumers: 4,726,670 ops/sec - MapsPerfTest
3 - 2 producers 2 consumers: 5,339,190 ops/sec - MapsPerfTest
4 - 2 producers 2 consumers: 4,952,414 ops/sec - MapsPerfTest

Cliff's
______________________
0 - 2 producers 2 consumers: 23,030,573 ops/sec - MapsPerfTest
1 - 2 producers 2 consumers: 31,244,208 ops/sec - MapsPerfTest
2 - 2 producers 2 consumers: 30,630,539 ops/sec - MapsPerfTest
3 - 2 producers 2 consumers: 28,825,699 ops/sec - MapsPerfTest
4 - 2 producers 2 consumers: 26,588,300 ops/sec - MapsPerfTest

Long2ObjectMap (FastUtil)
______________________
0 - 2 producers 2 consumers: 5,569,578 ops/sec - MapsPerfTest
1 - 2 producers 2 consumers: 5,940,526 ops/sec - MapsPerfTest
2 - 2 producers 2 consumers: 5,883,532 ops/sec - MapsPerfTest
3 - 2 producers 2 consumers: 5,798,396 ops/sec - MapsPerfTest
4 - 2 producers 2 consumers: 5,537,185 ops/sec - MapsPerfTest

Using JMH Approach 1

2 consumers, 2 producers. Map is initially empty. Producers execute put with random long 0-10. Consumer are removing with random long 0-10.

     [java] Benchmark                                                           Mode  Samples      Score      Error   Units
     [java] i.n.i.ThreadSafeMapsPerfTest.concurrentHashMap                     thrpt        5  17205.845 ± 1549.392  ops/ms
     [java] i.n.i.ThreadSafeMapsPerfTest.concurrentHashMap:putChm              thrpt        5   8031.802 ±  627.409  ops/ms
     [java] i.n.i.ThreadSafeMapsPerfTest.concurrentHashMap:removeChm           thrpt        5   9174.043 ±  938.380  ops/ms
     [java] i.n.i.ThreadSafeMapsPerfTest.nonBlockingHashMapLong                thrpt        5  22120.230 ± 1674.090  ops/ms
     [java] i.n.i.ThreadSafeMapsPerfTest.nonBlockingHashMapLong:putNbhml       thrpt        5  10369.226 ±  505.723  ops/ms
     [java] i.n.i.ThreadSafeMapsPerfTest.nonBlockingHashMapLong:removeNbhml    thrpt        5  11751.004 ± 1391.759  ops/ms

Using JMH Approach 2

3 threads.. Map is filled with 1000 elements. Each tread removes the element with random 0-1000 remove and then puts back the removed element to the map right away with same key.

     [java] Benchmark                                               Mode  Samples      Score     Error   Units
     [java] i.n.i.ThreadSafeMapsPerfTest.concurrentHashMap         thrpt        5  16033.299 ± 642.117  ops/ms
     [java] i.n.i.ThreadSafeMapsPerfTest.nonBlockingHashMapLong    thrpt        5  27084.222 ± 323.981  ops/ms

Saving With NonBlockingHashMapLong

Taking the map sizes reported by the indexing structure, we can easily check what is the expected size of indexing structure if the map would be changed to NonBlockingHashMapLong:

Map #	Number of elements	App. CocnurrentHashMap size	App. NonBlockingHashMapSize
1	17245	1,380,273	478,677
2	300778	24,073,989	8,348,829
3	4554	364,497	126,407
4	32965	2,638,487	915,024
5	644	51,721	18,348
6	10604	848,734	294,339
7	598	48,026	17,038
8	10604	848,734	294,339
Total	377992	30,254,461	10,493,001

As seen with change to high scale lib we would gain the storage optimization of almost 20MB (for given example), or decrease of 67%.

Further options to optimize indexing tree size

Simplest option would be to decrease the number of elements indexed. We can re-think on our use-cases and real need to index every single element that is in the buffer (timers and SQLs we often need in aggregated mode only).

The other option to optimize the indexing tree size would be to go away from the maps and implement different structure.

The additional option would be not to map the element ID with the element it self (that's what we store in maps). Because the element ID is only needed for direct element retrieval and this use case we have only for invocation sequences. Thus for everything else we could spare that space by holding the elements in the array list for example.

Develop

Problem with memory footprint