CMR Investigations during the Integration Tests
During the CMR Investigations during the Integration Tests, the CMR has also been monitored with VisualVM in order to investigate the CMR's resource consumption. Surprisingly, the CMR behaved completely different between the Integration Tests of the current agent queue implementation and the disruptor implementation.
Important: During the Integration tests the gatling loadtest produced a high amount of failures with create and delete booking requests. In order to reduce this a source of evil, additional load tests have been run without the create and delete booking usecase. In addition, load test duration was reduced from 30 to 15 minutes.
Investigations
Version 1.7.5.88
The two below screenshots are showing the CPU and Memory Usage of the CMR while using the agent with the current agent queue implementation. During the few minutes of the loadtest, the cpu and memory usage are rising. After that the CMR buffer is full and needs to be freed regularly. During that time the cpu usage is somewhere between 28 - 50 % depending on the memory usage. After a while the cpu usage drops to 10 - 20 %.
Version 1.7.5.88 Disruptor
In comparison to that, the below screenshot shows in the top left and right images the CPU and Memory Usage of the CMR while using the agent with the new disruptor implementation. It is clearly obvious, that the resource consumptions are much smoother and the CPU Usage is roughly between 5 -10 % during the whole load test.
Analysis
Unknown user helped during the analysis of that issue.
First assumption: Less data is send to the CMR with the disruptor implementation due to dropping on the agent
The table shows requests send from Gatling, elements added to the CMR buffer and elements dropped on the CMR. We expected to see a lower number of elements added to the CMR buffer for the 1.7.5.88 disruptor agent due to dropping on the agent. Surprisingly, we investigated it the other way round. Nearly equal amount of requests from Gatling and elements added to the buffer for the agent disruptor version. A bit of dropping at the end. In comparison, we saw a high number of dropped elements on the CMR for the current 1.7.5.88 agent. The sum of elements added and dropped is way lower than the gatling request count. Maybe, there was also dropping on the agent (Sorry, no logs available anymore).
Agent | Requests Gatling Report | Elements Added to the CMR buffer | Elements dropped on the CMR |
---|---|---|---|
1.7.5.88 | 536.937 | 213.596 | 98089 |
1.7.5.88 Disruptor | 537.599 | 537.070 | 769 |
As result we assumed that memory is kept longer on the CMR in case of the 1.7.5.88 agent. The difference between both version is the sending rate. The 1.7.5.88 agent sends all 5 seconds, the 1.7.5.88 disruptor agents sends irregularly in batches (every few milliseconds). This means that the 1.7.5.88 send every 5 seconds around 2.500 invocation sequences which need to be processed and added to the buffer. We assumed that the difference in the data chunks causes the high memory usage on the CMR
Second assumption: 1.7.5.88 agent memory is kept longer while processing incoming data and adding to the CMR
Unknown user assumed that storing the charting data in the H2 Database is the reason for this. Therefore, the HTTP Profile was changed and charting option was disabled.
Below table shows the results from re-running the load test with both agent versions without storing charting data into the H2 database. Now the CMR handles the load also with the 1.7.5.88 agent even without dropping any data. Disruptor Agent drops again some data as data arrives too quickly at the CMR (Unknown user mentioned a storing of ~ 50 agent deliveries).
Agent | Requests Gatling Report | Elements Added to the CMR buffer | Elements dropped on the CMR | CMR CPU | CMR Memory |
---|---|---|---|---|---|
1.7.5.88 | 537.709 | 537.581 | 0 | ||
1.7.5.88 Disruptor | 537.606 | 537.231 | 642 |
Verification Test - InfluxDB for the Rescue
As analyzed before, the storing of time series into the H2 Database caused the performance issues on the CMR with the 1.7.5.88 agent. We wanted to make sure that we do not run into the same problems if time data is stored to the InfluxDB.
Therefore, an additional test was performed. Initially the long term data is stored to the influxDB. During the test, the database is switched and data is stored 5 minutes into the h2. After that, the database is switched and the data is again stored to the influxDB.
The below pictures showing the CPU and Memory Usage of the CMR during the mixed database load test. It is obvious that both resources utilizations increase a lot if data is stored to the h2 database.