We have seen here in Brngr that the agent can disconnect under heavy load or when there are long GC pauses (here GC pauses are also more than 5s).. The disconnection is most likely initialized by the servers as the data sending is in some way taking too long..
We should try to re-connect the agent. The agent is anyway initialized and collects data, so it's stupid that we are just dropping everything on the first reconnect.
By the way Dscv was also reporting that agents disconnects after some time and there is nothing in the logs, I guess it can also be that this happens maybe after some network problems or similar, so re-connection should solve these and help.
We can implement the re-connect in the keep-alive method, so that with keep-alive it tries to reconnect (have some kind of exponential count so that we don't try with every singal).
Yes that's what I had in mind. Without need to keep an open socket connection to the server. I know this is performance wise slower, but also has some advantages, like that we can easily send data from .NET agent or anything else without need to figure out how to handle this kryonet connection etc. Plus we should really think about having this connection via SSL and introduce some security then and I think this is much better than what we have right now.
Ok, lets do the following:
Lets take this topic independently of this ticket here. Right now I think we know what to do to solve this problem directly (increase the timeout/tries). I propose that you Ivan prepare a small decision table what we would benefit (from your pov) with the change to sth like restful communication.
Then we can discuss, all agree (, )?
The implementation I provided can reconnect, but it works only if the CMR has not been stop. Let's discuss if we need to handle also situation when CMR is stopped completely. Then we need to ensure that agent can be reconnected (for example ID exists in the H2, class cache is created, status data is refreshed, etc).
SUCCESS: Integrated in