Ringbuffer in the inspectIT agent
Meetings
Date | Focus | Link |
---|---|---|
13.09.2012 | Initial discussion of the requirements and possible approaches | 2012-09-13 Ringbuffer meeting |
21.09.2012 | Discussion of approaches | 2012-09-21 Ring-buffer |
29.09.2012 | Current status, next steps: Create sample realization | 2012-09-29 Ringbuffer status update |
25.10.2012 | Current status, next steps: caliper, another approach | 2012-10-25 Ringbuffer status update |
Requirements
The ringbuffer will replace the buffering strategy in the agent completely, while the sending strategy will be kept, as it defines when data should be sent.
The ringbuffer uses a FIFO policy:
- threads put data to this buffer "at the end" (measurements that are gathered and being sent to the CMR)
- ONE sending thread takes data from "the beginning" of this buffer and sends the data to the CMR
- the sending thread takes all the elements it finds in the buffer
As we are dealing with a highly multi-threaded environment, the following requirements for the ringbuffer are important:
- thread-safe
- correctly sinchronized
- highly concurrent and performant
- as fast as possible
- as little locking as possible
Furthermore, the size of the buffer is limited, either by
- maximum number of elements or
- maximum memory
In case the buffer is full and an insertion operation is desired, the oldest element(s) must be dropped to make room for the new element(s).
Moreover, when the limitation is done with respect to memory occupancy, a utility to calculate object sizes is also required.
Approach
There are two issues that have to be dealt with:
- implementing the concurrent access, enqueue and dequeue operations.
- implementing the eviction of old elements when the buffer has reached its size limit
Approach for implementing a concurrent queue [1][2]
- non-blocking - for better performance in highly multi-threaded environments
- linked list:
- each element holds a reference to the next element
- sentinel nodes: head and tail
- head - points to the first element in the list
- tail - points to the last element in the list
- the last element has a NULL reference for next element
- non-blocking enqueue/dequeue: use compare-and-set to set the values of head and tail
- lazy approach - enqueue/dequeue are both done in two steps [1]
- two-step enqueue:
- compare-and-set to append new element
- compare-and-set to change the value of tail
- "every other method call must be prepared to encounter a half-finished enqueue call, and to finish the job"
- an enqueue call never waits for a dequeuer
- two-step dequeue:
- read the value from the first element
- compare-and-set to change the value of head
- make sure that tail does not lag behind
Refer to Section 10.5 "An Unbounded Lock-Free Queue" in [1] and Section 2 in [2] for a more detailed explanation.
Another possibility to implement a concurrent queue:
- using two locks, one for queing and one for dequeueing
- livelock free
- but high contention
Both locking and lock-free approaches will be implemented in order to find the best approach.
Approach for eviction of old elements
- similar to CMR atomic buffer implementation [4]
- use AbstractObjectSizes class for calculating object sizes [5]
- separate thread for analyzing and calculating the buffer size
- separate thread for eviction operation by FIFO policy; this operation has to be suspended while the size of all objects in the buffer is smaller than the buffer maximum size
- in the analysis thread, signal the eviction thread if needed
- use conditions for suspending and activating the threads that work on eviction and analysis
The object size can also be computed when enqueueing an element instead of on a different thread. Both approaches introduce some overhead: size calculation on a separate thread means locking and context switching, while synchronous size calculation means slowing down the enqueue operation, which has to be very fast. Nevertheless, it is hard to say which approach is faster, so both of them will be implemented.
Therefore, we have the following approaches that have to be implemented and compared:
Size Calculation / Locking | Lock-based | Lock-free |
---|---|---|
Asynchronous (separate thread) | 1 | 2 |
Synchronous (at enqueue) | 3 | 4 |
Realization
Class diagram for a lock-free RingBuffer with separate eviction and analysis threads (Approach 2):
The following diagram shows how the enqueueing and dequeueing are performed without using locks.
The first node in the buffer is always a dummy node with a null object. The tail is an atomic reference to the last element in the buffer. In this example, there is no need for an atomic reference to the first element in the buffer, as there is only one sending thread that can dequeue elements and thus change the dummy node's next element.
However, it is impossible to implement the eviction functionality within this approach without adding a lock. This lock can be held by dequeue and evict, so that these two operations cannot happen at the same time. Nevertheless, the enqueue is lock-free, thus should be very fast.
A completely lock-free approach is described below.
The third approach is lock-based and uses two locks:
- tail lock - used by enqueue, dequeue
- head lock - used by dequeue, evict
Next, the three operations performed on the buffer in this lock-based implementation are described step by step.
Enqueue:
- allocate node, compute and set size
- lock tail lock
- link node to tail
- swing tail to new node
- unlock tail lock
- change buffer size
- signal eviction if buffer is full
Dequeue:
- if buffer is empty, return
- lock tail lock
- lock head lock
- get the reference of the first element
- swing head and tail to dummy node
- change size to 0
- unlock head lock
- unlock tail lock
- go through buffer, add objects to list
- return list of dequeued nodes
Evict:
- wait until there is need for eviction
- lock head lock
- go through list until the required eviction size is reached
- change head to new first element
- unlock head lock
- subtract evicted size from total buffer size
- we assume here that eviction will never empty the buffer, so it does not need to acquire the tail lock
Testing
Issues that need to be checked:
- Performance
- measure time for different number of threads
- compare the different approaches
- Correctness
- the number/size of elements enqueued has to be equal to the sum of number/size of elements dequeued, evicted and remaining in the buffer at the end of the test:
- enqueued = dequeued + evicted + remaining
The general structure of a test application for RingBuffer has:
- 1 sending thread
- loop:
- sleeps 5s
- dequeues all the elements in the buffer
- prints them
- loop:
- n threads that enter data:
- n = 2, 4, 8, ..., 256
- loop that adds elements
- no sleeping between enqueues to make it fast
- limiting the size by number of elements:
- use random Floats as data
- object size not important, only the number of elements
- limiting the size by memory
- use Strings with variable length
The threads run for some time (e.g. 1 minute) and then are interrupted and the statistics about total number of elements added, retrieved and evicted are printed out.
Results
Comparing the number or size of elements added in 1min for the three implementations (see Figures below), it was observed that the lock-based implementation is the most inefficient, while the other two implementation give similar results, with slightly better results for the completely lock-free implementation, as expected. The figures are based on the average values of three runs for each case.
References
[1] Maurice Herlihy, Nir Shavit, "The Art of Multiprocessor Programming, Revised Reprint"
[2] Maged M. Michael, Michael L. Scott, "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms"
[3] ConcurrentLinkedQueue implementation in Java
[4] CMR buffer specifications
[5] Calculation of object sizes